# Section 5. Working with spatial data

#### Instructor: Pierre Biscaye

The content of this notebook draws on material from my own research. 
    
### Learning Objectives 
    
* Think about what processing steps are needed to prepare raw spatial data for analysis
* Understand some of the common steps, such as changing resolutions, going from points to rasters, and deriving new variables
* Think about ways to match spatial data to observation units for analysis

### Sections

1. Processing spatial data for analysis
2. Creating a dynamic raster
3. Point-level data analysis

### Required Data
* AustraliaRainfall.nc
* ucdp_geo_DZA.csv
* GPW.tif
* NGA_HouseholdGeovariables_Y1.dta
* NGA_water_areas_dcw.shp

### Required Packages
* numpy
* matplotlib
* geopandas
* pandas
* sys
* rasterio
* xarray
* shapely

In [None]:
# Import Packages

import numpy as np
import matplotlib.pyplot as plt
import geopandas as gpd
import pandas as pd
import sys
import rasterio
import xarray as xr
from matplotlib.colors import LinearSegmentedColormap

# 1. Processing spatial data for analysis

Most analysis with spatial data requires a great deal of processing in order to get the data into a useful combined form. 

We will first go through examples of these processing in the contex of code I used for my paper on the impacts of exposure to locust swarms on the risk of conflict. We will then work through some examples ourselves with data I have provided.

In this paper, the unit of observation for the analysis was 0.25$^{\circ}$ cells at the annual level. This means that all my spatial data have to be processed to be at this level. 

I first constructed a grid of 0.25$^{\circ}$ cells defined by cell centroids. This is what we have done previously in this class when creating meshed grids. I then duplicated this grid for the number of years in my dataset, to create a 3 dimensional array. The next step was to prepare my individual datasets to merge into this grid to create a master dataset for analysis. You can think of this dataset as a 3D array with many bands.

Here is a basic summary of my main data inputs and the necessary transformations:
* Locust swarm point data: use coordinates and date to merge to main array
* Conflict event point data: use coordinates and date to merge to main arrays
* Administrative boundaries: identify countries and administrative units for each grid cell midpoint; deal with edge cases
* CHIRPS precipitation: daily at 0.05 degree resolution globally, resample taking total within years and mean within grid cells
* ERA-5 temperature: monthly mean at 9 km resolution globally, resample taking maximum within years and mean within grid cells
* NDVI: every 16 days at 1 km resolution globally, resample taking mean within months, max within years, and mean within grid cells
* WorldClim precipitation and temperature: monthly at 2.5 arcmin (~0.04 degrees) resolution globally, resample taking total precip and max temp within years and mean within grid cells
* GPW population: every 5 years at 0.25 degree resolution globally, interpolate linearly between years
* Land cover: static raster at 0.0833 resolution, resample taking mean within grid cells
* Net migration: annual data at 5 arcmin resolution globally, resample taking mean within grid cells
* Crop yields: annual data at 5 arcmin resolution globally, resample taking mean within grid cells
* PRIO-GRID data: annural at 0.5 degree resolution globally, resample spreading across included grid cells

As you can see, one of the main tasks I had to deal with was resampling/transforming the resolution of different datasets to match my target dataset.

### Changing data resolution

* **High** resolution: more detailed, more fine
* **Low** resolution: less detailed, more aggregated

What are the implications of using higher-resolution data? How do you decide?

#### Spatial resolution

When combining data sources it is necessary to match data resolution (as well as CRS!!).

Most of the data I used for my locusts and conflict project were at a finer resolution than 0.25$^{\circ}$ degrees (why did I use this resolution?). This means I had multiple pixels for each 0.25$^{\circ}$ grid cell. The approach here is to **rescale** the original data to match the desired resolution. You could see that I did this in some of the GEE scripts I showed you (much of my processing was in GEE).

When the original data is at a higher resolution, you are **collapsing** it to the lower target resolution. You can take a mean, a sum, a max, or other statistics for higher resolution pixels to then generate statistics at the lower target resolution. What you use depends on what you are trying to measure.

When the original data is at a lower resolution, you are **spreading** the data across the higher target resolution. One approach is to assign all smaller pixels within the larger pixel the value of the larger pixel. More sophisticated approaches require assumptions about how the values in the larger pixel were generated, and using that as an input to determine how to spread the data. You might not want to do this for something like temperature, but you might for something like GDP or agricultural output, for example if you also have high-resolution data on population, economic activity, or land cover.

One concern is that your pixels may not overlap exactly, e.g., if the spatial resolutions are not multiples of each other. Here you have to make decisions about how you deal with small pixels that are not entirely contained within larger pixels. An advanced approach would take weighted statistics based on the share of the small pixel area that overlaps with the larger pixel. A simplistic approach would just assign small pixels in their entirety to large pixels based on their centroids. In many cases these decisions may not make a large difference, but in some cases they might.

#### Temporal resolution

Merging data sources also requires matching the temporal resolution.

How you approach this again depends on the relative resolutions of the original and target data sources. You can **collapse** data to lower temporal resolutions (i.e., months to years), or **spread** data across higher resolutions (i.e., years to days). Merging a static variable into a dynamic dataset implicitly spreads its values across all time periods.

#### Example: Preparing WorldClim weather data

WorldClim has data on monthly total precipitation and maximum temperature available at a 2.5 arcminute resolution (around 0.04 degrees) globally for every month from 1985-2018. 

I needed to merge this 0.04$^{\circ}$ by month data to a 0.25$^{\circ}$ by year target dataset.

**Temporal resolution:** I took the sum of precipitation across months to get an annual measure and took the max of maximum monthly temperature. 

*Question:* Why did I use different methods for the two measures?

**Spatial resolution:** I assigned every original 0.04$^{\circ}$ degree cell to the larger grid cell its centroid fell in, and took means across smaller cells to get an average value for the larger cells. 

*Question:* When might taking a sum or a max have been appropriate?

### Creating derived variables

In some cases you want to assign raster data to point locations, which we will discuss below. 

In other cases, you want to use information from point events to create a raster. This is what I did for the conflict and locust swarms data in my paper. 

I first identified which grid cell and year each point event was located in.

I then assigned values to cell-years based on the distribution of point events. I calculated two statistics for each cell-year:
* Count of events
* Indicators for any events

Finally, I created additional variables for each cell-year based on proximity to events outside of the cell, to consider potential spillovers.

When working with rasters, you may want to create some derived variables in the original resolution. Transforming first and then creating variables in a combined dataset at the target resolution may lead to different values of derived variables. You should think carefully about the math of what exactly the derived variables will be measuring depending on the order of operations and decide how to proceed. You also need to think hard about how to transform/rescale your data can affect the variables you have created, and how you will interpret them. 

With spatial data, we generally consider it best to create derived variables at the highest resolution, but even then we must be careful about the implications of how we rescale it. We will show a simulation with an example shortly.

#### Testing effects of order of operations

To illustrate how decisions about when to derive variables and rescale data can affect the final data, we will do some examples using the Australia rainfall. 

We first load the data and subset it to the southeast quadrant to reduce the time for later computations.

The data are rainfall by month and year at a 0.25 degree resolution. We will create some variables:
* a rainfall z-score within cell-months across years
* an indicator for rainfall above the 90th percentile within cell-months across years

We will create these before rescaling, then rescale to a 0.5 degree resolution by taking means of rainfall across sub-cells. In the rescale, we will also take the mean of z-score and the max of the high rainfall indicator, but also recalculate these from the rescaled rainfall data to compare.

In [None]:
aus = xr.open_dataset('Data/AustraliaRainfall.nc')
seaus = aus.where((aus.lat < -30) & (aus.lon > 135), drop=True)

In [None]:
seaus

In [None]:
seaus['rainfall'].sel(month = 1).mean(dim = 'year').plot()
plt.show()

In [None]:
# 1. Calculate required stats by cell-month across years
rain_m = seaus['rainfall'].groupby('month').mean(dim='year')
rain_s = seaus['rainfall'].groupby('month').std(dim='year')
rain_p90 = seaus['rainfall'].groupby('month').quantile(0.9, dim='year')
land_mask = seaus['rainfall'].mean(dim='year').notnull() & (seaus['rainfall'].mean(dim='year') > 0)

# 2. Create new varables
# Xarray automatically aligns the 'month' dimension
seaus['rain_z'] = ((seaus['rainfall'].groupby('month') - rain_m) / rain_s).where(land_mask)
seaus['rain_high'] = (seaus['rainfall'].groupby('month') >= rain_p90).astype(int).where(land_mask)

In [None]:
# 3. Rescale the data
# boundary='trim' ensures that if your dimensions aren't perfectly divisible by 2, it drops the edge
coarse_obj = seaus.coarsen(lat=2, lon=2, boundary='trim') # sets up 2x2 blocks of subcells
# collapse taking means
seaus_low = coarse_obj.mean()  # This handles 'rainfall' and 'rain_z' as means
# Handle 'rain_high' separately as a 'max' 
seaus_low['rain_high'] = coarse_obj.max()['rain_high']

seaus_low

In [None]:
# 4. Calculate new variables at low resolution
# Get required stats by cell-month across years
rain_m = seaus_low['rainfall'].groupby('month').mean(dim='year')
rain_s = seaus_low['rainfall'].groupby('month').std(dim='year')
rain_p90 = seaus_low['rainfall'].groupby('month').quantile(0.9, dim='year')
land_mask_low = land_mask.coarsen(lat=2, lon=2, boundary='trim').max() > 0

# Create new varables
seaus_low['rain_z_lores'] = ((seaus_low['rainfall'].groupby('month') - rain_m) / rain_s).where(land_mask_low)
seaus_low['rain_high_lores'] = (seaus_low['rainfall'].groupby('month') > rain_p90).astype(int).where(land_mask_low)

Let's visualize what it looks like to have **coarsened** the data.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

seaus['rainfall'].sel(year=1981, month=1).plot(ax=ax1, vmax=200)
ax1.set_title("Original Rainfall (0.25°)")
seaus_low['rainfall'].sel(year=1981, month=1).plot(ax=ax2, vmax=200)
ax2.set_title("Coarsened Rainfall (0.5°)")

plt.tight_layout()

Now let's compare what the derived rainfall z-scores and high rainfall indicators look like, in the original resolution and under the two methods for the coarse resolution.

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 5), sharey=True)

seaus['rain_z'].sel(year=1981, month=1).plot(ax=ax1, cmap='RdBu', vmin=-3, vmax=3)
ax1.set_title("Original Z (0.25°)")
seaus_low['rain_z'].sel(year=1981, month=1).plot(ax=ax2, cmap='RdBu', vmin=-3, vmax=3)
ax2.set_title("Method A: Mean of Zs (0.5°)")
seaus_low['rain_z_lores'].sel(year=1981, month=1).plot(ax=ax3, cmap='RdBu', vmin=-3, vmax=3)
ax3.set_title("Method B: Z of Mean Rain (0.5°)")

plt.tight_layout()

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(18, 5), sharey=True)

seaus['rain_high'].sel(year=1981, month=1).plot(ax=ax1, cmap='Greens')
ax1.set_title("Original High Rain (0.25°)")
seaus_low['rain_high'].sel(year=1981, month=1).plot(ax=ax2, cmap='Greens')
ax2.set_title("Method A: Max of Flags (0.5°)")
seaus_low['rain_high_lores'].sel(year=1981, month=1).plot(ax=ax3, cmap='Greens')
ax3.set_title("Method B: Flag of Mean (0.5°)")

plt.tight_layout()

What do you observe?

Let's plot the distributions.

In [None]:
import seaborn as sns

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Left: Z-score Distributions
sns.kdeplot(seaus['rain_z'].values.flatten(), label="Original", ax=ax1, color='black', ls='--')
sns.kdeplot(seaus_low['rain_z'].values.flatten(), label="Method A (Mean of Z)", ax=ax1, color='royalblue', fill=True)
sns.kdeplot(seaus_low['rain_z_lores'].values.flatten(), label="Method B (Z of Mean)", ax=ax1, color='crimson')
ax1.set_title("Z-Score Distribution Comparison")
ax1.legend()

# Right: High Rain Frequency (Bar plot)
# We compare the mean of the binary flags, which represents the probability/fraction of extremes
freqs = [seaus['rain_high'].mean().values, 
         seaus_low['rain_high'].mean().values, 
         seaus_low['rain_high_lores'].mean().values]
labels = ['Original', 'Method A (Max)', 'Method B (Lores)']

ax2.bar(labels, freqs, color=['gray', 'royalblue', 'crimson'])
ax2.set_title("Frequency of 'High Rain' Events")
ax2.set_ylabel("Fraction of total observations")

plt.tight_layout()
plt.show()

What is happening here? 

The z-scores overlap almost perfectly. This implies that the data have very high spatial autocorrelation. In this case, the z-score of the average is equivalent to the average of the z-scores. The implications is that coarsening the data doesn't lose too much information because neighboring cells were already largely in agreement.

There are more differences with the high rainfall indicator. Of course, the method where we take the max will result in a higher frequency because any subcell high rainfall is sufficient to trigger this indicator in the coarser cell. We are effectively "expanding the footprint" of every high rainfall event. The frequency of high rainfall is pretty similar in the coarsened data, again suggesting high spatial correlations. 

In general we can be somewhat reassured by this simulation, but we should still be aware of the implications of how we approach this process. In particular, we show that taking maximums vs means when coarsening data can lead to big differences, which matters if we are creating binary variables.

### Examples from my project data processing

*Note*: This code will not run on your machines.

In [None]:
import getpass
from pathlib import Path

# Get the current system username
user = getpass.getuser()

# Set the home path based on the user
if user == "pierrebiscaye":
    home = Path("/Users/pierrebiscaye/Dropbox")
elif user == "pibiscay":
    home = Path(r"C:\Users\pibiscay\Dropbox")
else:
    home = None
    print(f"Username '{user}' not recognized.")

# Example of how to join paths later:
# data_path = home / "Project" / "data.csv"

In [None]:
world=gpd.read_file(home / "Data/Country boundaries/Country raw" /
                    "UIA_World_Countries_Boundaries/World_Countries__Generalized_.shp")
data = pd.read_csv(home / "Locusts/Clean data/mapping.csv")
x = data['lon']
y = data['lat']

In [None]:
data.columns

In [None]:
crop = rasterio.open(home / "Data/Spatial/Land Cover/CIESIN 2000/gl-croplands-geotif/cropland.tif")
pasture = rasterio.open(home / "Data/Spatial/Land Cover/CIESIN 2000/gl-pastures-geotif/pasture.tif")

In [None]:
# setup
style = {
    'cmap': 'YlGn',
    'vmin': 0, 
    'vmax': 1,
    'extent': (-180, 180, -90, 90)
}
font_cfg = {'title': 12, 'label': 10, 'tick': 8}

plot_configs = [
    {'data': crop.read(1),    'title': 'Crop land share in 2000 (Source: CIESIN)',    'label': 'Crop share'},
    {'data': pasture.read(1), 'title': 'Pasture land share in 2000 (Source: CIESIN)', 'label': 'Pasture share'},
    {'data': data['ag_share'],'title': 'Ag (crop+pasture) share in 2000',           'label': 'Ag share', 'type': 'scatter'}
]

fig, axes = plt.subplots(1, 3, figsize=(20, 6))

# plots
for ax, cfg in zip(axes, plot_configs):
    # Handle imshow vs scatter
    if cfg.get('type') == 'scatter':
        im = ax.scatter(x, y, marker="s", s=10, c=cfg['data'], **{k: v for k, v in style.items() if k != 'extent'})
    else:
        im = ax.imshow(cfg['data'], **style)
    
    # Subplot Formatting
    ax.set_title(cfg['title'], fontsize=font_cfg['title'])
    world.plot(ax=ax, color='none', edgecolor='k', alpha=0.3)
    ax.set_xlim([30, 50])
    ax.set_ylim([5, 20])
    
    # Colorbar Logic
    cbar = fig.colorbar(im, ax=ax)
    cbar.set_label(cfg['label'], fontsize=font_cfg['label'])
    cbar.ax.tick_params(labelsize=font_cfg['tick'])

plt.tight_layout()
plt.show()

In [None]:
swarms = pd.read_csv(home / "Data/Locust Hub/Retrieved 9.13.20/Swarms_geo_clean.csv")
swarms=swarms[swarms['yr']>1989] 
swarms2=swarms[swarms['yr']>1997] 

In [None]:
# New color scheme
nodes = [0,  1] 
color_scheme3 = ['white', 'orange']  # corresponds to nodes
custom_cmap3 = LinearSegmentedColormap.from_list(
    'custom_name3', list(zip(nodes, color_scheme3)))
custom_cmap3.set_under('white')  # set values under vmin to white
custom_cmap3.set_over('orange')  # set values over vmax to blue

# Define a function to handle repetitive formatting
def format_map(ax, im, title, label, ticks=None):
    ax.set_title(title, fontsize=15)
    world.plot(ax=ax, color='none', edgecolor='k', alpha=0.3)
    ax.set_xlim([-18, 60])
    ax.set_ylim([-2, 38])
    cbar = fig.colorbar(im, ax=ax, ticks=ticks)
    cbar.set_label(label, fontsize=12)
    cbar.ax.tick_params(labelsize=10)

# Plotting 
fig, axes = plt.subplots(4, 1, figsize=(10, 15))

im0 = axes[0].scatter(swarms['x'], swarms['y'], s=1, c=swarms['yr'], cmap='jet')
format_map(axes[0], im0, "A. Swarm locations (raw)", "Year")

im1 = axes[1].scatter(x, y, s=1, c=data['treat_yr2'], cmap='jet')
format_map(axes[1], im1, "B. First exposure (derived)", "Year")

im2 = axes[2].scatter(swarms2['x'], swarms2['y'], s=1, c=swarms2['yr'], cmap='jet')
format_map(axes[2], im2, "C. Swarm locations post-1997 (raw)", "Year")

im3 = axes[3].scatter(x, y, s=1, c=data['swarm100max'], cmap=custom_cmap3)
format_map(axes[3], im3, "D. Within 100km (1997-2020)", "0=No, 1=Yes", ticks=[0, 1])

plt.tight_layout()

plt.show()

# 2. Creating a dynamic raster

Let's practice building a dynamic raster. We'll use data on conflict events from the [UCDP](https://www.uu.se/en/websites/ucdp---uppsala-conflict-data-program) and add to this a static layer with 2000 population data from GPW. 

To make this tractable, we'll focus on just one country: Algeria. 

As a first step, let's import a shapefile of Algeria's boundaries. Rather than downloading and importing it, let's import it directly from the [GADM](https://gadm.org/download_country.html) website. We can use GeoPandas to directly pull a file from a URL, including a specific subfile from a zipped folder.

In [None]:
# GADM download page URL
url = "https://geodata.ucdavis.edu/gadm/gadm4.1/shp/gadm41_DZA_shp.zip"

# Read in the country-level shapeful
# The '!' symbol is used to separate the zip path from the specific file inside
dza = gpd.read_file(f"zip+{url}!gadm41_DZA_0.shp")

dza.head()

Now let's load the GPW population data and mask it to keep only the country of Algeria. 

In [None]:
import rasterio.mask 

with rasterio.open('Data/GPW.tif') as src:
    # Apply the mask directly to the open file object
    # crop=True drops the rest of the world
    gpw_dza_image, gpw_dza_transform = rasterio.mask.mask(src, dza.geometry, crop=True, nodata=np.nan)
    # Copy the metadata and update it for the new smaller footprint
    gpw_dza_meta = src.meta.copy()
    gpw_dza_meta.update({
        "driver": "GTiff",
        "height": gpw_dza_image.shape[1],
        "width": gpw_dza_image.shape[2],
        "transform": gpw_dza_transform,
        "nodata": -1
    })

# Save the Algeria-only population data
gpw_dza = gpw_dza_image[0]


Let's check that this worked by plotting the masked population data over a background basemap.

We'll look at a **basemap** using the `contextily` package. We'll have to make sure the projections are aligned!

In [None]:
import contextily as cx
from rasterio.plot import plotting_extent

fig, ax = plt.subplots(figsize=(10, 8))

# Get extent of Algeria data so raster knows where to plot it
dza_extent = plotting_extent(gpw_dza, gpw_dza_transform)

# Plot the population raster
im = ax.imshow(gpw_dza, 
               cmap='Reds', 
               extent=dza_extent, 
               vmin=0.1, vmax=100000, 
               alpha=0.6, # Alpha allows the basemap to peek through
               zorder=2) # tells when to plot this
fig.colorbar(im)

# Plot the boundary (Transparent fill)
dza.plot(ax=ax, facecolor='none', edgecolor='black', linewidth=1.5, zorder=3)

# Add the Basemap
cx.add_basemap(ax, 
               crs=dza.crs.to_string(), # point to current CS
               source=cx.providers.CartoDB.Positron,
              zorder=1) # says to plot this first (at the bottom/back)

ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
plt.show()

There are many `contextily` basemaps. We can browse the catalog.

Some potentially useful ones:
* Minimalist (Light): `cx.providers.CartoDB.Positron`, best for colorful heatmaps/rasters.
* Minimalist (Dark): `cx.providers.CartoDB.DarkMatter`, great for "neon" or glowing data visualizations.
* Satellite: `cx.providers.Esri.WorldImagery`, showing terrain, vegetation, or urban density.
* Standard: `cx.providers.OpenStreetMap.Mapnik`, familiar "Google Maps" look with street names.
* Terrain: `cx.providers.Stamen.Terrain`, highlighting mountains and physical geography.

In [None]:
# List all available providers
print(cx.providers.keys())
# List all styles for a specific provider (e.g., CartoDB)
print(cx.providers.CartoDB.keys())

All right, now we have the static population data and grid. We need to import the conflict event data, saved in `ucdp_geo_DZA.csv`, and turn it into a cell-year raster. 

We could choose any resolution, but for simplicy we'll want to set it to the same resolution as the GPW data.

In [None]:
ucdp = pd.read_csv("Data/ucdp_geo_DZA.csv")
ucdp.head()

In [None]:
# Identify the variables we need
ucdp.columns

Let's lay out our strategy:
1. Extract grid coordinates for the target raster rom the `gpw_dza` metadata
2. Map each conflict event to a grid cell
3. Group by year and cell to get counts of events and deaths (the 'best' column)
4. Create a 3D year-lat-lon xarray

In [None]:
# 1. Get the coordinate arrays from the raster metadata
# First build a meshgrid of sufficient size to include all grid cells
cols, rows = np.meshgrid(np.arange(gpw_dza.shape[1]), np.arange(gpw_dza.shape[0]))
# Then use the transform from earlier to reconstruct the lat/lons
lons, lats = rasterio.transform.xy(gpw_dza_transform, rows, cols)
# Identify cell centroids to match with event lons and lats
lon_coords = np.unique(lons) # longitude centroids
lat_coords = np.sort(np.unique(lats))[::-1] # latitude centroids, top-down to maintain North-South orientation
# Identify cell edges
res = 0.25 # or gpw_dza_transform[0]
lon_edges = np.append(lon_coords - (res / 2), lon_coords[-1] + (res / 2))
lat_edges = np.append(lat_coords + (res / 2), lat_coords[-1] - (res / 2))

# 2. Map UCDP points to grid indices
# np.digitize() finds the index of the interval between edges that each event falls into
ucdp['col_idx'] = np.digitize(ucdp['longitude'], lon_edges) - 1 # because the first interval (1) maps to the first column (0)
ucdp['row_idx'] = np.digitize(ucdp['latitude'], lat_edges[::-1]) # flip back to South-North for digitize logic
ucdp['row_idx'] = (len(lat_coords) - 1) - ucdp['row_idx'] # flip back again to North-South

# 3. Collapse (aggregate) UCDP data by Year and Grid Cell
grid_agg = ucdp.groupby(['year', 'row_idx', 'col_idx']).agg(
    conflict_count=('year', 'count'),
    total_deaths=('best', 'sum')
).reset_index()
# Convert the grid_agg indices to the actual coordinate values
grid_agg['lat'] = lat_coords[grid_agg['row_idx'].astype(int)]
grid_agg['lon'] = lon_coords[grid_agg['col_idx'].astype(int)]

# 4. Initialize the Dynamic Xarray (3D)
years = sorted(ucdp['year'].unique())
ds = xr.Dataset(
    data_vars={
        "conflicts": (("year", "lat", "lon"), np.zeros((len(years), len(lat_coords), len(lon_coords)))),
        "deaths": (("year", "lat", "lon"), np.zeros((len(years), len(lat_coords), len(lon_coords)))),
    },
    coords={
        "year": years,
        "lat": lat_coords,
        "lon": lon_coords,
    }
)

# 5. Fill the Xarray with the aggregated data
# Convert the dataframe to an xarray object
sparse_ds = grid_agg.set_index(['year', 'lat', 'lon']).to_xarray()
# Update your main dataset with the non-zero values
ds.update(sparse_ds)
# Merge in the static Population layer
ds['population_2000'] = (("lat", "lon"), gpw_dza)

print(ds)

Let's take a quite that the data look reasonable.

In [None]:
ds['conflict_count'].sel(year=2012).plot();

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(20, 6))

# Define the variables and titles to plot
plot_configs = [
    ('conflict_count', 'Conflict Events (2011)', 'Reds'),
    ('total_deaths', 'Total Deaths (2011)', 'Oranges'),
    ('population_2000', 'Population (2000)', 'Blues')
]

for i, (var, title, cmap) in enumerate(plot_configs):
    # Plot the raster data
    im = ds.sel(year=2011)[var].plot(ax=axes[i], cmap=cmap, add_colorbar=True, 
                           cbar_kwargs={'shrink': 0.8, 'label': ''})
    
    # Overlay the national boundary
    dza.plot(ax=axes[i], facecolor='none', edgecolor='black', linewidth=1)
    
    axes[i].set_title(title)
    axes[i].set_xlabel('Longitude')
    axes[i].set_ylabel('Latitude')

plt.tight_layout()
plt.show()

In [None]:
# Zooming in
fig, axes = plt.subplots(1, 3, figsize=(20, 6))

plot_configs = [
    ('conflict_count', 'Conflict Events (2011)', 'Reds', 15),
    ('total_deaths', 'Total Deaths (2011)', 'Oranges', 30),
    ('population_2000', 'Population (2000)', 'Blues', 150000)
]

for i, (var, title, cmap, vmax) in enumerate(plot_configs):
    im = ds.sel(year=2011)[var].plot(ax=axes[i], cmap=cmap, vmax=vmax, add_colorbar=True, 
                           cbar_kwargs={'shrink': 0.8, 'label': ''})
    
    dza.plot(ax=axes[i], facecolor='none', edgecolor='black', linewidth=1)
    axes[i].set_xlim([-4, 8])
    axes[i].set_ylim([32, 38])
    axes[i].set_title(title)
    axes[i].set_xlabel('Longitude')
    axes[i].set_ylabel('Latitude')

plt.tight_layout()
plt.show()

Now we have a 3D raster dataset which is great for visualization. But for analysis we would like to have a 2D dataset. 

We will **restructure the data** to a 2D cell-year panel where the cell centroids becomes lat and lon columns. We'll set this up as a pandas dataframe. Tis is fortunately pretty easy to do with built-in functions!

In [None]:
# This creates a MultiIndex (year, lat, lon)
df_analysis = ds.to_dataframe().reset_index()
df_analysis.head()

# 3. Point-level analysis

We've just looked at using point data to create rasters. But in many cases we are interested in conducting analyses at the level of points - locations of individuals, cities, businesses, etc. 

There are a huge number of datasets with geolocated information on events, communities, survey locations, etc. It is often very useful to map these to other spatial data for analysis.

Let's work with locations of survey communities in the Nigeria General Household Survey Panel ([GHSP](https://microdata.worldbank.org/index.php/catalog/5835)). 

*Note*: these are not exact coordinates, which creates uncertainty with spatial joins.


### Plotting point data over a raster

Suppose we want to link the point data to some raster data. Let's first plot **GHSP survey locations** over a map of **population**, and then write some code to estimate population around the survey locations.

In [None]:
# Get Nigeria ADM1 boundaries
url = "https://geodata.ucdavis.edu/gadm/gadm4.1/shp/gadm41_NGA_shp.zip"

# Read in the country-level shapeful
# The '!' symbol is used to separate the zip path from the specific file inside
nga0 = gpd.read_file(f"zip+{url}!gadm41_NGA_0.shp")
nga1 = gpd.read_file(f"zip+{url}!gadm41_NGA_1.shp")
nga1.head()

In [None]:
# Load survey geovariables from 2010-11 round
ghsp = pd.read_stata("Data/NGA_HouseholdGeovariables_Y1.dta", convert_categoricals=False)
ghsp.columns

In [None]:
# Load population and keep only NGA
with rasterio.open('Data/GPW.tif') as src:
    # Apply the mask directly to the open file object
    # crop=True drops the rest of the world
    gpw_nga_image, gpw_nga_transform = rasterio.mask.mask(src, nga0.geometry, crop=True, nodata=np.nan)
    # Copy the metadata and update it for the new smaller footprint
    gpw_nga_meta = src.meta.copy()
    gpw_nga_meta.update({
        "driver": "GTiff",
        "height": gpw_nga_image.shape[1],
        "width": gpw_nga_image.shape[2],
        "transform": gpw_nga_transform,
        "nodata": -1
    })
gpw_nga = gpw_nga_image[0]

In [None]:
fig, ax = plt.subplots(ncols=1, figsize=(10,7))

nga1.plot(ax=ax, color='none', edgecolor='k', label="State boundaries")

nga_extent = plotting_extent(gpw_nga, gpw_nga_transform)
im = ax.imshow(gpw_nga, 
               cmap='viridis', 
               extent=nga_extent, 
               vmin=0.1, vmax=250000, 
               alpha=0.6) # tells when to plot this
fig.colorbar(im)

ax.scatter(ghsp['lon_dd_mod'],ghsp['lat_dd_mod'], marker='+', c='red')

ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_xlim([2,15])
ax.set_ylim([4,14])
ax.set_title('GHSP 2010-11 survey locations and 2000 population')
plt.show()

In [None]:
lon_coords

### Geospatial calculations

Now that we've brought together data sources, we can use them to conduct calculations.

Let's calculate population around the survey locations by taking the mean for all cells with centroids within 25 km of the location.

This requires a few steps:
1) Set up a grid of centroids for the GPW data
2) Write a function to estimate Haversine (great circle) distance between pairs of points
3) Determine which grid centroids are within 25 km of each survey location
4) Take the mean of population of those grid cells and assign it to each survey location

In [None]:
# 1. Make an grid of centroid lon_coords and lat_coords 
# gpw_nga.shape gives us (rows, cols)
rows, cols = gpw_nga.shape
col_indices = np.arange(cols)
row_indices = np.arange(rows)
# Use the transform to get the x (lon) and y (lat) of the centroids
# Note: we add 0.5 to indices to get the center of the pixel
lon_coords, _ = rasterio.transform.xy(gpw_nga_transform, [0] * cols, col_indices, offset='center')
_, lat_coords = rasterio.transform.xy(gpw_nga_transform, row_indices, [0] * rows, offset='center')
# Convert to clean 1D arrays
lon_coords = np.array(lon_coords)
lat_coords = np.array(lat_coords)
# Create the grid of all centroids for Nigeria
lon_mesh, lat_mesh = np.meshgrid(lon_coords, lat_coords)
# ravel() flattens a multidimensional array into a singel 1D line of numbers
grid_points = np.vstack([lon_mesh.ravel(), lat_mesh.ravel()]).T

In [None]:
# 2. Write a Haversine distance function
def haversine_km(lon1, lat1, lon2, lat2):
    R = 6371.0 # radius of the earth to convert radians to km
    # distance varies by latitude
    phi1, phi2 = np.radians(lat1), np.radians(lat2) # convert to radians
    # relevant differences
    dphi = np.radians(lat2 - lat1)
    dlambda = np.radians(lon2 - lon1)
    # formula
    a = np.sin(dphi/2)**2 + np.cos(phi1)*np.cos(phi2)*np.sin(dlambda/2)**2
    # convert radian distance to km
    return 2 * R * np.arctan2(np.sqrt(a), np.sqrt(1-a))

In [None]:
# 3. Identify nearby locations and extract means
# Note this could take a while to run with many points; would need to think about refining it

# Prep the pop data and survey data
flat_pop = gpw_nga.ravel() # convert to 1D
survey_points = ghsp[['lon_dd_mod', 'lat_dd_mod']].values

# Loop through survey points
est_pop = []
for survey in survey_points:
    # Calculate distance from this survey to every centroid in the Nigeria grid
    dist = haversine_km(survey[0], survey[1], grid_points[:, 0], grid_points[:, 1])
    
    # Filter for centroids within 25km
    nearby_pop = flat_pop[dist <= 25]
    
    # Take the mean, ignoring NaNs (common at borders/coastlines)
    if len(nearby_pop) > 0:
        est_pop.append(np.nanmean(nearby_pop))
    else:
        est_pop.append(np.nan)

ghsp['pop_25km_mean'] = est_pop

In [None]:
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
ghsp['pop_25km_mean'].describe()

Let's do another example of using a shapeful of major rivers/water bodies in Nigeria. We'll create an indicator for whether each survey location is within 10 km of a (non-ocean) body of water.

First we'll look the water data and plot it with the survey locations.

In [None]:
nga_wat_area = gpd.read_file("Data/NGA_water_areas_dcw.shp")

In [None]:
fig, ax = plt.subplots(ncols=1, figsize=(10,10))
nga1.plot(ax=ax, color='none', edgecolor='k', label="State boundaries")
nga_wat_area.plot(ax=ax, color='blue', edgecolor='b', alpha=0.1, label="Bodies of water")
ax.scatter(ghsp['lon_dd_mod'],ghsp['lat_dd_mod'], color='r', marker='+',alpha=0.6)
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('GHSP 2010-11 survey locations and Nigeria water bodies')
plt.show()

Now, let's create the indicator, instead of manually calculating all the distances, we will do something cleaner and more efficient. We will **use a buffer and a spatial join**, as we saw in previous notebooks.

Here are the steps we will need to follow:
1. Turn the pandas GHSP dataframe into a geopandas dataframe.
2. Reproject the water and the survey points into meters instead of degrees (for distance calculations.
3. Add a 10km buffer to the water shapes.
4. Do a spatial join to check which points fall inside the buffered water polygons.
5. Add the indicator to the dataframe.

In [None]:
# 1. Convert ghsp to a GeoDataFrame (currently it's likely just Pandas)
ghsp_gpd = gpd.GeoDataFrame(
    ghsp, 
    geometry=gpd.points_from_xy(ghsp['lon_dd_mod'], ghsp['lat_dd_mod']),
    crs="EPSG:4326" # Start with standard Lat/Lon
)

# 2. Re-project both to a Nigeria-specific meter-based system
# EPSG:26392 is a common one for Nigeria (Minna / UTM zone 32N)
ghsp_projected = ghsp_gpd.to_crs(epsg=26392)
water_projected = nga_wat_area.to_crs(epsg=26392)

# 3. Create the 10km buffer around the water bodies
# 10,000 meters = 10km
water_buffer = water_projected.copy()
water_buffer['geometry'] = water_projected.buffer(10000)

# 4. Perform a Spatial Join
# 'predicate="within"' checks if the point is inside the buffer
points_near_water = gpd.sjoin(ghsp_projected, water_buffer, how="left", predicate="within")
# Remove duplicates by grouping by the original index; we just need ANY match, not all matches
is_near = points_near_water['index_right'].groupby(points_near_water.index).first().notnull().astype(int)

# 5. Create the indicator
# If the join found a match, the index of the water object (index_right) won't be NaN
ghsp['near_water_10km'] = is_near

In [None]:
# Plot it
fig, ax = plt.subplots(ncols=1, figsize=(10,10))
nga1.plot(ax=ax, color='none', edgecolor='k', label="State boundaries")
nga_wat_area.plot(ax=ax, color='blue', edgecolor='b', alpha=0.1, label="Bodies of water")
scatter = ax.scatter(ghsp['lon_dd_mod'],ghsp['lat_dd_mod'], 
                     c=ghsp['near_water_10km'], marker='o', 
                     cmap='PuOr', edgecolors='black',
                     linewidths=0.1, alpha=0.6, s=40)
handles, labels = scatter.legend_elements()
ax.legend(handles, ["Far (>10km)", "Near (<10km)"], title="Water Proximity", loc='lower right')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('GHSP 2010-11 survey locations and Nigeria water bodies')
plt.show()