# Spatial Analysis

"Everything is related to everything else, but near things are more related than distant things." -Waldo Tobler

If the strength of a relationship between entities increases with their proximity, then spatial analysis/modeling is essential for understanding the relationship's process and pattern. Today we focus on exploratory spatial data analysis (ESDA) to discover patterns in spatial data.

Overview of today's topics:
  - Tobler's first law of geography
  - spatial weights
  - spatial interpolation
  - spatial lag
  - spatial autocorrelation
  - hot spot mapping
  
Today we will conduct an exploratory spatial analysis of LA county household income.

In [None]:
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pysal as ps
import seaborn as sns
from scipy.stats import stats

np.random.seed(0)

## 1. Data prep

### 1.1. California census tract geometries

In [None]:
# load CA tracts, display shape
tracts_ca = gpd.read_file('../../data/tl_2017_06_tract/')
tracts_ca = tracts_ca.set_index('GEOID')
tracts_ca.shape

In [None]:
# what variables are present?
tracts_ca.columns

In [None]:
# inspect the first 5 rows
tracts_ca.head()

In [None]:
# retain LA county only (and drop channel island tracts)
tracts_ca = tracts_ca[tracts_ca['COUNTYFP']=='037'].drop(index=['06037599100', '06037599000'])
tracts_ca.shape

In [None]:
# project spatial geometries to a meter-based projection for SoCal
crs = '+proj=utm +zone=11 +ellps=WGS84 +datum=WGS84 +units=m +no_defs'
tracts_ca = tracts_ca.to_crs(crs)

### 1.2. California tract-level census variables

In [None]:
# load CA tract-level census variables
df_census = pd.read_csv('../../data/census_tracts_data_ca.csv', dtype={'GEOID10':str}).set_index('GEOID10')
df_census.shape

In [None]:
df_census.columns

In [None]:
df_census.head()

### 1.3. Merge the data

In [None]:
# merge tract geometries with census variables
tracts = tracts_ca.merge(df_census, left_index=True, right_index=True, how='left')
tracts.shape

In [None]:
# calculate pop density in persons per sq km
# turn any infinities into nulls
tracts['pop_density'] = tracts['total_pop'] / (tracts['ALAND'] / 1e6)
tracts = tracts.replace([np.inf, -np.inf], np.nan)

In [None]:
tracts.columns

## 2. Initial exploration

Let's do some quick mapping and analysis of distributions and correlations for a couple variables of interest.

In [None]:
# descriptive stats
tracts['med_household_income'].describe()

In [None]:
# descriptive stats
tracts['pop_density'].describe()

In [None]:
# inspect these variables' statistical distributions
fig, axes = plt.subplots(1, 2, figsize=(12, 2))
ax1 = sns.boxplot(ax=axes[0], x=tracts['med_household_income'])
ax2 = sns.boxplot(ax=axes[1], x=tracts['pop_density'])

In [None]:
# map a couple variables to inspect their spatial distributions
cols = ['pop_density', 'med_household_income']
for col in cols:
    ax = tracts.dropna(subset=[col]).plot(column=col,
                                          scheme='NaturalBreaks',
                                          cmap='plasma',
                                          figsize=(4, 4),
                                          legend=True,
                                          legend_kwds={'bbox_to_anchor': (1.7, 1)})
    ax.set_title(col)
    _ = ax.axis('off')

Looks like we have some missing values. We'll spatially interpolate them later.

Visually, it appears that these two variables may be negatively correlated? In general, where one is high, the other is low.

In [None]:
# calculate correlation coefficient and p-value
subset = tracts.dropna(subset=['pop_density', 'med_household_income'])
r, p = stats.pearsonr(x=subset['pop_density'],
                      y=subset['med_household_income'])
print('r={:.4f}, p={:.4f}'.format(r, p))

In [None]:
# quick and dirty scatter plot with matplotlib
fig, ax = plt.subplots()
sc = ax.scatter(x=subset['pop_density'],
                y=subset['med_household_income'],
                s=1)

In [None]:
# estimate a simple linear regression model with scipy
# what if you log transform your variables first?
m, b, r, p, se = stats.linregress(x=subset['pop_density'],
                                  y=subset['med_household_income'])
print(f'm={m:.4f}, b={b:.4f}, r^2={r**2:.4f}, p={p:.4f}')

Every 1 person/km^2 increase in density is associated with a *m* change in median household income.

In [None]:
# now it's your turn
# look through the list of columns, pick two new variables, and map them
# do they look like they are correlated? would you expect them to be?


## 3. Spatial weights matrix

Spatial analysis depends on spatial relationships. A spatial weights matrix defines the spatial relationships among our units of analysis (tracts, in this case). It tells how they're spatially connected to one another. These weights can take on many different forms. Pick the right form for your theoretical needs, including:

  - rook contiguity
  - queen contiguity
  - k-nearest neighbors
  - distance band

### 3.1. Contiguity-based weights: rook contiguity

Using rook contiguity, two spatial units must share an edge of their boundaries to be considered neighbors. This isn't terribly common in practice since queen is usually more useful, but it's worth understanding as a trivial example.

In [None]:
# get the tract labels (GEOIDs) and pick one (arbitrarily) to work with throughout
labels = tracts.index.tolist()
label = labels[603]
label

In [None]:
%%time
# calculate rook spatial weights
w_rook = ps.lib.weights.Rook.from_dataframe(tracts, ids=labels, id_order=labels)

In [None]:
# find the neighbors of some tract
# this is a raw contiguity matrix, so weights are binary 1s and 0s meaning neighbor/not
w_rook[label]

### 3.2. Contiguity-based weights: queen contiguity

Using queen contiguity, two spatial units need only share a vertex (a single point) of their boundaries to be considered neighbors.

In [None]:
%%time
# calculate queen spatial weights
w_queen = ps.lib.weights.Queen.from_dataframe(tracts, ids=labels, id_order=labels)

In [None]:
# find the neighbors of some tract
# this is a raw contiguity matrix, so weights are binary 1s and 0s meaning neighbor/not
w_queen[label]

In [None]:
# how many neighbors does this tract have?
w_queen.cardinalities[label]

In [None]:
# convert cardinalities to series and describe data
pd.Series(w_queen.cardinalities).describe()

How many neighbors does the average tract have?

In [None]:
# min number of neighbors
w_queen.min_neighbors

In [None]:
# max number of neighbors
w_queen.max_neighbors

In [None]:
# islands are observations with no neighbors, disconnected in space (can cause modeling problems)
w_queen.islands

##### Plot a census tract of interest, along with its neighbors:

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))
tracts.plot(ax=ax, facecolor='#666666', edgecolor='w', linewidth=0.5)

# plot some tract of interest in red
tract = tracts.loc[[label]]
tract.plot(ax=ax, facecolor='#ff0000', edgecolor='w', linewidth=2)

# plot the neighbors in blue
neighbors = tracts.loc[w_queen[label]]
neighbors.plot(ax=ax, facecolor='#0033cc', edgecolor='w', linewidth=2)

# zoom to area of interest
xmin, ymin, xmax, ymax = neighbors.unary_union.bounds
ax.axis('equal')
ax.set_xlim(xmin-100, xmax+100)  # +/- 100 meters
ax.set_ylim(ymin, ymax)

ax.set_title('Neighbors of tract {}'.format(label))
_ = ax.axis('off')

In [None]:
%%time
# draw a queen-contiguity graph of the tracts
fig, ax = plt.subplots(figsize=(12, 12), facecolor='#111111')
tracts.plot(ax=ax, facecolor='#333333', edgecolor='k', linewidth=0.3)

# extract centroids of tract and its neighbors, then draw lines between them
for tract, neighbors in w_queen:
    tract_centroid = tracts.loc[tract, 'geometry'].centroid
    for neighbor_centroid in tracts.loc[neighbors, 'geometry'].centroid:
        Xs = [tract_centroid.x, neighbor_centroid.x]
        Ys = [tract_centroid.y, neighbor_centroid.y]
        ax.plot(Xs, Ys, color='r', linewidth=0.3)
_ = ax.axis('off')

### 3.3. Distance-based weights: *k*-nn

Find the *k*-nearest neighbors of each tract, by centroid.

In [None]:
%%time
# k-nearest neighbors finds the closest k tract centroids to each tract centroid
w_knn = ps.lib.weights.KNN.from_dataframe(tracts, k=6)

In [None]:
# they all have exactly k neighbors
w_knn.neighbors[label]

### 3.4. Distance-based weights: distance band

Here, other tracts are considered neighbors of some tract if they are within a given threshold distance of it, by centroid. Distance band weights can be specified to take on continuous values rather than binary (1s and 0s), with these values being the inverse distance between each pair of "neighboring" units.

  - linear distance-decay exponent is -1, so $w_l=\frac{1}{d}$
  - gravity model distance-decay exponent is -2, so $w_g=\frac{1}{d^2}$

In [None]:
# calculate maximum nearest neighbor distance so each unit is assured of >=1 neighbor
x = tracts.centroid.x
y = tracts.centroid.y
coords = np.array([x, y]).T
threshold = ps.lib.weights.min_threshold_distance(coords)
threshold

In [None]:
%%time
# calculate linear decay continuous weights
w_dist = ps.lib.weights.distance.DistanceBand.from_dataframe(tracts,
                                                             threshold=threshold,
                                                             binary=False,
                                                             alpha=-1)

In [None]:
# how many distance-band neighbors does our tract have?
len(w_dist.neighbors[label])

In [None]:
# map the neighbors, colored by weight from nearest to furthest
fig, ax = plt.subplots(figsize=(6, 6))
tracts.plot(ax=ax, facecolor='#333333', edgecolor='gray', linewidth=0.1)

# get the tract of interest and its neighbors/weights
tract = tracts.loc[[label]]
weights = pd.Series(w_dist[label])
neighbors = tracts.loc[weights.index, ['geometry']]
neighbors['weights_scaled'] = weights

# plot the tract's neighbors in blues by weight
neighbors.plot(ax=ax,
               column='weights_scaled',
               cmap='Blues_r',
               edgecolor='gray',
               linewidth=0.3,
               scheme='NaturalBreaks')

# plot the tract of interest in red
tract.plot(ax=ax, facecolor='r', edgecolor='r', linewidth=0.1)

# zoom to area of interest
xmin, ymin, xmax, ymax = neighbors.unary_union.bounds
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
ax.set_title('Neighbors of tract {}'.format(label))
_ = ax.axis('off')

In [None]:
# now it's your turn
# recompute the distance-based spatial weights with a gravity decay
# how and why does this impact the number of neighbors and the map above? why?


### 3.5. Standardizing weights

A spatial weights matrix with raw values (e.g., binary 1s and 0s for neighbor/not) is not always the best for analysis. Some sort of standardization is useful. Typically, we want to apply a row-based transformation so every row of the matrix sums up to 1. We'll see some examples of why this matters in practice shortly.

In [None]:
# inspect the neighbors and weights of our tract
w_queen[label]

In [None]:
# check the current transformation of the weights matrix (O = original)
w_queen.get_transform()

In [None]:
# transform the queen weights
w_queen.set_transform('R')
w_queen[label]

In [None]:
# transform the linear-decay distance-based weights
w_dist.set_transform('R')
#w_dist[label]

PySAL supports the following transformations:

  - O: original, returning the object to the initial state
  - B: binary, with every neighbor having assigned a weight of 1
  - R: row-based, with all the neighbors of a given observation adding up to 1
  - V: variance stabilizing, with the sum of all the weights being constrained to the number of observations

It can take a long time to calculate a weights matrix for a large data set. Once you've created yours, you might want to save it to disk to re-use in subsequent analyses.

In [None]:
# save your matrix to disk
f = ps.lib.io.open('tracts_queen.gal', 'w')
f.write(w_queen)
f.close()

# read a matrix from disk (notice its transformation)
w_queen = ps.lib.io.open('tracts_queen.gal', 'r').read()
w_queen[label]

## 4. Spatial interpolation

Interpolation lets you estimate unobserved values based on observed values. With spatial data, you can perform spatial interpolation by filling in missing data points based on nearby values. This assumes positive spatial autocorrelation exists: more on that in a moment. Remember Tobler's first law of geography.
  
  - **Nearest neighbor** interpolation is perhaps the simplest method: just assign the value of the nearest neighbor
  - **Local averaging** assigns missing values by taking the average of adjacent neighbors or neighbors within some radius
  - **Inverse distance weighting** assigns missing values using a distance weighted average: that is, the mean weighs nearby values more than it weighs distant values (and your distance decay choice is important!)
  - **Kriging** is a sophisticated method that incorporates information about spatial trends and autocorrelation with a variogram

We'll look at an example comparing local averaging to inverse distance weighting.

In [None]:
# how many tracts are missing values for this variable?
col = 'med_household_income'
nulls = tracts[pd.isnull(tracts[col])].index
len(nulls)

In [None]:
# for example, this tract is missing a value
tract = nulls[0]
tract

In [None]:
# local averaging: equal-weighted queen-adjacent tracts
neighbors = w_queen[tract]
tracts.loc[neighbors, col].mean()

In [None]:
# or, calculate inverse distance weighted mean
neighbors = w_dist[tract]
inv_dist_wt = pd.Series(neighbors)
(tracts.loc[neighbors, col] * inv_dist_wt).sum()

In [None]:
# or, interpolate all the missing values across this variable
estimates = {}
for tract in nulls:
    neighbors = w_dist[tract]
    inv_dist_wt = pd.Series(w_dist[tract])
    estimates[tract] = (tracts.loc[neighbors, col] * inv_dist_wt).sum()
pd.Series(estimates).head()

In [None]:
# now it's your turn
# spatially interpolate missing values for median home value


## 4. Spatial lag

Spatial lag tells us how values locate near other (dis)similar values. While spatial interpolation filled in unobserved (missing) values using nearby values, spatial lag lets us compare observed values to their nearby values.

Here we calculate the spatial lag of a variable. If the spatial weights matrix is row-standardized (important), then the spatial lag is the average value of an observation's neighbors, however "neighbor" is defined in the matrix.

In [None]:
# pick a variable to investigate and drop null rows
col = 'med_household_income'
tracts_not_null = tracts[[col, 'geometry']].dropna()
y = tracts_not_null[col]

In [None]:
# recompute spatial weights for just these observations then row-standardize
w_queen = ps.lib.weights.Queen.from_dataframe(tracts_not_null)
w_queen.set_transform('R')

In [None]:
# compute spatial lag
y_lag = ps.lib.weights.lag_spatial(w_queen, y)

In [None]:
# is a tract's med income similar to those of its neighbors?
col_lag = f'{col}_lag'
data_lag = pd.DataFrame(data={col:y, col_lag:y_lag}).astype(int)
data_lag

## 5. Spatial autocorrelation

Spatial autocorrelation is a central question in ESDA. Statistical models typically assume that the observations are independent of each other. This assumption is violated when a variable's value at one location is correlated with its value at nearby locations.

Such spatial autocorrelation is common in the real world due to proximity-based spillover effects. For example, a home's value may be a function of its own characteristics and accessibility, but it is also a function of nearby homes' values. In other words, homes near one another tend to have similar home values.

  - **positive** spatial autocorrelation: nearby values tend to be more similar (e.g. income, home values, temperature, rainfall)
  - **negative** spatial autocorrelation: nearby values tend to be more dissimilar (e.g. fire stations, libraries)

Substantive spatial autocorrelation can be explained by social or economic theory that describes a spatial relationship. Nuisance spatial autocorrelation stems from data problems.

In [None]:
# does household income exhibit spatial autocorrelation?
# let's find out
data_lag.sample(5)

### 5.1. Moran's I

Moran's I measures *global* spatial autocorrelation: do things tend to be near other (dis)similar things. Values > 0 indicate positive spatial autocorrelation, and values < 0 indicate negative spatial autocorrelation.

In [None]:
# calculate the statistic
mi = ps.explore.esda.Moran(data_lag[col], w_queen)

In [None]:
# show the I value
mi.I

In [None]:
# statistical inference: show the p value
mi.p_sim

If we generated a large number of maps with the same values but randomly allocated over space, and calculated Moran's I for each of these maps, only 1/1000 of them would display a larger absolute value than the one we computed from the real-world data set. Thus there is a 1/1000 chance of getting the observed value of Moran's I if the spatial distribution of our variable is random. We can conclude that the variable's distribution is statistically significantly postively spatially autocorrelated.

In [None]:
# now it's your turn
# calculate the moran's I of median home values
# is it statistically significant? what does it tell you?


### 5.2. Moran plots

A Moran plot scatter plots the spatially-lagged values (y-axis) vs the original variable's values (x-axis).

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))
sns.regplot(x=col, y=col_lag, data=data_lag, scatter_kws={'s':1, 'color':'gray'})
plt.show()

Notice the 95% confidence interval shading and the positive slope. Given the p-value of Moran's I that we calculated earlier, we can conclude that the slope of the line is statistically-significantly different from zero.

More useful, however, is a **standardized** Moran plot. Moran's I is the slope of the line in the standardized Moran plot, which makes this all a bit easier to conceptualize.

In [None]:
# standardize the variable's values (i.e., calculate z-scores)
y_std = (y - y.mean()) / y.std()
y_std.head()

In [None]:
# compute spatial lag of standardized values and save as series with same index
y_std_lag = pd.Series(ps.lib.weights.lag_spatial(w_queen, y_std),
                      index=y_std.index,
                      name=col_lag)
y_std_lag

In [None]:
# estimate a simple linear regression model
m, b, r, p, se = stats.linregress(x=y_std, y=y_std_lag)
print('m={:.4f}, b={:.4f}, r^2={:.4f}, p={:.4f}'.format(m, b, r ** 2, p))

In [None]:
# the slope is the same as moran's I, calculated earlier
mi.I

In [None]:
# standardized moran's plot
fig, ax = plt.subplots(figsize=(4, 4))
ax.scatter(x=y_std, y=y_std_lag, s=1, color='gray')

# draw quadrants and ignore outliers beyond 3 std devs (99.7% of distribution)
plt.axvline(0, c='k', alpha=0.5)
plt.axhline(0, c='k', alpha=0.5)
ax.set_xlim(-3, 3)
ax.set_ylim(-3, 3)

# draw a line with moran's I as the slope
Xs = pd.Series([-3, 3])
Ys = Xs * mi.I
line = ax.plot(Xs, Ys, lw=2)

In [None]:
# now it's your turn
# visualize a standardized moran's plot of median home values


### 5.3. LISAs

Local Indicators of Spatial Autocorrelation: are there specific areas with high concentrations of (dis)similar values?

Moran's I tells us about spatial clustering globally across the data set as a whole. However, it does not tell us where these clusters occur. For that, we need a local measure. Essentially, we will classify the data set's observations into four groups based on the four quadrants of the Moran plot:

  1. **HH**: high value near other high values (*hot spots*)
  1. **LL**: low value near other low values (*cold spots*)
  1. **HL**: high value near low values (*spatial outliers*)
  1. **LH**: low value near high values (*spatial outliers*)

Let's see what that looks like, visually.

In [None]:
# standardized moran's plot again, from above, but labeled this time
fig, ax = plt.subplots(figsize=(6, 6))
ax.scatter(x=y_std, y=y_std_lag, s=1, color='gray')

# draw quadrants and ignore outliers beyond 3 std devs
plt.axvline(0, c='k', alpha=0.5)
plt.axhline(0, c='k', alpha=0.5)
ax.set_xlim(-3, 3)
ax.set_ylim(-3, 3)

# label the quadrants
ax.text(1.25, 1.25, 'HH', fontsize=30)
ax.text(1.25, -1.75, 'HL', fontsize=30)
ax.text(-1.75, 1.25, 'LH', fontsize=30)
ax.text(-1.75, -1.75, 'LL', fontsize=30)

# draw a line with moran's I as the slope
Xs = pd.Series([-3, 3])
Ys = Xs * mi.I
line = ax.plot(Xs, Ys, lw=2)

In [None]:
# calculate LISA values using the queen spatial weights
lisa = ps.explore.esda.Moran_Local(data_lag[col], w_queen)

In [None]:
# set the statistical significance threshold (alpha)
alpha = 0.05

In [None]:
# identify whether each observation is significant or not
# p-value interpretation same as earlier with moran's I
data_lag['significant'] = lisa.p_sim < alpha
data_lag['significant'].value_counts()

In [None]:
# identify the quadrant each observation belongs to
data_lag['quadrant'] = lisa.q
data_lag['quadrant'] = data_lag['quadrant'].replace({1:'HH', 2:'LH', 3:'LL', 4:'HL'})
data_lag['quadrant'].sort_values().value_counts()

In [None]:
# what have we got in the end?
data_lag

##### Now map the tracts, colored according to their LISA quadrants, to identify clusters:

In [None]:
fig, ax = plt.subplots(figsize=(9, 9))

# merge original tracts and LISA quadrants data together, plot tracts basemap
tracts_lisa = tracts.merge(data_lag, how='left', left_index=True, right_index=True)
tracts_lisa.plot(ax=ax, facecolor='#999999', edgecolor='k', linewidth=0.1)

# plot each quandrant's tracts (if significant LISA) in a different color
quadrant_colors = {'HH':'r', 'LL':'b', 'LH':'skyblue', 'HL':'pink'}
for q, c in quadrant_colors.items():
    mask = tracts_lisa['significant'] & (tracts_lisa['quadrant']==q)
    rows = tracts_lisa.loc[mask]
    rows.plot(ax=ax, color=c, edgecolor='k', linewidth=0.1)

ax.axis('off')
fig.savefig('clusters.png', dpi=600, bbox_inches='tight')

How do we interpret this map?

  - Gray tracts have statistically-insignificant LISA value (no local spatial autocorrelation)
  - In red we see clusters of tracts with high values surrounded by other high values
  - In blue we see clusters of tracts with low values surrounded by other low values
  - In pink, we see the first type of spatial outliers: tracts with high values but surrounded by low values
  - In light blue we see the other type of spatial outlier: tracts with low values surrounded by other tracts with high values

## In-class exercise

To practice exploratory spatial analysis, do the following below:

  1. Select the tracts in a different CA county
  1. Calculate a new spatial weights matrix for this subset
  1. Choose a new variable from the data set
  1. Calculate its Moran's I
  1. Visualize its Moran's plot
  1. Calculate and map its LISA values