# Lecture 15: Geospatial Analysis

## Why Geospatial Analysis?

* *"Everything is related to everything else, but neat things are more related than distant things."* (Tobler 1979)
* *"...the purpose of geographic inquiry is to examine relationships between geographic features collectively and to describe the real-world phenomena that map features present."* (Clarke 2001)
* Clearly visualizes important differences

## Visualizing Geospatial Data

* Choropleth maps are useful for visualizing *clear regional pattens* in the data
* Use light colors for low values, dark colors for high values
* Choropleth should display relative differences,*not* absolute numbers
* Choropleth maps can be misleading
* Consider using the smallest unit possible (but there are exceptions!)

### Bubble graphs can represent a third (or fourth!) dimension, giving a sense of relative contribution

* Useful for comparing data in 3 numeric-data dimensions
* A 4th dimension can be represented by color
* XY-data is like a scatterplot with the third axis a circle whose size is determined by the dimension
* Helps with relative comparisons
* Cannot be used to display a lot of data, difficult to express actual values

### Bubble Maps

* Coordinates of latitude and longitude
* Bubble size is third axis, such as population density, COVID cases, etc.
* Notes:
    * Consider using area rather than radius to avoid exaggerating bubble sizes
    * Transparency for bubbles
    * Use a legend

## Visualization Choices

* Cartograms should be considered when displaying how many people were affected
* Isarithmic maps demonstrate smooth, continuous phenomena (temperature, elevation, rainfall, etc.)

**CLICKER QUESTION**

You want to visualize how many people have been affected by COVID19 worldwide. Best approach to visualize these data?

A) Choropleth

**B) Cartogram**

C) Isarithmic Graph

D) Bar Chart

E) Scatterplot

## Spatial Statistics: The Why

### Spatial Statistics

The statistical techniques we've discussed so far don't work well when considering spatial distributions...

...which means we have a chance to take a look at the data and the relationship between the data in new and interesting ways (distance, adjacency, interaction, and neighbor)

### Spatial Data Violate Conventional Statistics

Violations of conventional statistics:

* Spatial autocorrelation
* Modifiable areal unit problem (MAUP)
* Edge effects (Boundary problem)
* Ecology fallacy
* Nonuniformity of space

#### Spatial Autocorrelation

Data from locations near one another in space are more likely to be similar than data from locations remote from one another:

* Housing market
* Elevation change
* Temperature

#### Modificable Areal Unit Problem (MAUP)

The aggregation units used are arbitrary with respect to the phenomena under investigation, yet the aggregation units used will affect statistics determined on the basis of data reported in this way.

If spatial units in a particular study were specified differently, we might observe very different patterns and relaionships.

**Modifiable Area**: Units are arbitrarily defined and different organization of the units may create different analytical results.

* Potential problems in almost every field that utilizes spatial data
* One of the most stubborn problems in spatial analysis when spatially aggregated data is used

#### Edge Effects (Boundary Problem)

Analyzing A vs B ignores similarities between the two based on their shared boundary.

#### Ecological Fallacy

The ecological fallacy is a situation that can occur when a researcher or analyst makes an inference about an individual based on aggregate data for a group.

**Example**: We might observe a *strong relationship between income and crime at the county level*, with lower-income areas being associated with higher crime rates.

**Conclusion:**

* Lower-income persons are more likely to commit crime
* Lower-income areas are associated with higher crime rates
* Lower-inocme counties tend to experience higher crime rates

**Issues:**

Inferences drawn about associations between the characteristics of an aggregate population and the characteristics of sub-units within the population are wrong. That is: *results from aggregated data (e.g., counties) cannot be applied to individual people.

**What should we do?**

Be aware of the process of aggregating or disaggregating any data tmay conceal the variations that are not visible at the larger aggregate level.

#### Nonuniformity of Space

**Example:** Crime locations

**Conclusion:** Bank robberies are clustered, but only because banks are clusterd

**Clicker Question**

In Baltimore City, police spend more time in a few neighborhoods. Crime rates are higher in those neighborhoods. What explains what's going on here?

A) Spatial Autocorrelation

B) MAUP

C) Edge Effects

D) Ecological Fallacy

**E) Nonuniformity**

**Clicker Question**

A Trader Joe's just opened in a new neighborhood. Nearby homes are now worth more money. What explains what's going on here?

**A) Spatial Autocorrelation**

B) MAUP

C) Edge Effects

D) Ecological Fallacy

E) Nonuniformity

## Spatial Statistics: The Basics

**Question:** Are countries with a high conflict index score geographically clustered?

### Global Point Density

The ratio of observed number of points to the study region's surface area

### Quadrat Density (Local)

Surface is divided and then point density is calculated within quadrat

Note: Quadrat number and shape will affect measurement estimate. Suffers from MAUP.

### Kernel Density (Local)

Point density is calculated within sliding windows (window size = kernel)

Note: Kernel will affect measurement estimate, but this is less susceptible to MAUP

## Modeling These Data: Poission Point Process

(Density-based Methods -- How the points are distributed relative to study space)

$$\lambda(i) = e^{\alpha + \beta Z (i)}$$

* $\lambda(i)$ is the modeled density at location i
* $e^{\alpha}$ is the base density when the covariate is zero
* $e^{\beta}$ is the multiplier by which the intensity increases (or decreases) for each 1 unit increase in the covariate

The Possion Distribution models events in fixed intervals of time, given a known average rate (and independent occurences)

## Modeling These Data: Average Nearest Neighbor

(Distance-based Methods -- How the points are distributed relative to study space)

* Plot the ANN values for different order neighbors, that is for the first closest point, then the second closest point, and so forth
* ANN vs neighbor order offers insight into underlying spatial relationship
* Note: Study space definition affects this measure

## KNN: K Nearest Neighbor for Classification

To which class does the new data point belong?

### KNN: Choosing K

* K specifies how many neighbors to consider
* Note that as more neighbors are considered, the boundary smooths out

### KNN: Pros and Cons

**Pros:**

* No assumptions about data (good for nonlinear)
* Simple and interpretable
* Relatively high accuracy
* Versatile (classification and regression)

**Cons:**

* Computationally intensive
* High memory requirements
* Stores all (or most) of training data
* Prediction slow with large N
* Sensitive to outliers/irrelevant features

## Hypothesis Testing: CSR/IRP

(Distance-based Methods -- How the points are distributed relative to study space)

Compare observed point patterns to ones generated by an independent random process (IRP), aka complete spatial awareness (CSR)

CSR/IRP satisfy two conditions:

1. Any event has equal probability of being in any function, a 1st order effect
2. The location of one event is independent of the location of another event, a 2nd order effect

### Hypothesis Testing: A Monte Carlo Test

Is this distribution of Walmarts in MA the result of CSR?

* $H_{0}$: Distributed randomly (yes CSR)
* $H_{a}$: NOT distributed randomly (no CSR)

1. First, we postulate a process - our null hypothesis $H_{0}$. For example, we hypothesize that the distribution of Walmart stores is consistent with a completely random process (CSR).
2. Next, we simulate many realizations of our postulated process and compute a statistic (e.g., ANN) for each realization.
3. Finally, we compare our observed data to the patterns generated by our simulated processes and assess (via a measure of probability) if our pattern is a likely realization of the hypothesized process.

This is an example of bootstrapping!

Failing to reject the null suggests that our results come from a CSR.

When controlling for population density, are Walmarts randomly distributed?

* $H_{0}$: Walmarts are distributed according to population density alone
* $H_{a}$: Walmarts are *not* distributed based on population density alone

Two randomly generated point pattens using population density is used as the underlying process.

Rejecting the null means population is not the sole driving force!

Maybe median household income is the driving force?

...Is it CSR or median household income?

Hints at plausible scenarios, but doesn't tell us which one it is definitively.

## Basic Geospatial Analysis: Summary

1. Considerations when visualizing spatial data important to conclusions drawn

* Values to plot?
* Map type?
* Color scale?

2. Traditional statistics fail with geospatial data

* Spatial autocorrelation
* MAUP
* Edge effects
* Ecological fallacy
* Nonuniformity of space

3. Analysis is still possible

* Global Point Density, Quadrat Density, Kernel Density
* Poisson Point Process
* K-Nearest Neighbor (k-NN)
* Comparison to a CRP (using simulation)