# Performing Spatial Joins in GeoPandas
This notebook covers how to use spatial joins in GeoPandas to combine datasets based on geographic relationships rather than matching columns.

## What Is a Spatial Join?
In pandas, a **column join** combines two datasets based on a shared column (e.g., zip code).

A GeoPandas **spatial join**, by contrast, combines datasets based on their **geographic relationship** (e.g., points within polygons).

### Example Use Cases
- Join restaurants (points) with neighborhoods (polygons)
- Match roads (lines) to cities (polygons)
- Combine pollution sensors (points) with state boundaries

## Setting Up Our Data
Let's assume we have two GeoDataFrames:
- `states`: U.S. states (polygons)
- `plants`: Power plants (points)

In [None]:
import geopandas as gpd

# Sample loading (replace with actual paths)
states = gpd.read_file("path/to/states.shp")
plants = gpd.read_file("path/to/power_plants.shp")

## Coordinate Reference Systems (CRS) Must Match
Before performing a spatial join, ensure both GeoDataFrames use the same CRS.

In [None]:
print(states.crs)
print(plants.crs)

It doesn't really matter which CRS you choose, but they must be the same. But if they don't match, the results will be meaningless. 

To convert one to the other's CRS, use `.to_crs()`. 

In [None]:
states.to_crs(plants.crs, inplace=True)

## Performing the Spatial Join
Now that the CRSs match, we can join the data. We want to find which state each power plant is in.

In [None]:
joined = gpd.sjoin(plants, states, how="inner", op="within")

### Join Parameters Explained:
- `plants`: The GeoDataFrame whose geometry (points) we keep
- `states`: The GeoDataFrame we are joining to (polygons)
- `how='inner'`: Keeps only points that fall within a polygon
- `op='within'`: Tests whether each point is inside a polygon

In [None]:
joined.head()

## What Did We Get?
The result contains:
- All columns from `plants`
- All columns from `states`, except for`geometry`, because we kept the `plants` geometry.

Example: `megawatts`, `plant`, `source` (from plants) + `name`, `population` (from states)

## Aggregating Results
Now that each power plant has state information, we can group and analyze. For example, how many plants does each state have?

In [None]:
joined['name'].value_counts()

### Count Coal Plants by State

In [None]:
coal = joined.loc[joined['source'] == 'coal']
coal['name'].value_counts()

## Summary
- Spatial joins let you combine datasets based on geography.
- Always match CRS before a join.
- Use `predicate='within'` to find points inside polygons.
- Choose `how='left'` or `how='inner'` based on whether unmatched records should be kept.
- You can now perform powerful analyses like counting points per region.

# Changing the order of the join
Changing the order of datasets in a spatial join affects the result—particularly the geometry column. Depending on the values of the `how` and `op` parameters, changing the order can have effects or no effect at all in the results.

## Join: Power Plants within States (Points First)
This keeps power plant geometries and adds columns from the states they fall inside.

In [None]:
plants_with_states = gpd.sjoin(plants, states, how="inner", op="within")
plants_with_states.head()

## Swapping the Join Order: States with Power Plants
Now we place `states` first and `plants` second. This will:
- Keep the **geometry of states**
- Append the **columns from matching plants**

In the original example, we were looking for plants that were within states. In this case, we are looking for states that contain plants. To reflect this reversal, we also reverse the operation feature `op` from `within` to `contains`.

In [None]:
states_with_plants = gpd.sjoin(states, plants, how="inner", op="contains")
states_with_plants.head()

## Comparing Row Counts
Let's confirm both joins return the same number of matched rows.

In [None]:
print("Plants with States (geometry = points):", plants_with_states.shape)
print("States with Plants (geometry = polygons):", states_with_plants.shape)

## Key Differences
| Join Version               | Geometry Type | Column Order                |
|---------------------------|----------------|-----------------------------|
| `plants_with_states`      | Points         | Plant info, then State info |
| `states_with_plants`      | Polygons       | State info, then Plant info |

The rows are the same because both joins are **inner joins**, meaning only matched records are kept.

## Geometry Implications
In `states_with_plants`, **every row** has the geometry of the state, even if multiple rows refer to different power plants within that state.

## Counting Power Plants by State (Same Either Way)
We can still count how many power plants are in each state using `value_counts()` on the state name.

In [None]:
plants_with_states['name'].value_counts()

## Summary
- Reversing the order in a spatial join changes the **geometry** and **column order**, but not the **rows**.
- Use `op='within'` when the first dataset is **contained in** the second.
- Use `op='contains'` when the first dataset **contains** the second.
- Both methods can support the same kinds of analysis.

✅ Whether you start with points or polygons, understanding join direction helps manage geometry and prepare for visualization.

# "inner" vs "left" joins

## What Does `how` Mean in a Spatial Join?

The `how` parameter in `gpd.sjoin()` determines how unmatched rows are handled:
- `inner`: Only matched records are retained.
- `left`: All rows from the left GeoDataFrame are kept, even if they don’t match anything on the right.

### `how='inner'` example
Match all power plants with states. Drop plants that don't fall inside any state.

In [None]:
inner_join = gpd.sjoin(plants, states, how='inner', predicate='within')
print(f"Number of rows (inner join): {len(inner_join)}")

### `how='left'` example
Keep all power plants. If a plant isn't inside any state, its `state` columns will be `NaN`.

In [None]:
left_join = gpd.sjoin(plants, states, how='left', predicate='within')
print(f"Number of rows (left join): {len(left_join)}")

### Finding Unmatched Points in Left Join
These are plants that were not matched to any state in the left join.

In [None]:
left_join[left_join['name'].isna()].head()

## Case Study: The Winnetka Power Plant
The Winnetka plant is a real power plant located on the shoreline of Illinois, and its coordinates fall slightly outside the state polygon.

In [None]:
"Winnetka" in left_join['plant'].values


### Why Join Type Matters
If we had used `inner`, we would have lost the Winnetka plant in our dataset, which could skew any geographic analysis.

This shows how **inner joins eliminate unmatched data silently**, while **left joins preserve all original data**, letting you manually inspect or correct it later.


## Revisiting Join Order and Geometry
Recall that the first GeoDataFrame you pass to `gpd.sjoin()` determines which geometry is kept in the result.

In [None]:
# Example:
# This keeps point geometry (plants)
plants_geom = gpd.sjoin(plants, states, how='inner', predicate='within')

# This keeps polygon geometry (states)
states_geom = gpd.sjoin(states, plants, how='inner', predicate='contains')

In [None]:
print("plants_geom shape:", plants_geom.shape)
print("states_geom shape:", states_geom.shape)

## Now Let’s Try `left` With Polygons First
What happens when we put `states` first in a left join? We’ll keep all the states, but not necessarily all the plants.

In [None]:
left_states = gpd.sjoin(states, plants, how='left', predicate='contains')
print(f"Left join with states first: {len(left_states)} rows")

In [None]:
"Winnetka" in left_states['plant'].values  # Expected: False

## Summary Table: Join Type + Order
| Order         | `how='inner'`                   | `how='left'`                                    |
|---------------|----------------------------------|-------------------------------------------------|
| `plants, states` | Keeps matching plants only       | Keeps **all** plants; unmatched get `NaN`         |
| `states, plants` | Keeps matching states only       | Keeps **all** states; unmatched plants are lost |

- Join type (`how`) decides if unmatched rows are kept
- Join direction decides which geometry appears in the result


## ✅ Key Takeaways
- Use `how='left'` when you want to keep all rows from your main dataset (even if they don't match).
- Always check for unmatched records using `.isna()` on join columns.
- Use `predicate='within'` when joining points to polygons; use `predicate='contains'` to reverse.
- The first dataset in `gpd.sjoin()` determines the geometry kept.
