# Geospatial Data, Python & Pandas

<blockquote>"GeoPandas, as the name suggests, extends the popular data science library pandas by adding support for geospatial data." (GeoPandas, "<a href="https://geopandas.org/en/stable/getting_started/introduction.html">Introduction to GeoPandas</a>")</blockquote>

You could spend a whole semester on `GeoPandas` and Python's geospatial data workflows. We'll cover some basic workflows and highlight additional resources if folks want to go further.


## Setup & Environment

First, we'll need to install the `GeoPandas`.

Installing and configuring `Geopandas` requires creating a new Python environment.

A few resources that can get folks started:
  
- Anaconda
  * Tanish Gupta, "[Fastest Way to Install Geopandas in Jupyter Notebooks](https://medium.com/analytics-vidhya/fastest-way-to-install-geopandas-in-jupyter-notebook-on-windows-8f734e11fa2b)" *Analytics Vidhya* (6 December 2020)
  * Anaconda, "[conda-forge packages, geopandas](https://anaconda.org/conda-forge/geopandas)" *Anaconda documentation*
  * GeoPandas, "[Installation](https://geopandas.org/getting_started/install.html)" *GeoPandas documentation*
- Google CoLab
  * Abdishakur Hassan, Jupyter notebook on using `geopandas` in Google CoLab, from "[Geographic data science tutorials with Python](https://github.com/shakasom/GDS)" *GitHub repository*
    * [Google CoLab](https://colab.research.google.com/github/shakasom/GDS/blob/master/Part1%20-%20Introduction.ipynb)
    * [GitHub](https://github.com/shakasom/GDS/blob/master/Part1%20-%20Introduction.ipynb)

Additional `GeoPandas` resources:
- Jonathan Soma, "[Mapping with geopandas](https://jonathansoma.com/lede/foundations-2017/classes/geopandas/mapping-with-geopandas/)" from 2017 "[Foundations of Computing](https://jonathansoma.com/lede/foundations-2017/)" course, Columbia Graduate School of Journalism
- CoderzColumn, "[Plotting Static Maps with geopandas](https://coderzcolumn.com/tutorials/data-science/plotting-static-maps-with-geopandas-working-with-geospatial-data)" *CoderzColumn* (11 March 2020)
- GeoPandas, "[Plotting with Geoplot and GeoPandas](https://geopandas.org/gallery/plotting_with_geoplot.html)" *GeoPandas documentation*

In [1]:
# if working in Google Colab
!pip install geopandas



In [2]:
# import statements
import pandas as pd, geopandas as gpd, json, requests

## GeoDataFrame

When possible, loading geospatial data (especially polygon data) through `GeoPandas` will simply other workflows.

What distinguishes a `GeoDataFrame` from a standard `DataFrame`? The all important `geometry` column.

For more on data structures in `GeoPandas`:
- [GeoPandas documentation](https://geopandas.org/en/stable/docs/user_guide/data_structures.html)
- [Spatial analysis with Python tutorial](https://sustainability-gis.readthedocs.io/en/latest/lessons/L1/intro-to-python-geostack.html)

Let's work with the St. Joseph County zip code boundary file.
- [Link to download](https://sjcgis-stjocogis.hub.arcgis.com/datasets/stjocogis::zip-code-boundaries-3/about)
  * *Note: I've renamed this file `zip.geojson`.*

In [4]:
gdf = gpd.read_file("zip.geojson") # load file
gdf.head() # show geo dataframe head

Unnamed: 0,FID,ZIP,City_Town,Shape__Area,Shape__Length,geometry
0,1,46506,Bremen,183694700.0,71554.432719,"POLYGON ((-86.09920 41.50815, -86.09914 41.500..."
1,2,46530,Granger,107468900.0,59262.19356,"POLYGON ((-86.06268 41.73220, -86.06268 41.731..."
2,3,46536,Lakeville,120710700.0,50337.989555,"POLYGON ((-86.24495 41.56449, -86.24457 41.563..."
3,4,46544,Mishawaka,179076500.0,78110.454324,"POLYGON ((-86.06125 41.60846, -86.06126 41.608..."
4,5,46545,Mishawaka,79112720.0,56893.621236,"POLYGON ((-86.09635 41.72453, -86.09639 41.719..."


In [5]:
gdf.info() # show geodataframe info

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 23 entries, 0 to 22
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   FID            23 non-null     int64   
 1   ZIP            23 non-null     object  
 2   City_Town      23 non-null     object  
 3   Shape__Area    23 non-null     float64 
 4   Shape__Length  23 non-null     float64 
 5   geometry       23 non-null     geometry
dtypes: float64(2), geometry(1), int64(1), object(2)
memory usage: 1.2+ KB


Folks should notice the similarity with `Pandas` syntax and workflows, with some additional data types and file handling functionality.

## Attribute Data

Now that we have a `GeoDataFrame`, we might want attribute data that connects with our geospatial data.

For example, we might want to be able to connect educational outcome attribute data from the American Community Survey with the zip code boundary file.
- <a href="https://data.census.gov/table/ACSST5Y2022.S1501?t=Education:Educational Attainment&g=050XX00US18141,18141$8600000&moe=false">Link to the ACS dataset</a>
  * *Note: I'm working with 2022's 5 year estimate, which I've renamed `data.csv`.*

In [51]:
df = pd.read_csv("data.csv") # load attribute data
# df # show df

We'll cover more on reshaping data in Pandas in a future chapter, so I'm not goint to provide in-depth explanations for all the steps happening here. More to come soon!

In [None]:
df.columns = df.columns.str.split("!!", 2, expand=True) # split column headers into multi-level index based on separator
df = df.T # transpose dataframe
header = df.iloc[0] # isolate first row to be new header
df = df[1:] # subset dataframe (everything past the first row)
df.columns = header # reassign headers
df = df.reset_index() # reset the index
df.columns.values[0] = 'Area' # rename columns
df.columns.values[1] = 'Coverage'
df.columns.values[2] = 'Type'
# df # show updated df

Again, more on these kinds of data reshaping operations to come. But we're close to being able to connect our attribute data with our polygon data.

We just need to remove the `ZCTA5` string from the `Area` column. We can do that using regular expressions.

In [None]:
df['Area'] = df['Area'].str.replace("ZCTA5 ", "")
df

## Workflows

`GeoPandas` will come in especially handy when we start exploring visualization.

But `GeoPandas` can also hlep with connecting geospatial data with attribute data.

In [None]:
merged = gdf.merge(df, left_on="ZIP", right_on="Area") # merged attribute and geospatial data
merged # show merged geodataframe

## Additional Resources

We'll come back to `GeoPandas` when we start exploring data visualization.

But we're only scratching the surface of the data tasks and workflows `GeoPandas` can facilitate. A good place to start is the [GeoPandas User Guide](https://geopandas.org/en/stable/docs/user_guide.html).