# Getting Started With `GeoPandas`

## Setup & Environment

First, we'll need to install the `GeoPandas`.

Installing and configuring `Geopandas` requires creating a new Python environment.

A few resources that can get folks started:
  
- Anaconda
  * Tanish Gupta, "[Fastest Way to Install Geopandas in Jupyter Notebooks](https://medium.com/analytics-vidhya/fastest-way-to-install-geopandas-in-jupyter-notebook-on-windows-8f734e11fa2b)" *Analytics Vidhya* (6 December 2020)
  * Anaconda, "[conda-forge packages, geopandas](https://anaconda.org/conda-forge/geopandas)" *Anaconda documentation*
  * GeoPandas, "[Installation](https://geopandas.org/getting_started/install.html)" *GeoPandas documentation*
- Google CoLab
  * Abdishakur Hassan, Jupyter notebook on using `geopandas` in Google CoLab, from "[Geographic data science tutorials with Python](https://github.com/shakasom/GDS)" *GitHub repository*
    * [Google CoLab](https://colab.research.google.com/github/shakasom/GDS/blob/master/Part1%20-%20Introduction.ipynb)
    * [GitHub](https://github.com/shakasom/GDS/blob/master/Part1%20-%20Introduction.ipynb)

Additional `GeoPandas` resources:
- Jonathan Soma, "[Mapping with geopandas](https://jonathansoma.com/lede/foundations-2017/classes/geopandas/mapping-with-geopandas/)" from 2017 "[Foundations of Computing](https://jonathansoma.com/lede/foundations-2017/)" course, Columbia Graduate School of Journalism
- CoderzColumn, "[Plotting Static Maps with geopandas](https://coderzcolumn.com/tutorials/data-science/plotting-static-maps-with-geopandas-working-with-geospatial-data)" *CoderzColumn* (11 March 2020)
- GeoPandas, "[Plotting with Geoplot and GeoPandas](https://geopandas.org/gallery/plotting_with_geoplot.html)" *GeoPandas documentation*

In [None]:
# if working in Google Colab
!pip install geopandas

In [None]:
# import statements
import pandas as pd, geopandas as gpd, json, requests

When possible, loading geospatial data (especially polygon data) through `GeoPandas` will simplify other workflows.

What distinguishes a `GeoDataFrame` from a standard `DataFrame`? The all important `geometry` column.

For more on data structures in `GeoPandas`:
- [GeoPandas documentation](https://geopandas.org/en/stable/docs/user_guide/data_structures.html)
- [Spatial analysis with Python tutorial](https://sustainability-gis.readthedocs.io/en/latest/lessons/L1/intro-to-python-geostack.html)

## Dataset #1

The first dataset we'll use in this chapter is data about [City of South Bend parks](https://data-southbend.opendata.arcgis.com/datasets/SouthBend::parks-locations-and-features/about).

An API call to bring that data into Python:

In [None]:
import pandas as pd, json, requests # import statements
r = requests.get('https://services1.arcgis.com/0n2NelSAfR7gTkr1/arcgis/rest/services/Parks_Locations_and_Features/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson') # load page
d = r.json() # store as json object

data = [] # empty list for data

for i in d['features']: # iterate over list
  data.append(i['properties']) # isolate value, append to list

df = pd.DataFrame(data) # create dataframe
df.info() # show output

We'll need `latitude` and `longitude`, and those values are currently buried in the `Location_1` column. So we'll start there by splitting out that column on the `\n` character.

In [None]:
df[['Address', 'City', 'LatLon']] = df['Location_1'].str.split(r'\n', expand=True) # split column
df.head() # show output

We're closer!

The next step is breaking out the `latitude` and `longitude` values, and removing the `()` characters.

In [None]:
df['LatLon'] = df['LatLon'].str.replace('[()]', '') # remove parentheses
df[['Latitude', 'Longitude']] = df['LatLon'].str.split(',', expand=True) # split column
df.head() # show output

Now we can convert this to a `GeoDataFrame`.

In [None]:
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.Latitude, df.Longitude), crs="EPSG:4326") # convert to gdf
gdf.to_file("parks.json", driver='GeoJSON')
gdf.info() # inspect output

Now we have the all-important `geometry` column for our first dataset.

## Polygon Data

For our second dataset, let's work with the St. Joseph County zip code boundary file.
- [Link to download](https://sjcgis-stjocogis.hub.arcgis.com/datasets/stjocogis::zip-code-boundaries-3/about)
  * *Note: I've renamed this file `zip.geojson`.*

In [None]:
gdf = gpd.read_file("zip.geojson") # load file
gdf.head() # show geo dataframe head

Unnamed: 0,FID,ZIP,City_Town,Shape__Area,Shape__Length,geometry
0,1,46506,Bremen,183694700.0,71554.432719,"POLYGON ((-86.09920 41.50815, -86.09914 41.500..."
1,2,46530,Granger,107468900.0,59262.19356,"POLYGON ((-86.06268 41.73220, -86.06268 41.731..."
2,3,46536,Lakeville,120710700.0,50337.989555,"POLYGON ((-86.24495 41.56449, -86.24457 41.563..."
3,4,46544,Mishawaka,179076500.0,78110.454324,"POLYGON ((-86.06125 41.60846, -86.06126 41.608..."
4,5,46545,Mishawaka,79112720.0,56893.621236,"POLYGON ((-86.09635 41.72453, -86.09639 41.719..."


Let's connect those polygons with educational outcome attribute data from the American Community Survey with the zip code boundary file.
- <a href="https://data.census.gov/table/ACSST5Y2022.S1501?t=Education:Educational Attainment&g=050XX00US18141,18141$8600000&moe=false">Link to the ACS dataset</a>
  * *Note: I'm working with 2022's 5 year estimate, which I've renamed `data.csv`.*

In [None]:
df = pd.read_csv("data.csv") # load attribute data
df.columns = df.columns.str.split("!!", 2, expand=True) # split column headers into multi-level index based on separator
df = df.T # transpose dataframe
header = df.iloc[0] # isolate first row to be new header
df = df[1:] # subset dataframe (everything past the first row)
df.columns = header # reassign headers
df = df.reset_index() # reset the index
df.columns.values[0] = 'Area' # rename columns
df.columns.values[1] = 'Coverage'
df.columns.values[2] = 'Type'
df['Area'] = df['Area'].str.replace("ZCTA5 ", "") # standardize zip code column
df # show output

  df.columns = df.columns.str.split("!!", 2, expand=True) # split column headers into multi-level index based on separator


"(Label (Grouping), nan, nan)",Area,Coverage,Type,AGE BY EDUCATIONAL ATTAINMENT,Population 18 to 24 years,Less than high school graduate,High school graduate (includes equivalency),Some college or associate's degree,Bachelor's degree or higher,Population 25 years and over,...,High school graduate (includes equivalency).1,Some college or associate's degree.1,Bachelor's degree or higher.1,MEDIAN EARNINGS IN THE PAST 12 MONTHS (IN 2022 INFLATION-ADJUSTED DOLLARS),Population 25 years and over with earnings,Less than high school graduate.1,High school graduate (includes equivalency).2,Some college or associate's degree.2,Bachelor's degree,Graduate or professional degree
0,"St. Joseph County, Indiana",Total,Estimate,,31651,2781,11312,13760,3798,177621,...,(X),(X),(X),,43401,28431,35439,40901,51191,70582
1,"St. Joseph County, Indiana",Male,Estimate,,15346,1580,6063,6037,1666,85726,...,(X),(X),(X),,51874,32554,44936,51189,64026,85742
2,"St. Joseph County, Indiana",Female,Estimate,,16305,1201,5249,7723,2132,91895,...,(X),(X),(X),,35489,24167,27320,33550,43238,59452
3,46506,Total,Estimate,,773,256,263,240,14,6667,...,(X),(X),(X),,46271,34290,45855,43802,53264,-
4,46506,Male,Estimate,,359,94,192,73,0,3221,...,(X),(X),(X),,59384,34321,53333,68245,85303,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73,46635,Male,Estimate,,132,9,110,13,0,2177,...,(X),(X),(X),,50234,39219,55801,49879,45658,75625
74,46635,Female,Estimate,,334,193,89,42,10,2579,...,(X),(X),(X),,42725,-,-,38750,45556,76154
75,46637,Total,Estimate,,1956,114,566,782,494,11516,...,(X),(X),(X),,45659,45724,34778,39607,66181,53140
76,46637,Male,Estimate,,1091,31,373,427,260,5669,...,(X),(X),(X),,56035,63088,45677,48419,70833,72875


Now we can use `GeoPandas` to merge these datasets.

In [None]:
merged = gdf.merge(df, left_on="ZIP", right_on="Area") # merged attribute and geospatial data
merged # show merged geodataframe