# Learning goals
After this week's lesson you should be able to:
- Read and write spatial data formats using GeoPandas
- Explore spatial data in a map
- Set and change map projections for GeoDataframes. 
- Create GeoDataFrames with CSVs.
- Perform a spatial and attribute join.

This week's lessons are adapted from:
- [Automating GIS Processes Lesson 2](https://autogis-site.readthedocs.io/en/latest/lessons/lesson-2/geopandas-an-introduction.html)
- Wenzheng Li's materials from CRP 5680 Spring 2022. 


# 0. What is geopandas? 

**Geopandas** is a python library that allows us to ingest, analyze, and map geospatial vector data. It combines what we have learned in the previous two classes: The tabular data analysis tools in **Pandas** with the geometry handling of shapely. Under the hood, it is using a python library called **fiona**, which handles all different kinds of spatial file formats, and **pyproj**, which manages our coordinate reference systems.

The main data structures in Geopandas are GeoDataFrames and GeoSeries, which are intended to mirror the Pandas DataFrame and Series structures. 

The key distinction in Geopandas is that we will always a column called `geometry` like so that contains the geometries related to each row: 
<figure class="image">
<img src="https://autogis-site.readthedocs.io/en/2019/_images/geodataframe.png" alt="drawing" width="500" style="display: block; margin: 0 auto"/>
 <figcaption><center>(From Automating GIS Processes)</figcaption>
</figure>

## 0.1 The three components of a GeoPandas GeoDataFrame
To create a GeoDataFrame, we need three things:

1. a pandas *DataFrame (df)*
2. a *CRS* (coordinate reference system presented by EPSG code, e.g., "epsg: 4326");
3. a shapely *geometry list* which defines the geometric object types of each observation, e.g., points, lines, or polygons.

We are familiar with 1. and 3. from the previous two classes and the Coordinate Reference System from your GIS knowledge. There is a different EPSG for each CRS. THe two most common that we will use are `EPSG:4326` (WGS84) and `EPSG:3857` (WGS84 / pseudo-Mercator, which is common projection used by Google and OSM). 



# 1. Reading to different spatial data file formats
Because of fiona's great under the hood functionality, Geopandas supports almost every vector spatial data format. For us, the to most common formats will be ESRI shapefiles and GeoJSONs.

Let's take a look at NYC subyway data. 

Before we get started, let's orient where we are in our directory: 

In [None]:
# get your jupyter notebook path
import os
os.getcwd()

Now let's read in the NYC subway stations dataset by doing the following: 
- Go to the NYC OpenData portal's page on the data for [Subway Stations](https://data.cityofnewyork.us/Transportation/Subway-Stations/arq3-7z49)
- Select **Export** in the upper-right hand corner of the page and and select **Shapefile**. 
- Save this file in the same folder as this notebook. 
- Keep it as a zipped file! (Geopandas can read both the zipped and unzipped version of the shapefile, but I like to keep it in zipped format because it's cleaner)

In [None]:
# We will need this to use geopandas
import geopandas as gpd
# and remember that we're assigning a nickname to the package

# Now let's assign the variable name `stations` to the geodataframe
# These file names have spaces in them, which, as a reminder, 
# you should not do in your own work!
stations = gpd.read_file('Subway Stations.zip')


# 2. Initial data exploration with Geopandads
Once we've read in the file, let's take a look at the data. I'm sure somewhere online is the data dictionary (which I couldn't find :/) but this table contains what we'd expect: 
- `objectid` which is the ID number for each geometry
- `name` the name of the station 
- `line` which are all the lines that stop at that station
- `geometry` which is the shapely geometry
- and other columns that we'll probably not need. 

In [None]:
# Same as you would do in Pandas
stations.head()

But differently than a pandas dataframe, we can also do this with our data: 

In [None]:
stations.plot()

We can also find the CRS for this dataset: 

In [None]:
stations.crs

# 3. Analyzing and manipulating data in geopandas

## 3.1 Changing projections 

Currently my data is in `EPSG:4326` but say I wanted to change my CRS to `EPSG:3857` as I know other datasets I'm working in are in 3857. 


I can use the `.to_crs()` function to do this: 

In [None]:
# Note here that this function requires as an input
# the name of the new coordinate system
stations.to_crs(epsg=3857)

Now let's check our dataset to make sure the CRS was changed: 

In [None]:
stations.crs

Hm, it wasn't changed! Why not? 

`stations.to_crs(epsg=3857)` only returns re-projected geometry, but if we don't re-assign our variable name `stations` to this reprojected version, the old version will not be updated. 

In [None]:
stations = stations.to_crs(epsg=3857)
stations.crs

That worked! A good thing about using 3857 is that our units are in meters. This means that when we perform calculations on these geometries, the values we get are also going be in meters. 


## 3.2 Combining Pandas and Shapely functionalities

Note that you can use  the shapely functionalities we covered in class on Monday by selecting the geomeries from this GeoDataFrame

In [None]:
# Since the stations are points, they have no area
stations['geometry'].area

In [None]:
stations['geometry'].buffer(10)

In [None]:
# This returns the x coordinate of each row's point
stations['geometry'].x

You can also use the Pandas functionalities: 

In [None]:
stations.loc[1:10,'geometry']

In [None]:
# Note that we can take the head of any column 
# or subset of columns 

stations['name'].head()

## 3.3 Mapping
Let's talk about mapping in a bit more detail here.

You can also read the [GeoPandas user guide](https://geopandas.org/en/stable/docs/user_guide/mapping.html) for more on mapping. 

Beyond the basic map we just made above, we use the following to enhance our map a bit: 

We can specify which column will determine the color schema of the plot. If the column is a numeric data type (`int` or `float`) `plot()` will try to create a **choropleth** map, if the column is a `str` or `obj`, it will try to create a **categorical** map.

In [None]:
# The column 'line' is an object type here
stations['line'].dtype

In [None]:
stations.plot(column='line')

# This plot is not very informative
# since there are many different combinations of lines that pass through 
# each station
# The map does not exactly line up with subway lines

We can also specifcy the size of our plot using the `figsize=` optional input. 

In [None]:
# Here, we're using the `figsize` argument to make the plot bigger
# figsize takes a tuple of (width, height) in inches
stations.plot(column='line',
            figsize=(10,10))

Because the dimensions of the plot are constrained by the CRS we use, `figsize` is going to find the largest plot it can create given these dimension constraints.

In [None]:
# This produces the same size plot as above
stations.plot(column='line',
            figsize=(20,10))

You can also add a legend to this plot

In [None]:
# This produces the same size plot as above
stations.plot(column='line',
            figsize=(20,10),
            legend=True)

# This legend is not very informative 
# given the issue mentioned above 
# that there are many different combinations of lines that pass through
# each station

# 4. Working with multiple geospatial datasets. 

Most often, we are working with multiple datasets in order to analyze their relationship to each other.


For this example, let's take a look at public housing accessibility from a transit perspective. 

First, download the NYCHA public housing data from [here](https://data.cityofnewyork.us/Housing-Development/Map-of-NYCHA-Developments/i9rv-hdr5) and save it down in the same folder that contains this notebook. 

In [None]:
# A bad file name again! Also, why do they call it a "Map"?
public_housing = gpd.read_file('Map of NYCHA Developments.zip')

# We are going to filter the data to exclude MultiPolygon objects for this exercise
public_housing = public_housing[public_housing['geometry'].type!='MultiPolygon']

# oops, there's a typo in the column name where "development" is spelled
# "developmen". 
# Let's fix that.
public_housing = public_housing.rename(columns={'developmen':'development'})


In [None]:
public_housing.head()

In [None]:
public_housing.crs

In [None]:
# This look very faint, because we're working with building footprints 
# at the scale of the entire city
public_housing.plot(figsize=(10,10))

Let's first change our CRS to 3857. 

In [None]:
public_housing = public_housing.to_crs(epsg=3857)

Now, let's say we want calculate **how many subway stations are within a 10 minute walk of each housing unit**. 

We are going to do this by: 
- Providing an estimate of the distance a typical person can walk in 10 minutes
- Creating a new geometry that is buffered around each building by that distance.


A quick google search tells me 10 minutes is about 800 meters based on average walking speeds. 

## 4.1 Making a new GeoDataFrame
Let's make a new dataset that buffers each building with a 800 meter distance but still has the original tabular data of our public housing dataset. Recall that a GeoDataFrame takes a dataframe, a CRS, and a set of geometries

In [None]:
# First, let's make our geometries
buffer_geom = public_housing['geometry'].buffer(800)

# Second, we already know the CRS
# This the same as the CRS of public housing data 
buffer_crs = public_housing.crs 

# Third, let's grab the data we want
buffer_data = public_housing[['borough', 'development', 'tds_num']]  

# Now, let's put it all together using the GeoDataFrame constructor
public_housing_buffer = gpd.GeoDataFrame(buffer_data, crs=buffer_crs, geometry=buffer_geom)

In [None]:
# Note that I've started new lines for each argument 
# to make it easier to read
public_housing_buffer.plot(figsize=(10,10),
                            facecolor="none", 
                            edgecolor="green", 
                            lw=.5)

Whoa, what happened here?? Beyond setting the figure size, I'm including other optional inputs that allows me to style these more clearly. 
- `facecolor` is the fill color, which I want to set to "none" to make the polygons transparent
- `edgecolor` is the edge color, and `plot()` [recognizes certain named colors](https://matplotlib.org/stable/gallery/color/named_colors.html). 
- `lw` allows me to set the line weight. 

## 4.1.1 (Detour) Making a new GeoDataFrame from a CSV  
Let's say we have a CSV with a latitude and longitude column. We can easily turn this into a GeoDataFrame by transforming these lat/lng shapely `Points`. 

As an example: 
- download the CSV of [NYC Firehouses](https://data.cityofnewyork.us/Public-Safety/FDNY-Firehouse-Listing/hc8x-tcnd). (It's under **Export**.)
- Make sure this CSV is in the same folder as this notebook. 

In [None]:
import pandas as pd
firehouses_csv = pd.read_csv('FDNY_Firehouse_Listing.csv')
firehouses_csv.head()

In order to create a GeoDataFrame, we need to Geopandas all three components (data/DF, CRS, and geometry) of the GeoDataFrame. We will use the `points_from_xy()` function in GeoPandas. This is basically using shapely `Points` under the hood. 

In [None]:
# I know this is 4326 but it also says on the Firehouses listing page:
# "Latitude coordinate for Global Coordinate System, WGS 1984, decimal degrees (EPSG 4326)"
firehouses = gpd.GeoDataFrame(firehouses_csv, 
                            geometry=gpd.points_from_xy(firehouses_csv['Longitude'], firehouses_csv['Latitude']),
                            crs='EPSG:4326')

In [None]:
# Voila
firehouses.head()

Going back to our public housing buffer data, let's see what it looks like.

## 4.2 Creating a spatial join 

Now, let's count how many subway stations are in each buffer to get a sense of transit accessibility by using a spatial join between our new `public_housing_buffer` dataset and our `subway_stops` dataset. 

We are going to use  `gpd.sjoin(left_geoDF,right_geoDF)`. This function optionally takes as an input `how` to specify what type of spatial join. 

There are couple of different types of spatial joins: 
- `Left outer join`: In a LEFT OUTER JOIN (how='left'), we keep all rows from the left and duplicate them if necessary to represent multiple hits between the two dataframes. We retain attributes of the right if they intersect and lose right rows that don’t intersect. A left outer join implies that we are interested in retaining the geometries of the left.
- `Right outer join`: In a RIGHT OUTER JOIN (how='right'), we keep all rows from the right and duplicate them if necessary to represent multiple hits between the two dataframes. We retain attributes of the left if they intersect and lose left rows that don’t intersect. A right outer join implies that we are interested in retaining the geometries of the right.
- `Inner join` (this is the default setting): In an INNER JOIN (how='inner'), we keep rows from the right and left only where their binary predicate is True. We duplicate them if necessary to represent multiple hits between the two dataframes. We retain attributes of the right and left only if they intersect and lose all rows that do not. An inner join implies that we are interested in retaining the geometries of the left.

In this case, we want to join `public_housing_buffer` and `stations` and a **left outer join** because we want to keep the hits between our public housing buffer and all the subway stations. 

In [None]:
# Before we assign this to a new variable, 
# let's check to see what the join looks like
gpd.sjoin(public_housing_buffer,stations,how='left')

As we can, see there are duplicate rows of the buffers where each buffered geometry has intersected with multiple stations.

Great, since that looked like it worked, let's assign it to a new variable name: 


In [None]:
buffers_w_stations = gpd.sjoin(public_housing_buffer,stations,how='left')

Lastly, we are going use a common pandas operation I'm going to call **groupby-and-summarize**. Here, we aggregate our tabular data by a column containing a category (here, we are aggregating by `development` and perform some kind of summary function, typically something like a count, mean, min, or max. 

It's constructed like so: 
- `df.groupby('col_name').sum()` or
- `df.groupby('col_name').count()` or
- `df.groupby('col_name').min()`

Note, however, that the sum/count/min will be applied to all numeric columns

In [None]:
# Here we are counting all the instances of each column value per development. 
buffers_w_stations.groupby('development').count()

## Q.1 In-Class Execise #1 (2pts)
Q: In our the code `buffers_w_stations.groupby('development').count()` why do some rows have the same count all the way across and why do some rows have different counts in the row? ? 

A: [answer in the markdown cell here]

Here, the column we are interested in is `objectid` which is now displaying a count of the different station IDs within each development's buffer. 

In [None]:
station_counts = buffers_w_stations.groupby('development').count()['objectid']
station_counts

## 4.3 Attribute joins 
Now, let's join this back to our `public_housing_buffers` GeoDataFrame so we can map it. 

In an attribute join, a `GeoSeries` or `GeoDataFrame` is combined with a regular `pandas.Series` or `pandas.DataFrame` based on a common variable. This is analogous to normal merging or joining in pandas.

This is what a merge looks like visually

</figure>
<img src="https://miro.medium.com/max/1400/1*ZCpo3gXuXI4KFhKivEt2ZA.png " alt="drawing" width="700" style="display: block; margin: 0 auto"/>
</figure>


In [None]:
# .merge() takes as an argument the dataframe you want to merge with
# and the left and right columns you want to merge on
# Here, we're merging on the index of the station_counts pandas.Series
# and the development column of the public_housing_buffer geodataframe
public_housing_buffer.merge(station_counts, 
                            left_on='development', 
                            right_index=True)

# You will typically be merging on a column, not an index
# Here station_counts has the development column as its index   

That worked! Let's update our `public_housing_buffer` variable name to point to our updated geodataframe with this new column

In [None]:
public_housing_buffer = public_housing_buffer.merge(station_counts, 
                            left_on='development', 
                            right_index=True)

The last thing I want to change is the column name from `objectid`, which it is not, to somethign more descriptive. 

In [None]:
public_housing_buffer = public_housing_buffer.rename(columns={'objectid':'station_count'})

## 4.4 Writing to a file
Now let's write this buffer data we've created to a file. The default is writing to a shapefile. 

In [None]:
# this will write to a folder containing a .shp, .shx, .dbf, and .prj file
public_housing_buffer.to_file('public_housing_buffer')


In [None]:
# This will write to a single .geojson file
# You need to specify the driver
public_housing_buffer.to_file('public_housing_buffer_geojson',driver='GeoJSON')

## 4.5 Making a choropleth map

Now let's make a choropleth map.

In [None]:
public_housing_buffer.plot(column='station_count',
                            figsize=(15,15),
                            legend=True,
                            alpha=.6)
# Alpha is a value between 0 and 1 that controls the transparency of the fill color

  The default color map (`cmap`) in GeoPandas is `viridis`. You can find other ones [here](https://matplotlib.org/stable/tutorials/colors/colormaps.html). Let's change our color map to `Reds`

In [None]:
public_housing_buffer.plot(column='station_count',
                            figsize=(15,15),
                            legend=True,
                            alpha=.6,
                            cmap='Reds')

Ok, this is not a great map (and probably won't be today) since it's, without more context, a poetic suggestion of transit accessibility. 

## 4.6 Mapping multiple layers on the same map. 

To give some context to our map, let's plot our subway stations, public housing buildings, and building buffer data together. 

We are going to use a library called `matplotlib` (it's actually being used by `.plot()`). We'll cover this library more extensively in the coming week. 

In [None]:
# Import the pyplot module from matplotlib and assign it the nickname plt
import matplotlib.pyplot as plt

# plt.subplots() returns a tuple of the figure and the axis, 
# which we are assigning to fig and ax, respectively

# Note that using this method of creating a plot, 
# we set the figure size using the figsize argument in plt.subplots()
fig1, ax1 = plt.subplots(figsize=(15, 15))

# The drawing order is determined by the order in which we call the plot methods

# We can use the ax argument to specify the axis we want to plot on
# Here, we're plotting the public housing buffer on the axis we just created
public_housing_buffer.plot(column='station_count',
                            ax=ax1,
                            alpha=.6,
                            cmap='Reds')

# markersize is the marker size
stations.plot(markersize=5,
            color='black',
            ax=ax1)
# markersize is the marker size
public_housing.plot(
            color='red',
            ax=ax1)



Still not a great map, but at least we have a bit more context.

## Q.2 In-Class Exercise 2 (2pts)
Which building or buildings have the highest| number of stations within a 10 min walk? Show the code you used to get this answer. 

In [None]:
## Insert your code here


## Q.3 In-Class Exercise 3 (5pts)
Create a new column called `area` in `public_housing_buffer` that the area of the original building footprints in meters. (Hint: you'll need to do a `merge`.)

In [None]:
## Insert your code here


## Q.4 In-Class Exercise 4 (2 pts)
- From the NYC open data portal, download a [shapefile of neighborhoods](https://data.cityofnewyork.us/City-Government/2010-Neighborhood-Tabulation-Areas-NTAs-/cpf4-rkhq)
- Make sure to change the CRS so it matches the other layers. 
- Add it to the map in section 4.5 *first* (i.e. below the other layers)
- Make the fill color of the neighorhoods `lightgray`.


In [None]:
## Insert your code here


## Q.5 In-Class Exercise 5 - OPTIONAL (5pts)
Separately, create a choropleth map that shows the number of subway stations in each neighborhood. 

In [None]:
## Insert your code here
