# Notebook 4: Voting data ala Feng-Porter

Inspired by the [recent paper](https://epubs.siam.org/doi/abs/10.1137/19M1241519?casa_token=dNUS-FzzEf8AAAAA:Kcuzuz9KxnP7Q0EwdgKF7dBKVrN9AR_yeR6f9FyEABRS80oEp7facyYxcxjUtrS78PVZNwsG4Ng) by Feng and Porter we will use persistent homology to find patterns in voting data in North Carolina. Along the way we'll see some ways to handle geospatial data.

This Notebook has **two compulsory Exercises**. 

We first need to go fetch the vote data we will need from a github repository I set up for this purpose. Running the next block should (after a short delay) create a folder called `TDA-Class-Notebook4` in your home directory (or whatever directory this notebook is in).

In [None]:
!git clone https://github.com/thomasweighill/TDA-Class-Notebook4.git

Next let's install gudhi and a GIS library we'll need later too.

In [None]:
!pip install gudhi

In [None]:
!pip install geopandas

In [None]:
pip install networkx==2.5

We will also need the GerryChain library. This library is designed for large scale analysis of gerrymandering, but we'll just need a few helpful functions from it.

In [None]:
!pip install gerrychain

Import some libraries.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import geopandas as gpd
from gerrychain import Graph
import networkx as nx
import gudhi

# Loading in geodata
We are going to load the shapefile `NC_VTD.shp` using `geopandas`. A _shapefile_ consists of shapes (polygons) with attached data (e.g. population, votes). The `geopandas` library reads it in and stores it as a _geodataframe_. Let's read in the shapefile and take a look.

In [None]:
gdf = gpd.read_file('TDA-Class-Notebook4/NC_VTD/NC_VTD.shp')

In [None]:
gdf.head()

Take a look at the printout above showing the first few rows of `gdf`. Each row is a voting precinct, and each row carries a lot of data such as various IDs, population totals (VAP stands for "Voting Age Population"). The last column is the shape of that precinct (this is what makes this a geodataframe).

We might as well learn a few GIS tricks while we are here. You can plot the geodataframe like this:

In [None]:
gdf.plot()

If you want to color the map based on various data, you can just tell it which column to use. For example, we can color the precincts by county.

In [None]:
gdf.plot(column='County')

Let's try and color our precincts based on the 2016 election. The columns for this election are `EL16G_PR_R` and `EL16G_PR_D`. We'll color the precincts based on the Democratic vote share: that is, `EL16G_PR_D`/(`EL16G_PR_R`+`EL16G_PR_D`). We'll use the color map `RdBu` to get the traditional red to blue shading.

In [None]:
gdf.plot(column=gdf.EL16G_PR_D/(gdf.EL16G_PR_R+gdf.EL16G_PR_D), cmap='RdBu')

Does it look realistic? Let's zoom in on Guilford county. Geodataframes can be manipulated just like dataframes, which you may not have encountered before. Don't worry, we won't need too much dataframe manipulation. Here is how to pull out a specific county.

In [None]:
guilford_county = gdf[gdf.County == 'Guilford']

Let's look at what data we selected.

In [None]:
guilford_county

This doesn't look right. No precincts were selected! Let's take a closer look at the `County` column.

In [None]:
gdf.County

The problem is clear now: counties are encoded by numbers. These numbers are called _FIPS_ codes. You can Google the FIPS code for Guilford county, it's 37081

In [None]:
guilford_county = gdf[gdf.County == '37081']

In [None]:
guilford_county.head()

This looks much better. Let's plot it, first blank and then with vote data.

In [None]:
guilford_county.plot()

In [None]:
guilford_county.plot(
    column=guilford_county.EL16G_PR_D/(guilford_county.EL16G_PR_R+guilford_county.EL16G_PR_D),
    cmap='RdBu'
)

# Vietoris Rips filtration

Let's imitate the paper of Feng and Porter by taking the centroids of the Republican-won precincts and doing a VR complex on them. Let's first cut down Guilford county to just the R precincts by taking only those precincts where the Republican votes outnumber the Democratic ones.


Getting the centroids out of a shapefile is possible using the `centroid` attribute. Like this:

In [None]:
guilford_county_R_precincts = guilford_county[guilford_county.EL16G_PR_R  > guilford_county.EL16G_PR_D]

**Side note:** This kind of data selection and manipulation takes practice and I don't expect everyone to know how to deal with dataframes. Feel free to ask if there's a specific type of data operation you want to know how to do.

Let's check we did it right by plotting the R precincts.

In [None]:
guilford_county_R_precincts.plot()

It's time to find all the centroids of all these precincts. Fortunately, `geopandas` does this for us. It's a little bit tricky but here's how to find the centroids.

In [None]:
centroids = np.array([
    x.coords[0] for x in guilford_county_R_precincts.geometry.centroid
])

Let's see what we got.

In [None]:
centroids

Since this is just a list of points, we can plot them as a scatter plot like so.

In [None]:
plt.scatter(
    [x[0] for x in centroids],
    [x[1] for x in centroids]
)

# Exercise 1: VR persistence

Now that you have the point cloud data you need, write code to compute the 0th and 1st homology of this point cloud. Does $H_1$ find the blue islands?

# Adjacency complex

Let's try another method and construct the dual graph or adjacency graph of the Guilford county data. This is what we need `gerrychain` for. 

In [None]:
graph = Graph.from_geodataframe(guilford_county, adjacency='queen')

Just for fun, let's draw the graph and see that it doesn't really look anything like Guilford county. That's because the graph is just edges and vertices, no spatial information.

In [None]:
nx.draw(graph)

We are now going to build a `simplexTree` out of this graph so that we can feed it to `gudhi`. First we add a vertex for every node in our graph.

In [None]:
scomplex = gudhi.SimplexTree()
for i in graph.nodes:
    scomplex.insert([i]) #add a 0-simplex, given as a list with just one vertex

Now we add an edge for every one in the graph. This may seem redundant, but we are just translating a `networkx` graph object into a language that `gudhi` understands.

In [None]:
for e in graph.edges:
    scomplex.insert([e[0], e[1]]) #add a 1-simplex

One last thing remains, to add a 2-simplex for every triangle in the graph. `networkx` has a way of enumerating all cliques in the graph (cliques are subgraphs with all possible edges present, e.g. triangles or four vertices with every pair connected by an edge). We'll grab all the cliques and then filter to just the ones of size three -- these are our triangles.

In [None]:
all_cliques = nx.enumerate_all_cliques(graph)

In [None]:
all_triangles = [x for x in all_cliques if len(x) == 3]

Let's add all those triangles as 2-simplices.

In [None]:
for t in all_triangles:
    scomplex.insert([t[0], t[1], t[2]]) #add a 2-simplex

We need a filtration function, which for us is just the Democrat share of that vertex (i.e. precinct).

In [None]:
for v in graph.nodes:
    scomplex.assign_filtration(
      [v], #we have to put [] here because a 0-simplex is technically a list with one element
      filtration=graph.nodes[v]['EL16G_PR_D']/(graph.nodes[v]['EL16G_PR_R']+graph.nodes[v]['EL16G_PR_D'])
    )

And now, very importantly, we need to assign a filtration to every 1-simplex and 2-simplex. This filtration value is just the highest filtration value among all the vertices. Do you see how this ensures that each edge appears only when _both_ vertices appear? 

Fortunately, `gudhi` has a pre-built function for this.

In [None]:
scomplex.make_filtration_non_decreasing()

# Exercise 2
Now that we have the `simplexTree` all ready to go, let's look at the persistence. 

(a) Plot the persistence diagram for `simplexTree`. Can you find the $H_1$ points corresponding to the Blue islands?

(b) Repeat the above in the code blocks below, except instead of making the filtration value the Democratic vote share, make it the Republican vote share. Look at the 0th dimensional persistence diagram. Do you see two points away from the diagonal corresponding to blue islands?

# Exercise 3
(a) Pick another county in North Carolina, select data for that county only to make a geodataframe just like we did with Guilford county. Plot the county with Blue-Red shading showing vote shares just like above.

(b) Pick one of the above persistence methods (Exercise 1, Exercise 2(a) or Exercise 2(b)) and plot the persistence diagram for your chosen county.

(c) Briefly comment in a text block below whether you think the method you chose did a good job at finding the blue islands (or other patterns in the vote data), and why. 

# Challenge questions

(600 level students required to attempt at least one of these)

- Pick a county (you can use Guilford if you like) and make a persistence diagram for two different elections (e.g. 2012 and 2016 Presidential elections) using whatever method you like best. To see which columns of the dataframe encode which elections, look [here](https://github.com/mggg-states/NC-shapefiles). Is there a small or large difference between the two diagrams? 
- Repeat the above experiment for multiple pairs of elections and find which two elections are the closest and furthest apart in terms of Wasserstein or bottleneck distance.
- Repeat one of the above methods but instead of Democratic or Republican vote share, use the fraction of Non-Hispanic Black population (or 1 minus this value). The column for Non-Hispanic Black population (as collected by the Census) is `NH_BLACK`, and the column for total population is `TOTPOP`. Do you see some similarities between the persistence diagrams for racial data and the persistence diagram for vote data? (Answer may depend on many different choices, so there's no right answer).