# Connecticut Redistricting Analysis: Data Wrangling

- Project Objective: Analyze final 2021 CT State House and State Senate maps relative to incumbent protection
- Notebook Objective: Collect and organize all available public data necessary for GerryChain analysis

## Data Source

The main source of data for this project located on the Connecticut General Assembly website on their [2021 Redistricting Project](https://www.cga.ct.gov/rr/taskforce.asp?TF=20210401_2021%20Redistricting%20Project) committee page. 

- [Geographic Data](https://www.cga.ct.gov/rr/tfs/20210401_2021%20Redistricting%20Project/data.asp)
- [Election Data](https://www.cga.ct.gov/rr/tfs/20210401_2021%20Redistricting%20Project/data.asp)
- [Incumbent Data](https://www.cga.ct.gov/rr/tfs/20210401_2021%20Redistricting%20Project/data.asp)
- [2021 Final House Map](https://www.cga.ct.gov/rr/tfs/20210401_2021%20Redistricting%20Project/hmaps.asp)
- [2021 Final Senate Map](https://www.cga.ct.gov/rr/tfs/20210401_2021%20Redistricting%20Project/hmaps.asp)

The 2020 U.S. Census data for the voting aged population (VAP) at the census block level was downloaded from [Connecticut Open Data](https://data.ct.gov/Government/2020-U-S-Census-Block-Adjustments/bary-ntej/). This dataset includes adjustments made by Connecticut Office of Policy and Management to reflect "most individuals who are incarcerated to be counted at their address before incarceration". The technical report can be [viewed here](https://portal.ct.gov/-/media/OPM/CJPPD/CjAbout/SAC-Documents-from-2021-2022/PA21-13_OPM_Summary_Report_20210921.pdf).

All the data used in this project has been downloaded and hosted on [Github](https://github.com/ka-chang/RedistrictingCT). Other relevant data sources can be found at the [Redistricting Data Hub](https://redistrictingdatahub.org/state/connecticut/).

## Setup

In [1]:
import os
import sys
from pathlib import Path

import geopandas as gpd
from gerrychain import Graph
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd

In [2]:
github_file_path = str(Path(os.getcwd())) #Sets to local Github directory path

sys.path.insert(1, github_file_path) 

github_file_path

'/Users/katherinechang/RedistrictingCT'

## Data Exploration

In [3]:
ct_vap_2020_df = gpd.read_file("./data/2020_census_vap/geo_export_862c4487-53ad-4b53-bb2a-5fe128833d9b.shp")
house_block = pd.read_csv("./data/HOU.csv", dtype=str)
senate_block = pd.read_csv("./data/SEN.csv", dtype=str)

In [4]:
ct_vap_2020_df.head()
#Census block level with "geoid20"
#"p003001" is unadjusted, "p003001_a" is adjusted, "p003001_d" is difference

Unnamed: 0,town,geoid20,p0030001,p0030001_a,p0030001_d,geometry
0,Greenwich,90010101011000,23.0,23.0,0.0,"POLYGON ((-73.67642 41.12467, -73.66993 41.127..."
1,Greenwich,90010101011001,149.0,149.0,0.0,"POLYGON ((-73.68429 41.11007, -73.68420 41.110..."
2,Greenwich,90010101011002,12.0,13.0,1.0,"POLYGON ((-73.69362 41.10838, -73.69349 41.108..."
3,Greenwich,90010101011003,0.0,0.0,0.0,"POLYGON ((-73.68828 41.10238, -73.68821 41.102..."
4,Greenwich,90010101011004,2.0,2.0,0.0,"POLYGON ((-73.68926 41.11859, -73.68607 41.120..."


## Data Cleaning

### Geocode Incumbent Addresses

Incumbent addresses downloaded as a CSV, addresses were geocoded to identify which Census Block the address is located in.

- Nominatim is an open-source option, but 34/193 addresses unidentified
- GoogleMaps API identified 193/193 addresses, for reproductibility input unique API key from GCP

In [5]:
incumbent_address = pd.read_csv("./data/2021_incumbent_addresses.csv", dtype=str)

In [None]:
#Open source option using Nominatim

#from geopandas.tools import geocode
#from shapely.geometry import Point
#incumbent_address_geo = geocode(incumbent_address["Fulladdr"], provider="nominatim", 
#                                user_agent="geocode_redistricting_ct", timeout=5)

In [None]:
import googlemaps

#gmaps = googlemaps.Client(key="####")

incumbent_address["Longitude"] = pd.Series(dtype='float')
incumbent_address["Latitude"] = pd.Series(dtype='float')

In [None]:
for n in incumbent_address.index:
    geocode_result = gmaps.geocode(incumbent_address["Fulladdr"][n])
    geocode_lat = geocode_result[0]["geometry"]["location"]["lat"]
    geocode_long = geocode_result[0]["geometry"]["location"]["lng"]
    if isinstance(geocode_lat, float) and isinstance(geocode_long, float):
        incumbent_address["Longitude"][n] = geocode_long
        incumbent_address["Latitude"][n] = geocode_lat
        print("MAPPED", n)
    else:
        print("NOT MAPPED:", incumbent_address["Fulladdr"][n])

In [None]:
incumbent_address_geo = gpd.GeoDataFrame(incumbent_address,
                                         geometry=gpd.points_from_xy(incumbent_address.Longitude,
                                                                     incumbent_address.Latitude))
incumbent_address_geo = incumbent_address_geo.set_crs(epsg=4326)

In [None]:
#incumbent_address_geo.to_file("./data/2021_incumbent_addresses/2021IncumbentsGeocoded.shp")  

### Clean up VAP files

In [None]:
#Remove "not defined" under Town

ct_vap_2020_df = ct_vap_2020_df[~ct_vap_2020_df.town.str.contains("County subdivisions not defined")]
ct_vap_2020_df = ct_vap_2020_df[~ct_vap_2020_df.town.str.contains("Not in a specific geographic unit")]

In [None]:
ct_vap_2020_df = ct_vap_2020_df.dropna(subset = ["p0030001"])

In [None]:
ct_vap_2020_df.town.unique()

In [None]:
ct_vap_2020_df = ct_vap_2020_df.rename(columns={"p0030001": "VAP", 
                                                "p0030001_a": "VAP_adj",
                                                "p0030001_d": "VAP_diff"})

In [None]:
house_block=house_block.rename(columns={'BLOCKID':'geoid20'})
senate_block=senate_block.rename(columns={'BLOCKID':'geoid20'})

## Data Merge

In [None]:
ct_vap_2020_df = pd.merge(ct_vap_2020_df, house_block, on="geoid20")
ct_vap_2020_df = pd.merge(ct_vap_2020_df, senate_block, on="geoid20")

In [None]:
inc_block = {}
b_indices = np.array(range(len(ct_vap_2020_df)))

In [None]:
for index,row in incumbent_address_geo.iterrows():
    
    assignment = b_indices[ct_vap_2020_df.contains(row['geometry'])]
    if len(assignment) > 0:
        inc_block[index] = ct_vap_2020_df["geoid20"][assignment[0].astype(int)]
    else:
        inc_block[index] = np.nan   

In [None]:
incumbent_address_geo["geoid20"] =  incumbent_address_geo.index.map(inc_block)  

In [None]:
ct_vap_2020_df["INCUMBENT"] = 0
ct_vap_2020_df.loc[ct_vap_2020_df['geoid20'].isin(incumbent_address_geo['geoid20']), 'INCUMBENT'] = 1

incumbent_address_geo=incumbent_address_geo.drop(columns=["geometry", "Longitude", "Latitude"], axis=1)

In [None]:
ct_df = ct_vap_2020_df.merge(incumbent_address_geo, on='geoid20', how="left")

In [None]:
ct_df.plot(column="SENATE");

In [None]:
ct_df.to_file("./data/CT_analysis.shp")

## Dual Graph

GerryChain uses dual graphs for analysis; this section builds dual graphs directly from the shapefile. 

In [None]:
ct_graph = Graph.from_file("./data/CT_analysis.shp")

In [None]:
ct_df = gpd.read_file("./data/CT_analysis.shp")

In [None]:
centroids = ct_df.centroid
ct_df["C_X"] = centroids.x
ct_df["C_Y"] = centroids.y

ct_graph.add_data(ct_df,columns=["C_X","C_Y"])

In [None]:
for node in ct_graph.nodes():
    ct_graph.nodes[node]["VAP"] = int(ct_graph.nodes[node]["VAP"])
    ct_graph.nodes[node]["VAP_adj"] = int(ct_graph.nodes[node]["VAP_adj"])
    ct_graph.nodes[node]["VAP_diff"] = int(ct_graph.nodes[node]["VAP_diff"])
    ct_graph.nodes[node]["INCUMBENT"] = int(ct_graph.nodes[node]["INCUMBENT"])
    ct_graph.nodes[node]["HOUSE"] = int(ct_graph.nodes[node]["HOUSE"])
    ct_graph.nodes[node]["SENATE"] = int(ct_graph.nodes[node]["SENATE"])

In [None]:
graph.to_json("./data/CT_dual_graph.json")