# Connecticut Redistricting Analysis: Data Wrangling

- Project Objective: Analyze final 2021 CT State House and State Senate maps relative to incumbent protection
- Notebook Objective: Collect and organize all available public data necessary for GerryChain analysis

## Data Source

The main source of data for this project located on the Connecticut General Assembly website on their [2021 Redistricting Project](https://www.cga.ct.gov/rr/taskforce.asp?TF=20210401_2021%20Redistricting%20Project) committee page. 

- [Geographic Data](https://www.cga.ct.gov/rr/tfs/20210401_2021%20Redistricting%20Project/data.asp)
- [Election Data](https://www.cga.ct.gov/rr/tfs/20210401_2021%20Redistricting%20Project/data.asp)
- [Incumbent Data](https://www.cga.ct.gov/rr/tfs/20210401_2021%20Redistricting%20Project/data.asp)
- [2021 Final House Map](https://www.cga.ct.gov/rr/tfs/20210401_2021%20Redistricting%20Project/hmaps.asp)
- [2021 Final Senate Map](https://www.cga.ct.gov/rr/tfs/20210401_2021%20Redistricting%20Project/hmaps.asp)

The 2020 U.S. Census data for the voting aged population (VAP) at the census block level was downloaded from [Connecticut Open Data](https://data.ct.gov/Government/2020-U-S-Census-Block-Adjustments/bary-ntej/). This dataset includes adjustments made by Connecticut Office of Policy and Management to reflect "most individuals who are incarcerated to be counted at their address before incarceration". The technical report can be [viewed here](https://portal.ct.gov/-/media/OPM/CJPPD/CjAbout/SAC-Documents-from-2021-2022/PA21-13_OPM_Summary_Report_20210921.pdf).

All the data used in this project has been downloaded and hosted on [Github](https://github.com/ka-chang/RedistrictingCT). Other relevant data sources can be found at the [Redistricting Data Hub](https://redistrictingdatahub.org/state/connecticut/).

## Setup

In [1]:
import os
import sys
from pathlib import Path

import geopandas as gpd
from gerrychain import Graph
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd

In [2]:
github_file_path = str(Path(os.getcwd())) #Sets to local Github directory path

sys.path.insert(1, github_file_path) 

github_file_path

'/Users/katherinechang/RedistrictingCT'

## Data Exploration

In [3]:
ct_vap_2020_df = gpd.read_file("./data/2020_census_vap/geo_export_862c4487-53ad-4b53-bb2a-5fe128833d9b.shp")
house_block = pd.read_csv("./data/HOU.csv", dtype=str)
senate_block = pd.read_csv("./data/SEN.csv", dtype=str)
incumbent_address = gpd.read_file("./data/2021_incumbent/21IncumbentsGeocoded.shp")

In [4]:
ct_vap_2020_df.head()
#Census block level with "geoid20"
#"p003001" is unadjusted, "p003001_a" is adjusted, "p003001_d" is difference

Unnamed: 0,town,geoid20,p0030001,p0030001_a,p0030001_d,geometry
0,Greenwich,90010101011000,23.0,23.0,0.0,"POLYGON ((-73.67642 41.12467, -73.66993 41.127..."
1,Greenwich,90010101011001,149.0,149.0,0.0,"POLYGON ((-73.68429 41.11007, -73.68420 41.110..."
2,Greenwich,90010101011002,12.0,13.0,1.0,"POLYGON ((-73.69362 41.10838, -73.69349 41.108..."
3,Greenwich,90010101011003,0.0,0.0,0.0,"POLYGON ((-73.68828 41.10238, -73.68821 41.102..."
4,Greenwich,90010101011004,2.0,2.0,0.0,"POLYGON ((-73.68926 41.11859, -73.68607 41.120..."


In [5]:
house_block.head()

Unnamed: 0,BLOCKID,HOUSE
0,90035042001002,1
1,90035245021002,1
2,90035245022007,1
3,90035246003005,1
4,90035245012000,1


In [6]:
senate_block.head()

Unnamed: 0,BLOCKID,SENATE
0,90034921001033,1
1,90035023005001,1
2,90035049001009,1
3,90035247002007,1
4,90035049001007,1


## Data Cleaning

To Do:

- Clean and add incumbents in final ct_df before saving CT_analysis.shp
- Address Accuracy: Three out of 36 incumbents are not in shapefile, check that addresses are all accurate in incumbent_address
- ~Remove NaN in VAP df~
- ~Rename variables~

In [7]:
#Remove "not defined" under Town

ct_vap_2020_df = ct_vap_2020_df[~ct_vap_2020_df.town.str.contains("County subdivisions not defined")]
ct_vap_2020_df = ct_vap_2020_df[~ct_vap_2020_df.town.str.contains("Not in a specific geographic unit")]

In [10]:
ct_vap_2020_df = ct_vap_2020_df.dropna(subset = ["p0030001"])

In [11]:
ct_vap_2020_df.town.unique()

array(['Greenwich', 'Stamford', 'Darien', 'New Canaan', 'Norwalk',
       'Wilton', 'Westport', 'Weston', 'Fairfield', 'Bridgeport',
       'Stratford', 'Trumbull', 'Monroe', 'Easton', 'Shelton', 'Bethel',
       'Brookfield', 'Danbury', 'New Fairfield', 'Newtown', 'Redding',
       'Ridgefield', 'Sherman', 'Hartland', 'Berlin', 'Bristol',
       'Burlington', 'New Britain', 'Plainville', 'Southington',
       'Farmington', 'Avon', 'Canton', 'Simsbury', 'Granby',
       'East Granby', 'Bloomfield', 'Windsor', 'Windsor Locks',
       'Suffield', 'Enfield', 'East Windsor', 'South Windsor',
       'Rocky Hill', 'Wethersfield', 'Newington', 'West Hartford',
       'Hartford', 'East Hartford', 'Manchester', 'Glastonbury',
       'Marlborough', 'Bridgewater', 'New Milford', 'North Canaan',
       'Salisbury', 'Sharon', 'Cornwall', 'Warren', 'Kent', 'Washington',
       'Roxbury', 'Barkhamsted', 'Colebrook', 'Goshen', 'Harwinton',
       'Litchfield', 'Morris', 'New Hartford', 'Torrington', '

In [12]:
ct_vap_2020_df = ct_vap_2020_df.rename(columns={"p0030001": "VAP", 
                                                "p0030001_a": "VAP_adj",
                                                "p0030001_d": "VAP_diff"})

In [13]:
ct_vap_2020_df=ct_vap_2020_df.rename(columns={'geoid20':'BLOCKID'})

## Data Merge

In [15]:
ct_vap_2020_df = pd.merge(ct_vap_2020_df, house_block, on="BLOCKID")
ct_vap_2020_df = pd.merge(ct_vap_2020_df, senate_block, on="BLOCKID")

In [17]:
ct_vap_2020_df

Unnamed: 0,town,BLOCKID,VAP,VAP_adj,VAP_diff,geometry,HOUSE,SENATE
0,Greenwich,090010101011000,23.0,23.0,0.0,"POLYGON ((-73.67642 41.12467, -73.66993 41.127...",149,36
1,Greenwich,090010101011001,149.0,149.0,0.0,"POLYGON ((-73.68429 41.11007, -73.68420 41.110...",149,36
2,Greenwich,090010101011002,12.0,13.0,1.0,"POLYGON ((-73.69362 41.10838, -73.69349 41.108...",149,36
3,Greenwich,090010101011003,0.0,0.0,0.0,"POLYGON ((-73.68828 41.10238, -73.68821 41.102...",149,36
4,Greenwich,090010101011004,2.0,2.0,0.0,"POLYGON ((-73.68926 41.11859, -73.68607 41.120...",149,36
...,...,...,...,...,...,...,...,...
49840,Sterling,090159081004020,16.0,16.0,0.0,"POLYGON ((-71.84729 41.64574, -71.84694 41.645...",44,18
49841,Sterling,090159081004021,16.0,16.0,0.0,"POLYGON ((-71.84796 41.65310, -71.84794 41.653...",44,18
49842,Sterling,090159081004022,14.0,14.0,0.0,"POLYGON ((-71.81840 41.64453, -71.81837 41.644...",44,18
49843,Sterling,090159081004023,0.0,0.0,0.0,"POLYGON ((-71.82013 41.64191, -71.82012 41.641...",44,18


In [18]:
inc_block = {}
b_indices = np.array(range(len(ct_vap_2020_df)))

In [19]:
for index,row in incumbent_address.iterrows():
    
    assignment = b_indices[ct_vap_2020_df.contains(row['geometry'])]
    if len(assignment) > 0:
        inc_block[index] = assignment[0].astype(int)
    else:
        inc_block[index] = np.nan                           

In [20]:
incumbent_address['BLOCKINDEX'] =  incumbent_address.index.map(inc_block)

In [22]:
incumbent_address_blockid = pd.merge(ct_vap_2020_df, incumbent_address, left_index=True, right_on='BLOCKINDEX')

In [25]:
incumbent_address_blockid

Unnamed: 0,town,BLOCKID,VAP,VAP_adj,VAP_diff,geometry_x,HOUSE,SENATE,Status,Score,...,dist_code,dist_ordin,lname,fname,mid_init,Office,FmtName,Fulladdr,geometry_y,BLOCKINDEX
190,Greenwich,090010101022023,25.0,25.0,0.0,"POLYGON ((-73.65566 41.06923, -73.65493 41.069...",149,36,M,100.00,...,SEN,SEN,Blumenthal,Richard,,FED,Sen. Richard Blumenthal,"145 Clapboard Ridge Road Greenwich, CT 06831",POINT (-73.64292 41.07639),130.0
151,Greenwich,090010102011010,84.0,84.0,0.0,"POLYGON ((-73.63989 41.06367, -73.63988 41.063...",151,36,M,100.00,...,151,151st,Arora,Harry,,HRO,Rep. Harry Arora of the 151st,"56 Rockwood Lane Greenwich, CT 06830",POINT (-73.63403 41.05958),182.0
191,Greenwich,090010102022002,114.0,114.0,0.0,"POLYGON ((-73.59116 41.04415, -73.59102 41.044...",151,36,M,100.00,...,CT4,CT4,Himes,Jim,,FED,Rep. Jim Himes of the 4th,"197 Valley Road Cos Cob, CT 06807",POINT (-73.58715 41.05160),226.0
149,Greenwich,090010103002020,42.0,42.0,0.0,"POLYGON ((-73.63834 41.03081, -73.63763 41.030...",149,36,M,100.00,...,149,149th,Fiorello,Kimberly,,HRO,Rep. Kimberly Fiorello of the 149th,"1 Grove Lane Greenwich, CT 06831",POINT (-73.63188 41.03115),289.0
94,Greenwich,090010110002002,101.0,101.0,0.0,"POLYGON ((-73.56637 41.03223, -73.56596 41.032...",150,36,M,100.00,...,150,150th,Meskers,Stephen,R,HDO,Rep. Stephen Meskers of the 150th,"18 Lockwood Avenue Old Greenwich , CT 06870",POINT (-73.56419 41.02788),598.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
181,Windham,090158007002003,30.0,30.0,0.0,"POLYGON ((-72.21569 41.71903, -72.21568 41.719...",49,29,M,100.00,...,S29,29th,Flexer,Mae,,SDO,Sen. Mae Flexer of the 29th,"236 Walnut Street Willimantic, CT 06226",POINT (-72.21390 41.71896),47461.0
113,Chaplin,090158150002012,18.0,18.0,0.0,"POLYGON ((-72.12024 41.79792, -72.12004 41.798...",47,35,M,100.00,...,047,47th,Dubitsky,Doug,,HRO,Rep. Doug Dubitsky of the 47th,"125 North Bear Hill Road Chaplin, CT 06235",POINT (-72.11484 41.79967),47528.0
34,Pomfret,090159025003019,331.0,331.0,0.0,"POLYGON ((-71.96549 41.88603, -71.96544 41.886...",50,29,M,99.49,...,050,50th,Boyd,Patrick,S.,HDO,Rep. Patrick Boyd of the 50th,"398 Pomfret Street Pomfret, CT 06258",POINT (-71.96200 41.88427),48473.0
109,Putnam,090159031013008,52.0,52.0,0.0,"POLYGON ((-71.91958 41.92071, -71.91912 41.920...",51,29,M,100.00,...,051,51st,Hayes,Rick,L.,HRO,Rep. Rick Hayes of the 51st,"78 S. Prospect Street Putnam, CT 06260",POINT (-71.91785 41.92035),48559.0


In [26]:
ct_df = ct_vap_2020_df

In [27]:
ct_df.to_file("./data/CT_analysis.shp")

  pd.Int64Index,


## Dual Graph

GerryChain uses dual graphs for analysis; this section builds dual graphs directly from the shapefile. 

To Do:
- Look into warnings

In [None]:
ct_graph = Graph.from_file("./data/CT_analysis.shp")

In [None]:
ct_graph.nodes[50]

In [None]:
#ct_df = ct_df.to_crs(epsg=4326) 

In [None]:
ct_centroids = ct_df.centroid 

#Getting warning, still runs but breaks down when mapping centroids C_X and C_Y. 
#Check which CRS to transform to

In [None]:
ct_df["C_X"] = ct_centroids.x
ct_df["C_Y"] = ct_centroids.y

ct_graph.add_data(ct_df,
                  columns=["C_X", "C_Y"])

In [None]:
for node in ct_graph.nodes():
    ct_graph.nodes[node]["VAP"] = int(ct_graph.nodes[node]["VAP"])
    ct_graph.nodes[node]["VAP_adj"] = int(ct_graph.nodes[node]["VAP_adj"])
    ct_graph.nodes[node]["VAP_diff"] = int(ct_graph.nodes[node]["VAP_diff"])
    ct_graph.nodes[node]["HDIST21"] = int(ct_graph.nodes[node]["HDIST21"]) 
    ct_graph.nodes[node]["SEND21"] = int(ct_graph.nodes[node]["SEND21"]) 

In [None]:
nx.draw(ct_graph,pos = {node:(ct_graph.nodes[node]['C_X'],
                              ct_graph.nodes[node]['C_Y']) 
                        for node in ct_graph.nodes()},
        node_color=[ct_graph.nodes[node]["HDIST21"] for node in ct_graph.nodes()],
        node_size=10,
        cmap='tab20')

In [None]:
#house_2021_graph.to_json("./data/CT_analysis.json")