# VEST VA 2017 Governor, Lt Governor, Attorney General

In [1]:
import pandas as pd
import geopandas as gp
import matplotlib.pyplot as plt
import numpy as np
import os
from functools import reduce

In [2]:
pd.set_option('display.max_columns', None)

## VEST Documentation

Election results from Virginia Department of Elections (https://historical.elections.virginia.gov/)

Absentee ballots and provisional votes were reported at the county or city level throughout the state. These were distributed by candidate to precincts based on their share of the precinct-level reported vote.

Precinct shapefile primarily from the U.S. Census Bureau's 2020 Redistricting Data Program Phase 2 release. Virginia election reports often include precinct splits that are obsolete or unused in practice. These have been omitted. In cases where voters were incorrectly assigned to the wrong district the de facto precinct split has been included for that election.

The following modifications were made to recreate the 2017 precinct boundaries.

Albemarle: Merge Cale/Biscuit Run, Free Bridge/Pantops; Add Belfield from 2010 VTD shapefile; Adjust Brownsville/Crozet to match 2010 VTD shapefile  
Arlington: Adjust Gunston/Oakridge to match county GIS shapefile  
Bedford: Merge New London Academy/Forest Fire Station #2 to reverse 2018 split  
Bristol City: Adjust Ward 2/Ward 4 to match description in municipal code  
Carroll: Split Oakland A/Oakland D to match county GIS shapefile  
Charles City County: Adjust District 1/District 2 boundary to match county code  
Covington City: Realign Ward 1, Ward 2, Ward 3 to match city PDF map and municipal code  
Culpeper: Adjust East Fairfax/Brandy Station boundary to match county GIS shapefile  
Emporia City: Adjust Precincts 1/7, Precincts 2/5 to match municipal code  
Essex: Adjust South Precinct/Central Precinct boundary to match county PDF  
Fairfax: Adjust Virginia Run/Bull Run to match county GIS shapefile  
Fredericksburg City: Adjust District 1/3 boundaries to match municipal code  
Galax City: Adjust North/South precinct boundary to match municipal GIS shapefile  
Goochland: Adjust Hadensville/Fife boundary to match description in county code  
Halifax: Merge South Boston East/West; Adjust Meadville/Republican Grove to match 2011 redistricting PDF map  
Hampton City: Add US House District 2 segment of Tyler Precinct to match county PDF  
Hanover: Adjust Blunts/Beaverdam boundary to match county PDF  
Henrico: Split Glenside/Johnson to match 2010 VTD shapefile  
Henry: Adjust 10 precinct boundaries to align VTDs with county GIS shapefile  
Madison: Adjust all precincts to align VTDs with county GIS shapefile  
Newport News City: Adjust Sanford/Riverview boundary to match county GIS shapefile  
Prince William: Adjust Ben Lomond/Mullen, Freedom/Leesylvania to match county GIS shapefile  
Pulaski: Adjust Dublin/New River to match precinct assignments on county GIS parcel viewer  
Rappahanock: Adjust Sperryville/Washington boundary to match county PDF  
Richmond County: Adjust Precincts 2-1/3-1 boundary to match description in county ordinance  
Roanoke City: Add Virginia Heights-Norwich; Adjust Forest Park/Eureka Park based on county GIS shapefile and description in municipal ordinance  
Roanoke County: Adjust 12 precinct boundaries to match county GIS shapefile  
Rockingham: Adjust Bridgewater Precinct to match municipal boundary  
Russell: Adjust Daugherty/West Lebanon boundary to match county PDF  
Tazewell: Adjust nearly all precinct boundaries to align VTDs with county GIS shapefile  
Virginia Beach City: Merge Salem Woods/Rosemont Forest, Sigma/Sandbridge to match 2015 PDF; Adjust Centerville/Colonial to match county GIS shapefile  
Williamsburg City: Revise Matoaka/Stryker to match municipal PDF map and municipal code  
Wise: Adjust Big Stone Gap/East Stone Gap boundary to match county GIS shapefile  
Wythe: Adjust West Wytheville/East Wytheville boundary to match county GIS shapefile  

Results are divided across three files. Because precincts can be split across legislative districts, the legislative races are reported with their own geography that divides these split precincts, resulting in shapes that are assigned to exactly one district.  

*va_2017 file*
G17GOVDNOR - Ralph Northam (Democratic Party)  
G17GOVRGIL - Ed Gillespie (Republican Party)  
G17GOVLHYR - Cliff Hyra (Libertarian Party)  
G17GOVOWRI - Write-in Votes  

G17LTGDFAI - Justin Fairfax (Democratic Party)  
G17LTGRVOG - Jill Vogel (Republican Party)  
G17LTGOWRI - Write-in Votes  

G17ATGDHER - Mark Herring (Democratic Party)  
G17ATGRADA - John Adams (Republican Party)  
G17ATGOWRI - Write-in Votes  

## Load VEST file

In [3]:
gdfv = gp.read_file('./raw_from_source/va_2017/va_2017.shp')
gdfv.head()

Unnamed: 0,COUNTYFP,LOCALITY,VTDST,PRECINCT,G17GOVDNOR,G17GOVRGIL,G17GOVLHYR,G17GOVOWRI,G17LTGDFAI,G17LTGRVOG,G17LTGOWRI,G17ATGDHER,G17ATGRADA,G17ATGOWRI,geometry
0,1,Accomack County,101,Chincoteague,455,784,11,0,410,829,0,416,815,2,"POLYGON Z ((-75.42507 37.89957 0.00000, -75.42..."
1,1,Accomack County,201,Atlantic,144,414,1,0,137,414,0,132,422,0,"POLYGON Z ((-75.59978 37.87664 0.00000, -75.59..."
2,1,Accomack County,202,Greenbackville,225,468,8,0,222,471,0,216,477,0,"POLYGON Z ((-75.49919 37.93416 0.00000, -75.49..."
3,1,Accomack County,301,New Church,395,383,5,0,386,389,1,384,393,1,"POLYGON Z ((-75.64987 37.92702 0.00000, -75.64..."
4,1,Accomack County,401,Bloxom,103,232,1,0,89,239,0,92,236,0,"POLYGON Z ((-75.71556 37.87513 0.00000, -75.71..."


In [4]:
county_dict = pd.Series(gdfv['COUNTYFP'].values, index = gdfv['LOCALITY']).to_dict()

## Load election results

In [5]:
#Governor
gov = pd.read_csv('./raw_from_source/Virginia_Elections_Database__2017_Governor_General_Election_including_precincts.csv')
#Lt Gov
ltg = pd.read_csv('./raw_from_source/Virginia_Elections_Database__2017_Lieutenant_Governor_General_Election_including_precincts.csv')
#Attorney General
atg = pd.read_csv('./raw_from_source/Virginia_Elections_Database__2017_Attorney_General_General_Election_including_precincts.csv')

gov['join_id'] = gov['County/City']+gov['Pct']
ltg['join_id']= ltg['County/City']+ltg['Pct']
atg['join_id'] = atg['County/City']+atg['Pct']

gov_ltg = pd.merge(gov, ltg, on = 'join_id', how = 'outer')
df = pd.merge(atg, gov_ltg, on = 'join_id', how = 'outer')

df.columns

Index(['County/City', 'Ward', 'Pct', 'Mark Rankin Herring',
       'John Donley Adams', 'All Others', 'Total Votes Cast', 'join_id',
       'County/City_x', 'Ward_x', 'Pct_x', 'Ralph Shearer Northam',
       'Edward Walter Gillespie', 'Clifford Daniel Hyra', 'All Others_x',
       'Total Votes Cast_x', 'County/City_y', 'Ward_y', 'Pct_y',
       'Justin Edward Fairfax', 'Jill Holtzman Vogel', 'All Others_y',
       'Total Votes Cast_y'],
      dtype='object')

In [6]:
df = df.fillna(value = 0)
df = df[(df['County/City'] != 'TOTALS') & (df['join_id'] != 0)]
df['LOCALITY'] = df['County/City']
#Import county fip number values
df['COUNTYFP'] = df['LOCALITY'].map(county_dict)
#Change columns to match vest candidate ids
df['G17GOVDNOR'] = df['Ralph Shearer Northam'].map(lambda x: str(x).replace(',', '')).astype(str).astype(float).astype(int) 
df['G17GOVDNOR'] = df['Ralph Shearer Northam'].map(lambda x: str(x).replace(',', '')).astype(str).astype(float).astype(int)
df['G17GOVRGIL'] = df['Edward Walter Gillespie'].map(lambda x: str(x).replace(',', '')).astype(str).astype(float).astype(int)
df['G17GOVLHYR'] = df['Clifford Daniel Hyra'].map(lambda x: str(x).replace(',', '')).astype(str).astype(float).astype(int)
df['G17GOVOWRI'] = df['All Others_x'].map(lambda x: str(x).replace(',', '')).astype(str).astype(float).astype(int)

df['G17LTGDFAI'] = df['Justin Edward Fairfax'].map(lambda x: str(x).replace(',', '')).astype(str).astype(float).astype(int)
df['G17LTGRVOG'] = df['Jill Holtzman Vogel'].map(lambda x: str(x).replace(',', '')).astype(str).astype(float).astype(int)
df['G17LTGOWRI'] = df['All Others_y'].map(lambda x: str(x).replace(',', '')).astype(str).astype(float).astype(int)

df['G17ATGDHER'] = df['Mark Rankin Herring'].map(lambda x: str(x).replace(',', '')).astype(str).astype(float).astype(int)
df['G17ATGRADA'] = df['John Donley Adams'].map(lambda x: str(x).replace(',', '')).astype(str).astype(float).astype(int)
df['G17ATGOWRI'] = df['All Others'].map(lambda x: str(x).replace(',', '')).astype(str).astype(float).astype(int)
#drop repeat columns
df = df.drop(['County/City_x', 'Ward_x', 'Total Votes Cast_x','County/City_y', 'Ward_y', 'Total Votes Cast_y',
             'Ralph Shearer Northam','Edward Walter Gillespie','Clifford Daniel Hyra','All Others_x','Justin Edward Fairfax','Jill Holtzman Vogel','All Others_y',
             'Mark Rankin Herring','John Donley Adams','All Others', 'Pct_x', 'Pct_y'], axis = 1)

In [7]:
df.head()

Unnamed: 0,County/City,Ward,Pct,Total Votes Cast,join_id,LOCALITY,COUNTYFP,G17GOVDNOR,G17GOVRGIL,G17GOVLHYR,G17GOVOWRI,G17LTGDFAI,G17LTGRVOG,G17LTGOWRI,G17ATGDHER,G17ATGRADA,G17ATGOWRI
8,Accomack County,-,101 - Chincoteague,1152,Accomack County101 - Chincoteague,Accomack County,1,411,746,10,0,369,789,0,374,776,2
9,Accomack County,-,201 - Atlantic,521,Accomack County201 - Atlantic,Accomack County,1,130,394,1,0,123,394,0,119,402,0
10,Accomack County,-,202 - Greenbackville,648,Accomack County202 - Greenbackville,Accomack County,1,203,445,7,0,200,448,0,194,454,0
11,Accomack County,-,301 - New Church,721,Accomack County301 - New Church,Accomack County,1,357,365,4,0,347,370,1,346,374,1
12,Accomack County,-,401 - Bloxom,308,Accomack County401 - Bloxom,Accomack County,1,93,221,1,0,80,227,0,83,225,0


### Check vote totals - pre absentee reallocation

In [8]:
#column/race
column_list = ['G17GOVDNOR', 'G17GOVRGIL', 'G17GOVLHYR', 'G17GOVOWRI', 'G17LTGDFAI', 'G17LTGRVOG', 'G17LTGOWRI','G17ATGDHER', 'G17ATGRADA', 'G17ATGOWRI']
for val in column_list:
    vote_dif = df[val].sum()-gdfv[val].sum()
    if (vote_dif == 0):
        print(val+": EQUAL")
    else:
        print(val+": DIFFERENCE OF " + str(vote_dif)+ " VOTES")

G17GOVDNOR: EQUAL
G17GOVRGIL: EQUAL
G17GOVLHYR: EQUAL
G17GOVOWRI: EQUAL
G17LTGDFAI: EQUAL
G17LTGRVOG: EQUAL
G17LTGOWRI: EQUAL
G17ATGDHER: EQUAL
G17ATGRADA: EQUAL
G17ATGOWRI: EQUAL


In [9]:
#county
print("Counties with differences printed below:")
diff_counties=[]
for i in column_list:
    diff = df.groupby(['LOCALITY']).sum()[i]-gdfv.groupby(['LOCALITY']).sum()[i]
    for val in diff[diff != 0].index.values.tolist():
        if val not in diff_counties:
            diff_counties.append(val)
    if len(diff[diff != 0]!=0):
        print(diff[diff != 0].to_string(header=False))
print("")
print("All other races in all counties are equal")

Counties with differences printed below:

All other races in all counties are equal


## Absentee reallocation

In [10]:
#Function to account for counties split by CDs in absentee reallocation to better match VEST's steps
def add_cd_to_county(county_list, precinct, countyfp):
    if (countyfp in county_list):
        countyfp_cd = countyfp + '-' + precinct[-5:-1]
        return countyfp_cd
    else:
        countyfp_cd = countyfp
        return countyfp_cd
#Set-up for absentee reallocation
cd_abs_prov_prec = df[((df['Pct'].map(lambda x: 'Absentee' in str(x))) &(df['Pct'].map(lambda x: 'CD' in str(x)))) | ((df['Pct'].map(lambda x: 'Provisional' in str(x))) & (df['Pct'].map(lambda x: 'CD' in str(x))))]
county_with_cd_nec_list = list(cd_abs_prov_prec['COUNTYFP'])

df['countyfp_cd'] = df.apply(lambda row: add_cd_to_county(county_with_cd_nec_list, row['Pct'], row['COUNTYFP']), axis = 1)

absentee_and_prov = df[(df['Pct'].map(lambda x: 'Absentee' in str(x))) | (df['Pct'].map(lambda x: 'Provisional' in str(x)))]
groupby_absentee_and_prov_tot = absentee_and_prov.groupby(['countyfp_cd']).sum()

groupby_county_df_tot = df.groupby(['countyfp_cd']).sum()
df_no_absent_or_provisional = df[(df['Pct'].map(lambda x: 'Absentee' not in str(x))) & (df['Pct'].map(lambda x: 'Provisional' not in str(x)))
                                & (df['LOCALITY'] != 'TOTALS')]
groupby_county_tot_no_absentee = df_no_absent_or_provisional.groupby('countyfp_cd').sum()

In [None]:
df_with_absentee_reallocated = df_no_absent_or_provisional.copy()
groupby_absentee_and_prov_tot.reset_index(inplace=True,drop=False)
groupby_county_tot_no_absentee.reset_index(inplace=True,drop=False)

#Create copys of subset dfs to not modify in case want to check back later
to_dole_out_totals = groupby_absentee_and_prov_tot.copy()
precinct_specific_totals = groupby_county_tot_no_absentee.copy()

## PH CODE for vote allocation

#Create some new columns for each of these races to deal with the allocation
for race in column_list:
    add_var = race+"_add"
    rem_var = race+"_rem"
    floor_var = race+"_floor"
    df_with_absentee_reallocated.loc[:,add_var]=0.0
    df_with_absentee_reallocated.loc[:,rem_var]=0.0
    df_with_absentee_reallocated.loc[:,floor_var]=0.0

#Iterate over the rows
#Note this function iterates over the dataframe two times so the rounded vote totals match the totals to allocate
for index, row in df_no_absent_or_provisional.iterrows():
    for race in column_list:
        add_var = race+"_add"
        rem_var = race+"_rem"
        floor_var = race+"_floor"
        #Grab the district
        county_id = row["countyfp_cd"]
        #Get the denominator for the allocation (the precinct vote totals)
        denom = precinct_specific_totals.loc[precinct_specific_totals["countyfp_cd"]==county_id][race]
        #Get one of the numerators, how many districtwide votes to allocate
        numer = to_dole_out_totals.loc[to_dole_out_totals["countyfp_cd"]==county_id][race]
        #Get the vote totals for this race in this precinct
        val = df_with_absentee_reallocated.at[index,race]
        #Get the vote share, the precincts % of total precinct votes in the district times votes to allocate
        if ((float(denom)==0)):
            vote_share = 0
        else:
            vote_share = (float(val)/float(denom))*float(numer)
        df_with_absentee_reallocated.at[index,add_var] = vote_share
        #Take the decimal remainder of the allocation
        df_with_absentee_reallocated.at[index,rem_var] = vote_share%1
        #Take the floor of the allocation
        df_with_absentee_reallocated.at[index,floor_var] = np.floor(vote_share)

#After the first pass through, get the sums of the races by district to assist in the rounding            
first_allocation = pd.DataFrame(df_with_absentee_reallocated.groupby(["countyfp_cd"]).sum())

#Now we want to iterate district by district to work on rounding
county_list = list(to_dole_out_totals["countyfp_cd"].unique()) 

#Iterate over the district
for county in county_list:
    for race in column_list:
        add_var = race+"_add"
        rem_var = race+"_rem"
        floor_var = race+"_floor"
        #County how many votes still need to be allocated (because we took the floor of all the initial allocations)
        to_go = int(np.round((int(to_dole_out_totals.loc[to_dole_out_totals["countyfp_cd"]==county][race])-first_allocation.loc[first_allocation.index==county,floor_var])))
        #Grab the n precincts with the highest remainders and round these up, where n is the # of votes that still need to be allocated
        for index in df_with_absentee_reallocated.loc[df_with_absentee_reallocated["countyfp_cd"]==county][rem_var].nlargest(to_go).index:
            df_with_absentee_reallocated.at[index,add_var] = np.ceil(df_with_absentee_reallocated.at[index,add_var])

#Iterate over every race again
for race in column_list:
    add_var = race+"_add"
    #Round every allocation down to not add fractional votes
    df_with_absentee_reallocated.loc[:,add_var]=np.floor(df_with_absentee_reallocated.loc[:,add_var])
    df_with_absentee_reallocated.loc[:,race]+=df_with_absentee_reallocated.loc[:,add_var]

### Check vote totals - post absentee reallocation

In [None]:
#Column/race total check
for val in column_list:
    vote_dif = df_with_absentee_reallocated[val].sum()-gdfv[val].sum()
    if (vote_dif == 0):
        print(val+": EQUAL - "+ str(df_with_absentee_reallocated[val].sum()))
    else:
        print(val+": DIFFERENCE OF " + str(vote_dif)+ " VOTES")
        
print("Columns with differences printed below:")

In [None]:
#Differences between RDH/Partner total and VA Dept of Elections totals
one = 1409175 - 1408818.0
two = 1175731 - 1175732.0
three = 27987 - 27987.0
four = 1389 - 1528.0
five = 1368261 - 1368412.0
six = 1224519 - 1224520.0
seven = 2446 - 2606.0
eight = 1385389 - 1385390.0
nine = 1209339 - 1209540.0
ten = 2486 - 2614.0

print(one, two, three, four, five, six, seven, eight, nine, ten)

In [None]:
def county_totals_check(partner_df,source_df,column_list,county_col,full_print=False):
    print("***Countywide Totals Check***")
    print("")
    diff_counties=[]
    for race in column_list:
        diff = partner_df.groupby([county_col]).sum()[race]-source_df.groupby([county_col]).sum()[race]
        for val in diff[diff != 0].index.values.tolist():
            if val not in diff_counties:
                diff_counties.append(val)
        if len(diff[diff != 0]!=0):   
            print(race + " contains differences in these counties:")
            for val in diff[diff != 0].index.values.tolist():
                county_differences = diff[diff != 0]
                print("\t"+val+" has a difference of "+str(county_differences[val])+" votes")
                print("\t\tVEST: "+str(partner_df.groupby([county_col]).sum().loc[val,race])+" votes")
                print("\t\tSOURCES: "+str(source_df.groupby([county_col]).sum().loc[val,race])+" votes")
            if (full_print):
                for val in diff[diff == 0].index.values.tolist():
                    county_similarities = diff[diff == 0]
                    print("\t"+val + ": "+ str(partner_df.groupby([county_col]).sum().loc[val,race])+" votes")
        else:
            print(race + " is equal across all counties")
            if (full_print):
                for val in diff[diff == 0].index.values.tolist():
                    county_similarities = diff[diff == 0]
                    print("\t"+val + ": "+ str(partner_df.groupby([county_col]).sum().loc[val,race])+" votes")

In [None]:
county_totals_check(df_with_absentee_reallocated,gdfv,column_list,"LOCALITY",False)

## Unique Identifier to enable merge between election results and vest file

In [None]:
#Rely on VTDST code from vest file, and subset code from election results precinct column
def vtdst_changer(vtdst):
    if (vtdst[1:3] == ' -'):
        two_lead_zero = '00' + vtdst[:1]
        return two_lead_zero
    elif (vtdst[1:3] == '- '):
        two_lead_zero = '00' + vtdst[:1]
        return two_lead_zero
    elif (vtdst[-1:] == ' '):
        one_lead_zero = '0' + vtdst[:2]
        return one_lead_zero
    elif (vtdst[-1:] == '-'):
        one_lead_zero = '0' + vtdst[:2]
        return one_lead_zero
    else:
        return vtdst

In [None]:
#Isolating 3 digit VTDST code in election results as it appears in the shapefile and vest file, then creating unique id
df_with_absentee_reallocated['vtdst'] = df_with_absentee_reallocated.Pct.str.slice(stop = 3)
df_with_absentee_reallocated['vtdst'] = df_with_absentee_reallocated['vtdst'].apply(vtdst_changer)
df_with_absentee_reallocated['unique_id'] = df_with_absentee_reallocated['COUNTYFP'] + df_with_absentee_reallocated['vtdst']
gdfv['unique_id'] = gdfv['COUNTYFP'] + gdfv['VTDST'].str.slice(start = 3)

print('id in vest file not in df: ', set(gdfv['unique_id']) - set(df_with_absentee_reallocated['unique_id']))
print('id in df not in vest file: ', set(df_with_absentee_reallocated['unique_id']) - set(gdfv['unique_id']))

In [None]:
double_in_df = df_with_absentee_reallocated['unique_id'].value_counts()
df_double_list = double_in_df[double_in_df > 1].index
double_in_vest = gdfv['unique_id'].value_counts()
vest_double_list = double_in_vest[double_in_vest > 1].index
print('doubled in vest file not doubled in df',set(vest_double_list) - set(df_double_list))
print('doubled in df not doubled in vest file', set(df_double_list) - set(vest_double_list))

In [None]:
#Number of "unique" values that are not unique - they are doubled and need to be made unique
df_double_list.shape

### Add cd to unique_id to add uniqueness to the doubled ids in vest file and election results df

In [None]:
gdfv[gdfv['unique_id'].isin(df_double_list)].head()

In [None]:
gdfv['old_unique_id'] = gdfv['unique_id']
df_with_absentee_reallocated['old_unique_id'] = df_with_absentee_reallocated['unique_id']

gdfv['cd'] = gdfv['PRECINCT'].str.slice(start=-3, stop=-1)
df_with_absentee_reallocated['cd'] = df_with_absentee_reallocated['Pct'].str.slice(start=-3, stop=-1)

gdfv['id_w_cd'] = gdfv['unique_id']+'-'+gdfv['cd']
df_with_absentee_reallocated['id_w_cd'] = df_with_absentee_reallocated['unique_id']+'-'+df_with_absentee_reallocated['cd']

gdfv.loc[gdfv['unique_id'].isin(df_double_list), 'unique_id'] = gdfv.loc[gdfv['unique_id'].isin(df_double_list), 'id_w_cd']
df_with_absentee_reallocated.loc[df_with_absentee_reallocated['unique_id'].isin(df_double_list), 'unique_id'] = df_with_absentee_reallocated.loc[df_with_absentee_reallocated['unique_id'].isin(df_double_list), 'id_w_cd']

In [None]:
df_with_absentee_reallocated[['old_unique_id', 'unique_id']][df_with_absentee_reallocated['old_unique_id'].isin(df_double_list)].head()

## Join attempt 1 - election results to vest to check precinct totals

In [None]:
join_1_df_vest = pd.merge(df_with_absentee_reallocated, gdfv, on = 'unique_id', how = 'outer', indicator = True)

In [None]:
print(join_1_df_vest["_merge"].value_counts())

In [None]:
gdfv.shape

In [None]:
df_with_absentee_reallocated.shape

See in election results comparison (validation run 1) that the only mismatch > 1 is in Roanoke City - check out and compare:

In [None]:
join_1_df_vest[join_1_df_vest["_merge"]=="left_only"]

In [None]:
join_1_df_vest[join_1_df_vest["_merge"]=="right_only"]

In [None]:
df_with_absentee_reallocated[df_with_absentee_reallocated['unique_id'] == '770020']

In [None]:
gdfv[gdfv['unique_id'] == '770019']

In [None]:
df_with_absentee_reallocated[df_with_absentee_reallocated['unique_id'] == '770019']

In [None]:
gdfv[gdfv['unique_id'] == '770020']

**Modifications to create match**

Election results df `770019` = gdfv `770020`

Election results df `770020` != gdfv `770019` -- why are df `770020` election results so off?

Election results df `770018` = gdfv `770019`

Election results df `770020` = gdfv `770021`

### Make modifications based on first join attempt

In [None]:
#Fix Roanoke City
df_with_absentee_reallocated.loc[df_with_absentee_reallocated['old_unique_id']=='770019', 'unique_id'] = '770020'
df_with_absentee_reallocated.loc[df_with_absentee_reallocated['old_unique_id']=='770018', 'unique_id'] = '770019'
df_with_absentee_reallocated.loc[df_with_absentee_reallocated['old_unique_id']=='770020', 'unique_id'] = '770021'

## Join attempt 2 - election results to vest 

In [None]:
join_2_df_vest = pd.merge(df_with_absentee_reallocated, gdfv, on = 'unique_id', how = 'outer', indicator = True)

In [None]:
print(join_2_df_vest["_merge"].value_counts())

In [None]:
join_2_df_vest[join_2_df_vest["_merge"]=="right_only"]

After running validation on Join attempt 2, `770019` still has issues

In [None]:
join_2_df_vest[join_2_df_vest['unique_id']=='770019']

In [None]:
#Run modification to better match results
df_with_absentee_reallocated.loc[df_with_absentee_reallocated['id_w_cd']=='770018-ic', 'unique_id']='770018'

## Join attempt 3 - election results to vest

In [None]:
join_3_df_vest = pd.merge(df_with_absentee_reallocated, gdfv, on = 'unique_id', how = 'outer', indicator = True)

In [None]:
print(join_3_df_vest["_merge"].value_counts())

In [None]:
join_3_df_vest[join_3_df_vest["_merge"]=="left_only"]

In [None]:
join_3_df_vest[join_3_df_vest["_merge"]=="right_only"]

unique id for `770018` listing differently on left and right side, but can see that election results match, and the one precinct - `Fairfax Court` aka `059700` that only appears in VEST's file, not the election results, is a zero vote precinct so does not impact the election results validation. - so, overall great match!

### Preliminary precinct level election result comparison

In [None]:
def validater_row (df, column_List):
    matching_rows = 0
    different_rows = 0
    diff_list=[]
    diff_values = []
    max_diff = 0
    for j in range(0,len(df.index)):
        same = True
        for i in column_List:
            left_Data = i + "_x"
            right_Data = i + "_y"
            diff = abs(df.iloc[j][left_Data]-df.iloc[j][right_Data])
            if(diff >0):
                if(diff>1): #7/12/21 LF mod to be >1 instead of >0 to print fewer results
                    print(i, "{:.>72}".format(df.iloc[j]["unique_id"]), "(V)","{:.>5}".format(int(df.iloc[j][left_Data]))," (S){:.>5}".format(int(df.iloc[j][right_Data])),"(D):{:>5}".format(int(df.iloc[j][right_Data])-int(df.iloc[j][left_Data])))           
                #print(df.iloc[j]['countypct'])
                
                diff_values.append(abs(diff))
                same = False
                if(np.isnan(diff)):
                    print("NaN value at diff is: ", df.iloc[j]["unique_id"])
                    print(df.iloc[j][left_Data])
                    print(df.iloc[j][right_Data])
                if (diff>max_diff):
                    max_diff = diff
                    #print("New max diff is: ", str(max_diff))
                    #print(df.iloc[j]['cty_pct'])
        if(same != True):
            different_rows +=1
            diff_list.append(df.iloc[j]["unique_id"])
        else:
            matching_rows +=1
    print("")
    print("There are ", len(df.index)," total rows")
    print(different_rows," of these rows have election result differences")
    print(matching_rows," of these rows are the same")
    print("")
    print("The max difference between any one shared column in a row is: ", max_diff)
    if(len(diff_values)!=0):
        print("The average difference is: ", str(sum(diff_values)/len(diff_values)))
    count_big_diff = len([i for i in diff_values if i > 10])
    print("There are ", str(count_big_diff), "precinct results with a difference greater than 10")
    diff_list.sort()
    #print(diff_list)

In [None]:
validater_row(join_3_df_vest[join_3_df_vest['_merge'] == 'both'].sort_values("unique_id"),column_list)

## Precinct Shapefile

In [None]:
county_fips = []
for directory in os.listdir('./raw_from_source/census_shps_by_county_all_unzip/'):
    if not directory[0] == '.':
        county_fips.append(directory[-5:])
        
proj = gdfv.crs   

county_vtds = []
for i in county_fips: #i dont have fips_codes file
    ref = './raw_from_source/census_shps_by_county_all_unzip/partnership_shapefiles_19v2_'
    vtd_ref = ref + i + '/PVS_19_v2_vtd_' + i + '.shp' 
    vtd_shp = gp.read_file(vtd_ref)
    county_vtds.append(vtd_shp)

global shp
shp = gp.GeoDataFrame(pd.concat(county_vtds, axis = 0) , crs = proj) 

shp.plot()
gdfv.plot()

In [None]:
shp['unique_id'] = shp['COUNTYFP'] + shp['VTDST'].str.slice(start = 3)
print('preliminary id in shp not in vest: ', len((set(shp['unique_id']) - set(gdfv['unique_id']))), 'shp length:', shp.shape[0])
print('preliminary id in vest not in shp: ', len((set(gdfv['unique_id']) - set(shp['unique_id']))), 'vest length', gdfv.shape[0])

### CD Shapefile - Load in CD info to make splits to match VEST

In [None]:
county_cd = []

for i in county_fips:
    ref = './raw_from_source/census_shps_by_county_all_unzip/partnership_shapefiles_19v2_'
    cd_ref = ref + i + '/PVS_19_v2_cd_' + i + '.shp' 
    cd_shp = gp.read_file(cd_ref)
    county_cd.append(cd_shp)
global cd
cd = gp.GeoDataFrame(pd.concat(county_cd, axis = 0) , crs = proj) 

cd.plot()
overlay = gp.overlay(cd, shp, how = 'union', make_valid = True, keep_geom_type = True)
overlay.plot()

In [None]:
overlay_w_shp = gp.GeoDataFrame(pd.merge(overlay, shp, on = 'unique_id', how = 'outer'), crs = proj)
overlay_w_shp['old_unique_id'] = overlay_w_shp['unique_id']

overlay_w_shp['id_w_cd'] = overlay_w_shp['unique_id'] + '- ' +overlay_w_shp['CDFP'].str.lstrip('0')

overlay_w_shp.loc[overlay_w_shp['old_unique_id'].isin(df_double_list), 'unique_id'] = overlay_w_shp.loc[overlay_w_shp['unique_id'].isin(df_double_list), 'id_w_cd']

In [None]:
#overlay_w_shp['unique_id'] = shp['COUNTYFP'] + shp['VTDST'].str.slice(start = 3)
print('preliminary id in overlay not in vest: ', len((set(overlay_w_shp['unique_id']) - set(gdfv['unique_id']))), 'overlay length:', shp.shape[0])
print('preliminary id in vest not in overlay: ', len((set(gdfv['unique_id']) - set(overlay_w_shp['unique_id']))), 'vest length', gdfv.shape[0])

In [None]:
join_overlay = pd.merge(gdfv, overlay_w_shp, on = 'unique_id', how = 'outer', indicator = True)
print(join_overlay["_merge"].value_counts())

In [None]:
left_only = join_overlay[join_overlay["_merge"]=="left_only"]
right_only = join_overlay[join_overlay["_merge"]=="right_only"]
left_only.to_csv("./gdfv1_only.csv")
right_only.to_csv("./overlay_only.csv")

Hand matched in Excel using the csvs and precinct names to determine what needs to be merged versus split.

### Modify overlay to match gdfv based on hand matching in Excel

In [None]:
#Dict based on Excel hand matching
overlay_to_gdf_dict = {"520041":"520004",
"520042":"520004",
"077011":"077401",
"077012":"077401",
"035401":"035405",
"095041":"095104",
"095042":"095104",
"153112- 10":"153112-10",
"059513- 11":"059513-11",
"153061":"153106",
"153062":"153106",
"685031":"685003",
"685032":"685003",               
"153110- 10":"153110-10",
"153210- 10":"153210-10",
"153210- 11":"153210-11",
"153312- 11":"153312-11",
"153609- 11":"153609-11"}

In [None]:
#Apply dictionary to improve match rate
overlay_w_shp['old_unique_id_w_cd'] = overlay_w_shp['unique_id']
overlay_w_shp.loc[overlay_w_shp['old_unique_id_w_cd'].isin(overlay_to_gdf_dict.keys()), 'unique_id'] = overlay_w_shp['old_unique_id_w_cd'].map(overlay_to_gdf_dict)
#clean up geometry columns
overlay_w_shp['geometry'] = overlay_w_shp['geometry_x']
overlay_w_shp.loc[overlay_w_shp['geometry_x'] == None, 'geometry'] = overlay_w_shp.loc[overlay_w_shp['geometry_x'] == None, 'geometry_y']
#Dissolve meaning if same id, combine geometries
overlay_w_shp = overlay_w_shp.dissolve(by = 'unique_id', as_index = False)

In [None]:
overlay_w_shp.columns

# Join shapefile and election results

In [None]:
shp_df_merge = pd.merge(overlay_w_shp, df_with_absentee_reallocated, on = 'unique_id', how = 'outer', suffixes = ['_x', '_y'], indicator=True)
shp_df_gdf = gp.GeoDataFrame(shp_df_merge, geometry = 'geometry')

shp_df_gdf = shp_df_gdf.drop(['geometry_x', 'geometry_y'], axis = 1)

print(shp_df_merge["_merge"].value_counts())

In [None]:
overlay_w_shp.shape

# Validation

## Shapefile

In [None]:
shp_gdfv_merge = pd.merge(shp_df_gdf, gdfv, on = 'unique_id', how = 'outer', suffixes = ['_x', '_y'])
shp_gdfv_merge = shp_gdfv_merge.reset_index()

both = shp_gdfv_merge[shp_gdfv_merge["_merge"]=="both"]
both.reset_index(drop=True,inplace=True)
source_geoms = gp.GeoDataFrame(both,geometry="geometry_x",crs=gdfv.crs)
vest_geoms = gp.GeoDataFrame(both,geometry="geometry_y",crs=gdfv.crs)
source_geoms = source_geoms.to_crs(3857)
vest_geoms = vest_geoms.to_crs(3857)
source_geoms["geometry_x"]=source_geoms.buffer(0)
vest_geoms["geometry_y"]=vest_geoms.buffer(0)
vals = source_geoms.geom_almost_equals(vest_geoms,decimal=0)
print(vals.value_counts())

In [None]:
count = 0
area_list = []
big_diff = pd.DataFrame(columns=["area"])
for i in range(0,len(source_geoms)):
    diff = source_geoms.iloc[[i]].symmetric_difference(vest_geoms.iloc[[i]])
    intersection = source_geoms.iloc[[i]].intersection(vest_geoms.iloc[[i]])
    area = float(diff.area/10e6)
    area_list.append(area)
    #print("Area is " + str(area))

    if (area > 1):
        count += 1
        name = source_geoms.at[i,"unique_id"]
        big_diff.loc[name]=area
        print(str(count)+") For SOURCE: " + name + ', VEST: '+ vest_geoms.at[i,"unique_id"]+ " difference in area is " + str(area))
        if (intersection.iloc[0].is_empty):
            base = diff.plot(color="red")
            source_geoms.iloc[[i]].plot(color="orange",ax=base)
            vest_geoms.iloc[[i]].plot(color="blue",ax=base)
            base.set_title(name)
        else:
            base = diff.plot(color="red")
            source_geoms.iloc[[i]].plot(color="orange",ax=base)
            vest_geoms.iloc[[i]].plot(color="blue",ax=base)
            intersection.plot(color="green",ax=base)
            base.set_title(name)

Print out `unique_id` geometries that look similar above to see if can be combined somehow...

In [None]:
source_geoms[(source_geoms['unique_id']=='057301')]

In [None]:
source_geoms[(source_geoms['unique_id']=='057401')]

In [None]:
source_geoms[(source_geoms['unique_id']=='075401')]

In [None]:
source_geoms[(source_geoms['unique_id']=='075402')]

In [None]:
source_geoms[(source_geoms['unique_id']=='085201')]

In [None]:
source_geoms[(source_geoms['unique_id']=='085202')]

In [None]:
source_geoms[(source_geoms['unique_id']=='147101')]

In [None]:
source_geoms[(source_geoms['unique_id']=='147201')]

In [None]:
df = pd.DataFrame(area_list)
print(df.shape)

print(str(len(df[df[0]==0]))+" precincts w/ a difference of 0 km^2")
print(str(len(df[(df[0]<.1) & (df[0]>0)]))+ " precincts w/ a difference between 0 and .1 km^2")
print(str(len(df[(df[0]<.5) & (df[0]>=.1)]))+ " precincts w/ a difference between .1 and .5 km^2")
print(str(len(df[(df[0]<1) & (df[0]>=.5)]))+ " precincts w/ a difference between .5 and 1 km^2")
print(str(len(df[(df[0]<2) & (df[0]>=1)]))+ " precincts w/ a difference between 1 and 2 km^2")
print(str(len(df[(df[0]<5) & (df[0]>=2)]))+ " precincts w/ a difference between 2 and 5 km^2")
print(str(len(df[(df[0]>=5)]))+ " precincts w/ a difference greater than 5 km^2")