In [1]:
import maup # mggg's library for proration, see documentation here: https://github.com/mggg/maup
import pandas as pd # standard python data library
import geopandas as gp # the geo-version of pandas
import numpy as np 
import os
import fiona
from statistics import mean, median
from pandas import read_csv
gp.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw' #To load KML files

# VEST PA Validation

In [2]:
vest_pa_18 = gp.read_file("./raw-from-source/VEST/pa_2018/pa_2018.shp")

Election results from the Pennsylvania Secretary of State's office via OpenElections (https://github.com/openelections/openelections-data-pa/). Precinct data was corrected with canvass reports for the following counties: Berks, Blair, Bradford, Cambria, Carbon, Crawford, Elk, Forest, Franklin, Lawrence, Lycoming, Mifflin, Montgomery, Montour, Northumberland, Susquehanna. The candidate totals for Berks, Blair, Crawford, and Mifflin differ from the county totals reported by the state and therefore the statewide totals differ from the official results accordingly.

Precinct shapefiles primarily from the U.S. Census Bureau's 2020 Redistricting Data Program Phase 2 release. The shapefiles from Delaware County and the City of Pittsburgh are from the respective jurisdictions instead. Precinct numbers were corrected to match the voter file in the following locales: Allegheny (Elizabeth, Pittsburgh W12), Blair (Greenfield), Bradford (Athens), Greene (Nonongahela), Monroe (Smithfield), Montgomery (Hatfield), Northampton (Bethlehem Twp), Perry (Toboyne), Washington (New Eagle, Somerset), York (Fairview).

Precinct boundaries throughout the state were edited to match voter assignments in the PA Secretary of State voter file from the 2018 election cycle. While some edits reflect official updates to wards or divisions the great majority involve voters incorrectly assigned to voting districts by the counties. As such the VEST shapefile endeavors to reflect the de facto precinct boundaries and these often differ from the official voting district boundaries, in some cases quite drastically. Wherever possible edits were made using census boundaries or alternatively using the parcel shapefiles from the respective counties. 

In certain areas voter assignments appear so erratic that it is impractical to place all voters within their assigned precinct. These areas were edited so as to place as many voters as possible within their assigned precinct without displacing a greater number from their assigned precinct. In general, municipal boundaries were retained except where significant numbers of numbers were misassigned to the wrong municipality. In cases where the odd/even split was incorrectly reversed for precinct boundary streets the official boundary was retained. All such cases involved near equal number of voters swapped between voting districts.

The following revisions were made to the base shapefiles to match the de facto 2018 precinct boundaries consistent with the voter file. Individual precincts are noted in cases of splits or merges. Due to the sheer number of edits boundary adjustments are noted at the borough/township level. There may be as many as two dozen individual precincts that were revised within a given municipality.

Adams: Adjust Cumberland, Franklin  
Allegheny: Merge CD splits for S Fayette 3/5; Split Pittsburgh W5 11/17; Merge Pittsburgh W16 9/11/12, Align   McCandless with municipal boundary; Adjust Avalon, Baldwin, Bethel Park, Braddock, Brentwood, Castle Shannon, Clairton, Collier, Coraopolis, Crescent, Dormont, Dravosburg, Duquesne, E Deer, E McKeesport, E Pittsburgh, Elizabeth, Emsworth, Forward, Glassport, Hampton, Harmar, Ingram, Jefferson Hills, Kennedy, Leet, Liberty, Marshall, McCandless, McKees Rocks, McKeesport, Monroeville, Moon, Mount Lebanon, Munhall, N Fayette, N Versailles, O'Hara, Oakdale, Penn Hills, Pine, Pittsburgh (nearly all wards), Pleasant Hills, Reserve, Richland, Ross, Scott, Sewickley, Shaler, S Fayette, S Park, Stowe, Swissvale, Upper St. Clair, W Deer, W Homestead, W Mifflin, W View, Whitaker, Whitehall, Wilkins, Wilkinsburg
Armstrong: Align Dayton, Elderton, Ford City, Kittanning, N Apollo with municipal boundaries; Adjust Ford City, Gilpin, Kiskiminetas, Kittanning, Manor, N Buffalo, Parks, Parker City, S Buffalo  
Beaver: Adjust Aliquippa, Ambridge, Baden, Beaver, Brighton, Center, Chippewa, Conway, Economy, Franklin, Hanover, Harmony, Hopewell, Midland, Monaca, N Sewickley  
Bedford: Adjust Bedford Boro, Bedford Twp  
Berks: Adjust Cumru, Douglass, Oley, Maxatawny, Robeson, Sinking Spring, Spring, Union  
Blair: Merge Tunnelhill/Allegheny Twp 4; Align Altoona, Bellwood, Duncansville, Hollidaysburg, Newry, Roaring Spring, Tyrone, Williamsburg with municipal boundaries; Adjust Allegheny, Altoona, Antis, Frankstown, Freedom, Greenfield, Huston, Juniata, N Woodbury, Logan, Snyder, Tyrone Boro, Tyrone Twp  
Bucks: Align Sellersville, Tullytown with municipal boundaries; Adjust Bensalem, Bristol Boro, Bristol Twp, Buckingham, Doylestown Twp, Falls, Hilltown, Lower Makefield N, Lower Southampton E, Middletown, Milford, Morrissville, Newtown Twp, Northampton, Solebury Lower, Solebury, Springfield, Tinicum, Upper Makefield, Upper Southampton E, Warminster, Warrington, W Rockhill  
Butler: Merge CD splits for Cranberry E 2, 3, Cranberry W 1, 2, Jefferson 1, 2; Align Butler Twp, Valencia with municipal boundaries; Adjust Adams, Buffalo, Butler Boro, Butler Twp, Center, Cranberry E, Cranberry W, Jackson, Jefferson, Zelienople
Cambria: Align Daisytown, Sankertown, W Taylor, Wilmore with municipal boundaries; Adjust Cambria, Conemaugh, Croyle, E Taylor, Ebensburg, E Carroll, Geistown, Jackson, Johnstown W8, W17, W20, Lower Yoder, Northern Cambria, Portage Boro, Portage Twp, Richland, Southmont, Stonycreek, Summerhill, Susquehanna, Upper Yoder, W Carroll, Westmont
Cameron: Adjust Emporium, Shippen
Carbon: Adjust Jim Thorpe, Kidder, Mahoning, New Mahoning, Summit Hill
Centre: Merge CD splits for Halfmoon E Central/Proper; Merge Ferguson Northeast 1 A/B; Adjust Benner, College, Ferguson, Patton
Chester: Merge CD/LD splits for Birmingham 2, Phoenixville M 1; Adjust Birmingham, E Bradford S, E Fallowfield, E Goshen, E Marlborough, Easttown, N Coventry, Spring City, Tredyffrin M, Uwchlan, W Bradford, W Caln, W Goshen N, W Goshen S, Westtown
Clarion: Merge Emlenton/Richland; Adjust Clarion, Highland, Farmington, Knox
Clearfield: Adjust Bradford, Cooper, Decatur, Golden Rod, Lawrence Glen Richie, Morris, Plympton, Woodward
Columbia: Merge Ashland/Conyngham; Adjust Orange, Scott West
Crawford: Align Mead, Woodcock with municipal boundaries
Cumberland: Merge CD splits for N Middleton 1, 3; Split Lower Allen 1/Annex; Align Carlisle, E Pennsboro, Hampton, Lemoyne, Lower Allen, Mechanisburg, Middlesex, Mount Holly Springs, N Middleton, Shiremanstown, Silver Spring, W Pennsboro, Wormsleysburg with municipal boundaries
Dauphin: Align Middletown with municipal boundary; Adjust Derry, Harrisburg W1, W7, W8, W9, Hummelstown, Lower Paxton, Lykens, Middletown
Delaware: Adjust Chester, Concord, Darby Boro, Darby Twp, Haverford, Marple, Nether Providence, Newtown, Radnor, Ridley, Sharon Hill, Thornbury, Tinicum, Trainer, Upper Chichester, Upper Darby, Upper Providence
Elk: Split N/S Horton; Adjust Johnsonburg, Ridgeway Boro, Ridgeway Twp, St. Marys
Erie: Adjust Erie W1, W4, W5, W6, Greene, Lawrence Park, McKean, Millcreek, North East
Fayette: Align Dunbar with municipal boundary; Adjust Brownsville, Bullskin, Dunbar, Georges, German, Luzerne, N Union, Redstone
Franklin: Align Mercersburg with municipal boundary; Adjust Antrim, Fannett, Greene, Guilford, Hamilton, Metal, Peters, Quincy, St. Thomas, Southampton, Washington
Fulton: Align McConnellsburg with municipal boundary
Greene: Align Carmichaels with municipal boundary; Adjust Cumberland, Dunkard, Franklin, Jefferson, Lipencott, Mather, Morgan Chart, Monongahela, Nemacolin
Huntingdon: Merge CD splits for Penn; Adjust Huntingdon, Mount Union
Jefferson:  Align Reynoldsville with municipal boundary; Adjust Punxsutawney
Lackawanna: Adjust Archbald, Blakely, Carbondale, Clarks Summit, Dickson City, Dunmore, Fell, Jermyn, Jessup, Mayfield, Moosic, Old Forge, Olyphant, Scranton W1, W2, W3, W6, W7, W10, W12, W13, W14, W15, W16, W19, W20, W23, S Abington, Taylor, Throop
Lancaster: Split Lancaster 7-8 CV/LS; Adjust Brecknock, Columbia, E Hempfield, E Lampeter, E Petersburg, Elizabethtown, Ephrata, Lancaster W4, W8, Lititz, Manheim, Manor, Millersville, Mt Joy Boro, Mt Joy Twp, New Holland, Penn, Providence, Rapho, Warwick, W Cocalico, W Donegal, W Hempfield
Lawrence: Adjust Neshannock
Lebanon: Adjust Jackson, Lickdale, S Lebanon, Union Green Pt
Lehigh: Adjust Lower Macungie, Salisbury
Luzerne: Merge CD splits for Hazle 1; Align Avoca, Pittston with municipal boundaries; Adjust Butler, Dallas, Exeter, Foster, Freeland, Hanover, Hazle, Jenkins, Kingston Boro, Kingston Twp, Larksville, Lehman, Nanticoke, Newport, Plains, Salem, Smoyersville, W Wyoming, Wilkes-Barre
Lycoming: Align Williamsport with municipal boundary; Adjust Jersey Shore
McKean: Adjust Bradford City, Bradford Twp, Foster, Keating, Otto
Mercer: Adjust Delaware, Fredonia, Greenville, Hempfield, Hermitage, Sharon, Sharpsville, S Pymatuning, W Salem
Mifflin: Split Brown Reedsville/Church Hill
Monroe: Align E Stroudsburg with municipal boundary; Adjust E Stroudsburg, Smithfield
Montgomery: Add CD special election splits for Horsham 2-2, Perkiomen 1, Plymouth 2-3; Adjust Abington, Lower Merion, Pottstown, Springfield, Upper Moreland, Upper Merion, Upper Providence
Northampton: Align Glendon, Walnutport with municipal boundaries; Adjust Bangor, Bethlehem W2, W3, W4, W7, W9, W14, W15, Bethlehem Twp, Bushkill, Easton, Forks, Hanover, Hellertown, Lehigh, Lower Mt Bethel, Lower Saucon, Moore, Nazareth, Palmer, Plainfield, Upper Mt Bethel, Washington, Williams
Northumberland: Align Northumberland with municipal boundary; Adjust Coal, Milton, Mount Carmel W, Natalie-Strong, Northumberland, Point, Ralpho, Shamokin, Sunbury, Upper Augusta
Philadelphia: Adjust 1-19/21, 5-3/19, 7-2/3/17, 7-6/7, 9-5/6, 15-7/10, 17-20/26, 20-5/10, 21-1/15, 21-40/41, 22-21/26, 23-11/12, 25-9/17, 25-4/7/12, 25-10/12, 26-1/2, 27-7/8, 27-18/20/21, 28-1/8, 29-9/11, 29-10/17, 30-14/15, 31-5/6, 38-11/17, 38-13/20, 38-15/19, 40-12/18/19, 40-17/19, 42-3/4/7, 44-8/14, 50-3/12, 50-11/27, 52-2/6/9, 52-3/8, 57-6/7, 57-10/27, 57-17/28, 58-6/12, 62-5/19, 65-4/7, 65-11/16, 66-22/34  
Pike: Adjust Matamoras  
Potter: Adjust Galeton, Sharon  
Schuylkill: Adjust Coaldale, N Manheim, Norwegian, Porter, Pottsville
Somerset: Align New Centerville with municipal boundary; Adjust Conemaugh, Jefferson, Middlecreek, Paint, Somerset Boro  
Susquehanna: Adjust Montrose; Lanesboro, Susquehanna Depot  
Tioga: Adjust Delmar, Wellsboro  
Union: Adjust Buffalo, White Deer  
Venango: Adjust Franklin, Sugarcreek, Cornplanter, Oil City  
Warren: Adjust Conewango  
Washington: Align Allenport, Beallsville, Burgettstown, Canonsburg, Carroll, Charleroi, Claysville, Elco, Finleyville, Houston, Long Branch, McDonald, Monongahela, Speers, Twilight with municipal boundaries; Adjust Amwell, Bentleyville, California, Canonsburg, Canton, Cecil, Centerville, Chartiers, Donegal, Donora, Fallowfield, Hanover, Independence, Mount Pleasant, N Franklin, N Strabane, Peters, Robinson, Smith, Somerset, S Franklin, S Strabane, Union Washington, W Brownsville  
Wayne: Adjust Honesdale  
Westmoreland: Merge CD splits for Unity Pleasant Unity; Align Greensburg with municipal boundary; Adjust Allegheny, Arnold, Bell, Derry, E Huntingdon, Fairfield, Greensburg W1-W8, Hempfield, Jeannette, Latrobe, Ligonier, Lower Burrell, Monessen, Mount Pleasant, Murraysville, New Kensington, N Belle Vernon, N Huntingdon, Penn, Rostraver, St. Clair, Scottdale, Sewickley, S Greensburg, S Huntingdon, Trafford, Upper Burrell, Unity, Vandergrift, Washington, Youngwood  
Wyoming: Adjust Falls  
York: Merge CD splits for York Twp 5-3; Align E Prospect, Goldsboro, Jefferson, Manchester, Monaghan, Wellsville, York with municipal boundaries; Adjust Chanceford, Codorus, Conewago, Dover, Fairview, Hanover, Jackson, Lower Windsor, New Freedom, Newberry, N Codorus, Penn, Red Lion, Shrewsbury, Spring Garden, Springbettsbury, W Manchester, Windsor Boro, Windsor Twp, Wrightsville, York Twp, York W5, W6, W15  

In [3]:
print(vest_pa_18.head())
print(vest_pa_18.columns)

col_list = ['G18USSDCAS', 'G18USSRBAR','G18USSLKER', 'G18USSGGAL', 'G18GOVDWOL', 'G18GOVRWAG', 'G18GOVLKRA','G18GOVGGLO']
print("")
print("Here are the vote totals:")
for i in col_list:
    print(i + ": "+str(sum(vest_pa_18[i])))

  STATEFP COUNTYFP   VTDST          NAME  G18USSDCAS  G18USSRBAR  G18USSLKER  \
0      42      001  000010   ABBOTTSTOWN         120         183           5   
1      42      001  000020  ARENDTSVILLE         151         178           6   
2      42      001  000030  BENDERSVILLE          74         103           1   
3      42      001  000040       BERWICK         289         575          14   
4      42      001  000050   BIGLERVILLE         152         231           3   

   G18USSGGAL  G18GOVDWOL  G18GOVRWAG  G18GOVLKRA  G18GOVGGLO  \
0           2         120         185           2           2   
1           3         160         172           4           2   
2           2          76          98           3           2   
3           5         318         554           9           5   
4           7         168         215           5           2   

                                            geometry  
0  POLYGON Z ((-76.99801 39.88359 0.00000, -76.99...  
1  POLYGON Z ((-77

In [4]:
fips_file = pd.read_csv("./raw-from-source/FIPS/US_FIPS_Codes.csv")
fips_file = fips_file[fips_file["State"]=="Pennsylvania"]
fips_file["FIPS County"]=fips_file["FIPS County"].astype(str)
fips_file["FIPS County"]=fips_file["FIPS County"].str.zfill(3)
fips_file["unique_ID"] =  "42" + fips_file["FIPS County"]
fips_codes = fips_file["unique_ID"].tolist()
print(fips_file["County Name"].unique())
pa_fips_dict = dict(zip(fips_file["County Name"],fips_file["FIPS County"]))

['Adams' 'Allegheny' 'Armstrong' 'Beaver' 'Bedford' 'Berks' 'Blair'
 'Bradford' 'Bucks' 'Butler' 'Cambria' 'Cameron' 'Carbon' 'Centre'
 'Chester' 'Clarion' 'Clearfield' 'Clinton' 'Columbia' 'Crawford'
 'Cumberland' 'Dauphin' 'Delaware' 'Elk' 'Erie' 'Fayette' 'Forest'
 'Franklin' 'Fulton' 'Greene' 'Huntingdon' 'Indiana' 'Jefferson' 'Juniata'
 'Lackawanna' 'Lancaster' 'Lawrence' 'Lebanon' 'Lehigh' 'Luzerne'
 'Lycoming' 'McKean' 'Mercer' 'Mifflin' 'Monroe' 'Montgomery' 'Montour'
 'Northampton' 'Northumberland' 'Perry' 'Philadelphia' 'Pike' 'Potter'
 'Schuylkill' 'Snyder' 'Somerset' 'Sullivan' 'Susquehanna' 'Tioga' 'Union'
 'Venango' 'Warren' 'Washington' 'Wayne' 'Westmoreland' 'Wyoming' 'York']


In [5]:
pa_election = pd.read_csv("./raw-from-source/Election_Results/openelections-data-pa-master/2018/20181106__pa__general__precinct.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Do not include the "Straight Party" votes

In [6]:
#The full file does not include the Governors results from Butler
#Clearfield does not include Senate results
#Westmoreland does not include governor results

In [7]:
pa_election[pa_election["county"]=="Westmoreland"].to_csv("./westmoreland.csv")
pa_election[pa_election["county"]=="Butler"].to_csv("./butler.csv")
pa_election[pa_election["county"]=="Clearfield"].to_csv("./clearfield.csv")

In [8]:
office_list = ["U.S. Senate", 'Governor']
filtered_pa_election = pa_election[pa_election["office"].isin(office_list)]
county_changes_dict = {"Washington ":"Washington"}
filtered_pa_election["county"] = filtered_pa_election["county"].map(county_changes_dict).fillna(filtered_pa_election["county"])
filtered_pa_election["County_FIPS"]=filtered_pa_election.loc[:,"county"].map(pa_fips_dict).fillna(filtered_pa_election.loc[:,"county"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_pa_election["county"] = filtered_pa_election["county"].map(county_changes_dict).fillna(filtered_pa_election["county"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_pa_election["County_FIPS"]=filtered_pa_election.loc[:,"county"].map(pa_fips_dict).fillna(filtered_pa_election.loc[:,"county"])


In [9]:
19,33,129

(19, 33, 129)

In [10]:
filtered_pa_election["pivot_col"]=filtered_pa_election["County_FIPS"]+filtered_pa_election["precinct"]
filtered_pa_election["candidate"]=filtered_pa_election["candidate"].str.upper()
filtered_pa_election["candidate"] = filtered_pa_election["candidate"].str.strip()
filtered_pa_election["party"] = filtered_pa_election["party"].str.upper()


print(filtered_pa_election["party"].unique())

party_changes_dict = {"DEMOCRATIC":"DEM","REPUBLICAN":"REP","LIBERTARIAN":"LIB","GREEN":"GRN",
                     "GR":"GRN","GRE":"GRN","DEMOCRAT":"DEM"}

filtered_pa_election["party"] = filtered_pa_election["party"].map(party_changes_dict).fillna(filtered_pa_election["party"])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_pa_election["pivot_col"]=filtered_pa_election["County_FIPS"]+filtered_pa_election["precinct"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_pa_election["candidate"]=filtered_pa_election["candidate"].str.upper()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_pa_election["

['DEM' 'REP' 'GRN' 'LIB' nan 'GREEN' 'GR' 'GRE' 'DEMOCRATIC' 'REPUBLICAN'
 'LIBERTARIAN' 'DEMOCRAT']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_pa_election["party"] = filtered_pa_election["party"].str.upper()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_pa_election["party"] = filtered_pa_election["party"].map(party_changes_dict).fillna(filtered_pa_election["party"])


In [11]:
print(filtered_pa_election.head())
examine_list = ["019","129","033"]
print(filtered_pa_election[filtered_pa_election["County_FIPS"].isin(examine_list)])

filtered_pa_election[filtered_pa_election["County_FIPS"].isin(examine_list)].to_csv("./problem_counties_pre.csv")

  county          precinct       office  district           candidate party  \
0   York  Carroll Township  U.S. Senate       NaN      BOB CASEY, JR.   DEM   
1   York  Carroll Township  U.S. Senate       NaN        LOU BARLETTA   REP   
2   York  Carroll Township  U.S. Senate       NaN           NEAL GALE   GRN   
3   York  Carroll Township  U.S. Senate       NaN  DALE R. KERNS, JR.   LIB   
4   York  Carroll Township  U.S. Senate       NaN           WRITE-INS   NaN   

  votes  absentee  election_day County_FIPS            pivot_col  
0   958       NaN           NaN         133  133Carroll Township  
1  1858       NaN           NaN         133  133Carroll Township  
2    18       NaN           NaN         133  133Carroll Township  
3    32       NaN           NaN         133  133Carroll Township  
4     0       NaN           NaN         133  133Carroll Township  
             county                    precinct       office  district  \
53778        Butler       0001 ADAMS TOWNSHIP 1  

In [12]:
filtered_pa_election = filtered_pa_election[~(filtered_pa_election["candidate"].str[-3:]=="(W)")]

In [13]:
#Things to look into

party_cand_list = [
  'DEMOCRATIC', 
 'REPUBLICAN',
 'GREEN', 
 'INDEPENDENT', 
 'LIBERTARIAN', 

]

In [14]:
candidate_name_changes = {
   'DEMOCRATIC':'DEM', 
 'REPUBLICAN':"REP",
 'GREEN':"GRN", 
 'INDEPENDENT':"IND", 
 'LIBERTARIAN':"LIB",
    
    
    'LOU BARLETTA':'BARLETTA',
 'LOU  BARLETTA':'BARLETTA',
 'LOU BARLETTA JR':'BARLETTA',
 'BARLETTA, LOU':'BARLETTA',

    'KEN V KRAWCHUK, GOVERNOR':'KRAWCHUK',
    'KEN V. KRAWCHUK/K.S. SMITH':'KRAWCHUK',
    'KRAWCHUK /SMITH':'KRAWCHUK',
    'KEN V. KRAWCHUK KATHLEEN S. SMITH':'KRAWCHUK',
    'KRAWCHUK\\SMITH':'KRAWCHUK',
     'KEN V KRAWCHUK':'KRAWCHUK', 
 'KRAWCHUK / SMITH':'KRAWCHUK',
 'KRAWCHUK/SMITH':'KRAWCHUK',
 'KEN V. KRAWCHUK/K. S. SMITH':'KRAWCHUK',
 'KEN KRAWCHUK':'KRAWCHUK',
 'KRAWCHUK/ SMITH':'KRAWCHUK',
 'KEN V. KRAWCHUK':'KRAWCHUK',
 'KRAWCHUK, KEN V.':'KRAWCHUK',
 'KEN V. KRAWCHUK / K. S. SMITH':'KRAWCHUK',
    
    'GLOVER / BOSTICK':'GLOVER',
    'PAUL GLOVER, GOVERNOR':'GLOVER',
    'GLOVER / BOWSER BOSTICK':'GLOVER',
    'PAUL GLOVER/J. BOWSER-BOSTICK':'GLOVER',
    'GLOVER/BOSTICK':'GLOVER',
    'GLOVER/BOWSER-BOSTIC':'GLOVER',
    'PAUL GLOVER JOCOLYN BOWSER-BOSTICK':'GLOVER',
    'GLOVER/BOWSERBOS':'GLOVER',
    'GLOVER\\BOWSERBOSTICK':'GLOVER',
     'GLOVER / BOWSER-BOSTICK':'GLOVER', 
 'GLOVER/BOWSER-BOSTICK':'GLOVER',
 'GLOVER/BOWSER-BOS':'GLOVER', 
 'PAUL GLOVER/JOCOLYN BOWER-BOSTICK':'GLOVER', 
 'PAUL  GLOVER':'GLOVER',
 'GLOVER / BOWSER-BOS':'GLOVER',
 'PAUL GLOVER':'GLOVER',
 'GLOVER, PAUL':'GLOVER',
 'PAUL GLOVER / J. BOWSER BOSTICK':'GLOVER',
    
    'SCOTT R WAGNER, GOVERNOR':'WAGNER',
    'SCOTT R. WAGNER JEFF BARTOS':'WAGNER',
    'WAGNER\\BARTOS':'WAGNER',
     'SCOTT R WAGNER':'WAGNER', 
    'WAGNER/BARTOS':'WAGNER',
 'WAGNER / BARTOS':'WAGNER',
  'SCOTT R. WAGNER/JEFF BARTOS':'WAGNER',
 'WAGNER/ BARTOS':'WAGNER',
 'SCOTT R WAGNER AND JEFF BARTOS':'WAGNER',
 'SCOTT WAGNER':'WAGNER',
 'SCOTT R. WAGNER':'WAGNER',
 'WAGNER, SCOTT R.':'WAGNER',
 'SCOTT R. WAGNER / JEFF BARTOS':'WAGNER',
    
    'TOM WOLF, GOVERNOR':'WOLF',
    'TOM WOLF JOHN FETTERMAN':'WOLF',
    'WOLF\\FETTERMAN':'WOLF',
     'WOLF / FETTERMAN':'WOLF',
 'WOLF/FETTERMAN':'WOLF',
 'TOM WOLF/JOHN FETTERMAN':'WOLF', 
 'TOM  WOLF':'WOLF',
 'TOM WOLF AND JOHN FETTERMAN':'WOLF',
 'TOM WOLF':'WOLF',
 'WOLF, TOM':'WOLF',
 'TOM WOLF / JOHN FETTERMAN':'WOLF',
    
    'DALE KERNS':"KERNS",
    'DALE R KEARNS, JR':"KERNS",
 'DALE R KERNS, JR':"KERNS",
 'DALE R. KERNS JR.':"KERNS",
  'DALE KERNS JR':"KERNS", 
 'DALE R. KERNS, JR':"KERNS",
 'DALE R. KERNS, JR.':"KERNS",
 'DALE R. KERNS JR':"KERNS", 
 'DALE R KERNS JR':"KERNS",
    'KERNS, JR., DALE R.':"KERNS",
    
    'ROBERT CASEY JR.':"CASEY",
     'BOB CASEY, JR':"CASEY",
 'BOB CASEY JR.':"CASEY",
 'BOB  CASEY, JR.':"CASEY",
 'BOB CASEY':"CASEY",
 'CASEY, JR., BOB':"CASEY",
 'BOB CASEY, JR.':"CASEY", 
 'BOB CASEY JR':"CASEY", 
    
    'NEAL GALE':"GALE",
 'NEAL  GALE':"GALE",
 'GALE, NEAL':"GALE",
 'NEALE GALE':"GALE"}

filtered_pa_election["candidate"] = filtered_pa_election["candidate"].map(candidate_name_changes).fillna(filtered_pa_election["candidate"])

In [15]:
candidates_to_remove = ["NO AFFILIATION",'WRITE - IN','BLANK VOTES',
                      'WRITE-INS','WRITE IN','CAST VOTES','OVER VOTES',
                     'UNDER VOTES','WRITE IN VOTES','WRITE-IN VOTES']

parties_to_remove = ["NAF","IND"]

In [16]:
print(filtered_pa_election[filtered_pa_election["votes"].isna()])
filtered_pa_election["votes"]=filtered_pa_election["votes"].fillna(0)
print(filtered_pa_election[filtered_pa_election["votes"].isna()])

             county            precinct       office  district  \
67172      Crawford  Conneautville Boro     Governor       NaN   
67457      Crawford       Meadville 3-1     Governor       NaN   
74596     Jefferson        BROOKVILLE 2  U.S. Senate       NaN   
74605     Jefferson           MCCALMONT  U.S. Senate       NaN   
74608     Jefferson           PINECREEK  U.S. Senate       NaN   
...             ...                 ...          ...       ...   
145127  Susquehanna     Springville Twp     Governor       NaN   
145128  Susquehanna     Springville Twp     Governor       NaN   
145129  Susquehanna     Springville Twp     Governor       NaN   
145139  Susquehanna     Springville Twp     Governor       NaN   
156860   Montgomery          Green Lane  U.S. Senate       NaN   

             candidate party votes  absentee  election_day County_FIPS  \
67172           GLOVER   GRN   NaN       NaN           NaN         039   
67457           GLOVER   GRN   NaN       NaN           NaN 

In [17]:
filtered_pa_election = filtered_pa_election[~(filtered_pa_election["candidate"].isin(candidates_to_remove))]
filtered_pa_election = filtered_pa_election[~(filtered_pa_election["party"].isin(parties_to_remove))]
filtered_pa_election["party"] = filtered_pa_election["party"].fillna(filtered_pa_election["candidate"])
filtered_pa_election["candidate"] = filtered_pa_election["candidate"].fillna(filtered_pa_election["party"])
filtered_pa_election["votes"]=filtered_pa_election["votes"].astype(str)
filtered_pa_election["votes"]=filtered_pa_election["votes"].str.replace(',', '').astype(int)
filtered_pa_election.loc[:,"votes"]=filtered_pa_election.loc[:,"votes"].astype(int)
filtered_pa_election = filtered_pa_election.drop_duplicates()

In [18]:
#Delaware seems to list duplicate precincts in its data
#print(filtered_pa_election[filtered_pa_election["County_FIPS"].isin(examine_list)])
print(filtered_pa_election[filtered_pa_election["County_FIPS"]=="019"])

#print(filtered_pa_election[filtered_pa_election["pivot_col"]=="045SWARTHMORE Northern"])

       county               precinct       office  district candidate party  \
53778  Butler  0001 ADAMS TOWNSHIP 1  U.S. Senate       NaN     CASEY   DEM   
53779  Butler  0001 ADAMS TOWNSHIP 1  U.S. Senate       NaN  BARLETTA   REP   
53780  Butler  0001 ADAMS TOWNSHIP 1  U.S. Senate       NaN      GALE   GRN   
53781  Butler  0001 ADAMS TOWNSHIP 1  U.S. Senate       NaN     KERNS   LIB   
53795  Butler  0002 ADAMS TOWNSHIP 2  U.S. Senate       NaN     CASEY   DEM   
...       ...                    ...          ...       ...       ...   ...   
55247  Butler   0088 BUTLER CITY 4-2  U.S. Senate       NaN     KERNS   LIB   
55261  Butler     0089 BUTLER CITY 5  U.S. Senate       NaN     CASEY   DEM   
55262  Butler     0089 BUTLER CITY 5  U.S. Senate       NaN  BARLETTA   REP   
55263  Butler     0089 BUTLER CITY 5  U.S. Senate       NaN      GALE   GRN   
55264  Butler     0089 BUTLER CITY 5  U.S. Senate       NaN     KERNS   LIB   

       votes  absentee  election_day County_FIPS   

In [19]:


filtered_pa_election[filtered_pa_election["County_FIPS"].isin(examine_list)].to_csv("./problem_counties_post.csv")

In [20]:
#print(len(filtered_pa_election["votes"].unique()))
print(set(filtered_pa_election["votes"].unique()))

i = 0
stop = False
while (stop == False):
    if i not in set(filtered_pa_election["votes"].unique()):
        print(i)
        stop=True
    else:
        i+=1
        

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221,

In [21]:
filtered_pa_election["cand_col"]=filtered_pa_election["office"]+filtered_pa_election["candidate"]
#filtered_pa_election["votes"] = filtered_pa_election["votes"].str.replace(',', '').astype(float)
#filtered_pa_election["votes"]=filtered_pa_election["votes"].astype(float)

In [22]:
pd.set_option('display.max_rows', 500)
filtered_pa_election[filtered_pa_election["County_FIPS"]=="005"].to_csv("./005_pa.csv")

In [23]:
pivoted_2018 = pd.pivot_table(filtered_pa_election, values=["votes"], index=["pivot_col"],columns=["cand_col"],aggfunc=sum)

In [24]:
print(pivoted_2018.head())

                         votes                                               \
cand_col        GovernorGLOVER GovernorKRAWCHUK GovernorWAGNER GovernorWOLF   
pivot_col                                                                     
001Abbottstown             2.0              2.0          185.0        120.0   
001Arendtsville            2.0              4.0          172.0        160.0   
001Bendersville            2.0              3.0           98.0         76.0   
001Berwick                 5.0              9.0          554.0        318.0   
001Biglerville             2.0              5.0          215.0        168.0   

                                                                      \
cand_col        U.S. SenateBARLETTA U.S. SenateCASEY U.S. SenateGALE   
pivot_col                                                              
001Abbottstown                183.0            120.0             2.0   
001Arendtsville               178.0            151.0             3.0   
001Bend

In [25]:
pivoted_2018.reset_index(drop=False,inplace=True)
print(pivoted_2018.head())
#pivoted_2018.drop(['level_1'], axis=1,inplace=True)
#print(pivoted_2018.head())

                pivot_col          votes                                  \
cand_col                  GovernorGLOVER GovernorKRAWCHUK GovernorWAGNER   
0          001Abbottstown            2.0              2.0          185.0   
1         001Arendtsville            2.0              4.0          172.0   
2         001Bendersville            2.0              3.0           98.0   
3              001Berwick            5.0              9.0          554.0   
4          001Biglerville            2.0              5.0          215.0   

                                                                            \
cand_col GovernorWOLF U.S. SenateBARLETTA U.S. SenateCASEY U.S. SenateGALE   
0               120.0               183.0            120.0             2.0   
1               160.0               178.0            151.0             3.0   
2                76.0               103.0             74.0             2.0   
3               318.0               575.0            289.0             5.0   

In [26]:
print(pivoted_2018.columns)

MultiIndex([('pivot_col',                    ''),
            (    'votes',      'GovernorGLOVER'),
            (    'votes',    'GovernorKRAWCHUK'),
            (    'votes',      'GovernorWAGNER'),
            (    'votes',        'GovernorWOLF'),
            (    'votes', 'U.S. SenateBARLETTA'),
            (    'votes',    'U.S. SenateCASEY'),
            (    'votes',     'U.S. SenateGALE'),
            (    'votes',    'U.S. SenateKERNS')],
           names=[None, 'cand_col'])


In [27]:
pivoted_2018.columns = pivoted_2018.columns.droplevel(0)
print(pivoted_2018.head())

cand_col                   GovernorGLOVER  GovernorKRAWCHUK  GovernorWAGNER  \
0          001Abbottstown             2.0               2.0           185.0   
1         001Arendtsville             2.0               4.0           172.0   
2         001Bendersville             2.0               3.0            98.0   
3              001Berwick             5.0               9.0           554.0   
4          001Biglerville             2.0               5.0           215.0   

cand_col  GovernorWOLF  U.S. SenateBARLETTA  U.S. SenateCASEY  \
0                120.0                183.0             120.0   
1                160.0                178.0             151.0   
2                 76.0                103.0              74.0   
3                318.0                575.0             289.0   
4                168.0                231.0             152.0   

cand_col  U.S. SenateGALE  U.S. SenateKERNS  
0                     2.0               5.0  
1                     3.0               6.

In [28]:
print(pivoted_2018.columns)

Index(['', 'GovernorGLOVER', 'GovernorKRAWCHUK', 'GovernorWAGNER',
       'GovernorWOLF', 'U.S. SenateBARLETTA', 'U.S. SenateCASEY',
       'U.S. SenateGALE', 'U.S. SenateKERNS'],
      dtype='object', name='cand_col')


G18USSDCAS - Robert P. Casey Jr. (Democratic Party)  
G18USSRBAR - Louis J. Carletta (Republican Party)  
G18USSLKER - Dale R. Kerns (Libertarian Party)  
G18USSGGAL - Neal Taylor Gale (Green Party)  
  
G18GOVDWOL - Thomas W. Wolf (Democratic Party)  
G18GOVRWAG - Scott R. Wagner (Republican Party)  
G18GOVLKRA - Kenneth V. Krawchuk (Libertarian Party)  
G18GOVGGLO - Paul Glover (Green Party)  

In [29]:
print(pivoted_2018.shape)
print(vest_pa_18.shape)

(9165, 9)
(9160, 13)


In [30]:
pivoted_2018.columns=["cty_pct","G18GOVGGLO","G18GOVLKRA","G18GOVRWAG","G18GOVDWOL","G18USSRBAR","G18USSDCAS","G18USSGGAL","G18USSLKER"]

In [31]:
pivoted_2018 = pivoted_2018.fillna(0)

In [32]:
print(pivoted_2018.head())
pivoted_2018["county"]=pivoted_2018["cty_pct"].str[0:3]
print(pivoted_2018.head())

           cty_pct  G18GOVGGLO  G18GOVLKRA  G18GOVRWAG  G18GOVDWOL  \
0   001Abbottstown         2.0         2.0       185.0       120.0   
1  001Arendtsville         2.0         4.0       172.0       160.0   
2  001Bendersville         2.0         3.0        98.0        76.0   
3       001Berwick         5.0         9.0       554.0       318.0   
4   001Biglerville         2.0         5.0       215.0       168.0   

   G18USSRBAR  G18USSDCAS  G18USSGGAL  G18USSLKER  
0       183.0       120.0         2.0         5.0  
1       178.0       151.0         3.0         6.0  
2       103.0        74.0         2.0         1.0  
3       575.0       289.0         5.0        14.0  
4       231.0       152.0         7.0         3.0  
           cty_pct  G18GOVGGLO  G18GOVLKRA  G18GOVRWAG  G18GOVDWOL  \
0   001Abbottstown         2.0         2.0       185.0       120.0   
1  001Arendtsville         2.0         4.0       172.0       160.0   
2  001Bendersville         2.0         3.0        98.0   

In [33]:
pivoted_2018 = pivoted_2018[~(pivoted_2018["cty_pct"].str[3:]=="Total")]

In [34]:
col_list=["G18GOVGGLO","G18GOVLKRA","G18GOVRWAG","G18GOVDWOL","G18USSRBAR","G18USSDCAS","G18USSGGAL","G18USSLKER"]
for i in col_list:
    print(i)
    print(sum(vest_pa_18[i]))
    print(sum(pivoted_2018[i]))
    print("")

G18GOVGGLO
27797
26627.0

G18GOVLKRA
49238
46621.0

G18GOVRWAG
2040233
1918230.0

G18GOVDWOL
2895931
2795233.0

G18USSRBAR
2135223
2118139.0

G18USSDCAS
2792693
2783116.0

G18USSGGAL
31228
31119.0

G18USSLKER
50927
50274.0



In [35]:
#005 total precinct
#019 governors races
#025 Total
#129 governors races


In [36]:
diff_counties = []

for i in col_list:
    diff = pivoted_2018.groupby(["county"]).sum()[i]-vest_pa_18.groupby(["COUNTYFP"]).sum()[i]
    print(i)
    for val in diff[diff != 0].index.values.tolist():
        if val not in diff_counties:
            diff_counties.append(val)
    print(diff[diff != 0])
    print("")

#print(diff_counties)

G18GOVGGLO
county
019   -490.0
021      2.0
025     -2.0
039      6.0
047      4.0
129   -690.0
Name: G18GOVGGLO, dtype: float64

G18GOVLKRA
county
019   -1088.0
039       1.0
097      -1.0
129   -1529.0
Name: G18GOVLKRA, dtype: float64

G18GOVRWAG
county
015      -58.0
019   -45242.0
021       11.0
039      288.0
055     -260.0
067       -8.0
073        3.0
091     -700.0
093       -2.0
097       -9.0
115      100.0
129   -76126.0
Name: G18GOVRWAG, dtype: float64

G18GOVDWOL
county
015      -28.0
019   -32891.0
021       20.0
039      194.0
053       -1.0
067       -7.0
081        3.0
091      -30.0
093       -4.0
097       -4.0
129   -67950.0
Name: G18GOVDWOL, dtype: float64

G18USSRBAR
county
015      -61.0
033   -16852.0
067       -9.0
091     -150.0
093       -2.0
097      -11.0
115        1.0
Name: G18USSRBAR, dtype: float64

G18USSDCAS
county
015     -26.0
033   -9540.0
053      -1.0
067      -3.0
093      -3.0
097      -3.0
115      -1.0
Name: G18USSDCAS, dtype: float64

G18USS

In [37]:
print(len(diff_counties))

18


In [38]:
print(fips_file.head())
fips_name_dict=dict(zip(fips_file["FIPS County"],fips_file["County Name"]))
print(fips_name_dict)
for i in diff_counties:
    print(fips_name_dict.get(i))

             State County Name  FIPS State FIPS County unique_ID
2241  Pennsylvania       Adams          42         001     42001
2242  Pennsylvania   Allegheny          42         003     42003
2243  Pennsylvania   Armstrong          42         005     42005
2244  Pennsylvania      Beaver          42         007     42007
2245  Pennsylvania     Bedford          42         009     42009
{'001': 'Adams', '003': 'Allegheny', '005': 'Armstrong', '007': 'Beaver', '009': 'Bedford', '011': 'Berks', '013': 'Blair', '015': 'Bradford', '017': 'Bucks', '019': 'Butler', '021': 'Cambria', '023': 'Cameron', '025': 'Carbon', '027': 'Centre', '029': 'Chester', '031': 'Clarion', '033': 'Clearfield', '035': 'Clinton', '037': 'Columbia', '039': 'Crawford', '041': 'Cumberland', '043': 'Dauphin', '045': 'Delaware', '047': 'Elk', '049': 'Erie', '051': 'Fayette', '053': 'Forest', '055': 'Franklin', '057': 'Fulton', '059': 'Greene', '061': 'Huntingdon', '063': 'Indiana', '065': 'Jefferson', '067': 'Juniata',

In [39]:
filtered_pivot = pivoted_2018[pivoted_2018["county"].isin(diff_counties)]
filtered_pivot.to_csv("./pa_diff_counties.csv")

#005 - Total


In [40]:
print(vest_pa_18.head())

  STATEFP COUNTYFP   VTDST          NAME  G18USSDCAS  G18USSRBAR  G18USSLKER  \
0      42      001  000010   ABBOTTSTOWN         120         183           5   
1      42      001  000020  ARENDTSVILLE         151         178           6   
2      42      001  000030  BENDERSVILLE          74         103           1   
3      42      001  000040       BERWICK         289         575          14   
4      42      001  000050   BIGLERVILLE         152         231           3   

   G18USSGGAL  G18GOVDWOL  G18GOVRWAG  G18GOVLKRA  G18GOVGGLO  \
0           2         120         185           2           2   
1           3         160         172           4           2   
2           2          76          98           3           2   
3           5         318         554           9           5   
4           7         168         215           5           2   

                                            geometry  
0  POLYGON Z ((-76.99801 39.88359 0.00000, -76.99...  
1  POLYGON Z ((-77

In [41]:
#Combine all the data from separate files into one
li = []
for i in fips_codes:
    ref = "./raw-from-source/Census/partnership_shapefiles_19v2_"
    file_ref = ref+i+"/PVS_19_v2_vtd_"+i+".shp"
    file_prev = gp.read_file(file_ref)
    #print(file_prev.shape)
    li.append(file_prev)
shapefiles_census = pd.concat(li, axis=0, ignore_index=True)

In [44]:
#print(pivoted_2018[pivoted_2018["county"]=="001"])
pivoted_2018["cty_pct"] = pivoted_2018["cty_pct"].str.upper()
print(pivoted_2018[pivoted_2018["county"]=="001"])

                 cty_pct  G18GOVGGLO  G18GOVLKRA  G18GOVRWAG  G18GOVDWOL  \
0         001ABBOTTSTOWN         2.0         2.0       185.0       120.0   
1        001ARENDTSVILLE         2.0         4.0       172.0       160.0   
2        001BENDERSVILLE         2.0         3.0        98.0        76.0   
3             001BERWICK         5.0         9.0       554.0       318.0   
4         001BIGLERVILLE         2.0         5.0       215.0       168.0   
5        001BONNEAUVILLE         5.0         6.0       368.0       231.0   
6              001BUTLER         5.0         9.0       657.0       362.0   
7   001CARROLL VALLEY #1         5.0        13.0       592.0       331.0   
8   001CARROLL VALLEY #2         5.0        10.0       463.0       307.0   
9         001CONEWAGO #1         5.0        15.0       679.0       527.0   
10        001CONEWAGO #2         7.0         9.0       827.0       506.0   
11      001CUMBERLAND #1         2.0         5.0       422.0       533.0   
12      001C

In [45]:
shapefiles_census["join_col"] = shapefiles_census["COUNTYFP"]+shapefiles_census["NAME"]

In [46]:
pivoted_2018["cty_pct"] = pivoted_2018["cty_pct"].str.replace(" DIST ", " DISTRICT ")
pivoted_2018["cty_pct"] = pivoted_2018["cty_pct"].str.replace("#", "")
pivoted_2018["cty_pct"] = pivoted_2018["cty_pct"].str.replace(" TOWNSHIP", "")
pivoted_2018["cty_pct"] = pivoted_2018["cty_pct"].str.replace(" TWP", "")
pivoted_2018["cty_pct"] = pivoted_2018["cty_pct"].str.replace(" BORO", "")

In [47]:
pivoted_2018["pct"]=pivoted_2018["cty_pct"].str[3:]
print(pivoted_2018.head())
for word, initial in {" 1":" 01"," 2":" 02"," 3":" 03"," 4":" 04"," 5":" 05"," 6":" 06"," 7":" 07"," 8":" 08"," 9":" 09"}.items():
    pivoted_2018.loc[:,"pct"] = pivoted_2018.loc[:,"pct"].str.replace(word, initial)
pivoted_2018["join_col"]=pivoted_2018["county"]+pivoted_2018["pct"]

           cty_pct  G18GOVGGLO  G18GOVLKRA  G18GOVRWAG  G18GOVDWOL  \
0   001ABBOTTSTOWN         2.0         2.0       185.0       120.0   
1  001ARENDTSVILLE         2.0         4.0       172.0       160.0   
2  001BENDERSVILLE         2.0         3.0        98.0        76.0   
3       001BERWICK         5.0         9.0       554.0       318.0   
4   001BIGLERVILLE         2.0         5.0       215.0       168.0   

   G18USSRBAR  G18USSDCAS  G18USSGGAL  G18USSLKER county           pct  
0       183.0       120.0         2.0         5.0    001   ABBOTTSTOWN  
1       178.0       151.0         3.0         6.0    001  ARENDTSVILLE  
2       103.0        74.0         2.0         1.0    001  BENDERSVILLE  
3       575.0       289.0         5.0        14.0    001       BERWICK  
4       231.0       152.0         7.0         3.0    001   BIGLERVILLE  


In [49]:
merged_source = pd.merge(pivoted_2018,shapefiles_census,how="outer",on="join_col",indicator=True)

print(merged_source["_merge"].value_counts())
right_only = merged_source[merged_source["_merge"]=="right_only"]["join_col"]
left_only = merged_source[merged_source["_merge"]=="left_only"]["join_col"]
right_only.to_csv("./shapefiles.csv")
left_only.to_csv("./elections.csv")

right_only    7727
left_only     7722
both          1479
Name: _merge, dtype: int64


In [50]:
vest_pa_18["unique_vote_id"]=vest_pa_18["G18USSDCAS"].astype(str)+vest_pa_18["G18USSRBAR"].astype(str)+vest_pa_18["G18USSLKER"].astype(str)+vest_pa_18["G18USSGGAL"].astype(str)+vest_pa_18["G18GOVDWOL"].astype(str)+vest_pa_18["G18GOVRWAG"].astype(str)
vest_pa_18["join_col_vest"]=vest_pa_18["COUNTYFP"]+vest_pa_18["NAME"]

In [51]:
for i in col_list:
    pivoted_2018.loc[:,i]=pivoted_2018.loc[:,i].astype(int)
    
pivoted_2018["unique_vote_id"]=pivoted_2018["G18USSDCAS"].astype(str)+pivoted_2018["G18USSRBAR"].astype(str)+pivoted_2018["G18USSLKER"].astype(str)+pivoted_2018["G18USSGGAL"].astype(str)+pivoted_2018["G18GOVDWOL"].astype(str)+pivoted_2018["G18GOVRWAG"].astype(str)

In [52]:
print(vest_pa_18.head())
print(pivoted_2018.head())

  STATEFP COUNTYFP   VTDST          NAME  G18USSDCAS  G18USSRBAR  G18USSLKER  \
0      42      001  000010   ABBOTTSTOWN         120         183           5   
1      42      001  000020  ARENDTSVILLE         151         178           6   
2      42      001  000030  BENDERSVILLE          74         103           1   
3      42      001  000040       BERWICK         289         575          14   
4      42      001  000050   BIGLERVILLE         152         231           3   

   G18USSGGAL  G18GOVDWOL  G18GOVRWAG  G18GOVLKRA  G18GOVGGLO  \
0           2         120         185           2           2   
1           3         160         172           4           2   
2           2          76          98           3           2   
3           5         318         554           9           5   
4           7         168         215           5           2   

                                            geometry   unique_vote_id  \
0  POLYGON Z ((-76.99801 39.88359 0.00000, -76.99...   

In [53]:
print(vest_pa_18["unique_vote_id"].value_counts())
print(vest_pa_18[vest_pa_18["unique_vote_id"]=="000000"])



2591002611           2
000000               2
430392042741         1
354382117363376      1
3573103554           1
                    ..
422913142391         1
20310463200107       1
84212461969101190    1
624406762740         1
4775551511519517     1
Name: unique_vote_id, Length: 9158, dtype: int64
     STATEFP COUNTYFP   VTDST       NAME  G18USSDCAS  G18USSRBAR  G18USSLKER  \
911       42      011  1578P1  ADAMSTOWN           0           0           0   
2108      42      049  999999  LAKE ERIE           0           0           0   

      G18USSGGAL  G18GOVDWOL  G18GOVRWAG  G18GOVLKRA  G18GOVGGLO  \
911            0           0           0           0           0   
2108           0           0           0           0           0   

                                               geometry unique_vote_id  \
911   POLYGON Z ((-76.04949 40.25063 0.00000, -76.04...         000000   
2108  POLYGON Z ((-80.51985 42.32713 0.00000, -80.49...         000000   

     join_col_vest  
911   01

In [54]:
dup_list = ["000000","2591002611"]
joined_vest_pa_18 = vest_pa_18[~(vest_pa_18["unique_vote_id"].isin(dup_list))]

In [55]:
vote_name_hack = pd.merge(joined_vest_pa_18,pivoted_2018,how="outer",on="unique_vote_id",indicator=True)

In [56]:
print(pivoted_2018.head())

           cty_pct  G18GOVGGLO  G18GOVLKRA  G18GOVRWAG  G18GOVDWOL  \
0   001ABBOTTSTOWN           2           2         185         120   
1  001ARENDTSVILLE           2           4         172         160   
2  001BENDERSVILLE           2           3          98          76   
3       001BERWICK           5           9         554         318   
4   001BIGLERVILLE           2           5         215         168   

   G18USSRBAR  G18USSDCAS  G18USSGGAL  G18USSLKER county           pct  \
0         183         120           2           5    001   ABBOTTSTOWN   
1         178         151           3           6    001  ARENDTSVILLE   
2         103          74           2           1    001  BENDERSVILLE   
3         575         289           5          14    001       BERWICK   
4         231         152           7           3    001   BIGLERVILLE   

          join_col   unique_vote_id  
0   001ABBOTTSTOWN   12018352120185  
1  001ARENDTSVILLE   15117863160172  
2  001BENDERSVILLE  

In [57]:
print(vote_name_hack["_merge"].value_counts())

both          8555
right_only     606
left_only      601
Name: _merge, dtype: int64


In [58]:
name_quick = vote_name_hack[vote_name_hack["_merge"]=="both"][["join_col","join_col_vest"]]
name_quick["use"] = name_quick["join_col"]==name_quick["join_col_vest"]
name_quick=name_quick[name_quick["use"]==False]

In [59]:
print(name_quick.head())

                join_col                  join_col_vest    use
7   001CARROLL VALLEY 01  001CARROLL VALLEY DISTRICT 01  False
8   001CARROLL VALLEY 02  001CARROLL VALLEY DISTRICT 02  False
9       001CUMBERLAND 02      001CUMBERLAND DISTRICT 02  False
10        001CONEWAGO 01        001CONEWAGO DISTRICT 01  False
11        001CONEWAGO 02        001CONEWAGO DISTRICT 02  False


In [63]:
rename_dict = dict(zip(name_quick["join_col"],name_quick["join_col_vest"]))

In [64]:
print(rename_dict)

{'001CARROLL VALLEY 01': '001CARROLL VALLEY DISTRICT 01', '001CARROLL VALLEY 02': '001CARROLL VALLEY DISTRICT 02', '001CUMBERLAND 02': '001CUMBERLAND DISTRICT 02', '001CONEWAGO 01': '001CONEWAGO DISTRICT 01', '001CONEWAGO 02': '001CONEWAGO DISTRICT 02', '001CUMBERLAND 04': '001CUMBERLAND DISTRICT 04', '001GETTYSBURG 02': '001GETTYSBURG WARD 02', '001MT JOY 01': '001MOUNT JOY DISTRICT 01', '001GETTYSBURG 01': '001GETTYSBURG WARD 01 PRECINCT 01', '001GETTYSBURG 03': '001GETTYSBURG WARD 03 PRECINCT 01', '001HUNTINGTON': '001HUNTINGTON DISTRICT 01', '001LITTLESTOWN 01': '001LITTLESTOWN WARD 01', '001LITTLESTOWN 02': '001LITTLESTOWN WARD 02', '001MCSHERRYSTOWN 01': '001MCSHERRYSTOWN DISTRICT 01', '001MCSHERRYSTOWN 02': '001MCSHERRYSTOWN DISTRICT 02', '001MT JOY 02': '001MOUNT JOY DISTRICT 02', '001MT PLEASANT 01': '001MOUNT PLEASANT DISTRICT 01', '001MT PLEASANT 02': '001MOUNT PLEASANT DISTRICT 02', '001OXFORD 01': '001OXFORD DISTRICT 01', '001OXFORD 02': '001OXFORD DISTRICT 02', '001READIN

In [65]:
print(pivoted_2018.head())

           cty_pct  G18GOVGGLO  G18GOVLKRA  G18GOVRWAG  G18GOVDWOL  \
0   001ABBOTTSTOWN           2           2         185         120   
1  001ARENDTSVILLE           2           4         172         160   
2  001BENDERSVILLE           2           3          98          76   
3       001BERWICK           5           9         554         318   
4   001BIGLERVILLE           2           5         215         168   

   G18USSRBAR  G18USSDCAS  G18USSGGAL  G18USSLKER county           pct  \
0         183         120           2           5    001   ABBOTTSTOWN   
1         178         151           3           6    001  ARENDTSVILLE   
2         103          74           2           1    001  BENDERSVILLE   
3         575         289           5          14    001       BERWICK   
4         231         152           7           3    001   BIGLERVILLE   

          join_col   unique_vote_id  
0   001ABBOTTSTOWN   12018352120185  
1  001ARENDTSVILLE   15117863160172  
2  001BENDERSVILLE  

In [66]:
pivoted_2018["join_col"]=pivoted_2018["join_col"].map(rename_dict).fillna(pivoted_2018["join_col"])

In [73]:
vest_pa_18.rename(columns={"join_col_vest":"join_col"},inplace=True)
print(vest_pa_18.head())

  STATEFP COUNTYFP   VTDST          NAME  G18USSDCAS  G18USSRBAR  G18USSLKER  \
0      42      001  000010   ABBOTTSTOWN         120         183           5   
1      42      001  000020  ARENDTSVILLE         151         178           6   
2      42      001  000030  BENDERSVILLE          74         103           1   
3      42      001  000040       BERWICK         289         575          14   
4      42      001  000050   BIGLERVILLE         152         231           3   

   G18USSGGAL  G18GOVDWOL  G18GOVRWAG  G18GOVLKRA  G18GOVGGLO  \
0           2         120         185           2           2   
1           3         160         172           4           2   
2           2          76          98           3           2   
3           5         318         554           9           5   
4           7         168         215           5           2   

                                            geometry   unique_vote_id  \
0  POLYGON Z ((-76.99801 39.88359 0.00000, -76.99...   

In [111]:
print(shapefiles_census["join_col"].value_counts())

107PINE GROVE PRECINCT 01             2
035BEECH CREEK                        2
009HOPEWELL                           2
079NESCOPECK                          2
107PINE GROVE PRECINCT 02             2
                                     ..
003HAMPTON DISTRICT 12                1
043HARRISBURG WARD 09 DISTRICT 01     1
003PITTSBURGH WARD 08 DISTRICT 11     1
077BETHLEHEM WARD 10                  1
129EAST HUNTINGDON DISTRICT WHITES    1
Name: join_col, Length: 9121, dtype: int64


In [97]:
merged_source = pd.merge(pivoted_2018,shapefiles_census,how="outer",on="join_col",indicator=True)

print(merged_source["_merge"].value_counts())
right_only = merged_source[merged_source["_merge"]=="right_only"]["join_col"]
left_only = merged_source[merged_source["_merge"]=="left_only"]["join_col"]
right_only.to_csv("./shapefiles.csv")
left_only.to_csv("./elections.csv")

both          8199
right_only    1073
left_only     1057
Name: _merge, dtype: int64


In [69]:
print(vest_pa_18.head())

  STATEFP COUNTYFP   VTDST          NAME  G18USSDCAS  G18USSRBAR  G18USSLKER  \
0      42      001  000010   ABBOTTSTOWN         120         183           5   
1      42      001  000020  ARENDTSVILLE         151         178           6   
2      42      001  000030  BENDERSVILLE          74         103           1   
3      42      001  000040       BERWICK         289         575          14   
4      42      001  000050   BIGLERVILLE         152         231           3   

   G18USSGGAL  G18GOVDWOL  G18GOVRWAG  G18GOVLKRA  G18GOVGGLO  \
0           2         120         185           2           2   
1           3         160         172           4           2   
2           2          76          98           3           2   
3           5         318         554           9           5   
4           7         168         215           5           2   

                                            geometry   unique_vote_id  \
0  POLYGON Z ((-76.99801 39.88359 0.00000, -76.99...   

In [71]:
print(merged_source.columns)

Index(['cty_pct', 'G18GOVGGLO', 'G18GOVLKRA', 'G18GOVRWAG', 'G18GOVDWOL',
       'G18USSRBAR', 'G18USSDCAS', 'G18USSGGAL', 'G18USSLKER', 'county', 'pct',
       'join_col', 'unique_vote_id', 'STATEFP', 'COUNTYFP', 'VTDST',
       'NAMELSAD', 'VTDI', 'LSAD', 'CHNG_TYPE', 'ORIG_NAME', 'ORIG_CODE',
       'RELATE', 'NAME', 'VINTAGE', 'FUNCSTAT', 'JUSTIFY', 'MTFCC', 'geometry',
       '_merge'],
      dtype='object')


In [91]:
name_check = pd.merge(merged_source[merged_source["_merge"]=="both"],vest_pa_18,how="outer",on="join_col",indicator="merge")

print(name_check["merge"].value_counts())
right_only = name_check[name_check["merge"]=="right_only"]
left_only = name_check[name_check["merge"]=="left_only"]
both = name_check[name_check["merge"]=="both"]

both          8389
right_only    1058
left_only        0
Name: merge, dtype: int64


In [92]:
print(name_check.columns)

Index(['cty_pct', 'G18GOVGGLO_x', 'G18GOVLKRA_x', 'G18GOVRWAG_x',
       'G18GOVDWOL_x', 'G18USSRBAR_x', 'G18USSDCAS_x', 'G18USSGGAL_x',
       'G18USSLKER_x', 'county', 'pct', 'join_col', 'unique_vote_id_x',
       'STATEFP_x', 'COUNTYFP_x', 'VTDST_x', 'NAMELSAD', 'VTDI', 'LSAD',
       'CHNG_TYPE', 'ORIG_NAME', 'ORIG_CODE', 'RELATE', 'NAME_x', 'VINTAGE',
       'FUNCSTAT', 'JUSTIFY', 'MTFCC', 'geometry_x', '_merge', 'STATEFP_y',
       'COUNTYFP_y', 'VTDST_y', 'NAME_y', 'G18USSDCAS_y', 'G18USSRBAR_y',
       'G18USSLKER_y', 'G18USSGGAL_y', 'G18GOVDWOL_y', 'G18GOVRWAG_y',
       'G18GOVLKRA_y', 'G18GOVGGLO_y', 'geometry_y', 'unique_vote_id_y',
       'merge'],
      dtype='object')


In [93]:
print(left_only["join_col"].str[0:3].value_counts())

Series([], Name: join_col, dtype: int64)


In [86]:
print(both["COUNTYFP_y"].value_counts())
print(left_only["county"].value_counts())

129    304
019     89
033     72
097     57
119     18
091      8
067      8
093      5
025      2
095      2
055      2
101      2
013      2
015      1
049      1
011      1
079      1
073      1
053      1
047      1
021      1
081      1
Name: COUNTYFP_y, dtype: int64
Series([], Name: county, dtype: int64)


In [90]:
def validater_row (df, column_List):
    matching_rows = 0
    different_rows = 0
    diff_list=[]
    diff_values = []
    max_diff = 0
    
    for j in range(0,len(df.index)):
        same = True
        for i in column_List:
            left_Data = i + "_x"
            right_Data = i + "_y"
            diff = abs(df.iloc[j][left_Data]-df.iloc[j][right_Data])
            if(diff != 0):
                #print(df.iloc[j]['countypct'])
                #print(i)
                diff_values.append(abs(diff))
                same = False
                if(np.isnan(diff)):
                    print("NaN value at diff is: ", df.iloc[j]['join_col'])
                    print(df.iloc[j][left_Data])
                    print(df.iloc[j][right_Data])
                if (diff>max_diff):
                    max_diff = diff
                    print("New max diff is: ", str(max_diff))
                    print(df.iloc[j]['join_col'])
        if(same != True):
            different_rows +=1
            diff_list.append(df.iloc[j]['join_col'])
        else:
            matching_rows +=1
    print("There are ", len(df.index)," total rows")
    print(different_rows," of these rows have election result differences")
    print(matching_rows," of these rows are the same")
    print("")
    print("The max difference between any one shared column in a row is: ", max_diff)
    if(len(diff_values)!=0):
        print("The average difference is: ", str(sum(diff_values)/len(diff_values)))
    count_big_diff = len([i for i in diff_values if i > 10])
    print("There are ", str(count_big_diff), "precinct results with a difference greater than 10")
    diff_list.sort()
    print(diff_list)

In [96]:
validater_row(both,col_list)

New max diff is:  3.0
003BALDWIN DISTRICT 01
New max diff is:  77.0
003BALDWIN DISTRICT 01
New max diff is:  133.0
003BALDWIN DISTRICT 01
New max diff is:  172.0
003BALDWIN DISTRICT 02
New max diff is:  364.0
007DARLINGTON
New max diff is:  379.0
007DARLINGTON
New max diff is:  558.0
009HOPEWELL
New max diff is:  657.0
039WOODCOCK
There are  8389  total rows
220  of these rows have election result differences
8169  of these rows are the same

The max difference between any one shared column in a row is:  657.0
The average difference is:  68.70584192439863
There are  737 precinct results with a difference greater than 10
['003BALDWIN DISTRICT 01', '003BALDWIN DISTRICT 01', '003BALDWIN DISTRICT 01', '003BALDWIN DISTRICT 01', '003BALDWIN DISTRICT 02', '003BALDWIN DISTRICT 02', '003BALDWIN DISTRICT 02', '003BALDWIN DISTRICT 02', '007DARLINGTON', '007DARLINGTON', '007DARLINGTON', '007DARLINGTON', '009HOPEWELL', '009HOPEWELL', '009HOPEWELL', '009HOPEWELL', '009WOODBURY', '009WOODBURY', '009W

In [105]:
count_vals = vest_pa_18["join_col"].value_counts().to_frame()

In [108]:
print(len(count_vals[count_vals["join_col"]==2]))

52
