## Filter County Data Set

The master data sets can be found at https://drive.google.com/drive/u/2/folders/1Ag39uw9kKCKqlEOUkCxcezAutJI6mWyC

The master datasets are not present in the github repository for our project.

This notebook combines a pair of master data sources as follows:

* A dataset containing presidential election returns, by county, for elections from 2000 through 2020
* A dataset containing the counties that make up the thirteen television markets of interest for our project

This notebook produces a dataset that eliminates all voting from counties not in the regional markets of interest.
Each row in the dataset contains the votes won by a candidate in that county in that election.
Each row also is tagged with the relevant television market.

In [1]:
# Source datasets are located at:
# https://drive.google.com/drive/u/2/folders/1Ag39uw9kKCKqlEOUkCxcezAutJI6mWyC

import pandas as pd

results_directory = "../../datasets/MIT_Election_Lab/"

votes_df = pd.read_csv(results_directory + "countypres_2000-2020.csv",
                       dtype={'county_fips': str})

# fips codes are 5 digit strings with leading zeros
# cast vote counts as ints and not floats

votes_df['county_fips'] = votes_df['county_fips'].str.zfill(5)
votes_df['candidatevotes'] = votes_df['candidatevotes'].fillna(0).astype(int)

region_df = pd.read_csv(results_directory + "region_county_makeup.csv")

In [2]:
votes_df.columns

Index(['year', 'state', 'state_po', 'county_name', 'county_fips', 'office',
       'candidate', 'party', 'candidatevotes', 'totalvotes', 'version',
       'mode'],
      dtype='object')

In [3]:
# Limit votes data to 2016 results
# remove year, office and version columns as being uniform across the dataset
# remove state_po column as it provides the same information as state
# remove totalvotes column from source dataset we substitute our own below
# remove mode column as we do not use it

votes_df = votes_df[votes_df.year == 2016]
votes_df = votes_df.drop(
    columns=['year', 'state_po', 'office', 'totalvotes', 'version', 'mode'])

votes_df.columns

Index(['state', 'county_name', 'county_fips', 'candidate', 'party',
       'candidatevotes'],
      dtype='object')

In [4]:
votes_df.dtypes

state             object
county_name       object
county_fips       object
candidate         object
party             object
candidatevotes     int64
dtype: object

In [5]:
region_df.columns

Index(['region_id', 'state', 'county_name'], dtype='object')

In [6]:
region_df

Unnamed: 0,region_id,state,county_name
0,tampa_region,FLORIDA,CITRUS
1,tampa_region,FLORIDA,HARDEE
2,tampa_region,FLORIDA,HERNANDO
3,tampa_region,FLORIDA,HIGHLANDS
4,tampa_region,FLORIDA,HILLSBOROUGH
...,...,...,...
269,new_york_city_region,NEW JERSEY,SOMERSET
270,new_york_city_region,NEW JERSEY,SUSSEX
271,new_york_city_region,NEW JERSEY,UNION
272,new_york_city_region,NEW JERSEY,WARREN


In [7]:
# We merge the region dataframe which allows us to map from market regions
# to counties.  By using an "inner" merge we also eliminate all counties
# that are not within the scope of our project.

merged = pd.merge(votes_df, region_df, on=['state', 'county_name'], how='inner')

In [8]:
merged

Unnamed: 0,state,county_name,county_fips,candidate,party,candidatevotes,region_id
0,ARIZONA,APACHE,04001,HILLARY CLINTON,DEMOCRAT,17083,phoenix_region
1,ARIZONA,APACHE,04001,DONALD TRUMP,REPUBLICAN,8240,phoenix_region
2,ARIZONA,APACHE,04001,OTHER,OTHER,2338,phoenix_region
3,ARIZONA,COCONINO,04005,HILLARY CLINTON,DEMOCRAT,32404,phoenix_region
4,ARIZONA,COCONINO,04005,DONALD TRUMP,REPUBLICAN,21108,phoenix_region
...,...,...,...,...,...,...,...
799,WYOMING,NIOBRARA,56027,DONALD TRUMP,REPUBLICAN,1116,denver_region
800,WYOMING,NIOBRARA,56027,OTHER,OTHER,83,denver_region
801,WYOMING,PLATTE,56031,HILLARY CLINTON,DEMOCRAT,719,denver_region
802,WYOMING,PLATTE,56031,DONALD TRUMP,REPUBLICAN,3437,denver_region


In [9]:
merged.region_id.value_counts()

denver_region           174
washington_dc_region    123
new_york_city_region     87
raleigh_region           69
cedar_rapids_region      63
philadelphia_region      54
cleveland_region         51
boston_region            48
phoenix_region           33
san_francisco_region     33
tampa_region             30
milwaukee_region         30
las_vegas_region          9
Name: region_id, dtype: int64

In [10]:
# The votes dataset is in long format with three rows for every county in
# each election, one to reflect republican votes, one to reflect democrat
# votes and one for all other.  We reshape the dataset set to wide
# format with three columns in every row as opposed to having three rows
# as in the source dataset.

pivot_index = ['state',
               'county_name',
               'county_fips', 
               'region_id']

pivot = merged.pivot(columns = 'party', 
                     index = pivot_index, 
                     values = 'candidatevotes')

# The pivot creates a Pandas multi index as a side effect.
# We reset the index back to a simple range and
# restore the columns used for the pivot index back
# to their original status as columns.

pivot = pivot.reset_index()

# Convert column  names to lower case
pivot.columns = pivot.columns.str.lower()

# Derive the totalvotes column instead of relying on the source
# data as that data had a few mistakes in derivation.
pivot['totalvotes'] = pivot.democrat + pivot.republican + pivot.other
pivot.totalvotes = pivot.totalvotes.astype(int)

In [11]:
pivot.columns

Index(['state', 'county_name', 'county_fips', 'region_id', 'democrat', 'other',
       'republican', 'totalvotes'],
      dtype='object', name='party')

In [12]:
# Reorder the columns a little bit

columns = ['state', 
           'county_name', 
           'county_fips', 
           'region_id', 
           'democrat', 
           'republican',
           'other',
           'totalvotes']

pivot = pivot[columns]

pivot.columns

Index(['state', 'county_name', 'county_fips', 'region_id', 'democrat',
       'republican', 'other', 'totalvotes'],
      dtype='object', name='party')

In [13]:
pivot.dtypes

party
state          object
county_name    object
county_fips    object
region_id      object
democrat        int64
republican      int64
other           int64
totalvotes      int64
dtype: object

In [15]:
pivot.to_csv(
    results_directory + "enriched_county_pres_2016.gz",
    compression="gzip",
    index=False,
)