## Goal
In this notebook, I will clean and preprocess the data, then feed it to some basic machine learning models, with a goal of predicting the number of days until a given university acts (relative to the first university's action in that category). So, if I put the features corresponding to WashU in my model, I may get 3 as an output, which would mean WashU is predicted to impose covid restrictions 3 days after the first mover.

**Note (important):** There are many problems in my input data. First, there's not enough of it (only 51 samples) due to the difficulty of collecting the data. Also, there's likely not enough features--a large one I'm missing is the incidence rate of COVID in each zip code during the period where decisions were made. This may make the model more accurate. Second, there are a lot of interconnectivities between the data, meaning the iid assumption is violated. It's almost always realistic for this assumption to be violated, but in this model it is especially problematic, as universities likely base their decisions on the status of other, similar universities. So, this model is flawed. However, it is based on a novel problem I thought about and could possibly possess some small degree of predictive power. Whatever the outcome, it will have been good practice, especially with getting the data into a workable form.

Notes for running:
- Can change ```census_vars``` to include any variables from the Census Bureau's ACS 5-year data

In [577]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Preparing the Data

In [578]:
covid_dates = pd.read_excel("cleaned_university_covid_dates.xlsx")

In [579]:
covid_dates.head()

Unnamed: 0,displayName,Unofficial Ranking,rankingDisplayRank,state,city,zip,Ivy,institution type,Date of Spring 2020 Move Online (first action to move classes online and tell students not to return to campus (some require it); many acted later to extend move to end of semester;),Date of Vaccine Requirement for Students in FL2021,Date of Booster Requirement,Date of Spring 2022 Move Online/Delay,description
0,Princeton University,1,#1,NJ,Princeton,8544,True,Private,2020-03-11,2021-04-20,2021-12-16,2021-12-27,The ivy-covered campus of Princeton University...
1,Columbia University,2,#2,NY,New York,10027,True,Private,2020-03-12,2021-04-19,2021-12-16,2021-12-22,Columbia University has three undergraduate sc...
2,Harvard University,3,#2,MA,Cambridge,2138,True,Private,2020-03-10,2021-05-05,2021-12-16,NaT,Harvard University is a private institution in...
3,Massachusetts Institute of Technology,4,#2,MA,Cambridge,2139,False,Private,2020-03-10,2021-04-30,2021-12-13,NaT,Though the Massachusetts Institute of Technolo...
4,Yale University,5,#5,CT,New Haven,6520,True,Private,2020-03-10,2021-04-19,2021-12-17,2021-12-22,"Yale University, located in New Haven, Connect..."


First, I'll change the titles of the columns; they are a bit too long. Also, I'm not using the descriptions from USNews so I can drop that, as well as the display rank and display name and city (we'll just use state and zip here)

In [580]:
covid_dates = covid_dates.rename(columns=
                       {"Date of Spring 2020 Move Online (first action to move classes online and tell students not to return to campus (some require it); many acted later to extend move to end of semester;)": "Spring2020",
                        "Date of Vaccine Requirement for Students in FL2021": "FirstVaccine",
                        "Date of Booster Requirement": "Booster",
                        "Date of Spring 2022 Move Online/Delay": "Spring2022"}).drop(columns=["description", "displayName", "rankingDisplayRank", "city"])

In [581]:
covid_dates.head()

Unnamed: 0,Unofficial Ranking,state,zip,Ivy,institution type,Spring2020,FirstVaccine,Booster,Spring2022
0,1,NJ,8544,True,Private,2020-03-11,2021-04-20,2021-12-16,2021-12-27
1,2,NY,10027,True,Private,2020-03-12,2021-04-19,2021-12-16,2021-12-22
2,3,MA,2138,True,Private,2020-03-10,2021-05-05,2021-12-16,NaT
3,4,MA,2139,False,Private,2020-03-10,2021-04-30,2021-12-13,NaT
4,5,CT,6520,True,Private,2020-03-10,2021-04-19,2021-12-17,2021-12-22


Now, instead of using dates, I will use days after the first university acted in each category.

In [582]:
covid_dates_only_d = covid_dates[['Spring2020', 'FirstVaccine', 'Booster', 'Spring2022']]
first_dates = covid_dates_only_d.min()
date_diff = covid_dates_only_d - first_dates
covid_dates[['Spring2020', 'FirstVaccine', 'Booster', 'Spring2022']] = date_diff.apply(lambda x: x.dt.days)
covid_dates.head()

Unnamed: 0,Unofficial Ranking,state,zip,Ivy,institution type,Spring2020,FirstVaccine,Booster,Spring2022
0,1,NJ,8544,True,Private,5,18.0,10.0,11.0
1,2,NY,10027,True,Private,6,17.0,10.0,6.0
2,3,MA,2138,True,Private,4,33.0,10.0,
3,4,MA,2139,False,Private,4,28.0,7.0,
4,5,CT,6520,True,Private,4,17.0,11.0,6.0


There are two different types of dates here: those that cause a shift to online learning and those that mandate vaccinations. I will average the days for both these categories and create a new column signifying 0 for the former type of date (move online) and 1 for the latter (vaccination).

Note that there are some null values in the data, which signify if a university did not act on that specific outcome. This is problematic as if we ignore/impute them, then we lose valuable data--perhaps a certain combination of features entices schools not to act. This would then be ignored in our data (see the section [exploring a different task](#exploring-a-different-task) below for more). So, I will consider not acting as a university taking the largest possible number of days to make the decision * 2 (an arbitrary choice). Thus the predicted outcome will be much higher if a university did not act. I will impute the values below.

In [583]:
arb_constant = 2
max_dates_online = covid_dates[['Spring2020', 'FirstVaccine', 'Booster', 'Spring2022']].max() * arb_constant
covid_dates_imputed = covid_dates.fillna(max_dates_online)

To ignore the above (no imputing), comment out the above.

In [584]:
# covid_dates_imputed = covid_dates

In [585]:
covid_dates_imputed['online'] = covid_dates_imputed[['Spring2020', 'Spring2022']].mean(axis=1).dropna()
covid_dates_imputed['vaccine'] = covid_dates_imputed[['FirstVaccine', 'Booster']].mean(axis=1).dropna()
covid_dates_cleaned = covid_dates_imputed.drop(columns=['Spring2020', 'FirstVaccine', 'Booster', 'Spring2022'])

Now, I'll separate each university into two rows, one for online and one for vaccine.

In [586]:
covid_dates_cleaned.rename(columns={'Unofficial Ranking': 'ranking', 'Ivy': 'ivy', 'institution type': 'institution_type'}, 
                           inplace=True)

In [587]:
covid_dates_cleaned = covid_dates_cleaned.melt(id_vars=['ranking', 'state', 'zip', 'ivy', 'institution_type'], var_name="decision_type", value_name="days_after_first")

Finally, I'll change ranking to be in groups of 10, i.e., rankings 1-10 encoded with 0, rankings 11-20 encoded with 1, etc (note that 51 will be encoded with a 4). This makes more sense as individual rankings don't matter as much as general cases.

In [588]:
covid_dates_cleaned['ranking'] = (covid_dates_cleaned['ranking'] - 1)//10
covid_dates_cleaned['ranking'] = covid_dates_cleaned['ranking'].where(covid_dates_cleaned["ranking"] != 5, 4)

In [589]:
covid_dates_cleaned.head()

Unnamed: 0,ranking,state,zip,ivy,institution_type,decision_type,days_after_first
0,0,NJ,8544,True,Private,online,8.0
1,0,NY,10027,True,Private,online,6.0
2,0,MA,2138,True,Private,online,24.0
3,0,MA,2139,False,Private,online,24.0
4,0,CT,6520,True,Private,online,5.0


### Extracting Data from Zip Codes Using the Census Bureau Data API
As inspired by [this article on leveraging value from postal codes](https://towardsdatascience.com/leveraging-value-from-postal-codes-naics-codes-area-codes-and-other-funky-arse-categorical-be9ce75b6d5a), I'll extract mean information for each zip code and use as features, not the zip code itself. I'm following [these reddit comments](https://www.reddit.com/r/datasets/comments/i2g55u/demographic_data_sets_by_zip_code_free/) as a guide. I will be using the [census package](https://pypi.org/project/census/) for Python and the census api. Also, I'm storing my api key as an env variable so I don't accidentally push it to Github. See [here](https://able.bio/rhett/how-to-set-and-get-environment-variables-in-python--274rgt5) for more.

"This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau."

In [590]:
from census import Census
import os
census_api_key = os.getenv('api_key_census')
c = Census(census_api_key, year=2020)

I will get data from [ACS 5-year data](https://www.census.gov/data/developers/data-sets/acs-5year.html) in 2020, as that's the closest available year to the pandemic and will give us a general sense of the demographic variables of the location the university is in. I would prefer to use data from 2020-2022, which is when colleges made these decisions, but unfortunately that's not available. There is ACS 1-year data, but that doesn't have zip code support, at least not in the package.

Here's a list of variables.

In [591]:
pd.DataFrame.from_dict(c.acs5.tables())

Unnamed: 0,name,description,variables,universe
0,B17015,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILI...,http://api.census.gov/data/2020/acs/acs5/group...,FAMILY
1,B18104,SEX BY AGE BY COGNITIVE DIFFICULTY,http://api.census.gov/data/2020/acs/acs5/group...,NONINST_05_OVER
2,B17016,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILI...,http://api.census.gov/data/2020/acs/acs5/group...,FAMILY
3,B18105,SEX BY AGE BY AMBULATORY DIFFICULTY,http://api.census.gov/data/2020/acs/acs5/group...,NONINST_05_OVER
4,B17017,POVERTY STATUS IN THE PAST 12 MONTHS BY HOUSEH...,http://api.census.gov/data/2020/acs/acs5/group...,HSHLD
...,...,...,...,...
1135,B99131,ALLOCATION OF MARITAL STATUS FOR FEMALES 15 TO...,http://api.census.gov/data/2020/acs/acs5/group...,WOMEN_15_50
1136,B09018,RELATIONSHIP TO HOUSEHOLDER FOR CHILDREN UNDER...,http://api.census.gov/data/2020/acs/acs5/group...,POP_18_UNDER_HSHLD_EXCL
1137,B09019,HOUSEHOLD TYPE (INCLUDING LIVING ALONE) BY REL...,http://api.census.gov/data/2020/acs/acs5/group...,TOTAL_POP
1138,B99132,ALLOCATION OF FERTILITY OF WOMEN 15 TO 50 YEARS,http://api.census.gov/data/2020/acs/acs5/group...,WOMEN_15_50


In this project, I will just get an assortment of basic demographic variables into a new dataframe, then merge it with my university data. A more sophisticated analysis could have a greater basis for choosing such variables.

Save variables to excel for easy browsing. Use [this guide to subject definitions](https://www2.census.gov/programs-surveys/acs/tech_docs/subject_definitions/2020_ACSSubjectDefinitions.pdf) for detailed help.

In [592]:
# pd.DataFrame.from_dict(c.acs5.tables()).to_excel('acs5_vars.xlsx')

An example for median income in past 12 months for Columbia's zip code.

In [593]:
c.acs5.state_zipcode('B07011_001E', 36, 10027)

[{'B07011_001E': 32100.0, 'zip code tabulation area': '10027'}]

I need to collect the codes for all the variables and the fips for each state (using [unitedstates package](https://github.com/unitedstates/python-us)). First I'll do fips.

In [594]:
import us

In [595]:
state_fips = us.states.mapping('abbr', 'fips')
covid_dates_cleaned['state_fips'] = covid_dates_cleaned['state'].apply(lambda x: state_fips[x])
covid_dates_cleaned.head()

Unnamed: 0,ranking,state,zip,ivy,institution_type,decision_type,days_after_first,state_fips
0,0,NJ,8544,True,Private,online,8.0,34
1,0,NY,10027,True,Private,online,6.0,36
2,0,MA,2138,True,Private,online,24.0,25
3,0,MA,2139,False,Private,online,24.0,25
4,0,CT,6520,True,Private,online,5.0,9


Now, I need to **select variables**. I think median income, population size, and political leaning could be interesting variables to start with. Note that political leaning is not available; I need to obtain it myself. It's specifically important here because Republicans and Democrats have different COVID responses; for example, a lot of Republican-led states may have more lax restrictions.

In [597]:
census_vars = {'B07011_001E': 'median_income', 'B01003_001E': 'total_population'}

Note that some zip codes don't have corresponding values. See the zip code for Princeton, NJ.

In [599]:
v = list(census_vars.keys())[0]

In [600]:
c.acs5.state_zipcode(v, 34, '08544')

[]

So, I'll try to get the corresponding variables from one geographic level up. For example, the county fips for Mercer County (where Princeton, NJ is located) is 021. Using this gets a result.

In [601]:
c.acs5.state_county(v, 34, '021')

[{'B07011_001E': 37223.0, 'state': '34', 'county': '021'}]

So, I need to extract the county fips from the zip code. I can do this using the [US Zipcodes to County State to FIPS Crosswalk](https://www.kaggle.com/datasets/danofer/zipcodes-county-fips-crosswalk) dataset. Note that the first two digits of the county fips correspond to the state fips, and the last three digits of STCOUNTYFP correspond to the county fips [as seen here](https://www2.census.gov/geo/pdfs/maps-data/data/tiger/tiger2006se/app_a03.pdf).

In [602]:
county_zips = pd.read_csv('zip-county-fips/ZIP-COUNTY-FIPS_2017-06.csv')
county_zips.head()

Unnamed: 0,ZIP,COUNTYNAME,STATE,STCOUNTYFP,CLASSFP
0,36003,Autauga County,AL,1001,H1
1,36006,Autauga County,AL,1001,H1
2,36067,Autauga County,AL,1001,H1
3,36066,Autauga County,AL,1001,H1
4,36703,Autauga County,AL,1001,H1


In [603]:
covid_dates_cleaned = covid_dates_cleaned.merge(county_zips[["ZIP", "STATE", "STCOUNTYFP"]], left_on=["state", "zip"], right_on=["STATE", "ZIP"]).drop(columns=["ZIP", "STATE"])

In [604]:
covid_dates_cleaned["county_fips"] = covid_dates_cleaned["STCOUNTYFP"]%1000

Now, try through the api with county

In [605]:
c.acs5.state_county(v, 34, '021')

[{'B07011_001E': 37223.0, 'state': '34', 'county': '021'}]

In [606]:
covid_dates_cleaned[['state_fips', 'county_fips']]

Unnamed: 0,state_fips,county_fips
0,34,21
1,34,21
2,36,61
3,36,61
4,25,17
...,...,...
109,42,95
110,25,25
111,25,25
112,39,49


In [607]:
c.acs5.state_county(v, 34, '021')

[{'B07011_001E': 37223.0, 'state': '34', 'county': '021'}]

This works, so I will use counties, which should all have values (unlike zip codes). Before I call the api, I need to make sure I found county fips for all counties present.

In [608]:
covid_dates_cleaned['county_fips'].isna().sum()

0

Also, I need to pad zeros for the api.

In [609]:
covid_dates_cleaned['county_fips_str'] = covid_dates_cleaned['county_fips'].astype(str).str.zfill(3)

Now, create a function that we can apply to each row of the dataframe that will call the api and get the desired census variables (held in ```census_vars```). Note that we only want this for unique rows.

In [610]:
api_return_cols = list(census_vars.values()) + ['state', 'county']
def separate_county_fips(x, *v):    
    api_return = c.acs5.state_county(v, x[0], x[1]) # returns dict in a list if found; empty list if not    
    try:
        api_return_clean = pd.Series(api_return[0])
        return api_return_clean.rename(census_vars) 
    except:
        no_api_return = pd.Series(index=api_return_cols, dtype='object')
        no_api_return[['state', 'county']] = x.values
        return no_api_return

Note that I can obtain multiple fields from the same api call. This will be better than using a for-loop.

In [611]:
census_var_names = list(census_vars.keys())
census_var_names

['B07011_001E', 'B01003_001E']

In [613]:
c.acs5.state_county(census_var_names, 34, '021')

[{'B07011_001E': 37223.0,
  'B01003_001E': 368085.0,
  'state': '34',
  'county': '021'}]

Do a quick check to make sure it handles non-existant inputs correctly.

In [614]:
testing_df = pd.DataFrame({'state_fips': [34, 36], 'county_fips_str': ['111', '061']})
testing_output = testing_df.drop_duplicates().apply(separate_county_fips, args=(census_var_names), axis=1)
testing_output

Unnamed: 0,median_income,total_population,state,county
0,,,34,111
1,52409.0,1629153.0,36,61


Now, I need to call the api on all my census variables using the above. Create a new df with target values, then merge.

In [615]:
census_vars_counties = covid_dates_cleaned[['state_fips', 'county_fips_str']].drop_duplicates().apply(separate_county_fips, args=(census_var_names), axis=1)
census_vars_counties.head()

Unnamed: 0,median_income,total_population,state,county
0,37223.0,368085.0,34,21
2,52409.0,1629153.0,36,61
4,48230.0,1605899.0,25,17
8,36670.0,855733.0,9,9
10,50030.0,1924379.0,6,85


In [625]:
covid_dates_cleaned = (covid_dates_cleaned.merge(census_vars_counties, 
                                                 left_on=['state_fips', 'county_fips_str'], 
                                                 right_on=['state', 'county'], 
                                                 suffixes=('', '_new'))
                       .drop(columns=['state_new', 'county']))

covid_dates_cleaned.head()

Unnamed: 0,ranking,state,zip,ivy,institution_type,decision_type,days_after_first,state_fips,STCOUNTYFP,county_fips,county_fips_str,median_income,total_population
0,0,NJ,8544,True,Private,online,8.0,34,34021,21,21,37223.0,368085.0
1,0,NJ,8544,True,Private,vaccine,14.0,34,34021,21,21,37223.0,368085.0
2,0,NY,10027,True,Private,online,6.0,36,36061,61,61,52409.0,1629153.0
3,0,NY,10027,True,Private,vaccine,13.5,36,36061,61,61,52409.0,1629153.0
4,2,NY,10012,False,Private,online,23.5,36,36061,61,61,52409.0,1629153.0


### County-Level COVID Data
County data is easier to retrieve and more consistent with the rest of our analysis, so I will keep using it instead of zip codes. I will use the [CovidActNow API](https://covidactnow.org/data-api) to get time series data by county. I will then merge the data using the 5-digit county fips.

In [876]:
# import requests
# api_key_covidnow = os.getenv('api_key_covidnow')
# url = f'https://api.covidactnow.org/v2/counties.timeseries.json?apiKey={api_key_covidnow}'
# r = requests.get(url)
# covid_county_basic = r.json()

In [884]:
# import json
# with open('covid_county.json', 'w', encoding='utf-8') as f:
#     json.dump(covid_county_basic, f, ensure_ascii=False)

This CSV seems to be too large for Pandas to handle, so I'll use Dask (parallel computing library that extends Pandas and other Python libraries).

Note that I needed to restart my kernel for this to work--and I ran it first, but it worked. The operations are just taking a long time.

I may not need Dask here; perhaps I can just use SQL, as I need to aggregate the dates anyway into some constant measure per university.

In [1]:
import dask.dataframe as dd
covid_county = dd.read_json('covid_county.json')

In [None]:
covid_county.head(1)

In [None]:
covid_county

In [None]:
covid_dates_cleaned

### Political Leaning
There is no simple way to gauge political leanings, so I will use composition of state legislature, as well as the county presidential election returns from 2020 as a proxy. A college town may often lean left, but the state legislature represents overall political sentiment and has a direct impact on COVID guidelines, so I think it'll be useful as well and not redundant.

I will use the data from [the ncsl](https://www.ncsl.org/Portals/1/Documents/Elections/Legis_Control_2020_April%201.pdf) for who controlled the state legislature in 2020, filled in as Rep for Nebraska after a Google and Dem for DC as it votes heavily Democratic. Hardcoded because it's only 50 samples and pretty basic.

In [116]:
political_control_state = {'DC': 'Dem', 'AL': 'Rep', 'AK': 'Rep', 'AZ': 'Rep', 'AR': 'Rep', 'CA': 'Dem', 'CO': 'Dem', 'CT': 'Dem', 'DE': 'Dem', 'FL': 'Rep', 'GA': 'Rep', 'HI': 'Dem', 'ID': 'Rep', 'IL': 'Dem', 'IN': 'Rep', 'IA': 'Rep', 'KS': 'Div', 'KY': 'Div', 'LA': 'Div', 'ME': 'Dem', 'MD': 'Div', 'MA': 'Div', 'MI': 'Div', 'MN': 'Div', 'MS': 'Rep', 'MO': 'Rep', 'MT': 'Div', 'NE': 'Rep', 'NV': 'Dem', 'NH': 'Div', 'NJ': 'Dem', 'NM': 'Dem', 'NY': 'Dem', 'NC': 'Div', 'ND': 'Rep', 'OH': 'Rep', 'OK': 'Rep', 'OR': 'Dem', 'PA': 'Div', 'RI': 'Dem', 'SC': 'Rep', 'SD': 'Rep', 'TN': 'Rep', 'TX': 'Rep', 'UT': 'Rep', 'VT': 'Div', 'VA': 'Dem', 'WA': 'Dem', 'WV': 'Rep', 'WI': 'Div', 'WY': 'Rep'}

In [319]:
covid_dates_cleaned['political_control_state'] = covid_dates_cleaned['state'].map(political_control_state)
covid_dates_cleaned.head()

Unnamed: 0,ranking,state,zip,ivy,institution_type,decision_type,days_after_first,political_control_state
0,0,NJ,8544,True,Private,online,8.0,Dem
1,0,NY,10027,True,Private,online,6.0,Dem
2,0,MA,2138,True,Private,online,4.0,Div
3,0,MA,2139,False,Private,online,4.0,Div
4,0,CT,6520,True,Private,online,5.0,Dem


For the local data, I'll use [County Presidential Election Returns 2000-2020](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ) from the MIT Election Data and Science Lab.

In [781]:
county_pres = pd.read_csv('political-data/countypres_2000-2020.csv')
county_pres.head()

Unnamed: 0,year,state,state_po,county_name,county_fips,office,candidate,party,candidatevotes,totalvotes,version,mode
0,2000,ALABAMA,AL,AUTAUGA,1001.0,US PRESIDENT,AL GORE,DEMOCRAT,4942,17208,20220315,TOTAL
1,2000,ALABAMA,AL,AUTAUGA,1001.0,US PRESIDENT,GEORGE W. BUSH,REPUBLICAN,11993,17208,20220315,TOTAL
2,2000,ALABAMA,AL,AUTAUGA,1001.0,US PRESIDENT,RALPH NADER,GREEN,160,17208,20220315,TOTAL
3,2000,ALABAMA,AL,AUTAUGA,1001.0,US PRESIDENT,OTHER,OTHER,113,17208,20220315,TOTAL
4,2000,ALABAMA,AL,BALDWIN,1003.0,US PRESIDENT,AL GORE,DEMOCRAT,13997,56480,20220315,TOTAL


Note that there's not data for every state in 2020. So, I'll use 2016--it should be fairly similar. I don't expect many counties to have shifted too much. However, when the data becomes available, I will use all 2020.

In [782]:
county_pres.query("state_po == 'NC' and mode == 'TOTAL' and year == 2020")

Unnamed: 0,year,state,state_po,county_name,county_fips,office,candidate,party,candidatevotes,totalvotes,version,mode


Restrict to 2016

In [783]:
election_year = 2016
county_pres = county_pres.query("year == @election_year")

Change county_fips to an int. 

Note there are 9 counties without a fips. We'll just drop those. If it becomes a problem, I'll revisit this.

In [784]:
county_pres['county_fips'].isna().sum()

9

In [785]:
county_pres = county_pres.dropna(subset=['county_fips'])
county_pres['county_fips'] = county_pres['county_fips'].astype(int, copy=False)

In [786]:
county_pres.head(2)

Unnamed: 0,year,state,state_po,county_name,county_fips,office,candidate,party,candidatevotes,totalvotes,version,mode
40517,2016,ALABAMA,AL,AUTAUGA,1001,US PRESIDENT,HILLARY CLINTON,DEMOCRAT,5936,24973,20220315,TOTAL
40518,2016,ALABAMA,AL,AUTAUGA,1001,US PRESIDENT,DONALD TRUMP,REPUBLICAN,18172,24973,20220315,TOTAL


Now find percent vote for each party by county for total election results.

In [787]:
county_pres['mode'].value_counts()

TOTAL    9465
Name: mode, dtype: int64

In [788]:
county_pres['percentvote'] = county_pres['candidatevotes']/county_pres['totalvotes']

Label who won by county. To see how Democratic or Republican a county is, either convert to a discrete variable or leave it continuous. I'll do the latter, but note that 20% vs 30% may not be that different, so a categorical variable may better capture this

Before doing this, make sure data makes sense for every county (i.e., both Democrats and Republicans got votes).

In [789]:
county_pres.query("(party == 'REPUBLICAN' or party == 'DEMOCRAT') and percentvote == 0 and mode == 'TOTAL'")

Unnamed: 0,year,state,state_po,county_name,county_fips,office,candidate,party,candidatevotes,totalvotes,version,mode,percentvote


Set new variable as difference in percent vote between votes for Democrats and Republicans (defined as percent vote Democrat - percent vote Republican).

Want to create a new df with a column for Democrat percent vote and a column for Republican percent vote so I can easily subtract.

In [790]:
county_pres = county_pres.query("(party == 'REPUBLICAN' or party == 'DEMOCRAT') and mode == 'TOTAL'")

In [791]:
county_pres_percents = (county_pres.query("party == 'DEMOCRAT'")[['county_fips', 'percentvote']]
                        .merge(county_pres.query("party == 'REPUBLICAN'")[['county_fips', 'percentvote']], 
                               on='county_fips', suffixes=('_D', '_R')))

In [792]:
county_pres_percents['county_vote_diff'] = county_pres_percents['percentvote_D'] - county_pres_percents['percentvote_R']
county_pres_percents.head()

Unnamed: 0,county_fips,percentvote_D,percentvote_R,county_vote_diff
0,1001,0.237697,0.727666,-0.489969
1,1003,0.193856,0.765457,-0.571601
2,1005,0.465278,0.520967,-0.055688
3,1007,0.212496,0.764032,-0.551536
4,1009,0.084258,0.893348,-0.80909


Merge with main df.

In [793]:
county_pres_percents['STCOUNTYFP'] = county_pres_percents['county_fips'].astype(str)

In [794]:
covid_dates_cleaned['STCOUNTYFP'] = covid_dates_cleaned['STCOUNTYFP'].astype(str)

In [795]:
covid_dates_all = covid_dates_cleaned.merge(county_pres_percents[['STCOUNTYFP', 'county_vote_diff']], on='STCOUNTYFP', how='left')
covid_dates_all[covid_dates_all['county_vote_diff'].isna()] # check na

Unnamed: 0,ranking,state,zip,ivy,institution_type,decision_type,days_after_first,state_fips,STCOUNTYFP,county_fips,county_fips_str,median_income,total_population,county_vote_diff


## Save