## Goal
In this notebook, I will clean and preprocess the data, then feed it to some basic machine learning models, with a goal of predicting the number of days until a given university acts (relative to the first university's action in that category). So, if I put the features corresponding to WashU in my model, I may get 3 as an output, which would mean WashU is predicted to impose covid restrictions 3 days after the first mover.

**Note (important):** There are many problems in my input data. First, there's not enough of it (only 51 samples) due to the difficulty of collecting the data. Also, there's likely not enough features--a large one I'm missing is the incidence rate of COVID in each zip code during the period where decisions were made. This may make the model more accurate. Second, there are a lot of interconnectivities between the data, meaning the iid assumption is violated. It's almost always realistic for this assumption to be violated, but in this model it is especially problematic, as universities likely base their decisions on the status of other, similar universities. So, this model is flawed. However, it is based on a novel problem I thought about and could possibly possess some small degree of predictive power. Whatever the outcome, it will have been good practice, especially with getting the data into a workable form.

In [49]:
import pandas as pd
import matplotlib.pyplot as plt

## Preparing the Data

In [50]:
covid_dates = pd.read_excel("cleaned_university_covid_dates.xlsx")

In [51]:
covid_dates.head()

Unnamed: 0,displayName,Unofficial Ranking,rankingDisplayRank,state,city,zip,Ivy,institution type,Date of Spring 2020 Move Online (first action to move classes online and tell students not to return to campus (some require it); many acted later to extend move to end of semester;),Date of Vaccine Requirement for Students in FL2021,Date of Booster Requirement,Date of Spring 2022 Move Online/Delay,description
0,Princeton University,1,#1,NJ,Princeton,8544,True,Private,2020-03-11,2021-04-20,2021-12-16,2021-12-27,The ivy-covered campus of Princeton University...
1,Columbia University,2,#2,NY,New York,10027,True,Private,2020-03-12,2021-04-19,2021-12-16,2021-12-22,Columbia University has three undergraduate sc...
2,Harvard University,3,#2,MA,Cambridge,2138,True,Private,2020-03-10,2021-05-05,2021-12-16,NaT,Harvard University is a private institution in...
3,Massachusetts Institute of Technology,4,#2,MA,Cambridge,2139,False,Private,2020-03-10,2021-04-30,2021-12-13,NaT,Though the Massachusetts Institute of Technolo...
4,Yale University,5,#5,CT,New Haven,6520,True,Private,2020-03-10,2021-04-19,2021-12-17,2021-12-22,"Yale University, located in New Haven, Connect..."


First, I'll change the titles of the columns; they are a bit too long. Also, I'm not using the descriptions from USNews so I can drop that, as well as the display rank and display name and city (we'll just use state and zip here)

In [52]:
covid_dates = covid_dates.rename(columns=
                       {"Date of Spring 2020 Move Online (first action to move classes online and tell students not to return to campus (some require it); many acted later to extend move to end of semester;)": "Spring2020",
                        "Date of Vaccine Requirement for Students in FL2021": "FirstVaccine",
                        "Date of Booster Requirement": "Booster",
                        "Date of Spring 2022 Move Online/Delay": "Spring2022"}).drop(columns=["description", "displayName", "rankingDisplayRank", "city"])

In [53]:
covid_dates.head()

Unnamed: 0,Unofficial Ranking,state,zip,Ivy,institution type,Spring2020,FirstVaccine,Booster,Spring2022
0,1,NJ,8544,True,Private,2020-03-11,2021-04-20,2021-12-16,2021-12-27
1,2,NY,10027,True,Private,2020-03-12,2021-04-19,2021-12-16,2021-12-22
2,3,MA,2138,True,Private,2020-03-10,2021-05-05,2021-12-16,NaT
3,4,MA,2139,False,Private,2020-03-10,2021-04-30,2021-12-13,NaT
4,5,CT,6520,True,Private,2020-03-10,2021-04-19,2021-12-17,2021-12-22


Now, instead of using dates, I will use days after the first university acted in each category.

In [54]:
covid_dates_only_d = covid_dates[['Spring2020', 'FirstVaccine', 'Booster', 'Spring2022']]
first_dates = covid_dates_only_d.min()
date_diff = covid_dates_only_d - first_dates
covid_dates[['Spring2020', 'FirstVaccine', 'Booster', 'Spring2022']] = date_diff.apply(lambda x: x.dt.days)
covid_dates.head()

Unnamed: 0,Unofficial Ranking,state,zip,Ivy,institution type,Spring2020,FirstVaccine,Booster,Spring2022
0,1,NJ,8544,True,Private,5,18.0,10.0,11.0
1,2,NY,10027,True,Private,6,17.0,10.0,6.0
2,3,MA,2138,True,Private,4,33.0,10.0,
3,4,MA,2139,False,Private,4,28.0,7.0,
4,5,CT,6520,True,Private,4,17.0,11.0,6.0


There are two different types of dates here: those that cause a shift to online learning and those that mandate vaccinations. I will average the days for both these categories and create a new column signifying 0 for the former type of date (move online) and 1 for the latter (vaccination).

Note that there are some null values in the data, which signify if a university did not act on that specific outcome. This is problematic as if we ignore/impute them, then we lose valuable data--perhaps a certain combination of features entices schools not to act. This would then be ignored in our data (see the section [exploring a different task](#exploring-a-different-task) below for more). So, I will consider not acting as a university taking the largest possible number of days to make the decision * 2 (an arbitrary choice). Thus the predicted outcome will be much higher if a university did not act. I will impute the values below.

In [55]:
arb_constant = 2
max_dates_online = covid_dates[['Spring2020', 'FirstVaccine', 'Booster', 'Spring2022']].max() * arb_constant
covid_dates_imputed = covid_dates.fillna(max_dates_online)

In [56]:
covid_dates_imputed['online'] = covid_dates_imputed[['Spring2020', 'Spring2022']].mean(axis=1)
covid_dates_imputed['vaccine'] = covid_dates_imputed[['FirstVaccine', 'Booster']].mean(axis=1)
covid_dates_cleaned = covid_dates_imputed.drop(columns=['Spring2020', 'FirstVaccine', 'Booster', 'Spring2022'])

In [57]:
covid_dates_cleaned.rename(columns={'Unofficial Ranking': 'ranking', 'Ivy': 'ivy', 'institution type': 'institution_type'}, 
                           inplace=True)

In [58]:
covid_dates_cleaned.head()

Unnamed: 0,ranking,state,zip,ivy,institution_type,online,vaccine
0,1,NJ,8544,True,Private,8.0,14.0
1,2,NY,10027,True,Private,6.0,13.5
2,3,MA,2138,True,Private,24.0,21.5
3,4,MA,2139,False,Private,24.0,17.5
4,5,CT,6520,True,Private,5.0,14.0


### Extracting Data from Zip Codes
As inspired by [this article on leveraging value from postal codes](https://towardsdatascience.com/leveraging-value-from-postal-codes-naics-codes-area-codes-and-other-funky-arse-categorical-be9ce75b6d5a), I'll extract mean information for each zip code and use as features, not the zip code itself. I'm following [these reddit comments](https://www.reddit.com/r/datasets/comments/i2g55u/demographic_data_sets_by_zip_code_free/) as a guide. I will be using the [census package](https://pypi.org/project/census/) for Python and the census api. Also, I'm storing my api key as an env variable so I don't accidentally push it to Github. See [here](https://able.bio/rhett/how-to-set-and-get-environment-variables-in-python--274rgt5) for more.

"This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau."

In [59]:
from census import Census
import os
census_api_key = os.getenv('api_key_census')
c = Census(census_api_key)

I will get data from [ACS 5-year data](https://www.census.gov/data/developers/data-sets/acs-5year.html) in 2020, as that's the closest available year to the pandemic and will give us a general sense of the demographic variables of the location the university is in. I would prefer to use data from 2020-2022, which is when colleges made these decisions, but unfortunately that's not available. There is ACS 1-year data, but that doesn't have zip code support, at least not in the package.

Here's a list of variables.

In [60]:
pd.DataFrame.from_dict(c.acs5.tables())

Unnamed: 0,name,description,variables,universe
0,B17015,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILI...,http://api.census.gov/data/2020/acs/acs5/group...,FAMILY
1,B18104,SEX BY AGE BY COGNITIVE DIFFICULTY,http://api.census.gov/data/2020/acs/acs5/group...,NONINST_05_OVER
2,B17016,POVERTY STATUS IN THE PAST 12 MONTHS OF FAMILI...,http://api.census.gov/data/2020/acs/acs5/group...,FAMILY
3,B18105,SEX BY AGE BY AMBULATORY DIFFICULTY,http://api.census.gov/data/2020/acs/acs5/group...,NONINST_05_OVER
4,B17017,POVERTY STATUS IN THE PAST 12 MONTHS BY HOUSEH...,http://api.census.gov/data/2020/acs/acs5/group...,HSHLD
...,...,...,...,...
1135,B99131,ALLOCATION OF MARITAL STATUS FOR FEMALES 15 TO...,http://api.census.gov/data/2020/acs/acs5/group...,WOMEN_15_50
1136,B09018,RELATIONSHIP TO HOUSEHOLDER FOR CHILDREN UNDER...,http://api.census.gov/data/2020/acs/acs5/group...,POP_18_UNDER_HSHLD_EXCL
1137,B09019,HOUSEHOLD TYPE (INCLUDING LIVING ALONE) BY REL...,http://api.census.gov/data/2020/acs/acs5/group...,TOTAL_POP
1138,B99132,ALLOCATION OF FERTILITY OF WOMEN 15 TO 50 YEARS,http://api.census.gov/data/2020/acs/acs5/group...,WOMEN_15_50


In this project, I will just get an assortment of basic demographic variables into a new dataframe, then merge it with my university data. A more sophisticated analysis could have a greater basis for choosing such variables.

Save variables to excel for easy browsing. Use [this guide to subject definitions](https://www2.census.gov/programs-surveys/acs/tech_docs/subject_definitions/2020_ACSSubjectDefinitions.pdf) for detailed help.

In [61]:
# pd.DataFrame.from_dict(c.acs5.tables()).to_excel('acs5_vars.xlsx')

An example for median income in past 12 months for Columbia's zip code.

In [62]:
c.acs5.state_zipcode('B07011_001E', 36, 10027)

[{'B07011_001E': 32100.0, 'zip code tabulation area': '10027'}]

I need to collect the codes for all the variables and the fips for each state (using [unitedstates package](https://github.com/unitedstates/python-us)). First I'll do fips.

In [63]:
import us

In [75]:
state_fips = us.states.mapping('abbr', 'fips')
covid_dates_cleaned['state_fips'] = covid_dates_cleaned['state'].apply(lambda x: state_fips[x])
covid_dates_cleaned.head()

Unnamed: 0,ranking,state,zip,ivy,institution_type,online,vaccine,state_fips
0,1,NJ,8544,True,Private,8.0,14.0,34
1,2,NY,10027,True,Private,6.0,13.5,36
2,3,MA,2138,True,Private,24.0,21.5,25
3,4,MA,2139,False,Private,24.0,17.5,25
4,5,CT,6520,True,Private,5.0,14.0,9


Now, I need to select variables. I think median income, population size, and political leaning could be interesting variables to start with. Note that political leaning is not available; I need to obtain it myself. It's specifically important here because Republicans and Democrats have different COVID responses; for example, a lot of Republican-led states may have more lax restrictions.

In [77]:
census_vars = {'median_income': 'B07011_001E', 'total_population': 'B01003_001E'}

### TODO
*In progress:* Note that some zip codes don't have corresponding values. Maybe use city names instead?

In [96]:
c.acs5.state_zipcode(v, 34, 8544)

[]

In [86]:
for v in census_vars.values():
    test = covid_dates_cleaned[['state_fips', 'zip']].apply(lambda x: print(x[0], x[1]), axis=1) #c.acs5.state_zipcode(v, x[0], x[1]), axis=1)
    print(test)
#     c.acs5.state_zipcode(v, covid_dates_cleaned['state_fips'], covid_dates_cleaned['zip'])

34 8544
36 10027
25 2138
25 2139
09 6520
06 94305
17 60637
42 19104
06 91125
37 27708
24 21218
17 60208
33 3755
44 2912
47 37240
29 63130
36 14853
48 77251
18 46556
06 90095
13 30322
06 94720
11 20057
26 48109
42 15213
51 22904
06 90089
36 10012
25 2155
06 93106
12 32611
37 27599
37 27109
06 92093
36 14627
25 2467
06 92697
13 30332
06 95616
48 78705
51 23187
25 2215
25 2454
39 44106
22 70118
55 53706
17 61820
13 30602
42 18015
25 2115
39 43210
0     None
1     None
2     None
3     None
4     None
5     None
6     None
7     None
8     None
9     None
10    None
11    None
12    None
13    None
14    None
15    None
16    None
17    None
18    None
19    None
20    None
21    None
22    None
23    None
24    None
25    None
26    None
27    None
28    None
29    None
30    None
31    None
32    None
33    None
34    None
35    None
36    None
37    None
38    None
39    None
40    None
41    None
42    None
43    None
44    None
45    None
46    None
47    None
48    None
49    None
50

### Political Leaning
There is no simple way to gauge political leanings, so I will use composition of state legislature as a proxy. A college town may often lean left, but the state legislature represents overall political sentiment and has a direct impact on COVID guidelines, so I think it'll be more useful. I will use the data from [the ncsl](https://www.ncsl.org/Portals/1/Documents/Elections/Legis_Control_2020_April%201.pdf) for who controlled the state legislature in 2020, filled in as Rep for Nebraska after a Google and Dem for DC as it votes heavily Democratic. Hardcoded because it's only 50 samples and pretty basic.

In [116]:
political_control_state = {'DC': 'Dem', 'AL': 'Rep', 'AK': 'Rep', 'AZ': 'Rep', 'AR': 'Rep', 'CA': 'Dem', 'CO': 'Dem', 'CT': 'Dem', 'DE': 'Dem', 'FL': 'Rep', 'GA': 'Rep', 'HI': 'Dem', 'ID': 'Rep', 'IL': 'Dem', 'IN': 'Rep', 'IA': 'Rep', 'KS': 'Div', 'KY': 'Div', 'LA': 'Div', 'ME': 'Dem', 'MD': 'Div', 'MA': 'Div', 'MI': 'Div', 'MN': 'Div', 'MS': 'Rep', 'MO': 'Rep', 'MT': 'Div', 'NE': 'Rep', 'NV': 'Dem', 'NH': 'Div', 'NJ': 'Dem', 'NM': 'Dem', 'NY': 'Dem', 'NC': 'Div', 'ND': 'Rep', 'OH': 'Rep', 'OK': 'Rep', 'OR': 'Dem', 'PA': 'Div', 'RI': 'Dem', 'SC': 'Rep', 'SD': 'Rep', 'TN': 'Rep', 'TX': 'Rep', 'UT': 'Rep', 'VT': 'Div', 'VA': 'Dem', 'WA': 'Dem', 'WV': 'Rep', 'WI': 'Div', 'WY': 'Rep'}

In [119]:
covid_dates_cleaned['political_control_state'] = covid_dates_cleaned['state'].map(political_control_state)
covid_dates_cleaned.head()

Unnamed: 0,ranking,state,zip,ivy,institution_type,online,vaccine,state_fips,political_control_state
0,1,NJ,8544,True,Private,8.0,14.0,34,Dem
1,2,NY,10027,True,Private,6.0,13.5,36,Dem
2,3,MA,2138,True,Private,24.0,21.5,25,Div
3,4,MA,2139,False,Private,24.0,17.5,25,Div
4,5,CT,6520,True,Private,5.0,14.0,9,Dem


### Preprocessing

Next, I'll turn state, zip, ivy, and institution type into categorical variables. To do this, we will use one-hot encoding. Note that my preprocessing is copied/based on [this sklearn course](https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline_column_transformer.html).

Note there's only 23 states present in this

In [147]:
len(covid_dates_cleaned['state'].unique())

23

In [140]:
covid_dates_cleaned.dtypes

ranking               int64
state                object
zip                   int64
ivy                    bool
institution_type     object
vaccine             float64
online              float64
dtype: object

In [142]:
from sklearn.compose import make_column_selector as selector # separate columns into numerical and categorical
numerical_columns_selector = selector(dtype_exclude=[object, bool])
categorical_columns_selector = selector(dtype_include=[object, bool])

numerical_columns = numerical_columns_selector(covid_dates)
categorical_columns = categorical_columns_selector(covid_dates)

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScalar


## Exploring a Different Task
*TODO* Can we create a classification model that tells us whether or not universities have a vaccine mandate? Is there even enough data to do this?