## Goal
Predict how long it takes for colleges to implement vaccine requirement.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Preprocessing and Feature Engineering
Some was done in previous notebook, but most will be done here. I'm using the cleaned vaccine data from my 'Vaccine Mandates' notebook.

In [74]:
vacc_data = pd.read_csv('vacc_mandates_cleaned.csv')
vacc_data.head()

Unnamed: 0,College,ranking,state,city,zip,announce_date,Type,State_x,all_employee_vacc,some_employee_vacc,...,median_income,total_population,avg_hhsize,avg_community_level,political_control_state,county_vote_diff,State_y,State Code,Region,Division
0,Princeton University,0,NJ,Princeton,8544,19.0,Private,NJ,1,0,...,37223.0,368085.0,2.46,0.62766,Dem,0.374257,New Jersey,NJ,Northeast,Middle Atlantic
1,Columbia University,1,NY,New York,10027,18.0,Private,NY,1,0,...,52409.0,1629153.0,2.07,0.535047,Dem,0.768507,New York,NY,Northeast,Middle Atlantic
2,New York University,27,NY,New York,10012,22.0,Private,NY,1,0,...,52409.0,1629153.0,2.07,0.535047,Dem,0.768507,New York,NY,Northeast,Middle Atlantic
3,Fordham University,67,NY,New York,10023,15.0,Private,NY,1,0,...,52409.0,1629153.0,2.07,0.535047,Dem,0.768507,New York,NY,Northeast,Middle Atlantic
4,Yeshiva University,73,NY,New York,10033,33.0,Private,NY,0,0,...,52409.0,1629153.0,2.07,0.535047,Dem,0.768507,New York,NY,Northeast,Middle Atlantic


First, drop unnecessary columns (i.e., ones that won't be used as features in my model). Drop names, locations (as I extracted data from them) and ```state_pol```, as that was from the Chronicle data but I found different political data myself. Also dropping records of employee vaccination and boosters, as I'm trying to predict original student vaccination. I will consider ```all_students_vacc``` instead of ```res_students_vacc``` as it'll be less dependent on each college's dorm situations. Also drop ```Division``` as I don't have that much data, so using the more broad category of ```Region``` will probably be better.

In [75]:
vacc_data.drop(columns=['College', 'state', 'city', 'zip', 'State_x', 'state_fips', 
                        'STCOUNTYFP', 'county_fips', 'county_fips_str', 'state_pol', 
                        'State_y', 'State Code', 'Division',
                        'all_employee_vacc', 'some_employee_vacc', 'res_students_vacc'], 
               inplace=True)

In [76]:
vacc_data.head()

Unnamed: 0,ranking,announce_date,Type,all_students_vacc,booster,median_income,total_population,avg_hhsize,avg_community_level,political_control_state,county_vote_diff,Region
0,0,19.0,Private,1,1,37223.0,368085.0,2.46,0.62766,Dem,0.374257,Northeast
1,1,18.0,Private,1,1,52409.0,1629153.0,2.07,0.535047,Dem,0.768507,Northeast
2,27,22.0,Private,1,1,52409.0,1629153.0,2.07,0.535047,Dem,0.768507,Northeast
3,67,15.0,Private,1,1,52409.0,1629153.0,2.07,0.535047,Dem,0.768507,Northeast
4,73,33.0,Private,1,0,52409.0,1629153.0,2.07,0.535047,Dem,0.768507,Northeast


Now, I need to figure out my target variable. Explore the two possibilities--categorical variable for vaccination status or continuous variable for days after first announcement.

In [72]:
vacc_data['all_students_vacc'].value_counts()

1    161
0     14
Name: all_students_vacc, dtype: int64

In [78]:
vacc_data[['all_students_vacc', 'booster']].value_counts()

all_students_vacc  booster
1                  0          108
                   1           53
0                  0           14
dtype: int64

In [73]:
vacc_data['announce_date'].isna().sum()

15

In [82]:
vacc_data.dropna(subset=['announce_date'], inplace=True)

With only 14 colleges not requiring the vaccine, there's likely not enough data to train a good model. So, I can drop these and do a regression estimating how long it takes for a college to institute a vaccination requirement, under the assumption that they will make one. This won't be as useful, but can help gauge each college's decision times and urgency regarding their requirements.

Note that in the future, I can also try to predict if a college that already had a vaccine requirement will require a booster.

In [122]:
target = vacc_data['announce_date'] # Note: is continuous
features = vacc_data.drop(columns=['announce_date', 'all_students_vacc', 'booster'])

Now make sure all the vars are the correct type.

In [123]:
vacc_data.dtypes

ranking                      int64
announce_date              float64
Type                        object
all_students_vacc            int64
booster                      int64
median_income              float64
total_population           float64
avg_hhsize                 float64
avg_community_level        float64
political_control_state     object
county_vote_diff           float64
Region                      object
dtype: object

In [124]:
vacc_data.columns

Index(['ranking', 'announce_date', 'Type', 'all_students_vacc', 'booster',
       'median_income', 'total_population', 'avg_hhsize',
       'avg_community_level', 'political_control_state', 'county_vote_diff',
       'Region'],
      dtype='object')

Next, use one-hot encoding for the categorical variables. Note that my preprocessing is copied/based on [this sklearn course](https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline_column_transformer.html).

In [125]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
categorical_preprocessor = OneHotEncoder(handle_unknown='ignore')
numerical_preprocessor = StandardScaler() # normalize data to make it easier for sklearn models to handle

Testing on one column.

In [126]:
type_encoded = categorical_preprocessor.fit_transform(vacc_data[['Type']])
print(type_encoded.toarray()[:5])

[[1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]
 [1. 0.]]


With a pipeline, apply to all categorical features. Also, standardize numerical features.

In [135]:
categorical_columns = ['Type', 'political_control_state', 'Region']
numerical_columns = list(set(features.columns).difference(categorical_columns))

In [140]:
from sklearn.compose import ColumnTransformer # splits the column, transforms each subset differently, then concatenates
preprocessor = ColumnTransformer([('one-hot-encoder', categorical_preprocessor, categorical_columns),
                                  ('standard_scaler', numerical_preprocessor, numerical_columns)])

Now split into testing and training (which randomizes data).

In [141]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=42)

In [147]:
features

Unnamed: 0,ranking,Type,median_income,total_population,avg_hhsize,avg_community_level,political_control_state,county_vote_diff,Region
0,0,Private,37223.0,368085.0,2.46,0.627660,Dem,0.374257,Northeast
1,1,Private,52409.0,1629153.0,2.07,0.535047,Dem,0.768507,Northeast
2,27,Private,52409.0,1629153.0,2.07,0.535047,Dem,0.768507,Northeast
3,67,Private,52409.0,1629153.0,2.07,0.535047,Dem,0.768507,Northeast
4,73,Private,52409.0,1629153.0,2.07,0.535047,Dem,0.768507,Northeast
...,...,...,...,...,...,...,...,...,...
170,361,Private,27231.0,181014.0,2.11,0.632275,Rep,-0.205932,South
171,363,Public,31110.0,243692.0,2.43,0.482940,Div,-0.335871,South
172,369,Public,33377.0,315389.0,2.98,0.497409,Dem,-0.222495,West
173,379,Public,28467.0,430319.0,2.14,0.633075,Rep,0.177848,Midwest


## Modeling

I'll use normal linear regression at first.

In [142]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(preprocessor, LinearRegression())
pipe.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('one-hot-encoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['Type',
                                                   'political_control_state',
                                                   'Region']),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  ['avg_community_level',
                                                   'avg_hhsize',
                                                   'county_vote_diff',
                                                   'median_income', 'ranking',
                                                   'total_population'])])),
                ('linearregression', LinearRegression())])

In [143]:
pipe.score(X_test, y_test)

0.3050603083911495

Maybe should output a rank (i.e., which universities acted first)? Also, the ranking feature should probably be changed--not a linear relationship?

## Exploring a Different Task
*TODO* Can we create a classification model that tells us whether or not universities have a vaccine mandate? Is there even enough data to do this? I will try to get the data from another source. This will be a more interesting task as it has more data and is not subject to the problems with the first model when encountering a university that didn't act make one of the recorded COVID decisions. also, it may provide better results.

First, I'll train a model to classify the universities that required the vaccine and those that didn't. Then, I may try a multi-level classification approach if the data I find are rich enough, with three options: one for no mandate, one for a regular mandate, and one for a booster mandate.

## TODO
Find papers that illustrate where COVID-related sentiment comes from and what factors may cause people/administrations to act as they do.