## Goal
Predict how long it takes for colleges to implement vaccine requirement.

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Preprocessing and Feature Engineering
Some was done in previous notebook, but most will be done here. I'm using the cleaned vaccine data from my 'Vaccine Mandates' notebook.

In [114]:
vacc_data = pd.read_csv('vacc_mandates_cleaned_school.csv')
vacc_data.head()

Unnamed: 0,College,ranking,state,city,zip,announce_date,Type,State_x,all_employee_vacc,some_employee_vacc,...,Region,Division,zip_str,cleaned_name_list,2020.student.size,school.name,school.zip,id,cleaned_school.name_list,name_similarity
0,Adelphi University,171,NY,Garden City,11530,116.0,Private,NY,0,0,...,Northeast,Middle Atlantic,11530,['adelphi'],5076.0,Adelphi University,11530,188429.0,['adelphi'],1.0
1,American University,78,DC,Washington,20016,13.0,Private,DC,1,0,...,South,South Atlantic,20016,['american'],7510.0,American University,20016,131159.0,['american'],1.0
2,Arizona State University,116,AZ,Tempe,85287,197.0,Public,AZ,1,0,...,West,Mountain,85287,"['arizona', 'state']",62633.0,Arizona State University Campus Immersion,85287,104151.0,"['arizona', 'state', 'immersion']",0.816497
3,Aurora University,302,IL,Aurora,60506,139.0,Private,IL,1,0,...,Midwest,East North Central,60506,['aurora'],4123.0,Aurora University,60506,143118.0,['aurora'],1.0
4,Bellarmine University,202,KY,Louisville,40205,,Private,KY,1,0,...,South,East South Central,40205,['bellarmine'],2431.0,Bellarmine University,40205,156286.0,['bellarmine'],1.0


First, drop unnecessary columns (i.e., ones that won't be used as features in my model). Drop names, locations (as I extracted data from them) and ```state_pol```, as that was from the Chronicle data but I found different political data myself. Also dropping records of employee vaccination and boosters, as I'm trying to predict original student vaccination. I will consider ```all_students_vacc``` instead of ```res_students_vacc``` as it'll be less dependent on each college's dorm situations. Also drop ```Division``` as I don't have that much data, so using the more broad category of ```Region``` will probably be better.

In [115]:
vacc_data.drop(columns=['College', 'state', 'city', 'zip', 'State_x', 'state_fips', 
                        'STCOUNTYFP', 'county_fips', 'county_fips_str', 'state_pol', 
                        'State_y', 'State Code', 'Division', 'zip_str', 'cleaned_name_list',
                        'school.name', 'school.zip', 'id', 'cleaned_school.name_list', 'name_similarity',
                        'all_employee_vacc', 'some_employee_vacc', 'res_students_vacc'], 
               inplace=True)

In [116]:
vacc_data.head()

Unnamed: 0,ranking,announce_date,Type,all_students_vacc,booster,median_income,total_population,avg_hhsize,avg_community_level,political_control_state,county_vote_diff,Region,2020.student.size
0,171,116.0,Private,1,0,47254.0,1355683.0,2.62,0.624352,Dem,0.059304,Northeast,5076.0
1,78,13.0,Private,1,1,52328.0,701974.0,2.19,0.726562,Dem,0.867763,South,7510.0
2,116,197.0,Public,0,0,34032.0,4412779.0,2.64,0.492925,Rep,-0.028354,West,62633.0
3,302,139.0,Private,1,0,35433.0,531756.0,2.81,0.568831,Dem,0.105015,Midwest,4123.0
4,202,,Private,1,0,32123.0,768419.0,2.25,1.062827,Div,0.1333,South,2431.0


As rankings are tied from 299-391 (see "Vaccine Mandates.ipynb") and a change in ranking won't usually signal a large change in a college's operations unless they move to a new echelon, I will place the rankings into bins. The rankings are fairly arbitrary besides 299-391. I isolated the top 20 because they often are spoken of in a different way. Then, I separated the rest of the top 100 (another arbitrary cut, but one that makes sense when talking about top colleges).

In [117]:
vacc_data['ranking'] = pd.cut(vacc_data['ranking'], bins=[0, 20, 100, 200, 298, 400], labels=['a', 'b', 'c', 'd', 'e'], right=False)

Now, I need to figure out my target variable. Explore the two possibilities--categorical variable for vaccination status or continuous variable for days after first announcement.

In [96]:
vacc_data['all_students_vacc'].value_counts()

1    141
0     12
Name: all_students_vacc, dtype: int64

In [98]:
vacc_data[['all_students_vacc', 'booster']].value_counts()

all_students_vacc  booster
1                  0          99
                   1          42
0                  0          12
dtype: int64

In [99]:
vacc_data['announce_date'].isna().sum()

14

In [100]:
vacc_data.dropna(subset=['announce_date'], inplace=True)

With only 14 colleges not requiring the vaccine, there's likely not enough data to train a good model. So, I can drop these and do a regression estimating how long it takes for a college to institute a vaccination requirement, under the assumption that they will make one. This won't be as useful, but can help gauge each college's decision times and urgency regarding their requirements.

Note that in the future, I can also try to predict if a college that already had a vaccine requirement will require a booster.

In [101]:
target_date = vacc_data['announce_date'] # Note: is continuous
features_date = vacc_data.drop(columns=['announce_date', 'all_students_vacc', 'booster'])
target_booster = vacc_data['booster'] # Note: is continuous
features_booster = vacc_data.drop(columns=['all_students_vacc', 'booster'])

Now make sure all the vars are the correct type.

In [102]:
vacc_data.dtypes

ranking                    category
announce_date               float64
Type                         object
all_students_vacc             int64
booster                       int64
median_income               float64
total_population            float64
avg_hhsize                  float64
avg_community_level         float64
political_control_state      object
county_vote_diff            float64
Region                       object
2020.student.size           float64
dtype: object

In [103]:
vacc_data.columns

Index(['ranking', 'announce_date', 'Type', 'all_students_vacc', 'booster',
       'median_income', 'total_population', 'avg_hhsize',
       'avg_community_level', 'political_control_state', 'county_vote_diff',
       'Region', '2020.student.size'],
      dtype='object')

## Predicting Dates

Next, use one-hot encoding for the categorical variables. Note that my preprocessing is copied/based on [this sklearn course](https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline_column_transformer.html).

In [104]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
categorical_preprocessor = OneHotEncoder(drop='first') # drop to avoid multicollinearity
numerical_preprocessor = StandardScaler() # normalize data to make it easier for sklearn models to handle

Testing on one column.

In [105]:
type_encoded = categorical_preprocessor.fit_transform(vacc_data[['Type']])
print(type_encoded.toarray()[:5])

[[0.]
 [0.]
 [1.]
 [0.]
 [0.]]


With a pipeline, apply to all categorical features. Also, standardize numerical features.

In [106]:
categorical_columns = ['ranking', 'Type', 'political_control_state', 'Region']
numerical_columns = list(set(features_date.columns).difference(categorical_columns))

In [107]:
from sklearn.compose import ColumnTransformer # splits the column, transforms each subset differently, then concatenates
preprocessor = ColumnTransformer([('one-hot-encoder', categorical_preprocessor, categorical_columns),
                                  ('standard_scaler', numerical_preprocessor, numerical_columns)])

Now split into testing and training (which randomizes data).

In [108]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_date, target_date, random_state=42)

## Modeling

I'll use normal linear regression first.

In [109]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(preprocessor, LinearRegression())
pipe.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('one-hot-encoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['ranking', 'Type',
                                                   'political_control_state',
                                                   'Region']),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  ['avg_hhsize',
                                                   'median_income',
                                                   'total_population',
                                                   'county_vote_diff',
                                                   'avg_community_level',
                                                   '2020.student.size'])])),
                ('linearregression', Lin

In [110]:
pipe.score(X_test, y_test)

0.4361647632216755

Add regularization--let's try Lasso as some features may note be important, using cross validation to select the hyperparameter.

In [111]:
from sklearn.linear_model import LassoCV
pipe = make_pipeline(preprocessor, LassoCV())
pipe.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('one-hot-encoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['ranking', 'Type',
                                                   'political_control_state',
                                                   'Region']),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  ['avg_hhsize',
                                                   'median_income',
                                                   'total_population',
                                                   'county_vote_diff',
                                                   'avg_community_level',
                                                   '2020.student.size'])])),
                ('lassocv', LassoCV())])

In [112]:
pipe.score(X_test, y_test)

0.40396006163919096

Also not working too well. Try ensemble learning.

Maybe should output a rank (i.e., which universities acted first)?

##  Predicting Booster
*TODO* Can we create a classification model that tells us whether or not universities have a vaccine mandate? Is there even enough data to do this? I will try to get the data from another source. This will be a more interesting task as it has more data and is not subject to the problems with the first model when encountering a university that didn't act make one of the recorded COVID decisions. also, it may provide better results.

First, I'll train a model to classify the universities that required the vaccine and those that didn't. Then, I may try a multi-level classification approach if the data I find are rich enough, with three options: one for no mandate, one for a regular mandate, and one for a booster mandate.

## TODO
Find papers that illustrate where COVID-related sentiment comes from and what factors may cause people/administrations to act as they do.

## Sources
- https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html