In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Preprocessing

Next, use one-hot encoding for the categorical variables. Note that my preprocessing is copied/based on [this sklearn course](https://inria.github.io/scikit-learn-mooc/python_scripts/03_categorical_pipeline_column_transformer.html).

Note there's only 23 states present in this

In [214]:
len(covid_dates_cleaned['state'].unique())

23

Don't use zip as we extracted data from it.

In [320]:
covid_dates_data = covid_dates_cleaned.drop(columns="zip").dropna() # drop rows if NaN
covid_dates_target = covid_dates_data["days_after_first"]
covid_dates_data.drop(columns="days_after_first", inplace=True)

In [321]:
covid_dates_data.dtypes

ranking                     int64
state                      object
ivy                          bool
institution_type           object
decision_type              object
political_control_state    object
dtype: object

In [322]:
from sklearn.compose import make_column_selector as selector # separate columns into numerical and categorical
numerical_columns_selector = selector(dtype_exclude=[object, bool])
categorical_columns_selector = selector(dtype_include=[object, bool])

numerical_columns = numerical_columns_selector(covid_dates_data)
categorical_columns = categorical_columns_selector(covid_dates_data)

In [323]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
categorical_preprocessor = OneHotEncoder(handle_unknown='ignore')
numerical_preprocessor = StandardScaler() # normalize data to make it easier for sklearn models to handle

In [324]:
from sklearn.compose import ColumnTransformer # splits the column, transforms each subset differently, then concatenates
preprocessor = ColumnTransformer([('one-hot-encoder', categorical_preprocessor, categorical_columns),
                                  ('standard_scalar', numerical_preprocessor, numerical_columns)])

In [325]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(covid_dates_data, covid_dates_target)

## Modeling
Choose a good one!

I'll use normal linear regression at first.

In [326]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(preprocessor, LinearRegression())
pipe.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('one-hot-encoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['state', 'ivy',
                                                   'institution_type',
                                                   'decision_type',
                                                   'political_control_state']),
                                                 ('standard_scalar',
                                                  StandardScaler(),
                                                  ['ranking'])])),
                ('linearregression', LinearRegression())])

In [327]:
pipe.score(X_test, y_test)

0.3152768081719818

Using my fairly random imputing method, I get a score of about -0.003, while with dropping the NaNs, I get a nicer result of 0.3.

Maybe should output a rank (i.e., which universities acted first)? Also, the ranking feature should probably be changed--not a linear relationship.

## Exploring a Different Task
*TODO* Can we create a classification model that tells us whether or not universities have a vaccine mandate? Is there even enough data to do this? I will try to get the data from another source. This will be a more interesting task as it has more data and is not subject to the problems with the first model when encountering a university that didn't act make one of the recorded COVID decisions. also, it may provide better results.

First, I'll train a model to classify the universities that required the vaccine and those that didn't. Then, I may try a multi-level classification approach if the data I find are rich enough, with three options: one for no mandate, one for a regular mandate, and one for a booster mandate.

## TODO
Find papers that illustrate where COVID-related sentiment comes from and what factors may cause people/administrations to act as they do.