In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

##  Predicting Booster
Create a classification model that tells us whether or not universities mandate a booster *given* they already mandated a vaccine. Uses preprocessing from "Covid Model Creation" notebook. Note that this analysis better aligns with the iid assumption needed for machine learning--choosing dates to implement vaccine requirements is very much based on the action of other colleges. However, we can assume that choosing to implement a requirement is not *entirely* based on the actions of other institutions; once vaccination requirements seemed imminent, colleges evaluate their own situations and (likely) chose based on the science and social consequences for their students and surrounding environment, especially for a booster.

Ideally, I'd like to train a model to classify the universities that required the vaccine and those that didn't. Then, I would want to try a multi-level classification with three options: one for no mandate, one for a regular mandate, and one for a booster mandate. However, the lack of schools in the data without a vaccine mandate makes this analysis more difficult. Since my dataset is small, the model will likely overfit on those small examples and provide bad generalization. Still, I may try this after I finish with my booster analysis.

In [4]:
target_booster = pd.read_pickle('target_booster.pkl')
features_booster = pd.read_pickle('features_booster.pkl')

In [6]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
categorical_preprocessor = OneHotEncoder(drop='first') # drop to avoid multicollinearity
numerical_preprocessor = StandardScaler() # normalize data to make it easier for sklearn models to handle

In [195]:
from sklearn.compose import ColumnTransformer # splits the column, transforms each subset differently, then concatenates
categorical_columns = ['ranking', 'Type', 'political_control_state', 'Region']
numerical_columns = list(set(features_booster.columns).difference(categorical_columns))
preprocessor = ColumnTransformer([('one-hot-encoder', categorical_preprocessor, categorical_columns),
                                  ('standard_scaler', numerical_preprocessor, numerical_columns)])

### Metrics
Accuracy is the fraction of cases identified correctly $= \frac{tp + tn}{tp + tn + fp + fn}$

Precision is the proportion of predicted positives that are true positives $= \frac{tp}{tp + fp}$

Recall is the proportion of true positives correctly identified $= \frac{tp}{tp + fn}$

See distribution of values to choose best metric.

In [15]:
target_booster.value_counts()/target_booster.shape[0]

0    0.71223
1    0.28777
Name: booster, dtype: float64

So, we have an imbalanced test set, meaning I won't use regular accuracy--I will use balanced accuracy instead.

As I don't have a preference for precision or recall in this data, I'll use the F1 Score, which is the harmonic mean between them.

In [43]:
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
def show_metrics_classification(model, X_test, y_test):
    y_pred = model.predict(X_test)
    print(f'balanced accuracy: {balanced_accuracy_score(y_test, y_pred)}')
    print(f'F1 Score: {f1_score(y_test, y_pred)}') # equal weight to precision and recall
    print(classification_report(y_test, y_pred, labels=[0, 1], target_names=['no booster', 'booster']))

### Models
Now train models. Using [Sebastian Raschka's paper as a reference](https://arxiv.org/abs/1811.12808), especially his [code on nested CV which I took heavily from](https://github.com/rasbt/model-eval-article-supplementary/blob/master/code/nested_cv_code.ipynb). I initially did some selection with the model (you can see my previous commits on GitHub), but am going to have a more comprehensive approach going forward. Here's a summary of some of the things discussed in the paper:

Evaluate overall model performance:
- Use Monte-Carlo Cross-Validation
- Bootstrapping (LOOB) to ; use 50-200 samples
- 3 way holdout -- used in deep learning when dataset is large
- k-fold CV
    - can repeat many times (unnecessary for LOOCV), e.g., run 5-fold cross validation 100 times (with different random seeds), getting 500 test fold estimates
    - use LOOCV for small datasets--note it's approximately unbiased but with high variance
    - generally, increasing k decreases bias but increases variance and computation time

Hyperparameter tuning using CV:
- find best params using k-fold CV, then fit model with those params to entire training set to evaluate test set, afterwards using all data to fit final model
- feature selection could be done inside or outside the loop

Use Nested CV--outer loop estimates generalization error, inner loop selects model. For example, if we are doing 5-fold CV in the outer loop, we take the data from 4 folds, combine it into one dataset, then split that set into k folds and run CV.

In [211]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
X_train, X_test, y_train, y_test = train_test_split(features_booster, target_booster, test_size=0.2, 
                                                    random_state=42, stratify=target_booster)
# these will be passed to other methods to make sure the CV split is the same for every model
inner_cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Set up many models and grid search for each of them. After finding the best hyperparameters with the inner loop, train each model and evaluate on each of the k folds in the outer loop. Sklearn lets me do this nicely by passing in the grid search as a parameter for ```cross_val_score```.

Models

In [None]:
from sklearn.pipeline import make_pipeline

Start with Logistic Regression.

In [9]:
from sklearn.linear_model import LogisticRegression
pipe_log = make_pipeline(preprocessor, LogisticRegression())
pipe_log.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('one-hot-encoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['ranking', 'Type',
                                                   'political_control_state',
                                                   'Region']),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  ['total_population',
                                                   '2020.student.size',
                                                   'county_vote_diff',
                                                   'announce_date',
                                                   'avg_community_level',
                                                   'median_income',
                                          

In [10]:
pipe_log.score(X_test, y_test) # mean accuracy

0.7142857142857143

In [32]:
show_metrics_classification(pipe_log, X_test, y_test)

balanced accuracy: 0.6032608695652174
F1 Score: 0.375
              precision    recall  f1-score   support

  no booster       0.71      0.96      0.81        23
     booster       0.75      0.25      0.38        12

    accuracy                           0.71        35
   macro avg       0.73      0.60      0.59        35
weighted avg       0.72      0.71      0.66        35



Now, use random forests first with default setting, then with grid search.

In [138]:
from sklearn.ensemble import RandomForestClassifier
pipe_forest = make_pipeline(preprocessor, RandomForestClassifier())
pipe_forest.fit(X_train, y_train)
show_metrics_classification(pipe_forest, X_test, y_test)

balanced accuracy: 0.5815217391304348
F1 Score: 0.35294117647058826
              precision    recall  f1-score   support

  no booster       0.70      0.91      0.79        23
     booster       0.60      0.25      0.35        12

    accuracy                           0.69        35
   macro avg       0.65      0.58      0.57        35
weighted avg       0.67      0.69      0.64        35



Do grid search, using oob error as the metric as in the "Covid Model Creation" notebook.

In [49]:
from sklearn.model_selection import PredefinedSplit
cv = PredefinedSplit([-1]*(X_train.shape[0]-1) + [0])
for (train, test) in cv.split(X_train, y_train):
    print(len(train), len(test))

103 1


In [62]:
def oob_scorer(estimator, X, y):
    """
    Get oob score where RandomForest is 1st element in Pipeline
    """
    return estimator[1].oob_score_

Also use [this post on tuning hyperparameters](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)

In [102]:
from sklearn.model_selection import GridSearchCV
param_grid = [
    {'randomforestclassifier__n_estimators': [25, 75, 100, 150, 200],
     'randomforestclassifier__max_depth': [None, 10, 25, 50, 75, 100],
     'randomforestclassifier__max_features': [None, 0.25, 0.5, 0.75], # Note I have 10 features (when they're not encoded)
     'randomforestclassifier__min_samples_split': [2, 3, 5, 7], 
     'randomforestclassifier__min_samples_leaf': [1, 2, 3, 5] # 104 samples in training data -- 84 in validation
    }
]
pipe_forest_grid = GridSearchCV(estimator=make_pipeline(preprocessor, RandomForestClassifier(oob_score=True)), param_grid=param_grid, scoring=oob_scorer, cv=cv)

In [103]:
pipe_forest_grid.fit(X_train, y_train)

GridSearchCV(cv=PredefinedSplit(test_fold=array([-1, -1, ..., -1,  0])),
             estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('one-hot-encoder',
                                                                         OneHotEncoder(drop='first'),
                                                                         ['ranking',
                                                                          'Type',
                                                                          'political_control_state',
                                                                          'Region']),
                                                                        ('standard_scaler',
                                                                         StandardScaler(),
                                                                         ['total_population',
                                         

In [128]:
display(pipe_forest_grid.best_params_)
pipe_forest_grid.score(X_test, y_test)
pipe_forest.score(X_test, y_test)

{'randomforestclassifier__max_depth': None,
 'randomforestclassifier__max_features': 0.75,
 'randomforestclassifier__min_samples_leaf': 5,
 'randomforestclassifier__min_samples_split': 5,
 'randomforestclassifier__n_estimators': 200}

0.7142857142857143

In [173]:
print(pipe_forest_grid.predict(X_test))
print(pipe_forest.predict(X_test))

[0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0]
[0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0]


In [139]:
show_metrics_classification(pipe_forest_grid, X_test, y_test)

balanced accuracy: 0.5815217391304348
F1 Score: 0.35294117647058826
              precision    recall  f1-score   support

  no booster       0.70      0.91      0.79        23
     booster       0.60      0.25      0.35        12

    accuracy                           0.69        35
   macro avg       0.65      0.58      0.57        35
weighted avg       0.67      0.69      0.64        35



Here, I get the same outcome (or 1 prediction different) as the normal random forest, depending on when the tree splits (there are two similar outcomes). So, the hyperparameter tuning didn't change anything much at all.

Now, I will use normal cross-validation, not the OOB error. However, I'll use random search so it's quick.

In [176]:
from sklearn.model_selection import RandomizedSearchCV
rand_param_grid = [
    {'randomforestclassifier__n_estimators': np.linspace(10, 250, 20).astype(int),
     'randomforestclassifier__max_depth': [None, 10, 25, 50, 75, 100],
     'randomforestclassifier__max_features': [None, 0.25, 0.5, 0.75], # Note I have 10 features (when they're not encoded)
     'randomforestclassifier__min_samples_split': [2, 3, 5, 7], 
     'randomforestclassifier__min_samples_leaf': [1, 2, 3, 5] # 104 samples in training data -- 84 in validation
    }
]
forest_random_grid = RandomizedSearchCV(make_pipeline(preprocessor, RandomForestClassifier()), rand_param_grid, n_iter=20, cv=3)
forest_random_grid.fit(X_train, y_train)

RandomizedSearchCV(cv=3,
                   estimator=Pipeline(steps=[('columntransformer',
                                              ColumnTransformer(transformers=[('one-hot-encoder',
                                                                               OneHotEncoder(drop='first'),
                                                                               ['ranking',
                                                                                'Type',
                                                                                'political_control_state',
                                                                                'Region']),
                                                                              ('standard_scaler',
                                                                               StandardScaler(),
                                                                               ['total_population',
                             

In [177]:
forest_random_grid.best_params_

{'randomforestclassifier__n_estimators': 186,
 'randomforestclassifier__min_samples_split': 3,
 'randomforestclassifier__min_samples_leaf': 1,
 'randomforestclassifier__max_features': 0.5,
 'randomforestclassifier__max_depth': 25}

In [180]:
show_metrics_classification(forest_random_grid, X_test, y_test)

balanced accuracy: 0.6231884057971014
F1 Score: 0.4444444444444444
              precision    recall  f1-score   support

  no booster       0.72      0.91      0.81        23
     booster       0.67      0.33      0.44        12

    accuracy                           0.71        35
   macro avg       0.70      0.62      0.63        35
weighted avg       0.70      0.71      0.68        35



This gives an identical performing estimator to the random forest tuned using OOB error when using cv=5. I set cv=3, as I thought it might give different performance. As there's only 35 elements in the test set, using cv=5 I would only test on 7, where cv=3 means I would test on ~12. As there's already a class imbalance in the data, using more data is likely to get us a more representative sample and less likely to only include (or almost only include) colleges with booster mandates.

Note that both ```GridSearchCV``` and ```RandomizedSearchCV``` both already use stratified CV to keep the percentage of samples in each class the same for each fold.

So, using random forest methods gives a similar result to using logistic regression.

## READ MORE on SVM before I do more
Now, try SVM. I'll start with the defaults (rbf kernel).

In [33]:
from sklearn import svm
pipe_svm = make_pipeline(preprocessor, svm.SVC(kernel='rbf', gamma='scale'))
pipe_svm.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('one-hot-encoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['ranking', 'Type',
                                                   'political_control_state',
                                                   'Region']),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  ['total_population',
                                                   '2020.student.size',
                                                   'county_vote_diff',
                                                   'announce_date',
                                                   'avg_community_level',
                                                   'median_income',
                                          

In [46]:
show_metrics_classification(pipe_svm, X_test, y_test)
print(pipe_svm.predict(X_test))

balanced accuracy: 0.5
F1 Score: 0.0
              precision    recall  f1-score   support

  no booster       0.66      1.00      0.79        23
     booster       0.00      0.00      0.00        12

    accuracy                           0.66        35
   macro avg       0.33      0.50      0.40        35
weighted avg       0.43      0.66      0.52        35

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Now, try linear kernel.

In [39]:
pipe_svm_lin = make_pipeline(preprocessor, svm.SVC(kernel='linear'))
pipe_svm_lin.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('one-hot-encoder',
                                                  OneHotEncoder(drop='first'),
                                                  ['ranking', 'Type',
                                                   'political_control_state',
                                                   'Region']),
                                                 ('standard_scaler',
                                                  StandardScaler(),
                                                  ['total_population',
                                                   '2020.student.size',
                                                   'county_vote_diff',
                                                   'announce_date',
                                                   'avg_community_level',
                                                   'median_income',
                                          

In [40]:
show_metrics_classification(pipe_svm_lin, X_test, y_test)

balanced accuracy: 0.5597826086956521
F1 Score: 0.3333333333333333


## Final Evaluation and Training
Evaluate the chosen model on the test set, then train the model using all the data to get a final predictor.

## Results & Takeaway
Create a mini web-page where people can input a state and zip-code and then our model can predict vaccine classification. 

Also, create a map, using slider bars to indicate university-specific variables not specified by state or county. 

Shade each region differently based on their classification. Have a drawing in my notebook.

## Sources (some in previous notebooks)
- https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226