In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

##  Predicting Booster
Create a classification model that tells us whether or not universities mandate a booster *given* they already mandated a vaccine. Uses preprocessing from "Covid Model Creation" notebook. Note that this analysis better aligns with the iid assumption needed for machine learning--choosing dates to implement vaccine requirements is very much based on the action of other colleges. However, we can assume that choosing to implement a requirement is not *entirely* based on the actions of other institutions; once vaccination requirements seemed imminent, colleges evaluate their own situations and (likely) chose based on the science and social consequences for their students and surrounding environment, especially for a booster.

Ideally, I'd like to train a model to classify the universities that required the vaccine and those that didn't. Then, I would want to try a multi-level classification with three options: one for no mandate, one for a regular mandate, and one for a booster mandate. However, the lack of schools in the data without a vaccine mandate makes this analysis more difficult. Since my dataset is small, the model will likely overfit on those small examples and provide bad generalization. Still, I may try this after I finish with my booster analysis.

In [248]:
target_booster = pd.read_pickle('target_booster.pkl')
features_booster = pd.read_pickle('features_booster.pkl')
num_features = features_booster.shape[1]

In [6]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
categorical_preprocessor = OneHotEncoder(drop='first') # drop to avoid multicollinearity
numerical_preprocessor = StandardScaler() # normalize data to make it easier for sklearn models to handle

In [195]:
from sklearn.compose import ColumnTransformer # splits the column, transforms each subset differently, then concatenates
categorical_columns = ['ranking', 'Type', 'political_control_state', 'Region']
numerical_columns = list(set(features_booster.columns).difference(categorical_columns))
preprocessor = ColumnTransformer([('one-hot-encoder', categorical_preprocessor, categorical_columns),
                                  ('standard_scaler', numerical_preprocessor, numerical_columns)])

### Metrics
Accuracy is the fraction of cases identified correctly $= \frac{tp + tn}{tp + tn + fp + fn}$

Precision is the proportion of predicted positives that are true positives $= \frac{tp}{tp + fp}$

Recall is the proportion of true positives correctly identified $= \frac{tp}{tp + fn}$

**USE AUC CURVE?**

See distribution of values to choose best metric.

In [15]:
target_booster.value_counts()/target_booster.shape[0]

0    0.71223
1    0.28777
Name: booster, dtype: float64

So, we have an imbalanced test set, meaning I won't use regular accuracy--I will use balanced accuracy instead.

As I don't have a preference for precision or recall in this data, I'll use the F1 Score, which is the harmonic mean between them.

In [43]:
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
def show_metrics_classification(model, X_test, y_test):
    y_pred = model.predict(X_test)
    print(f'balanced accuracy: {balanced_accuracy_score(y_test, y_pred)}')
    print(f'F1 Score: {f1_score(y_test, y_pred)}') # equal weight to precision and recall
    print(classification_report(y_test, y_pred, labels=[0, 1], target_names=['no booster', 'booster']))

### Models
Now train models. Using [Sebastian Raschka's paper as a reference](https://arxiv.org/abs/1811.12808), especially his [code on nested CV which I took heavily from](https://github.com/rasbt/model-eval-article-supplementary/blob/master/code/nested_cv_code.ipynb). I initially did some selection with the model (you can see my previous commits on GitHub), but am going to have a more comprehensive approach going forward. Here's a summary of some of the things discussed in the paper:

Evaluate overall model performance:
- Use Monte-Carlo Cross-Validation
- Bootstrapping (LOOB) to ; use 50-200 samples
- 3 way holdout -- used in deep learning when dataset is large
- k-fold CV
    - can repeat many times (unnecessary for LOOCV), e.g., run 5-fold cross validation 100 times (with different random seeds), getting 500 test fold estimates
    - use LOOCV for small datasets--note it's approximately unbiased but with high variance
    - generally, increasing k decreases bias but increases variance and computation time

Hyperparameter tuning using CV:
- find best params using k-fold CV, then fit model with those params to entire training set to evaluate test set, afterwards using all data to fit final model
- feature selection could be done inside or outside the loop

Use Nested CV--outer loop estimates generalization error, inner loop selects model. For example, if we are doing 5-fold CV in the outer loop, we take the data from 4 folds, combine it into one dataset, then split that set into k folds and run CV. This will better account for the variance of the test set--same motivation as regular cross validation accounting for the variance of the validation set. Also note that in the previous example we select 5 best models and see the generalization error on each of them. We can choose one of those models or ensemble those models. See [Sergey Feldman's lecture](https://www.youtube.com/watch?v=DuDtXtKNpZs) for more.

**Figuring** out what to do after nested cross validation has been difficult. This is a summary of the previous resources and the stack overflow articles listed here ([1](https://stats.stackexchange.com/questions/232897/how-to-build-the-final-model-and-tune-probability-threshold-after-nested-cross-v/233027#233027), [2](https://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection/65156#65156), [3](https://stats.stackexchange.com/questions/341229/an-intuitive-understanding-of-each-fold-of-a-nested-cross-validation-for-paramet?rq=1)):

My takeaways are that the inner loop is for model selection, while the outer loop is for generalization error. Therefore it would be wrong to choose a model based on the results of the outer loop. Instead, the estimates of the all the errors in outer loop for a specific model are averaged to provide the approximate generalization error for that model fitting method when all the data is used. After nested CV is done, apply your inner CV methods with all the data to select the optimal hyperparameters, using the results from nested CV as estimates for this procedure.
From 3:
>Thus: run the auto-tuning of hyperparameters on the whole data set just as you do during cross validation. Same hyperparameter combinations to consider, same strategy for selecting the optimum. In short: same training algorithm, just slightly different data (1/k additional cases).



Note that if the inner and outer estimates of the model are very different, this could signify overfitting. Also note if the inner estimates vary widely in their hyperparameters, this means the model is likely not stable. Can use iterated/repeated CV to inspect further.

In [271]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
X_train, X_test, y_train, y_test = train_test_split(features_booster, target_booster, test_size=0.2, 
                                                    random_state=42, stratify=target_booster)
# use 5-fold for inner and outer so there's enough data in validation set but not too much
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Set up many models and grid search for each of them. After finding the best hyperparameters with the inner loop, train each model and evaluate on each of the k folds in the outer loop. Sklearn lets me do this nicely by passing in the grid search as a parameter for ```cross_val_score```.

Create all the models I want to use and a parameter grid for each of them. I'll use logistic regression, random forest, and SVM. Nested CV should give 3\*k total estimates of model performance--k for each algorithm, where k is the number of folds in the outer CV. I'll average the estimates by model and report the standard deviation for each of the 3 models. Then, I'll choose the best model according to this and apply CV (the same as the inner CV used previously but with more data) with the same grid search parameters on the training set. Note that the generalization errors are no longer unbiased once I've chosen the minimum. Finally, I'll train those parameters on the entire training set, and use the test set to get an unbiased estimtae.

In [276]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

model_names = ['log', 'for', 'svm']

pipe_log = Pipeline([('pre', preprocessor), (model_names[0], LogisticRegression(random_state=42))])
pipe_for = Pipeline([('pre', preprocessor), (model_names[1], RandomForestClassifier(random_state=42, oob_score=True))])
pipe_svm = Pipeline([('pre', preprocessor), (model_names[2], SVC(random_state=42))])
param_grid_log = [
    {'log__C': [0.1**(x) for x in range(0, 10, 2)]} # l2 default
] 
param_grid_for = [
    {'for__max_features': [0.5, 0.75, 'sqrt', None], 
     'for__max_samples': [0.25, 0.5, 0.75, None],
     'for__min_samples_leaf': [1, 2, 5],
#      'for__max_leaf_nodes': [None, 2, 5, 10], 
     'for__max_depth': [10, 25, 50, None],
     # 'for__min_samples_split': [2, 3, 5, 7, 10],
    }
]
param_grid_svm = [
    {'svm__kernel': ['linear', 'poly', 'rbf'],
     'svm__gamma': [2**x for x in range(-12, 2, 2)],
     'svm__C': [2**x for x in range(-4, 10, 2)]
    }
]

def oob_scorer(estimator, X, y):
    """
    Get oob score where RandomForest is 1st element in Pipeline
    """
    return estimator[1].oob_score_

validation_scores = np.zeros((3, 5))
for i, pipe, param_grid in zip([0, 1, 2], [pipe_log, pipe_for, pipe_svm], [param_grid_log, param_grid_for, param_grid_svm]):
    name = list(pipe.named_steps.keys())[1]
#     if name == 'for': # use oob error instead of CV estimates for inner CV--but this would require changing CV split which I'll ignore for now
#         gcv = GridSearchCV(pipe, param_grid, cv=inner_cv, scoring=oob_scorer)
#     else:
    gcv = GridSearchCV(pipe, param_grid, cv=inner_cv)
    nested_score = cross_val_score(gcv, X=X_train, y=y_train, cv=outer_cv)
    validation_scores[i, :] = nested_score
    print(f'{name}: {nested_score}') # should output k_outer folds--the number of folds used in outer cv 

log: [0.60869565 0.68181818 0.77272727 0.81818182 0.72727273]
for: [0.60869565 0.77272727 0.68181818 0.77272727 0.77272727]
svm: [0.60869565 0.68181818 0.72727273 0.68181818 0.68181818]


In [286]:
validation_scores_mean = validation_scores.mean(axis=1)
validation_scores_std = validation_scores.std(axis=1)
for i in range(3):
    print(f'{model_names[i]} | Outer accuracy: {validation_scores_mean[i]} +/- {validation_scores_std[i]}')

log | Outer accuracy: 0.7217391304347827 +/- 0.07253152898435454
for | Outer accuracy: 0.7217391304347827 +/- 0.06659111364000848
svm | Outer accuracy: 0.6762845849802371 +/- 0.0381048988300238


So, I have an approximately unbiased generalization error for all three model selecting processes. Logistic regression and random forests have the same estimated accuracy and a similar standard deviation. SVM performs worse, but has a smaller standard deviation. Now run my model selection procedure on the entire training set.

In [288]:
final_gcv = GridSearchCV(pipe_log, param_grid_log, cv=inner_cv)
final_gcv.fit(X_train, y_train)
final_gcv.best_params_

{'log__C': 1.0}

## Final Evaluation and Training
Evaluate the chosen model on the test set, then train the model using all the data to get a final predictor.

## Results & Takeaway
Create a mini web-page where people can input a state and zip-code and then our model can predict vaccine classification. 

Also, create a map, using slider bars to indicate university-specific variables not specified by state or county. 

Shade each region differently based on their classification. Have a drawing in my notebook.

## Sources (some in previous notebooks)
- https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226
- random forest parameter tuning
    - https://stats.stackexchange.com/questions/344220/how-to-tune-hyperparameters-in-a-random-forest
    - https://arxiv.org/pdf/1804.03515.pdf
- svm parameter tuning
    - https://stats.stackexchange.com/questions/43943/which-search-range-for-determining-svm-optimal-c-and-gamma-parameters
    - https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
    - https://stats.stackexchange.com/questions/249881/svm-hyperparameters-tuning use bayesian optimization in future?