# Practicum 05 - Machine Learning Model Development Process

In this practicum, we will revisit the [SUPPORT2](https://archive.ics.uci.edu/dataset/880/support2) dataset to create a classification model that predicts in-patient mortality for critically ill patients from data gathered within the first three days of admission. We will follow the full model development process discussed in the last several lectures to:
1. Split the data into a training and test set.
2. Preprocess the training data to:
    1. Impute missing values
    2. Standardize it
3. Model selection
    1. Randomized search for hyperparameter selection
    2. Stratified K-fold validation for model selection
4. Model Assessment
    1. Evaluate performance on the test set
    2. Examine learning curves

In [None]:
# Google Colab setup
# mount the google drive - this is necessary to access supporting src
from google.colab import drive
drive.mount("/content/drive")

In [None]:
# install any packages not found in the Colab environment
!pip install ucimlrepo

In [None]:
from ucimlrepo import fetch_ucirepo
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV, learning_curve
from sklearn.impute import KNNImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import balanced_accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay, PrecisionRecallDisplay
from prettytable import PrettyTable
from functools import reduce
from statsmodels.stats.contingency_tables import mcnemar
import numpy as np

# local project imports
import sys
sys.path.append("/content/drive/MyDrive/Colab Notebooks/CPSC-8810-ML-BioMed/src")
from uci_utils import get_vars_of_type, get_vars_of_type_in_list

In [None]:
# fetch Support2 dataset
# fetch Support Dataset
# fetch dataset
support2 = fetch_ucirepo(id=880)

# data (as pandas dataframes)
X = support2.data.features.drop(['charges', 'totcst', 'totmcst',
                                 'hday', 'dnrday' ], axis=1)
y = support2.data.targets['hospdead'] # death in hospital
meta_vars = support2.variables
feature_type_corrections = [('edu', 'Integer'),
                            ('prg6m', 'Continuous'),
                            ('adls', 'Categorical'),
                            ('diabetes', 'Categorical'),
                            ('dementia', 'Categorical')]
# several of the features have the wrong type, so we correct them here
for tpl in feature_type_corrections:
    row = meta_vars[meta_vars.name == tpl[0]].index[0]
    meta_vars.loc[row, 'type'] = tpl[1]

# convert categorical variables to dummies and create a new dataframe for the features
quant_vars, X_quant = get_vars_of_type_in_list(X, meta_vars, var_type_key = 'type', var_name_key = 'name', type_list = ['Continuous', 'Integer', 'Binary'])
cat_vars, X_cat = get_vars_of_type(X, meta_vars, var_type_key = 'type', var_name_key = 'name', type_kw = 'Categorical')
X_dummy = pd.get_dummies(X_cat, columns=cat_vars,drop_first=True, dtype=int, dummy_na=True)
X = pd.concat([pd.DataFrame(X_quant), X_dummy.reset_index(drop=True)], axis=1)

In [None]:
# global settings
pd.options.display.max_columns = 100
rs = 654321 # random state, use this to ensure reproducibility

# Data Preprocessing
Let's begin with data preprocessing in preparaton for model training. Our first step is to split the data into a training set and a held out test set. We set `stratify=y` to ensure the target balance is the same across both sets.

In [None]:
# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=rs)

Now let's examine the class balance. We antiticipate that the samples are imbalanced relative to the target label and that we'll wish to address this during model training and selection.

In [None]:
# count plot of the target variable
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))

d = len(y_train.index)
vc = y_train.value_counts()
ax = axes[0]
vc.plot(kind='bar', title='Traing Data Mortality Class Balance', ax=ax)
ax.set_xticklabels([f"No - {vc[0]/d*100:.1f}%", f"Yes - {vc[1]/d*100:.1f}%"], rotation=0)
ax.set_xlabel('Patient In Hospital Mortality Outcome')
ax.set_ylabel('Sample Counts');

d = len(y_test.index)
vc = y_test.value_counts()
ax = axes[1]
vc.plot(kind='bar', title='Test Data Mortality Class Balance', ax=ax)
ax.set_xticklabels([f"No - {vc[0]/d*100:.1f}%", f"Yes - {vc[1]/d*100:.1f}%"], rotation=0)
ax.set_xlabel('Patient In Hospital Mortality Outcome')
ax.set_ylabel('Sample Counts');

## Missing Data
Next, let's examine the presence of missing data in our feature set.

In [None]:
mp_train = (len(X_train.index) - X_train.count())/len(X_train.index)*100
mp_test = (len(X_test.index) - X_test.count())/len(X_test.index)*100
mp_table = PrettyTable(['Feature', 'Training Missing %', 'Test Missing %'])
for j in range(len(mp_train.index)):
    label = mp_train.index[j]
    if mp_train[label] > 0:
        mp_table.add_row([label, f'{mp_train[label]:.1f}', f'{mp_test[label]:.1f}'])
print(mp_table)

We can see most features have some fraction of missing observations. We perform imputation to address these. However, note that if we were developing this model for a real-world application, we would need to consider the manner in which the data is missing (i.e., _missing completely at random_, _missing at random_ or _missing not at random_).Features thought to be MNAR would likely need to be dropped from the analysis or possibly treated as indicator variables. For purpsoses of illustration, we will assume the data is either MCAR or MAR.

### Imputation
To address the missing data, we will perform imputation using the [scikit-learn nearest neighbors imputation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer) module. We want to ensure that information from the test data does not _leak_ into the training data. Therefore, we will fit the imputation model using __only__ the training data. We will then use the imputation model fit with the training data to impute values for both the training and test set data.

In [None]:
# create an KNNImputer and fit to training data
imputer = KNNImputer(n_neighbors=5, weights='uniform')
imputer.fit(X_train)
X_train_imputed = pd.DataFrame(data=imputer.transform(X_train), columns=X_train.columns)

In [None]:
# plot the KDE of the original and imputed data
mp_train = (len(X_train.index) - X_train.count())/len(X_train.index)*100
keys = [key for key in mp_train.index if mp_train[key] > 0.5]
ncols = 4
nrows = len(keys) // ncols + 1
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(20, 20))
for i, key in enumerate(keys):
    ax = axes[i // ncols, i % ncols]
    sns.kdeplot(X_train[key], ax=ax, label='original')
    sns.kdeplot(X_train_imputed[key], ax=ax, label='imputed')
    ax.set_title(key)
    ax.legend()
plt.show()

# Problem 1 (1 point)
Impute values for missing data on the `X_test`. The imputed values should be assigned to the variable `X_test_imputed`. Be sure to use the `imputer` variable that __was fit to the training data__, `x_train`. The test data should __NOT__ be used to inform the imputation fit. Plot the distributions of the fitted test data as a check on the quality of the imputation.

In [None]:
## PROBLEM 1 - YOUR CODE HERE ##############################################
# Impute the missing values in the test data
# Store the imputed data in a new dataframe called X_test_imputed
X_test_imputed = None

# plot the KDE of the original and imputed data on the test set

## Data Standardization

Now let's standardize the data to have zero mean and unit variance for the continuous features. As with imputation, we want to prevent test set information from leaking into the training set. Therefore, we will _fit_ our standardization model to the training data and then use it to standardize both the training data and the test data. This amounts to computing a the mean, $\mu$, and standard deviation, $\sigma$, on the training data and using those values to center and scale both the training and test sets.

We also want to be careful to standardize only the numeric features (continuous and integer). We do __not__ want to standardize the dummy variable (binary) features. We can do this in a single step by using the [scikit-learn column transformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) module. This allows to specify which columns should be altered.

In [None]:
# standardize the continuous and integer features on the training data
continuous_vars, _ = get_vars_of_type_in_list(X_train_imputed, meta_vars, var_type_key = 'type', var_name_key = 'name', type_list = ['Continuous', 'Integer'])
standarizer = ColumnTransformer([("standardize", preprocessing.StandardScaler(), continuous_vars)], remainder='passthrough')
standarizer.fit(X_train_imputed)
X_train_scaled = pd.DataFrame(standarizer.transform(X_train_imputed), columns=X_train_imputed.columns)

Standardize the test values in `X_test_imputed`. Be sure to use the `standardizer` variable that __was fit to the training data__, `x_train_imputed`. The test data should __NOT__ be used to inform the standardization fit.

In [None]:
# standardize the continuous and integer features on the test data
continuous_vars, _ = get_vars_of_type_in_list(X_test_imputed, meta_vars, var_type_key = 'type', var_name_key = 'name', type_list = ['Continuous', 'Integer'])
X_test_scaled = pd.DataFrame(standarizer.transform(X_test_imputed), columns=X_test_imputed.columns)

# Model Training.
Now that we've prepared the data by addressing missing observations and applying standardization, we are ready to train our prediction models. We will consider two models: (1) Logistic regression classifier and (2) Random Forest.

As these modesl have servarl tuning parameteres (i.e., hyperparameters that are not learned from the data), we will need to evaluate a variety of hyperparameter combinations in a systematic way to select the optimal values. We will use the scikit-learn [RandomizedSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV) module to perform the hyperparameter search.

The `RandomizedSearchCV` module will randomly sample hyperparameter value combinations from a user specified dictionary and a budget indicating the number of combinations to test. We will specify a number of number of allowed values for each hyperparameter combination. Importantly, we will include different options of the `class_weight` parameter in both models as a potential mechanism to address class imbalance. Note, our total parameter space will be larger than our budget, so not all hyperparameter combinations will be tested.

For each hyperparameter combination, `RandomizedSearchCV` applies cross-validation to score the combination. We will use the scikit-learn [StratifiedKFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold) module to ensure that the cross-validation folds are balanced relative to the outcome. We will use `balanced_accuracy` as our validation metrics (for other metrics see [Metrics and scoring](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)).

## Logistic Regression
Let's start by training a logistic regression classifier using the training data with imputed and standardized values, `x_train_scaled`.

In [None]:
# specify the model
# set the random state for reproducibility
# set the soloer to saga to allow for all regularization types
# set the l1_ratio to 0.5 to allow for elasticnet mixing when the penalty is elasticnet, ignored otherwise
model = LogisticRegression(random_state=rs, max_iter=1000, solver='saga', l1_ratio=0.5)

# specify the hyperparameter space
parameter_space = {
    'C': [0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 50, 100],
    'penalty': ['l1', 'l2', 'elasticnet', None],
    'class_weight': [None, 'balanced'],
    'fit_intercept': [True, False]
}
parameter_space_size = reduce(lambda left, right: left*len(right), parameter_space.values(), 1)

# specify the cross-validation method. Use stratified k-fold because the target variable is imbalanced
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=rs)

# specify the budget (number of hyperparameter combinations to try)
budget = 2

# select a score to optimize
score = 'balanced_accuracy'

# number of jobs to run in //, -1 means use all CPU processors
n_jobs = -1

search_lr = RandomizedSearchCV(estimator=model,
                            param_distributions=parameter_space,
                            n_iter=budget,
                            scoring=score,
                            n_jobs=n_jobs,
                            cv=skf,
                            random_state=rs,
                            return_train_score=True,
                            verbose=0)
rslt_lr = search_lr.fit(X_train_scaled, y_train)

In [None]:
print(f'The search space has {parameter_space_size} hyperparameter combinations.\nWe evaluated {budget/parameter_space_size*100:.1f}% of them.')

Let's take a look at the best overall model from the randomized search over hyperparameter combinations. The `rslt_lr` variable includes several attributes that provide information on the search results. These include:
1. `best_score_` - the highest cross validation score among the hyperparameter combinations
2. `best_params_` - the hyperparameter combination that resulted in the best cross validaton score
3. `best_estimator_` - the model with the best hyperparameter combination that has been refit to __all__ of the training data (assuming `refit=True`)

In [None]:
print(f'Best {score} score: {rslt_lr.best_score_:.2f}')
rslt_lr.best_estimator_

## Random Forest
Now let's train a random forest model.

# Problem 2 (2 points)
In the code cell below, use the same procedure as shown for the logistic regression model to train a random forest model. Specifically, using the defined variables `model`, `parameter_space`, `skf`, `budget`, `score`, and `n_jobs`, create a `RandomizedSearchCV` object and assign it to the `search_rf` variable and fit it to the training data and assign the result to the `reslt_rf` variable.


In [None]:
# specify the model
# set the random state for reproducibility
model = RandomForestClassifier(random_state=rs, bootstrap=True)

# specify the hyperparameter space
parameter_space = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4, 8],
    'max_features': ['sqrt', 'log2'],
    'class_weight': [None, 'balanced']
}
parameter_space_size = reduce(lambda left, right: left*len(right), parameter_space.values(), 1)

# specify the cross-validation method. Use stratified k-fold because the target variable is imbalanced
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=rs)

# specify the budget (number of hyperparameter combinations to try)
# budget = int(0.125*parameter_space_size)
budget = 50

# select a score to optimize
score = 'balanced_accuracy'

# number of jobs to run in //, -1 means use all CPU processors
n_jobs = -1

######################################### PROBLEM 2 - YOUR CODE HERE #####################################################3
# create a RandomizedSearchCV using the RandomForestClassifier object and fit it to the training data
search_rf = None
rslt_rf = None

In [None]:
print(f'The search space has {parameter_space_size} hyperparameter combinations.\nWe evaluated {budget} of them.')

Let's take a look at the selected random forest hyperparameters and corresponding cross-validation score.

In [None]:
print(f'Best {score} score: {rslt_rf.best_score_:.2f}')
rslt_rf.best_estimator_

# Model Assessment

Now that we've selected our hyperparameters and trained our two models, let's begin assessing their characteristics.

## Test set performance
Naturally, the first thing we want to consider is the test set performance. Let's start by examining the logistic regression model test set performance. We will examine the confusion matrix, the classifcation report (point metrics), the ROC curve, and the PR curve.

In [None]:
# set clf to the best logistic regression model
clf = rslt_lr.best_estimator_

# generate the test predictions
y_pred = clf.predict(X_test_scaled)

# Plot the ROC
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))

# plot the confusion matrix on the first axis
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm, display_labels=['Morality No', 'Mortality Yes'])
disp.plot(cmap='Blues', ax=axes[0])

# plot the ROC curve on the second axis
disp = RocCurveDisplay.from_estimator(clf, X_test_scaled, y_test, ax=axes[1])
axes[1].grid()

# Plot the precision-recall curve on the third axis
disp = PrecisionRecallDisplay.from_estimator(clf, X_test_scaled, y_test, ax=axes[2])
axes[2].grid()

print(classification_report(y_test, y_pred, target_names=['Morality No', 'Mortality Yes']))

# Problem 3 (1 point)
As shown above for the logistic regression model, evaluate the Random Forest test set performance by printing the classifcation report, and plotting the confusion matrix, ROC curve, and PR curve.

In [None]:
# set clf to the best random forest model
clf = rslt_rf.best_estimator_

# generate the test predictions
y_pred = clf.predict(X_test_scaled)

# Plot the ROC
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))

# plot the confusion matrix on the first axis
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm, display_labels=['Morality No', 'Mortality Yes'])
disp.plot(cmap='Blues', ax=axes[0])

# plot the ROC curve on the second axis
disp = RocCurveDisplay.from_estimator(clf, X_test_scaled, y_test, ax=axes[1])
axes[1].grid()

# Plot the precision-recall curve on the third axis
disp = PrecisionRecallDisplay.from_estimator(clf, X_test_scaled, y_test, ax=axes[2])
axes[2].grid()

print(classification_report(y_test, y_pred, target_names=['Morality No', 'Mortality Yes']))

## Statistical Significance of Test Set Performance

We can see in the classification report and the confusion matrix that the models make somewhat differnt predictions on the test set. Let's evaluate whether the prediction distributions are signficantly different. Since we have only two models that make binary classification predictions, we may apply McNemar's test. To perform McNemar's test, we need to form a contingency table that contains four values:
1. `n00`: the number of samples misclassified by both models
2. `n01` : the number of samples misclassified by the logistic regression model, but correctly classified by the random forest model
3. `n10` : the number of samples correctly classified by the logistic regression model, but misclassified by the random forest model
4. `n11` : the number of samples correctly classified by both models

In [None]:
significance_threshold = 0.05

lr_correct = np.where(rslt_lr.best_estimator_.predict(X_test_scaled)==y_test)
rf_correct = np.where(rslt_rf.best_estimator_.predict(X_test_scaled)==y_test)
lr_wrong = np.where(rslt_lr.best_estimator_.predict(X_test_scaled)!=y_test)
rf_wrong = np.where(rslt_rf.best_estimator_.predict(X_test_scaled)!=y_test)

n00 = len(np.intersect1d(lr_wrong, rf_wrong))
n01 = len(np.intersect1d(lr_wrong, rf_correct))
n10 = len(np.intersect1d(lr_correct, rf_wrong))
n11 = len(np.intersect1d(lr_correct, rf_correct))

table = [[n00, n01],[n10, n11]]
test = mcnemar(table, exact=True)

c_table = PrettyTable(['','RF Wrong', 'RF Correct'])
c_table.add_row(['LR Wrong', n00, n01])
c_table.add_row(['LR Correct', n10, n11])
print(c_table)

if test.pvalue < significance_threshold:
    print(f'The models are significantly different with p-value {test.pvalue:.3f}')

## Learning Curve Analyis

Now that we see that there is a statistical difference in the models, let's examine the learning curves.

In [None]:
def plot_learning_curve(model, model_name, scoring='balanced_accuracy', ax=None):
    train_sizes = np.linspace(0.1, 1.0, 10)
    train_sizes, train_scores, validation_scores = learning_curve(
        model, X_train_scaled, y_train, train_sizes=train_sizes, cv=3, scoring=scoring, n_jobs=-1)
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)

    validation_mean = np.mean(validation_scores, axis=1)
    validation_std = np.std(validation_scores, axis=1)
    if ax is None:
        ax = plt.gca()
    ax.plot(train_sizes, train_mean, label='Training score', color='blue', marker='o')
    ax.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color='blue', alpha=0.15)

    ax.plot(train_sizes, validation_mean, label='Validation score', color='green', marker='o')
    ax.fill_between(train_sizes, validation_mean - validation_std, validation_mean + validation_std, color='green', alpha=0.15)

    ax.set_title(f'{model_name} Learning Curve')
    ax.set_xlabel('Training Data Size')
    ax.set_ylabel(scoring.replace('_', ' ').title())
    ax.legend(loc='best')
    ax.grid()


In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
plot_learning_curve(rslt_rf.best_estimator_, 'Random Forest', ax=axes[1])
plot_learning_curve(rslt_lr.best_estimator_, 'Logistic Regression', ax=axes[0])

# Problem 4 (2 points)
Based on the learning curves:
1. Both models display some bias (insuficient capacity). Which model do you think has worse bias and why?
2. Which of the two models displays variance (overfitting)? Justify your answer.
3. Which model do you think might benefit from additional training samples? Justify you answer.

Problem 4 - your response here.