CPSC 8810 Machine Learning for Biomedical Applications

# Assignment 2 - Classifier Model Development with Structured Data
# Bacteremia Dataset
In this assignment, you are asked to use the `a2-bactermia.csv` dataset to create classification models that predict whether or not bacteria is present in a given blood sample based on input features that include patient age, sex and 50 labratory measurements. The data and the associated data dictionary are contained in the course repository in the _assignments/source_data_ folder. Detailed information can be found in the related journal article [A Risk Prediction Model for Screening Bacteremic Patients: A Cross Sectional Study](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0106765).

__Please read through the notebook and follow the instructions for each of the 9 problems.__

In [None]:
# Google Colab setup
# mount the google drive - this is necessary to access supporting src
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
# imports
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV, learning_curve
from sklearn.metrics import balanced_accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay, PrecisionRecallDisplay
from prettytable import PrettyTable
from sklearn.impute import KNNImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from statsmodels.stats.contingency_tables import mcnemar
from imblearn.over_sampling import SMOTE

# local project imports
import sys
sys.path.append("/content/drive/MyDrive/Colab Notebooks/CPSC-8810-ML-BioMed/src")
from plotting import plt_box_grid_by_target, plt_box_grid, plt_xy_scatter_grid
from cluster_utils import plot_hclust
from util import load_data

In [None]:
# global settings
pd.options.display.max_columns = 100
rs = 654321 # random state, use this to ensure reproducibility

In [None]:
####################################################################################################
# DO NOT CHANGE THIS CELL
####################################################################################################
# load the bactermia dataset
X, y = load_data('bacteremia', '../source_data')
#X, X_discard, y, y_discard = train_test_split(X_all, y_all, stratify=y_all, test_size=0.5, random_state=rs)
X.head()

# load the data dictionary
dd = load_data('bacteremia_dictionary', '../source_data')
dd.head()

# Data Preprocessing

We will first split the data into a training set (70%) and test set (30%)

In [None]:
# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=rs)

# READ THIS
Before proceeding further, note that the training dataset is rather large. Many of the evaluations in the code cells that follow may take long periods of time to complete on computers with limited resources. It is recommended that you uncomment the code below while you are developing and testing your code. Once you are confident your code is correct, then comment this line out and rerun the entire notebook.

In [None]:
#X_train, y_train = train_test_split(X_train, y_train, stratify=y_train, test_size=0.05, random_state=rs)

### Missing Data
Next, let's examing the percent missing data for each feature in the training and test set. In a real-world problem, we would need to assess if we suspect any of the features are missing not at random prior to imputation. For the purposes of this assignment, we'll assume all missing observations are either missing completely at random or missing at random.

In [None]:
mp_train = (len(X_train.index) - X_train.count())/len(X_train.index)*100
mp_test = (len(X_test.index) - X_test.count())/len(X_test.index)*100
mp_table = PrettyTable(['Feature', 'Training Missing %', 'Test Missing %'])
for j in range(len(mp_train.index)):
    label = mp_train.index[j]
    if mp_train[label] > 0:
        mp_table.add_row([label, f'{mp_train[label]:.1f}', f'{mp_test[label]:.1f}'])
print(mp_table)

# Problem 1 (1 point)
In the code cell below, use _K Nearest Neighbors_ imputation to impute missing observations for training data, `X_train`. Store the the new values in a dataframe variable `X_train_imputed`. Use imputation model fit to the training data to impute the missing observations for the test set, `X_test`. The test data should __not__ be used to fit the imputation model. __HINT__: See practicum 5.

In [None]:
##### Problem 1 Your code here #####
imputer = None

X_train_imputed = None
X_test_imputed = None

# Problem 2 (1 point)
In the code cell below, plot kernel density esitmates (KDEs) of the features in training data before and after imputation. Only plot the features with greater than 1% missing observations. The plots should be arranged in a subplot grid with 4 or 5 columns. __HINT__: See practicum 5.

In [None]:
##### Problem 2 Your code here #####

# Problem 3 (1 point)
In the code cell below, standardize the continuuous features (all but `AGE` and `SEX`) of the feature values in the training set, `X_train_imputed` to have zero mean and unit variance. Store the standardized features in a pandas dataframe variable `X_train_scaled`. Using the training data feature means and variances, scale the feature values in the test set, `X_test_imputed` and store the result in a pandas dataframe variable `X_test_scaled`. __HINT__: The scikit-learn `ColumnTransformer` module will be useful here, see Practicum 5.

In [None]:
##### Problem 3 Your code here #####
# standardize the continuous and integer features. Do not standardize the binary features.
continuous_vars = X.columns.drop(['AGE', 'SEX'])
standarizer = None

X_train_scaled = None

# standardize the continuous and integer features
X_test_scaled = None

# Imbalanced Data
In the bar graphs below, we the data balance in the target label for both the training data and the test data. In both cases, only 8% of the samples are positive for bacteremia. It turns out that this imbalance significantly degrades model performance on the training data.

In [None]:
# count plot of the target variable
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 3))

d = len(y_train.index)
vc = y_train.value_counts()
ax = axes[0]
vc.plot(kind='bar', title='Training Data Bacteremia Class Balance', ax=ax)
ax.set_xticklabels([f"No - {vc[0]/d*100:.1f}%", f"Yes - {vc[1]/d*100:.1f}%"], rotation=0)
ax.set_xlabel('Bacteremia')
ax.set_ylabel('Sample Counts');

d = len(y_test.index)
vc = y_test.value_counts()
ax = axes[1]
vc.plot(kind='bar', title='Test Data Baceteremia Class Balance', ax=ax)
ax.set_xticklabels([f"No - {vc[0]/d*100:.1f}%", f"Yes - {vc[1]/d*100:.1f}%"], rotation=0)
ax.set_xlabel('Bacteremia')
ax.set_ylabel('Sample Counts');

### SMOTE oversampling
Because the class imbalance will impact our model performance, we will try applying an over sampling strategy to add postive samples to our data for model training. In the code cell below, the [SMOTE module](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html) from the [imbalanced-lear](https://imbalanced-learn.org/stable/index.html) is used to increase the number of bacteremia postive samples in the _training data only_. The test set is not altered.

__OPTIONAL__:<br/>
If you are interested to see the impact of the using SMOTE on the model training and results, you can add the following line to the end of the code cell below to nullify the effect of SMOTE oversampling. Then rerun the entire notebook and examine the model performances. <br/><br/>
`X_train_resampled, y_train_resampled = X_train_scaled, y_train` <br/><br/>
You should see that accuracy and ROC are not significantly impacted, however recall on the test set is reduced to nearly 0% for both models. Please be sure to rerun the notebook with SMOTE enabled before submitting.

In [None]:
sm = SMOTE(random_state=rs, k_neighbors=10, sampling_strategy='minority')
X_train_resampled, y_train_resampled = sm.fit_resample(X_train_scaled, y_train)
print(f"Original training data shape: {X_train_scaled.shape}")
print(f"Resampled training data shape: {X_train_resampled.shape}")

In [None]:
# count plot of the target variable
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(6, 3))

d = len(y_train_resampled.index)
vc = y_train_resampled.value_counts()
ax = plt.gca()
vc.plot(kind='bar', title='Training Data After Resampling Bacteremia Class Balance', ax=ax)
ax.set_xticklabels([f"No - {vc[0]/d*100:.1f}%", f"Yes - {vc[1]/d*100:.1f}%"], rotation=0)
ax.set_xlabel('Bacteremia')
ax.set_ylabel('Sample Counts');

# Problem 4 (1 point)
In the code cell below, use the scikit-learn `RandomizedSearchCV` module with `StratifiedKFold` cross validation to perform hyperparameter selection for a logitistic regression model using `X_train_resampled` and `y_train_resampled` to fit the model. Your parameter space should include the following tuning parameter values:
- `C` in [0.001, 0.01, 0.1, 0.5, 1, 10, 100]
- `penalty` in ['l1', 'l2', 'elasticnet', None]
- `fit_intercept` in [True, False]

Store the best model in the variable `rslt_lr`. The best model should be fit on the entire SMOTE training data, `X_train_resampled` and `y_train_resampled`
__HINT:__ See Practicum 5.

In [None]:
##### Problem 4 Your code here #####
model = LogisticRegression(random_state=rs, max_iter=1000, solver='saga', l1_ratio=0.5)

# specify the hyperparameter space
parameter_space = None

# specify the cross-validation method. Use stratified k-fold because the target variable is imbalanced
skf = None

# specify the budget (number of hyperparameter combinations to try)
budget = 20

# select a score to optimize
score = 'balanced_accuracy'

# number of jobs to run in //, -1 means use all CPU processors
n_jobs = -1

search_lr = None
rslt_lr = None

In [None]:
print(f'Best {score} score: {rslt_lr.best_score_:.2f}')
rslt_lr.best_estimator_

# Problem 5 (1 point)
In the code cell below, use the scikit-learn `RandomizedSearchCV` module with `StratifiedKFold` cross validation to perform hyperparameter selection for a random forest model using `X_train_resampled` and `y_train_resampled` to fit the model. Your parameter space should include the following tuning parameter values:
- `n_estimators` in [100, 200, 300, 400, 500]
- `max_depth` in [4, 8, 10, 12]
- `min_samples_leaf` in [1, 2]
- `max_features` in ['sqrt', 'log2']
Store the best model in the variable `rslt_rf`.

The best model should be fit on the entire SMOTE training data, `X_train_resampled` and `y_train_resampled`
__HINT:__ See Practicum 5.

In [None]:
##### Problem 5 Your code here #####
# specify the model
model = RandomForestClassifier(random_state=rs, bootstrap=True)

# specify the hyperparameter space
parameter_space = None

# specify the cross-validation method. Use stratified k-fold because the target variable is imbalanced
skf = None

# specify the budget (number of hyperparameter combinations to try)
budget = 20

# select a score to optimize
score = 'balanced_accuracy'

# number of jobs to run in //, -1 means use all CPU processors
n_jobs = -1

search_rf = None
rslt_rf = None

In [None]:
rslt_rf.best_params_

In [None]:
print(f'Best {score} score: {rslt_rf.best_score_:.2f}')
rslt_rf.best_estimator_

# Model Assessment

# Problem 6 (1 point)
In the code cell below, assess the logistic regression model performance on the test data `X_test_scaled` and `y_test`. Specifically, provide the following performance assessments:<br/>
1. Print the classification report (point metrics)
2. Plot the confusion matrix
3. Plot the ROC curve
4. Plot the Precision-Recall curve
__HINT__:See Practicum 5

In [None]:
##### Problem 6 Your code here #####

# Problem 7 (1 point)
In the code cell below, assess the random forest model performance on the test data `X_test_scaled` and `y_test`. Specifically, provide the following performance assessments:<br/>
1. Print the classification report (point metrics)
2. Plot the confusion matrix
3. Plot the ROC curve
4. Plot the Precision-Recall curve
__HINT__:See Practicum 5

In [None]:
##### Problem 7 Your code here #####

# Problem 8 (1 point)
In the code cell below, us McNemar's test to determine if there is a signficant difference in the predictions made on the test set between the logistic regression model and the random forest model. __HINT__: See Practicum 5.

In [None]:
##### Problem 8 Your code here #####

# Problem 9 (2 points)
Assume you are building a system to detect bacteremia for a clinic. The clinicians at the clinic have have informed you that they are very concerned about false positive results. They are willing to accept a minimum sensitivity (recall) of bacteremia cases of 30% in order to minimize false positives. Which model, logistic regression or random forest, would you recommend? Justify your answer.

#Problem 9: Enter your response here.