# Practicum 03 - Classification with Structured Biomedical Data

In this practicum, we apply classification models on the dataset [HCV](https://archive.ics.uci.edu/dataset/571/hcv+data). Per the [UCI website](https://archive.ics.uci.edu/dataset/925/infrared+thermography+temperature+dataset), the dataset __"contains laboratory values of blood donors and Hepatitis C patients and demographic values like age"__. Specifically, there are 12 predictor variables that include age, sex, and 10 laboratory measurements that are known to be associated with Hepatitis C presence and stage. __Our goal is to develop classification models that can accurately predict if a potential blood donor has Hepatitis C or not.__. In the published paper by [Hoffmann et al.](https://jlpm.amegroups.org/article/view/4401/5425), the authors developed a multiclass model to predict Hepatitis stage. We will simplify the problem here to one of binary classification in which we will predict if the potential donor either has Hepatitis or not. We will develop a logistic regression model using [scikit-learn](https://scikit-learn.org/stable/index.html) and a gradient boosted tree model using [catboost](https://catboost.ai/).

Before working with the HCV data, we will first illustrate the model development and evaluation approaches using a simulated dataset.

This practicum will implement some of the concepts discussed in the two previous lectures, _Logistic Regression_ and _Tree-based classification_. Specifically we: will
-  Demonstrate logistic regression classification on a simulated dataset
-  Apply logistic regression classification to the _HVC_ dataset
-  Demonstrate gradient boosting trees on the same simulated dataset
-  Apply gradient boosting trees to the _HVC_ datest
-  Examine standard performance metrics on all models

In [None]:
# Google Colab setup
# mount the google drive - this is necessary to access supporting src
from google.colab import drive
drive.mount("/content/drive")

In [None]:
# install any packages not found in the Colab environment
!pip install ucimlrepo
!pip install catboost

In [None]:
# imports
from ucimlrepo import fetch_ucirepo
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import make_classification
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix,  ConfusionMatrixDisplay, RocCurveDisplay, PrecisionRecallDisplay
import numpy as np
import catboost as cb

# local project imports
import sys
sys.path.append("/content/drive/MyDrive/Colab Notebooks/CPSC-8810-ML-BioMed/src")
from plotting import plt_box_grid_by_target, plt_box_grid, plt_xy_scatter_grid
from uci_utils import get_vars_of_type
from regression_util import plot_fitted_resids, plot_outliers, plot_leverage
from filter import correlation_filter

In [None]:
# global settings
pd.options.display.max_columns = 100
rs = 654321 # random state, use this to ensure reproducibility

# Simulated data

To get started, let's generate simulated data for classification. We will use this data to illustrate classification model development and evaluation process using both logistic regression and gradient boosting tree classificaiton approaches.

To generate the simulated data, we use the `make_classification` method from [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html#). For demonstration, we will split the data into training and test sets.

In [None]:
# generate simulated classification data
n = 250 # number of samples
n_train = 200 # number of training samples
balance = 0.4 # fraction of negative class
X, y = make_classification(n_samples=n, n_features=10, n_informative=8, n_redundant=2,
                           n_classes=2, n_clusters_per_class=2,
                           flip_y=0.01, class_sep=1.0, hypercube=True, weights=[balance, 1-balance],
                           shift=0.0, scale=1.0, shuffle=True, random_state=rs)
# for plotting purposes, let's convert to a dataframe
df = pd.DataFrame(X, columns=[f'x{i}' for i in range(1, 11)])
df['target'] = y

# split into train and test
X_train = X[:n_train]
y_train = y[:n_train]
X_test = X[n_train:]
y_test = y[n_train:]

Let's make a bar plot of the target variable to see the distribution of the target class.

In [None]:
# count plot of the target variable
sns.countplot(df, x='target')

Next, let's generate boxplots of the predictor features grouped by the class label, $y$. This will give us a sense of the inter-class distribution differenecs for each feature. We should expect to see that most of the features will have different means and IQR indicating an assocation between the feature and the outcome class.

In [None]:
# put X and y in a pandas dataframe, then create a grid of boxplots showing the distribution of each feature in the training data segmented by the target
plt_box_grid_by_target(df, target_label='target', num_cols=5, fig_size=(16,9));

# Logistic Regression Models
## Simulated data model

Let's start by fitting a logistic regression classifier model to the simulated data. We will use the [scikit-learn Logistic Regression module](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) to create the model. We will rely on the default parameters which include _l2 regularization_. Later in the course, we will discuss hyperparameter selection approaches where we will modify the default values. Here, we fit the model to the _training data_.

In [None]:
clf = LogisticRegression(random_state=rs).fit(X_train, y_train)

Now let's examine the perforamnce on the test set. Let's start by looking at the confusion matrix. In the confusion matrix, we see the counts of __true negative__ and __positive predictions__ in the __top left__ and __bottom right corners__, respectively. We see the counts of __false positive__ and __false negative__ predictions in the top right and bottom left corners, respectively.

In [None]:
y_pred = clf.predict(X_test)
target_names = ['class 0', 'class 1']
cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
disp = ConfusionMatrixDisplay(cm, display_labels=target_names)
disp.plot(cmap='Blues')
plt.show()

Now let's examine some of the common point metrics used to evaluate classifier performance. We will use the [scikit-learn classification_report module](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) to generate a summary of the performance metrics.

In [None]:
print(classification_report(y_test, y_pred, target_names=target_names))

To keep the table small and accomodate multiclass scenarios, scikit-learn presents _intraclass_ metrics. Let's break these down sarting with the first two rows. Let _P_ be the number of postive samples, _N_ be the number of negative samples, _TP_ be the number of true postives (correctly predicted postives), _FP_ be the number of false postives, let _TN_ be the number of true negatives (correctly predicted negatives), and _FN_ be the number of false negatives:
- __class 0 precision ($C_0P$)__: equal to _TN/N=TN_/(_TN_+_FN_) also known as _negative predictive value_ (NPV)
- __class 0 recall ($C_0R$)__: equal to _TN_/(_TN_+_FP_) also known as _specificity_
- __class 0 f1-score__ : equal to $2\frac{C_0P\times C_0R}{C_0P+C_0R}$
- __class 1 precision ($C_1P$)__: equal to _TP/P=TP_/(_TP_+_FP_) typically known
- __class 1 recall ($C_1R$)__: equal to _TP_/(_TP_+_FN_) also known as _specificity_
- __class 1 f1-score__ : equal to $2\frac{C_1P\times C_1R}{C_1P+C_1R}$

The _accuracy_ row is simply the overall model accuracy = (TP+TN)/(P+N)

The _macro avg_ row is the unweighted average over the class metric scores for each class. For example, the _macro avg precision_ = $\frac{1}{2}(C_0P + C_1P$). The _macro avg_ treats all classes equally.

The _weighted avg_ row is similar to to the _macro _avg_ but weights each class metric score by the support proportion. For example, the _macro avg precison_ = $\frac{1}{50}(16C_0P + 34C_1P)$

Next, let's look at the _Reciever Operating Characteritic_ curve which tells us how well the model can trade off between recall (a.k.a. sensitivity or TPR) and specificity = 1 - FPR as the decision threshold (the probability value of the postive class above which we decide the input belongs to the postive class). Ideally, we seek a model where the precision recall curve rises sharply to 1.0 on the y-axis and doesn't decline). The _Area Under the ROC Curve_ (AUC) is a summary statistic for the ROC. The optimal value of AUC is 1.0. We can generate the plot directly from the model using the [scikit-learn RocCurveDisplay module](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.RocCurveDisplay.html). As we expected from the confusion matrix, the model performs very well in terms of the ROC.

In [None]:
# Plot the ROC
RocCurveDisplay.from_estimator(clf, X_test, y_test)
plt.grid()

Finally, let's look at the _Precision Recall_ curve which tells us how well the model can trade off between recall and precision (a.k.a. positive predictive value). In cases of strong class imbalance (i.e., where there are many more samples in one class than the other for binary classification), the PR curve is a better measure of performance than the ROC. Similar to the AUC, the _average precision_ (AP) is a point metric that summarizes the PR curve with the optimal value being 1.0. We can generate the plot directly from the model using the [scikit-learn PrecisionRecallDisplay module](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.PrecisionRecallDisplay.html#sklearn-metrics-precisionrecalldisplay). Again, we see that this model has good PR performance as we expected from the confusion matrix.

In [None]:
PrecisionRecallDisplay.from_estimator(clf, X_test, y_test)
plt.grid()

# Logistic Regression Model for HCV data

Now, let's apply the logistic regression model analysis to the [HCV](https://archive.ics.uci.edu/dataset/571/hcv+data). Our goal is to use the features of this data set to predict whether the potential blood donor has Hepatitis C or not.

The predictors for this dataset include demographics and several blood test features .

In a complete model development process, we would explore the dataset as we did in Practicum 1. In the interest of time, here we will proceed directly to model development. You are encouraged to use the lessons of Practicum 1 and the above data exploration techniques for the simulated dataset to explore this dataset.

Before developing our model, we will need to:
1. Pull the data from the UCI website
2. Address any missing data
3. Address feature cross-correlation
4. Standardize features
5. This dataset does not have categorical features, so we will not need to create dummy values.
We will address these concerns in more detail in future lectures. For now, we will address missing data by dropping incomplete cases. We will address feature cross-correlation by dropping one feature from any pair that has a correlation coefficient >0.95. Finally, we will standardize continuous features as we did in practicum 1.

Once these steps are completed, we will complete the following:
1. Fit an logistic regression classification model to the training data.
2. Assess performance on the test set using point metrics, ROC and PR curves.

In [None]:
# fetch dataset
hcv_data = fetch_ucirepo(id=571)

# data (as pandas dataframes)
X = hcv_data.data.features.dropna().copy()
X['Sex_male'] = [1 if x=='m' else 0 for x in X.Sex] # convert to binary from string
X.drop(columns=['Sex'], inplace=True) # drop the original column

# converting this to a binary classification problem where the
# positive class is hepatitis, fibrosis, or cirrhosis
y = hcv_data.data.targets.loc[X.index].isin(['1=Hepatitis', '2=Fibrosis', '3=Cirrhosis']).astype(int)
meta_vars = hcv_data.variables

# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=rs)

# standardize the continuous features
continuous_vars, X_train_continuous = get_vars_of_type(X_train, meta_vars, var_type_key = 'type', var_name_key = 'name', type_kw = 'Continuous')
scaler = preprocessing.StandardScaler().fit(X_train_continuous)
X_train_continuous_scaled = scaler.transform(X_train_continuous)
# note we use the same scaler for the test data to prevent data leakage
continuous_vars, X_test_continuous = get_vars_of_type(X_test, meta_vars, var_type_key = 'type', var_name_key = 'name', type_kw = 'Continuous')
X_test_continuous_scaled = scaler.transform(X_test_continuous)

# this dataset does not contain any categorical features

# combine the continuous and categorical features
X_train_new_lr = pd.DataFrame(X_train_continuous_scaled, columns=X_train_continuous.columns)
X_test_new_lr = pd.DataFrame(X_test_continuous_scaled, columns=X_train_continuous.columns)
y_train_new_lr = y_train.reset_index(drop=True)
y_test_new_lr = y_test.reset_index(drop=True)

# Problem 1 (1 point)
In the code cell below, (1) use the scikit-learn `LogisticRegression` module to fit a logistic regression model to the HVC data using the `X_train_new_lr` and `y_train_new_lr` variables. Store the fit model in the `clf` variable; and (2) generate the model predictions on the test set `X_test_new_lr`. Store the model predictions in the `y_pred` variable.

In [None]:
################# YOUR CODE HERE ##################
clf = None
y_pred = None

Now let's look at the confusion matrix to get an initial assessment of the model performance on the HVC test set.

In [None]:
target_names = ['blood donor', 'hepititis']
cm = confusion_matrix(y_test_new_lr, y_pred, labels=clf.classes_)
disp = ConfusionMatrixDisplay(cm, display_labels=target_names)
disp.plot(cmap='Blues')
plt.show()

# Problem 2 (1 point)

Now let's generate some formal performance metrics. First, let's view the standard point metrics. In the code cell below, use the `y_test_new_lr` and `y_pred` variables to print the _classification report_ from scikit-learn. It will make the table easier to read if you provide the `target_names` keyword argument using the target names as in the confusion matrix.

In [None]:
################# YOUR CODE HERE ##################

# Problem 3 (1 point)
Finally, let's examine the Reciever Operating Chracteristic curve and the Precision Recall Curve. In the two code cells below, plot the ROC and PR on the test data, `X_test_new_lr` and `y_test_new_lr` for the logistic regression classifer `clf`.

In [None]:
################# YOUR CODE HERE ##################
# Plot the ROC

In [None]:
################# YOUR CODE HERE ##################
# Plot the PR

# Tree-based Classification Models using CatBoost

## Simulated data model

Let's now investigate a tree-based model. As in Practicum 2, we will use the [CatBoost.ai](https://catboost.ai/) library to create a boosted gradient tree ensemble model.  For more details on the implementation see the original [CatBoost paper](https://arxiv.org/abs/1706.09516).

First, let's recreate our simulated data.

In [None]:
X, y = make_classification(n_samples=n, n_features=10, n_informative=8, n_redundant=2,
                           n_classes=2, n_clusters_per_class=2,
                           flip_y=0.01, class_sep=1.0, hypercube=True, weights=[balance, 1-balance],
                           shift=0.0, scale=1.0, shuffle=True, random_state=rs)

# split into train and test
X_train = X[:n_train]; y_train = y[:n_train]; X_test = X[n_train:]; y_test = y[n_train:]
train_dataset = cb.Pool(X_train, y_train)
test_dataset = cb.Pool(X_test, y_test)

We can now fit a gradient boosted tree model to the simulated data using the [CatBoostClassifier module](https://catboost.ai/en/docs/concepts/python-reference_catboostclassifier). As with the scikit-learn logistic regression classifer, the CatBoost classifier module has numerous tuning parameters. Here, we will use the default parameters, again noting that in later sessions, we will instead prefer to optimize these tuning parameters using a systematic approach.

In [None]:
cb_clf = cb.CatBoostClassifier(random_seed=rs, verbose=False)
cb_clf.fit(train_dataset)

Just as in the logistic regression model, once we have fit the model with the training data, we can use it to predict the class labels for the test data using the `predict` method.

In [None]:
y_pred = cb_clf.predict(X_test)

Now let's examine the perforamnce on the test set. __Fortunately, the CatBoost API uses the same conventions as scikit-learn, and so we can use all the scikit-learn peformance assessment modules directly on the CatBoost model object `cb_clf` and the test set predictions, `y_pred`.__

Let's start by looking at the confusion matrix. We can see that the CatBoost model effecitively does the same as the logistic regression model.

In [None]:
target_names = ['class 0', 'class 1']
cm = confusion_matrix(y_test, y_pred, labels=cb_clf.classes_)
disp = ConfusionMatrixDisplay(cm, display_labels=target_names)
disp.plot(cmap='Blues')
plt.show()

Next let's examine the point metric values for the CatBoost model using the sklearn `classification_report` module.

In [None]:
print(classification_report(y_test, y_pred, target_names=target_names))

Finally, let's examine the ROC and PR curves for the CatBoost model. Not surprisingly, we see that the model has achieved nearly optimal performance.

In [None]:
# Plot the ROC
RocCurveDisplay.from_estimator(cb_clf, X_test, y_test)
plt.grid()

In [None]:
PrecisionRecallDisplay.from_estimator(cb_clf, X_test, y_test)
plt.grid()

# CatBoost Classifier for HCV Data

Now, let's apply the gradient boosted tree model analysis to the [HCV](https://archive.ics.uci.edu/dataset/571/hcv+data) dataset. We will first reload the dataset. Notice two important items here:
1. Because we are using decision trees we do not need to standardize the variables. Recall, there are no weights applied to the inputs, rather only cutpoints need to be selected.
2. There are no categorical variables in this dataset.

As with the simulated data, we use the CatBoost `Pool` class to construct training and testing pools.

In [None]:
# fetch dataset
hcv_data = fetch_ucirepo(id=571)

# data (as pandas dataframes)
X = hcv_data.data.features.dropna()
X = hcv_data.data.features.dropna().copy()
X['Sex_male'] = [1 if x=='m' else 0 for x in X.Sex] # convert to binary from string
X.drop(columns=['Sex'], inplace=True) # drop the original column
# converting this to a binary classification problem where the
# positive class is hepatitis, fibrosis, or cirrhosis
y = hcv_data.data.targets.loc[X.index].isin(['1=Hepatitis', '2=Fibrosis', '3=Cirrhosis']).astype(int)
meta_vars = hcv_data.variables

# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=rs)

# there are no categorical variables in this dataset
# create the catboost datasets
train_dataset = cb.Pool(X_train, y_train, cat_features=None)
test_dataset = cb.Pool(X_test, y_test, cat_features=None)

# Problem 4 (1 point)
In the code cell below, (1) use the CatBoost `CatBoostClassifier` module to fit a boosted tree model to the HVC data using the `train_dataset` pool variable. Store the fit model in the `cb_clf` variable; and (2) generate the model predictions on the test set `test_dataset` pool variable. Store the model predictions in the `y_pred` variable.

In [None]:
################# YOUR CODE HERE ##################
cb_clf = None
y_pred = None

Now let's look at the confusion matrix to get an initial assessment of the model performance on the HVC test set.

In [None]:
y_pred = cb_clf.predict(X_test)
target_names = ['blood donor', 'hepititis']
cm = confusion_matrix(y_test, y_pred, labels=cb_clf.classes_)
disp = ConfusionMatrixDisplay(cm, display_labels=target_names)
disp.plot(cmap='Blues')
plt.show()

# Problem 5 (1 point)

Now let's generate some formal performance metrics for the CatBoost classifier. First, let's view the standard point metrics. In the code cell below, use the `y_test` and `y_pred` variables to print the _classification report_ from scikit-learn. It will make the table easier to read if you provide the `target_names` keyword argument using the target names as in the confusion matrix.

In [None]:
################# YOUR CODE HERE ##################

# Problem 6 (1 point)
Finally, let's examine the Reciever Operating Chracteristic curve and the Precision Recall Curve. In the two code cells below, plot the ROC and PR on the test data, `X_test` and `y_test` for the CatBoost classifer `cb_clf`.

In [None]:
################# YOUR CODE HERE ##################
# Plot the ROC

In [None]:
################# YOUR CODE HERE ##################
# Plot the PR