# Modelling - Classification

This notebook explores classification models to **predict risk of becoming a long stayer** as a baseline to the [Long Stayer Risk Stratification](https://github.com/nhsx/skunkworks-long-stayer-risk-stratification) model.

This notebook is broken down into:

1. Converting the length of stay into a relative risk
1. Training a range of baseline models using cross validation
3. Testing final models on a test dataset
4. Exploring in more detail the best performing baseline model

In [None]:
import math
import pickle
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
from catboost import CatBoostClassifier
from sklearn import preprocessing
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import balanced_accuracy_score, f1_score, roc_auc_score
from sklearn.model_selection import GridSearchCV, cross_validate, train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils.class_weight import compute_class_weight
from xgboost import XGBClassifier

sys.path.append("../src/")

from utils import risk_score, train_and_test_model

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

%matplotlib inline
plt.rcParams["figure.figsize"] = [15, 8]

## Load features

In [None]:
features_df = pd.read_parquet("../../data/processed/features.parquet")
features_df.shape

In [None]:
features_catboost_df = pd.read_parquet("../../data/processed/features-catboost.parquet")
features_catboost_df.shape

## Calculate risk scores

We will convert actual Length of Stay (days) into a risk score defined as:

Risk Category|Day Range for Risk Category
-----|------
1 - Very low risk|0-6
2 - Low risk|7-10
3 - Normal risk|11-13
4 - Elevated risk|14-15
5 - High risk|>15

In [None]:
# actual risk scores
risk_labels = [
    "1 - Very Low Risk",
    "2 - Low Risk",
    "3 - Normal Risk",
    "4 - Elevated Risk",
    "5 - High Risk",
]
features_df["risk"] = [risk_score(los) for los in features_df.LENGTH_OF_STAY]
features_catboost_df["risk"] = [
    risk_score(los) for los in features_catboost_df.LENGTH_OF_STAY
]

## Define target and training features

In [None]:
X = features_df.drop(columns=["LENGTH_OF_STAY"])
y = features_df.risk

# Non-one-hot encoded data for catboost
X_catboost = features_catboost_df.drop(columns=["LENGTH_OF_STAY"])
y_catboost = features_catboost_df.risk

## Train/test split

For model evaluation, we will hold back a 25% test set, and use cross-validation on the remaining 75% for all models until the final comparison is made.

In [None]:
# Split data for train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.75, random_state=42
)
print(X_train.shape, X_test.shape)

# Scale data for LogReg only using training data
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = pd.DataFrame(
    scaler.transform(X_train), index=X_train.index, columns=X_train.columns
)
X_test_scaled = pd.DataFrame(
    scaler.transform(X_test), index=X_test.index, columns=X_test.columns
)
print(X_train_scaled.shape, X_test_scaled.shape)

# Split data for train/test
X_train_catboost, X_test_catboost, y_train_catboost, y_test_catboost = train_test_split(
    X_catboost, y_catboost, train_size=0.75, random_state=42
)
print(X_train_catboost.shape, X_test_catboost.shape)

## Explore class imbalance

This is a multi-class classification model, so we need to understand where any class imbalance lies otherwise we will skew to larger classes.

In [None]:
# Show how many of each class are present in the training set:
X_train.risk.value_counts().sort_index()

### Weight samples

Given significant class imbalance, we will weight samples where the smallest class has a weight of 1, and other classes as a (<1) proportion. This is called _balanced_ weighting as can be automatically calculated for many algorithms, but not all (e.g. XGBoost).

e.g.

class|count|weight
---|---|---
1|1000|0.1
2|100|1
3|500|0.2

In [None]:
# note these are same for one-hot encoded and unencoded dataframes
# note this is same as compute_class_weight/(max weight from compute_class_weight)
weights = X_train.risk.value_counts().min() / X_train.risk.value_counts().sort_index()
sample_weights = [weights[risk] for risk in X_train["risk"].values]

## Drop risk from training/test set
X_train.drop(columns="risk", inplace=True)
X_test.drop(columns="risk", inplace=True)

X_train_scaled.drop(columns="risk", inplace=True)
X_test_scaled.drop(columns="risk", inplace=True)

X_train_catboost.drop(columns="risk", inplace=True)
X_test_catboost.drop(columns="risk", inplace=True)

# Modeling

Strategy is to try a number of classification models with:

* Baseline models for each algorithm trained on the training set with default parameters
* Baseline models tested on the test set
* GridsearchCV for hyperparameter tuning on best performing model
* Explore feature importance of final model

In [None]:
# Initiate empty models dictionary
models = {}

### Prior model

The simplest baseline model takes the most frequent class label as its prediction

In [None]:
model_name = "prior"

# define estimator
estimator = DummyClassifier(strategy="prior")

# takes ~1 s to run on a STANDARD_D13_V2
models[model_name] = train_and_test_model(
    estimator, X_train, y_train, X_test, y_test, scoring_metric="f1_weighted"
)
models[model_name]

### Logistic Regression (elastic net regularisation)

Multi-class balanced and regularised (l1/l2 ratio of 0.5) logistic model

In [None]:
model_name = "elastic"

# define estimator
estimator = LogisticRegression(
    class_weight="balanced",
    penalty="elasticnet",
    solver="saga",
    l1_ratio=0.5,
    random_state=42,
)

# takes ~2 minutes to run on a STANDARD_D13_V2
models[model_name] = train_and_test_model(
    estimator, X_train, y_train, X_test, y_test, scoring_metric="f1_weighted"
)
models[model_name]

### Decision tree classifier (weighted)

Simplest tree classifier using one tree

In [None]:
model_name = "decisiontree"

# define estimator
estimator = DecisionTreeClassifier(class_weight="balanced", random_state=42)

# takes ~5s to run on a STANDARD_D13_V2
models[model_name] = train_and_test_model(
    estimator, X_train, y_train, X_test, y_test, scoring_metric="f1_weighted"
)
models[model_name]

### Random forest (weighted)

In [None]:
model_name = "randomforest"

estimator = RandomForestClassifier(class_weight="balanced", random_state=42)

# takes ~1 mins to run on a STANDARD_D13_V2
models[model_name] = train_and_test_model(
    estimator, X_train, y_train, X_test, y_test, scoring_metric="f1_weighted"
)
models[model_name]

### Catboost

Boosted tree optimised for categorical features. Note this requires **non-one-hot encoded features**



In [None]:
model_name = "catboost"

# extract categorical features
num_features = [
    "AGE_ON_ADMISSION",
    "EL CountLast12m",
    "EMCountLast12m",
    "OP First CountLast12m",
    "OP FU CountLast12m",
]
cat_features = list(set(X_train_catboost.columns) - set(num_features))

estimator = CatBoostClassifier(
    verbose=False,
    auto_class_weights="Balanced",
    cat_features=cat_features,
    random_state=42,
)

# takes ~8 mins to run on a STANDARD_D13_V2
models[model_name] = train_and_test_model(
    estimator,
    X_train_catboost,
    y_train_catboost,
    X_test_catboost,
    y_test_catboost,
    scoring_metric="f1_weighted",
)
models[model_name]

### XGBoost

In [None]:
model_name = "xgboost"
models[model_name] = {}

# note XGBoost only accepts the "sample_weight" parameter in the .fit() function
# and must be trained explicitly
# see https://discuss.xgboost.ai/t/multi-class-classification-weighting-for-unbalanced-datasets/2789

clf = XGBClassifier(random_state=42)

# takes ~1 mins to run on a STANDARD_D13_V2
models[model_name]["model"] = clf.fit(X_train, y_train, sample_weight=sample_weights)

# perform inference on both training and test set
preds_train = np.clip(models[model_name]["model"].predict(X_train), 0, None)
preds_test = np.clip(models[model_name]["model"].predict(X_test), 0, None)

# calculate performance
models[model_name]["train_metric"] = f1_score(y_train, preds_train, average="weighted")
models[model_name]["test_metric"] = f1_score(y_test, preds_test, average="weighted")

models[model_name]

## Evaluate model performance visually

Use the held-out test set to evaluate and visualise the performance of all the tuned models.

We will also calculate a range of metrics for the classification models:

* balanced_accuracy - the overall % correct predictions, weighted per class
* f1_score_weighted - the harmonic mean of precision and recall, weighted per class
* auc - the area under the receiver operator characteristic (roc) curve, one class-versus rest

In [None]:
# setup a subplot figure
fig, axs = plt.subplots(len(models), 2)
fig.set_size_inches(15, 7 * len(models))

i = 0

for model in models:
    if model == "catboost":
        model_X_test = X_test_catboost
        model_y_test = y_test
    elif model == "elastic":
        model_X_test = X_test_scaled
        model_y_test = y_test
    else:
        model_X_test = X_test
        model_y_test = y_test

    # perform inference
    preds = models[model]["model"].predict(model_X_test)
    probs = models[model]["model"].predict_proba(model_X_test)

    # calculate performance metrics
    balanced_accuracy = balanced_accuracy_score(model_y_test, preds)
    f1_score_weighted = f1_score(model_y_test, preds, average="weighted")
    auc = roc_auc_score(
        model_y_test, probs, multi_class="ovr", average="weighted"
    )  # one-vs-rest

    # output metrics
    print(
        f"{model} test balanced accuracy: {balanced_accuracy.round(3)}, f1 score (weighted): {f1_score_weighted.round(3)}, auc (ovr, weighted): {auc.round(3)}"
    )

    # create a prediction dataframe
    predictions_df = pd.DataFrame(data=model_y_test.reset_index(drop=True))
    predictions_df["pred"] = preds

    # plot actual vs predicted COUNTS
    axs[i, 0].hist([predictions_df.risk, predictions_df.pred])
    axs[i, 0].legend(["Actual risk", "Predicted risk"])
    axs[i, 0].set_title(f"{model} - f1 weighted: {f1_score_weighted.round(2)}")
    axs[i, 0].set_xticks([1, 2, 3, 4, 5], labels=risk_labels, minor=False)
    axs[i, 0].set_ylabel("Count of risk")

    # plot predicted vs actual CLASSES
    risks = dict.fromkeys(risk_labels)
    for proportion in risks:
        risks[proportion] = np.array([0.0, 0.0, 0.0, 0.0, 0.0])

        for label in risk_labels:
            this_risk = int(label[0])

            # extract the predicted risk
            subset = predictions_df[predictions_df.pred == this_risk]

            if proportion == "1 - Very Low Risk":
                count = (subset.risk == 1).sum()
            elif proportion == "2 - Low Risk":
                count = (subset.risk == 2).sum()
            elif proportion == "3 - Normal Risk":
                count = (subset.risk == 3).sum()
            elif proportion == "4 - Elevated Risk":
                count = (subset.risk == 4).sum()
            else:
                count = (subset.risk == 5).sum()

            prop = 0 if count == 0 else count / subset.shape[0]

            risks[proportion][this_risk - 1] = prop

    bottom = np.array([0.0, 0.0, 0.0, 0.0, 0.0])
    for proportion in risks:
        if proportion == "1 - Very Low Risk":
            data = risks[proportion]
            axs[i, 1].bar(risk_labels, data, label=proportion, width=0.35)
        else:
            bottom += data
            data = risks[proportion]
            axs[i, 1].bar(
                risk_labels, data, label=proportion, bottom=bottom, width=0.35
            )
    handles, labels = axs[i, 1].get_legend_handles_labels()
    axs[i, 1].legend(handles[::-1], labels[::-1], bbox_to_anchor=(1.05, 1))
    axs[i, 1].set_xlabel("Predicted risk")
    axs[i, 1].set_ylabel("Actual risk proportion")
    axs[i, 1].set_title(f"{model} - f1 weighted: {f1_score_weighted.round(2)}")
    i += 1

While the randomforest has a higher f1_score, catboost has a higher auc score and is able to predict across the classes.

We select catboost as the best performing model.

## Model tuning

We will select the best performing model using default parameters, `catboost` and use GridSearchCV to fine tune its hyperparameters.

In [None]:
# note the baseline performance of the chosen model
model_name = "catboost"

print(models[model_name]["test_metric"])

### Re-train best model

Using GridsearchCV and an appropriate parameter array for the chosen model

In [None]:
model_name = "catboost"

final_model = {model_name: {}}

# example from https://catboost.ai/en/docs/concepts/python-reference_catboostregressor_grid_search
# see https://catboost.ai/en/docs/concepts/parameter-tuning for other options

param_grid = {
    "learning_rate": [0.03, 0.1],
    "depth": [4, 6, 10],
    "l2_leaf_reg": [1, 3, 5, 7, 9],
}

# extract categorical features
num_features = [
    "AGE_ON_ADMISSION",
    "EL CountLast12m",
    "EMCountLast12m",
    "OP First CountLast12m",
    "OP FU CountLast12m",
]
cat_features = list(set(X_train_catboost.columns) - set(num_features))

gsc = GridSearchCV(
    estimator=CatBoostClassifier(
        verbose=False,
        auto_class_weights="Balanced",
        cat_features=cat_features,
        random_state=42,
    ),
    param_grid=param_grid,
    cv=5,
    scoring="f1_weighted",
    verbose=1,
    n_jobs=-1,
    refit=True,
)

# takes ~65 mins to run on a STANDARD_D13_V2
grid_result = gsc.fit(X_train_catboost, y_train_catboost)

final_model[model_name]["train_metric"] = grid_result.best_score_
# store model and parameters
final_model[model_name]["model"] = grid_result.best_estimator_
final_model[model_name]["params"] = grid_result.best_params_
final_model[model_name]

## Model evaluation

Now we have tuned the best model given the parameters specified, we will test the model on the test set.

In [None]:
# generate predictions
preds_test = final_model[model_name]["model"].predict(X_test_catboost)

# append the test metrics to the model
final_model[model_name]["test_metric"] = f1_score(
    y_test_catboost, preds_test, average="weighted"
)
final_model

How much does hyperparameter tuning on the training set improve the performance on the test set?

In our case, the performance decreases by 0.4%, indicating that hyperparameter tuning has negligible impact on the performance of the model.

## Save models

In [None]:
models["final_model"] = final_model

# save models outside the git tree
with open("../../models/classification.pickle", "wb") as handle:
    pickle.dump(models, handle)

## Model exploration

We will take the "best performing" baseline model, and explore in more detail how the model performs.

In [None]:
# setup a subplot figure
fig, axs = plt.subplots(1, 2)
fig.set_size_inches(15, 7)

if model_name == "catboost":
    model_X_test = X_test_catboost
    model_y_test = y_test
elif model_name == "elastic":
    model_X_test = X_test_scaled
    model_y_test = y_test
else:
    model_X_test = X_test
    model_y_test = y_test

# perform inference
preds = models["final_model"][model_name]["model"].predict(model_X_test)
probs = models["final_model"][model_name]["model"].predict_proba(model_X_test)

# calculate performance metrics
balanced_accuracy = balanced_accuracy_score(model_y_test, preds)
f1_score_weighted = f1_score(model_y_test, preds, average="weighted")
auc = roc_auc_score(
    model_y_test, probs, multi_class="ovr", average="weighted"
)  # one-vs-rest

# output metrics
print(
    f"{model_name} test balanced accuracy: {balanced_accuracy.round(3)}, f1 score (weighted): {f1_score_weighted.round(3)}, auc (ovr, weighted): {auc.round(3)}"
)

# create a prediction dataframe
predictions_df = pd.DataFrame(data=model_y_test.reset_index(drop=True))
predictions_df["pred"] = preds

# plot actual vs predicted COUNTS
axs[0].hist([predictions_df.risk, predictions_df.pred])
axs[0].legend(["Actual risk", "Predicted risk"])
axs[0].set_title(f"{model_name} - f1 weighted: {f1_score_weighted.round(2)}")
axs[0].set_xticks([1, 2, 3, 4, 5], labels=risk_labels, minor=False)
axs[0].set_ylabel("Count of risk")

# plot predicted vs actual CLASSES
risks = dict.fromkeys(risk_labels)
for proportion in risks:
    risks[proportion] = np.array([0.0, 0.0, 0.0, 0.0, 0.0])

    for label in risk_labels:
        this_risk = int(label[0])

        # extract the predicted risk
        subset = predictions_df[predictions_df.pred == this_risk]

        if proportion == "1 - Very Low Risk":
            count = (subset.risk == 1).sum()
        elif proportion == "2 - Low Risk":
            count = (subset.risk == 2).sum()
        elif proportion == "3 - Normal Risk":
            count = (subset.risk == 3).sum()
        elif proportion == "4 - Elevated Risk":
            count = (subset.risk == 4).sum()
        else:
            count = (subset.risk == 5).sum()

        prop = 0 if count == 0 else count / subset.shape[0]

        risks[proportion][this_risk - 1] = prop

bottom = np.array([0.0, 0.0, 0.0, 0.0, 0.0])
for proportion in risks:
    if proportion == "1 - Very Low Risk":
        data = risks[proportion]
        axs[1].bar(risk_labels, data, label=proportion, width=0.35)
    else:
        bottom += data
        data = risks[proportion]
        axs[1].bar(risk_labels, data, label=proportion, bottom=bottom, width=0.35)
handles, labels = axs[1].get_legend_handles_labels()
axs[1].legend(handles[::-1], labels[::-1], bbox_to_anchor=(1.05, 1))
axs[1].set_xlabel("Predicted risk")
axs[1].set_ylabel("Actual risk proportion")
axs[1].set_title(f"{model_name} - f1 weighted: {f1_score_weighted.round(2)}")

fig.suptitle("Final model");

### Severity of misclassification

When the model incorrectly predicts a class, how badly does it do this?

Because risk categories are numerical (1-5), we can calculate the difference between them as the number of classes incorrect the prediction was.

In [None]:
predictions_df["diff"] = predictions_df.diff().risk
fig = px.histogram(predictions_df, x="diff")
fig.show()

### Feature importance

Which features does the model ascribe predictive power to?

In [None]:
# Feature names
coef = pd.DataFrame(data=list(model_X_test.columns))
# Feature importances, sorted
coef["coef"] = models["final_model"][model_name]["model"].feature_importances_
coef.sort_values("coef", ascending=False, inplace=True)
coef.set_index(0, inplace=True)
# Plot interactive plot
# Hover over a feature for full feature name
fig = px.bar(coef, x=coef.index, y="coef")
fig.show()

## Extensions

- Fairness analysis
- Analysis of distribution of probabilities e.g. `predict_proba` to see how changes to threshold affect performance
- Plot PR curves per class as per https://stackoverflow.com/questions/56090541/how-to-plot-precision-and-recall-of-multiclass-classifier
- Train a binary classifier on Long Stay (21+ days) or not, use it as a precursor to two different regression models (one for long stayer, one for not)
- Include IS_MINOR data