# Modelling - Classification

This notebook explores classification models to **predict risk of becoming a long stayer** as a baseline to the [Long Stayer Risk Stratification](https://github.com/nhsx/skunkworks-long-stayer-risk-stratification) model.

This notebook is broken down into:

1. Converting the length of stay into a relative risk
1. Training a range of baseline models using cross validation
3. Testing final models on a test dataset
4. Exploring in more detail the best performing baseline model

In [None]:
import math
import pickle

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
from catboost import CatBoostClassifier
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score,
    balanced_accuracy_score,
    f1_score,
    roc_auc_score,
)
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

%matplotlib inline
plt.rcParams["figure.figsize"] = [15, 8]

In [None]:
# Helper functions


def train_model(gsc, X_train, y_train):
    """Uses a GridSearchCV instance to find a reasonable model, and store
    performance and fitted model into a python dict

    Parameters:

        gsc (sklearn.model_selection.GridSearchCV object): defined model
        X_train (pandas dataframe): training dataframe with features
        y_train (pandas dataframe): training dataframe with targets

    Returns:

        (dict): resulting fitted model and performance metrics
    """

    grid_result = gsc.fit(X_train, y_train)

    # note model fitted/scored on balanced accuracy
    model = {
        "cv_balanced_accuracy_mean": np.round(
            grid_result.cv_results_["mean_test_score"][grid_result.best_index_], 3
        ),
        "cv_balanced_accuracy_std": np.round(
            grid_result.cv_results_["std_test_score"][grid_result.best_index_], 2
        ),
        "model": grid_result.best_estimator_,
    }

    # retrain the best estimator on the full training set - note that refit=True does not appear to do this
    # note we calculate balanced accuracy as final metric
    model["model"].fit(X_train, y_train)
    model["balanced_accuracy"] = np.round(
        balanced_accuracy_score(y_train, model["model"].predict(X_train)), 3
    )

    return model


def risk_score(los):
    """Return risk score (1-5) based on LoS

    Parameters:
        los (float): length of stay in days

    Returns:
        (int): risk score (1 = Very low risk, 5 = High risk)
    """

    # round los up to whole days
    los = math.ceil(los)

    if los > 15:
        return 5
    elif los > 13:
        return 4
    elif los > 10:
        return 3
    elif los > 6:
        return 2
    else:
        return 1

## Load features

In [None]:
features_df = pd.read_parquet("../../data/features.parquet")
features_df.shape

## Calculate risk scores

We will convert actual Length of Stay (days) into a risk score defined as:

Risk Category|Day Range for Risk Category
-----|------
1 - Very low risk|0-6
2 - Low risk|7-10
3 - Normal risk|11-13
4 - Elevated risk|14-15
5 - High risk|>15

In [None]:
# actual risk scores
risk_labels = [
    "1 - Very Low Risk",
    "2 - Low Risk",
    "3 - Normal Risk",
    "4 - Elevated Risk",
    "5 - High Risk",
]
features_df["risk"] = [risk_score(los) for los in features_df.LENGTH_OF_STAY]

## Define target and training features

In [None]:
X = features_df.drop(columns=["LENGTH_OF_STAY"])
y = features_df.risk

## Train/test split

For model evaluation, we will hold back a 25% test set, and use cross-validation on the remaining 75% for all models until the final comparison is made.

In [None]:
# Split data for train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.75, random_state=42
)

## Explore class imbalance

This is a multi-class classification model, so we need to understand where any class imbalance lies otherwise we will skew to larger classes.

In [None]:
# Show how many of each class are present in the training set:
X_train.risk.value_counts().sort_index()

### Weight samples

Given significant class imbalance, we will weight samples where the smallest class has a weight of 1, and other classes as a (<1) proportion.

e.g.

class|count|weight
---|---|---
1|1000|0.1
2|100|1
3|500|0.2

In [None]:
weights = X_train.risk.value_counts().min() / X_train.risk.value_counts().sort_index()

## Drop risk from training/test set
X_train.drop(columns="risk", inplace=True)
X_test.drop(columns="risk", inplace=True)

# Modeling

Strategy is to try a number of classification models with:

* GridsearchCV for hyperparameter tuning with cross validation, refitting full training set for best model
* Test all final models against the held-out test set.
* Explore feature importance of best performing model
* Explore fairness (next notebook) of best performing model

Logistic regression models are excluded due to significant multi-collinearity (see analysis in Regression notebook).

In [None]:
# Initiate empty models dictionary
models = {}

### Prior model

The simplest baseline model takes the most frequent class label as its prediction

In [None]:
model_name = "prior"

# define gridsearch parameters
gsc = GridSearchCV(
    estimator=DummyClassifier(strategy="prior"),
    param_grid={},
    cv=5,
    scoring="balanced_accuracy",
    verbose=1,
    n_jobs=-1,
    refit=True,
)

# takes ~1 second to run on a STANDARD_DS3_V2
models[model_name] = train_model(gsc, X_train, y_train)
models[model_name]

### Decision tree classifier (weighted)

Simplest tree classifier using one tree

In [None]:
model_name = "decisiontree"

# define gridsearch parameters
gsc = GridSearchCV(
    estimator=DecisionTreeClassifier(class_weight=weights.to_dict()),
    param_grid={"max_depth": [5, 10, None]},
    cv=5,
    scoring="accuracy",
    verbose=1,
    n_jobs=-1,
    refit=True,
)

# takes ~10s to run on a STANDARD_DS3_V2
models[model_name] = train_model(gsc, X_train, y_train)
models[model_name]

### Random forest (weighted)

In [None]:
model_name = "randomforest"

# define gridsearch parameters
gsc = GridSearchCV(
    estimator=RandomForestClassifier(class_weight=weights.to_dict()),
    param_grid={"n_estimators": [10, 100, 500], "max_depth": [5, 10, None]},
    cv=5,
    scoring="balanced_accuracy",
    verbose=1,
    n_jobs=-1,
    refit=True,
)

# takes ~8 mins to run on a STANDARD_DS3_V2
models[model_name] = train_model(gsc, X_train, y_train)
models[model_name]

### Catboost

Boosted tree optimised for categorical features

In [None]:
model_name = "catboost"

# extract categorical features
num_features = [
    "AGE_ON_ADMISSION",
    "EL CountLast12m",
    "EMCountLast12m",
    "OP First CountLast12m",
    "OP FU CountLast12m",
]
cat_features = list(set(X_train.columns) - set(num_features))

# define gridsearch parameters
gsc = GridSearchCV(
    estimator=CatBoostClassifier(
        verbose=False, class_weights=weights.to_dict(), cat_features=cat_features
    ),
    param_grid={
        "max_depth": [5, 10, None],
        "learning_rate": [0.01, 0.1, 1],
        "iterations": [10, 100, 500],
    },
    cv=5,
    scoring="balanced_accuracy",
    verbose=1,
    n_jobs=-1,
    refit=True,
)

# takes ~25 mins to run on a STANDARD_DS3_V2
models[model_name] = train_model(gsc, X_train, y_train)
models[model_name]

### XGBoost

In [None]:
model_name = "xgboost"

# define gridsearch parameters
gsc = GridSearchCV(
    estimator=XGBClassifier(weight=weights.to_dict(), random_state=42),
    param_grid={
        "n_estimators": [1, 5],
        "learning_rate": [0.01, 0.1, 1],
        "max_depth": [5, 10, None],
    },
    cv=5,
    scoring="balanced_accuracy",
    verbose=1,
    n_jobs=-1,
    refit=True,
)

# takes ~1 mins to run on a STANDARD_DS3_V2
models[model_name] = train_model(gsc, X_train, y_train)
models[model_name]

## Save models

In [None]:
# save models outside the git tree
with open("../../models/classification.pickle", "wb") as handle:
    pickle.dump(models, handle)

## Load models

In [None]:
# load models from outside the git tree
with open("../../models/classification.pickle", "rb") as handle:
    models = pickle.load(handle)
models

## Validate models

Use the held-out test set to evaluate the performance of all the tuned models

In [None]:
for model in models:
    preds = models[model]["model"].predict(X_test)
    probs = models[model]["model"].predict_proba(X_test)
    # calculate performance
    balanced_accuracy = balanced_accuracy_score(y_test, preds)
    f1_score_weighted = f1_score(y_test, preds, average="weighted")
    auc = roc_auc_score(
        y_test, probs, multi_class="ovr", average="weighted"
    )  # one-vs-rest
    print(
        f"{model} test balanced accuracy: {balanced_accuracy.round(3)}, f1 score (weighted): {f1_score_weighted.round(3)}, auc (ovr, weighted): {auc.round(3)}"
    )

## Model exploration

A single performance metric can be a misleading summary of how a model performs. We will take the "best performing" baseline model, and explore in more detail how the model performs.

In [None]:
model = "xgboost"
# generate predictions
predictions_df = pd.DataFrame(data=y_test.reset_index(drop=True))
predictions_df["pred"] = models[model]["model"].predict(X_test)

### Actual vs predicted plot

Let's visualise model performance:

In [None]:
# plot actual vs predicted
predictions_df.risk.hist(alpha=0.5)
predictions_df.pred.hist(alpha=0.5)

plt.legend(["Actual risk", "Predicted risk"])
plt.xticks([1, 2, 3, 4, 5], risk_labels);

### Plot accuracy by risk category

In [None]:
risks = dict.fromkeys(risk_labels)
for proportion in risks:
    risks[proportion] = np.array([0.0, 0.0, 0.0, 0.0, 0.0])

    for label in risk_labels:
        this_risk = int(label[0])

        # extract the real risk
        subset = predictions_df[predictions_df.risk == this_risk]

        if proportion == "1 - Very Low Risk":
            prop = (subset.pred == 1).sum() / subset.shape[0]
        elif proportion == "2 - Low Risk":
            prop = (subset.pred == 2).sum() / subset.shape[0]
        elif proportion == "3 - Normal Risk":
            prop = (subset.pred == 3).sum() / subset.shape[0]
        elif proportion == "4 - Elevated Risk":
            prop = (subset.pred == 4).sum() / subset.shape[0]
        else:
            prop = (subset.pred == 5).sum() / subset.shape[0]

        risks[proportion][this_risk - 1] = prop

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
bottom = np.array([0.0, 0.0, 0.0, 0.0, 0.0])
for proportion in risks:
    if proportion == "1 - Very Low Risk":
        data = risks[proportion]
        ax.bar(risk_labels, data, label=proportion, width=0.35)
    else:
        bottom += data
        data = risks[proportion]
        ax.bar(risk_labels, data, label=proportion, bottom=bottom, width=0.35)
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], bbox_to_anchor=(1.05, 1));

### Severity of misclassification

When the model incorrectly predicts a class, how badly does it do this?

Because risk categories are numerical (1-5), we can calculate the difference between them as the number of classes incorrect the prediction was.

In [None]:
predictions_df["diff"] = predictions_df.diff().risk
fig = px.histogram(predictions_df, x="diff")
fig.show()

### Feature importance

Which features does the model ascribe predictive power to?

In [None]:
# Feature names
coef = pd.DataFrame(data=list(X_train.columns))
# Feature importances, sorted
coef["coef"] = models[model]["model"].feature_importances_
coef.sort_values("coef", ascending=False, inplace=True)
coef.set_index(0, inplace=True)
# Plot interactive plot
# Hover over a feature for full feature name
fig = px.bar(coef, x=coef.index, y="coef")
fig.show()

## TODO

- Plot PR curves per class as per https://stackoverflow.com/questions/56090541/how-to-plot-precision-and-recall-of-multiclass-classifier
- Explore poor predictive power