# Modelling - Regression

This notebook explores regression models to **predict Length of Stay (days)** as a baseline to the [Long Stayer Risk Stratification](https://github.com/nhsx/skunkworks-long-stayer-risk-stratification) model.

The best performing model used catboost and achieved a Mean Absolute Error (MAE) of **4.1 days**.

This notebook is broken down into:

1. Statistical tests to check the validity of linear models using Ordinary Least Squares (OLS)
2. Splitting the data into a training, validation and test set
3. Training a range of baseline models
4. Testing models on the validation dataset
5. Tuning the best model and testing on the test set

Regression models selected:

Model|Rationale
---|---
[Mean](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html)|The simplest baseline, uses the mean length of stay as the prediction in all cases
[ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html)|A regularised implementation of linear regression that can be used for multi-colinear datasets such as in this dataset
[DecisionTreeRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)|A simple, single tree regressor that is highly explainable
[RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)|An ensemble technique with potentially better performance than a single tree
[CatBoostRegressor](https://catboost.ai/en/docs/concepts/python-reference_catboostregressor)|A boosted tree technique designed specifically for datasets with high levels of categorical features as in this dataset
[XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)|A boosted tree technique that can improve on ensemble techniques such as RandomForest

Inputs|Outputs
---|---
`processed/features.parquet`|`models/regression.pickle`
`processed/features-sensitive.parquet`|&nbsp;
`processed/features-catboost.parquet`|&nbsp;

In [None]:
import math
import pickle
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
from catboost import CatBoostRegressor
from scipy.stats import anderson, kstest, shapiro
from sklearn import linear_model
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.tree import DecisionTreeRegressor
from statsmodels.stats.outliers_influence import variance_inflation_factor
from xgboost import XGBRegressor

sys.path.append("../src/")

from utils import train_and_test_model, train_test_validate_split

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

%matplotlib inline
plt.rcParams["figure.figsize"] = [15, 8]

## Load features

In [None]:
features_df = pd.read_parquet("../../data/processed/features.parquet")
features_df.shape

In [None]:
# we will also load all features so we can analyse the impact of train/validate/split on sensitive features
features_sensitive_df = pd.read_parquet(
    "../../data/processed/features-sensitive.parquet"
)
features_sensitive_df.shape

In [None]:
# note catboost requires not one-hot encoding features, as it deals with them during training
features_catboost_df = pd.read_parquet("../../data/processed/features-catboost.parquet")
features_catboost_df.shape

In [None]:
# define sensitive columns for fairness testing later
sensitive_columns = [
    "ETHNIC_CATEGORY_CODE_DESCRIPTION",
    "IMD county decile",
    "OAC Group Name",
    "OAC Subgroup Name",
    "OAC Supergroup Name",
    "PATIENT_GENDER_CURRENT_DESCRIPTION",
    "POST_CODE_AT_ADMISSION_DATE_DISTRICT",
    "Rural urban classification",
]

## Define target and training features

In [None]:
X = features_df.drop(columns="LENGTH_OF_STAY")
y = features_df.LENGTH_OF_STAY

X_sensitive = features_sensitive_df.drop(columns="LENGTH_OF_STAY")
y_sensitive = features_sensitive_df.LENGTH_OF_STAY

X_catboost = features_catboost_df.drop(columns="LENGTH_OF_STAY")
y_catboost = features_catboost_df.LENGTH_OF_STAY

## Variance Inflaction Factors

Simple linear models (ordinary least squares) assume there is no multi-collinearity.

Variance inflaction factors (VIF) help quantify the extent of any collinearity present.

We are looking for VIF ~< 10 across our features.

In [None]:
# Takes ~6 minutes to run on a STANDARD_DS3_V2
vif = pd.DataFrame()
vif["feature"] = X.columns

vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
vif

## Residual distributions of [OLS](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

If VIF factors indicate a lack of multi-collinearity (they do not), check for normality of residuals aka homoescadisticity.

This requires training a linear model, calculating the residuals and checking visually and statistically that they are normally distributed.

In [None]:
# Train basic OLS model for statistical testing only
reg = linear_model.LinearRegression()
reg.fit(X, y)
pred = pd.Series(reg.predict(X))
resid = y - pred

In [None]:
# Visual inspection
resid.hist(bins=30);

### Shapiro-Wilk test for normality

* Null hypothesis = our residuals are drawn from normal distribution
* Alternate hypothesis = our residuals are not drawn from normal distribution (and fail requirements of OLS model)
* Test statistic shows how much distribution differs to normal distribution
* p-value is probability null hypothesis true
* p-value < 0.05 leads us to reject null hypothesis

In [None]:
shapiro(resid)

### One-sided Kolmogorov-Smirnov test for normality

* Null hypothesis = our residuals are drawn from normal distribution
* Alternate hypothesis = our residuals are not drawn from normal distribution (and fail requirements of OLS model)
* Test statistic shows how much distribution differs to normal distribution
* p-value is probability null hypothesis true
* p-value < 0.05 leads us to reject null hypothesis

In [None]:
kstest(resid, "norm")

### Anderson-Darling test for normality

* Null hypothesis = our residuals are drawn from normal distribution
* Alternate hypothesis = our residuals are not drawn from normal distribution (and fail requirements of OLS model)
* Test statistic is compared to critical value at the significance level required (e.g. 5%)
* Test statistic > critical value for 5% significance level leads us to reject null hypothesis

In [None]:
anderson(resid, "norm")

**Statistical testing invalidate assumptions for OLS models**

## Train/validation/test split

We will use 70% of the data for training, leaving 15% for validation and 15% for testing.

This split was decided based on the quantity of the data (>100,000 rows of data) and will include some simple checks to make sure that the splits are representative across Length of Stay, age, ethnicity and gender.

Models will be trained on the training set, and tested on the validation set. This will help select a basic model using data it was not trained on, to reduce the risk of overfitting.

A final model will be trained using GridSearchCV across the training and validation data, before final performance metrics are generated using the unseen test data set.

In [None]:
# Split data for train/validate+test
X_train, X_validate, X_test, y_train, y_validate, y_test = train_test_validate_split(
    X_sensitive,
    y_sensitive,
    train_size=0.70,
    validate_size=0.15,
    test_size=0.15,
    random_state=42,
)

print(X_train.shape, X_validate.shape, X_test.shape)
print(y_train.shape, y_validate.shape, y_test.shape)

# Split data for train/validate+test
(
    X_train_catboost,
    X_validate_catboost,
    X_test_catboost,
    y_train_catboost,
    y_validate_catboost,
    y_test_catboost,
) = train_test_validate_split(
    X_catboost,
    y_catboost,
    train_size=0.70,
    validate_size=0.15,
    test_size=0.15,
    random_state=42,
)

print(X_train_catboost.shape, X_validate_catboost.shape, X_test_catboost.shape)
print(y_train_catboost.shape, y_validate_catboost.shape, y_test_catboost.shape)

We will now explore the distributions of data across the splits to ensure we are not accidentally introducing selection bias into the different datasets

In [None]:
# define splits
x_splits = {"train": X_train, "validate": X_validate, "test": X_test}
y_splits = {"train": y_train, "validate": y_validate, "test": y_test}
colours = {"train": "#f00", "validate": "#0f0", "test": "#00f"}

### Length of stay

In [None]:
import matplotlib.pyplot as plt

num_bins = 30
for split in y_splits:
    y_splits[split].hist(
        density=True, alpha=0.5, bins=num_bins, label=split, color=colours[split]
    )
plt.legend()
plt.ylabel("Length of stay (days)")
plt.xlabel("Density");

### Age

In [None]:
num_bins = 15
for split in x_splits:
    x_splits[split].AGE_ON_ADMISSION.hist(
        density=True, alpha=0.5, bins=num_bins, label=split, color=colours[split]
    )
plt.legend()
plt.ylabel("Age (years)")
plt.xlabel("Density");

## Sex

In [None]:
for split in x_splits:
    print(
        f"{split}: {x_splits[split].PATIENT_GENDER_CURRENT_DESCRIPTION.value_counts(dropna=False, normalize=True).round(2).tolist()}"
    )

## Ethnic category

In [None]:
for split in x_splits:
    print(
        f"{split}: {x_splits[split].ETHNIC_CATEGORY_CODE_DESCRIPTION.value_counts(dropna=False, normalize=True).round(2).tolist()}"
    )

Having established that the train/validate/test split does not introduce any bias in the representation of age, gender or ethnicity, we will drop the sensitive features which are not included in the model, but will be used to explore model bias later.

In [None]:
X_train = X_train.drop(columns=sensitive_columns)
X_validate = X_validate.drop(columns=sensitive_columns)
X_test = X_test.drop(columns=sensitive_columns)
print(X_train.shape, X_validate.shape, X_test.shape)
print(y_train.shape, y_validate.shape, y_test.shape)

# Modeling

Strategy is to try a number of regression models with:

* Baseline models for each algorithm trained on the training set with default parameters
* Baseline models tested on the test set
* GridsearchCV for hyperparameter tuning on best performing model
* Explore feature importance of final model
* Explore fairness (next notebook) of final model

OLS models are excluded due to statistical assumptions not being met. NN are excluded due to complexity/interpretability issues.

In [None]:
# Initiate empty models dictionary
models = {}

## Mean model

The simplest baseline model takes the mean length of stay as its prediction

In [None]:
model_name = "mean"

# define an estimator for this model
estimator = DummyRegressor(strategy="mean")

# takes ~1 second to run on a STANDARD_D13_V2
models[model_name] = train_and_test_model(
    estimator, X_train, y_train, X_validate, y_validate, scoring_metric="rmse"
)
models[model_name]

### Elastic net regression

A regularised linear model.

In [None]:
model_name = "elastic"

# define an estimator for this model
estimator = linear_model.ElasticNet(random_state=42)

# takes ~1 second to run on a STANDARD_D13_V2
models[model_name] = train_and_test_model(
    estimator, X_train, y_train, X_validate, y_validate, scoring_metric="rmse"
)
models[model_name]

### Decision Tree regressor

In [None]:
model_name = "decisiontree"

estimator = DecisionTreeRegressor(random_state=42)

# takes ~10 seconds to run on a STANDARD_D13_V2
models[model_name] = train_and_test_model(
    estimator, X_train, y_train, X_validate, y_validate, scoring_metric="rmse"
)

models[model_name]

### Random forest regressor

In [None]:
model_name = "randomforest"

estimator = RandomForestRegressor(random_state=42)

# takes ~3 mins to run on a STANDARD_D13_V2
models[model_name] = train_and_test_model(
    estimator, X_train, y_train, X_validate, y_validate, scoring_metric="rmse"
)
models[model_name]

### Catboost

Boosted tree optimised for categorical features

In [None]:
model_name = "catboost"

# extract categorical features
num_features = [
    "AGE_ON_ADMISSION",
    "EL CountLast12m",
    "EMCountLast12m",
    "OP First CountLast12m",
    "OP FU CountLast12m",
]
cat_features = list(set(X_train_catboost.columns) - set(num_features))

estimator = CatBoostRegressor(verbose=False, cat_features=cat_features, random_state=42)

# takes ~30 secs to run on a STANDARD_D13_V2
models[model_name] = train_and_test_model(
    estimator,
    X_train_catboost,
    y_train_catboost,
    X_validate_catboost,
    y_validate_catboost,
    scoring_metric="rmse",
)
models[model_name]

### XGBoost

In [None]:
model_name = "xgboost"

estimator = XGBRegressor(random_state=42)

# takes ~10s to run on a STANDARD_D13_V2
models[model_name] = train_and_test_model(
    estimator, X_train, y_train, X_validate, y_validate, scoring_metric="rmse"
)
models[model_name]

## Evaluate model performance visually

We will evalulate how the models performed so we can select the best model for hyperparameter tuning

In [None]:
# setup a subplot figure
fig, axs = plt.subplots(len(models), 2)
fig.set_size_inches(15, 7 * len(models))

i = 0

for model in models:
    if model == "catboost":
        model_X_test = X_test_catboost
        model_y_test = y_test_catboost
    else:
        model_X_test = X_test
        model_y_test = y_test

    # inference - ensure smallest LoS is 0 days (not negative value)
    preds = np.clip(models[model]["model"].predict(model_X_test), 0, None)

    # calculate RMSE and range of LoS
    rmse = mean_squared_error(model_y_test, preds, squared=False)
    mae = mean_absolute_error(model_y_test, preds)

    print(
        f"{model} test rmse: {rmse.round(3)} days, mae: {mae.round(3)}, range ({preds.min().round(1)} - {preds.max().round(1)} days)"
    )

    # create prediction dataframe
    predictions_df = pd.DataFrame(data=model_y_test.reset_index(drop=True))
    predictions_df["pred"] = preds

    # calculate relative error
    predictions_df["error"] = predictions_df.pred - predictions_df.LENGTH_OF_STAY

    # plot predicted vs actual
    axs[i, 0].scatter(predictions_df.pred, predictions_df.LENGTH_OF_STAY, alpha=0.1)
    # plot ideal 1:1 prediction line. Max LoS = 30
    axs[i, 0].plot(np.arange(0, 31), np.arange(0, 31), "r--")
    axs[i, 0].set_xlabel("Predicted Length of Stay (days)")
    axs[i, 0].set_ylabel("Actual Length of Stay (days)")
    axs[i, 0].set_xlim([-1, 31])
    axs[i, 0].set_ylim([-1, 31])
    axs[i, 0].set_title(f"{model} - RMSE {rmse.round(2)} days")

    # plot relative error
    axs[i, 1].scatter(predictions_df.LENGTH_OF_STAY, predictions_df.error, alpha=0.1)
    axs[i, 1].set_xlabel("Length of Stay (days)")
    axs[i, 1].set_ylabel("Error (days)")
    # plot mean relative error and 95% confidence intervals
    axs[i, 1].plot(np.arange(0, 31), np.ones(31) * predictions_df.error.mean(), "r")
    axs[i, 1].plot(
        np.arange(0, 51),
        np.ones(51) * (predictions_df.error.mean() + (2 * predictions_df.error.std())),
        "g--",
    )
    axs[i, 1].plot(
        np.arange(0, 51),
        np.ones(51) * (predictions_df.error.mean() - (2 * predictions_df.error.std())),
        "g--",
    )
    # scale plot
    axs[i, 1].set_xlim([-1, 31])
    # add statistical data in legend. LoA = limits of agreement
    # flag: errors are not normally distibuted so does 2*std capture 95% interval?

    axs[i, 1].legend(
        [
            f"\u03bc ({np.round(predictions_df.error.mean(),2)} days)",
            f"95% LoA (\u03C3 {np.round(predictions_df.error.std(),2)} days gives {2*np.round(predictions_df.error.std(),2)})",
        ]
    )
    axs[i, 1].set_title(f"{model} - RMSE {rmse.round(2)} days")
    i += 1

## Model tuning

We will select the best performing model using default parameters, `catboost` and use GridSearchCV to fine tune its hyperparameters.

In [None]:
# note the baseline performance of the chosen model
model_name = "catboost"

print(models[model_name]["test_metric"])

### Re-train best model

Using GridsearchCV and an appropriate parameter array for the chosen model

In [None]:
model_name = "catboost"

final_model = {model_name: {}}

# example adapted from https://catboost.ai/en/docs/concepts/python-reference_catboostregressor_grid_search
# see https://catboost.ai/en/docs/concepts/parameter-tuning for other options

param_grid = {
    "learning_rate": [0.03, 0.1],
    "depth": [4, 6, 10],
    "l2_leaf_reg": [1, 5, 9],
}

# extract categorical features
num_features = [
    "AGE_ON_ADMISSION",
    "EL CountLast12m",
    "EMCountLast12m",
    "OP First CountLast12m",
    "OP FU CountLast12m",
]
cat_features = list(set(X_train_catboost.columns) - set(num_features))

gsc = GridSearchCV(
    estimator=CatBoostRegressor(verbose=False, cat_features=cat_features),
    param_grid=param_grid,
    cv=5,
    scoring="neg_root_mean_squared_error",
    verbose=1,
    n_jobs=-1,
    refit=True,
)

# note: we will use both the training and validation dataset as we are using GridSearchCV across 5-folds
# takes ~70 mins to run on a STANDARD_D13_V2
grid_result = gsc.fit(
    pd.concat([X_train_catboost, X_validate_catboost]),
    pd.concat([y_train_catboost, y_validate_catboost]),
)

# store performance; neg_root_mean_squared_error = -root_mean_squared_error
final_model[model_name]["train_rmse"] = -grid_result.best_score_
# store model and parameters
final_model[model_name]["model"] = grid_result.best_estimator_
final_model[model_name]["params"] = grid_result.best_params_
final_model[model_name]

## Model evaluation

Now we have tuned the best model given the parameters specified, we will test the model on the test set and also calculate the `mean_absolute_error` which is an additional metric that is more understandable - it is the number of days error in the length of stay prediction across all predictions:

In [None]:
# generate predictions
preds_test = final_model[model_name]["model"].predict(X_test_catboost)

# append the test metrics to the model
final_model[model_name]["test_rmse"] = mean_squared_error(
    y_test_catboost, preds_test, squared=False
)
final_model[model_name]["test_mae"] = mean_absolute_error(y_test_catboost, preds_test)
final_model

We now have our best performing baseline model, which achieves an **MAE of 4.1 days**, 0.3 days greater than the original neural network approach.

However, the predicted vs actual plots and error plots show that the model struggles to capture long stayers, and is biased to short stays. Further work is needed to improve the modelling approach.

## Save models

In [None]:
models["final_model"] = final_model

# save models outside the git tree
with open("../../models/regression.pickle", "wb") as handle:
    pickle.dump(models, handle)

## Model exploration

A single performance metric can be a misleading summary of how a model performs. We will take the "best performing" baseline model, and explore in more detail how the model performs.

In [None]:
if model_name == "catboost":
    model_X_test = X_test_catboost
    model_y_test = y_test_catboost
else:
    model_X_test = X_test
    model_y_test = y_test

# setup a subplot figure
fig, axs = plt.subplots(1, 2)
fig.set_size_inches(15, 7)

# inference - ensure smallest LoS is 0 days (not negative value)
preds = np.clip(
    models["final_model"][model_name]["model"].predict(model_X_test), 0, None
)

# calculate RMSE and range of LoS
rmse = mean_squared_error(model_y_test, preds, squared=False)
print(
    f"{model_name} test rmse: {rmse.round(3)} days, range ({preds.min().round(1)} - {preds.max().round(1)} days)"
)

# create prediction dataframe
predictions_df = pd.DataFrame(data=model_y_test.reset_index(drop=True))
predictions_df["pred"] = preds

# calculate relative error
predictions_df["error"] = predictions_df.pred - predictions_df.LENGTH_OF_STAY

# plot predicted vs actual
axs[0].scatter(predictions_df.pred, predictions_df.LENGTH_OF_STAY, alpha=0.1)
# plot ideal 1:1 prediction line. Max LoS = 30
axs[0].plot(np.arange(0, 31), np.arange(0, 31), "r--")
axs[0].set_xlabel("Predicted Length of Stay (days)")
axs[0].set_ylabel("Actual Length of Stay (days)")
axs[0].set_xlim([-1, 31])
axs[0].set_ylim([-1, 31])
axs[0].set_title(f"{model_name} - RMSE {rmse.round(2)} days")

# plot relative error
axs[1].scatter(predictions_df.LENGTH_OF_STAY, predictions_df.error, alpha=0.1)
axs[1].set_xlabel("Length of Stay (days)")
axs[1].set_ylabel("Error (days)")
# plot mean relative error and 95% confidence intervals
axs[1].plot(np.arange(0, 31), np.ones(31) * predictions_df.error.mean(), "r")
axs[1].plot(
    np.arange(0, 51),
    np.ones(51) * (predictions_df.error.mean() + (2 * predictions_df.error.std())),
    "g--",
)
axs[1].plot(
    np.arange(0, 51),
    np.ones(51) * (predictions_df.error.mean() - (2 * predictions_df.error.std())),
    "g--",
)
# scale plot
axs[1].set_xlim([-1, 31])
# add statistical data in legend. LoA = limits of agreement
# flag: errors are not normally distibuted so does 2*std capture 95% interval?

axs[1].legend(
    [
        f"\u03bc ({np.round(predictions_df.error.mean(),2)} days)",
        f"95% LoA (\u03C3 {np.round(predictions_df.error.std(),2)} days gives {2*np.round(predictions_df.error.std(),2)})",
    ]
)
axs[1].set_title(f"{model_name} - RMSE {rmse.round(2)} days")

fig.suptitle("Final model");

### Feature importance

Which features does the model ascribe predictive power to?

In [None]:
# Feature names
coef = pd.DataFrame(data=list(model_X_test.columns))
# Feature importances, sorted
coef["coef"] = models["final_model"][model_name]["model"].feature_importances_
coef.sort_values("coef", ascending=False, inplace=True)
coef.set_index(0, inplace=True)
# Plot interactive plot
# Hover over a feature for full feature name
fig = px.bar(coef, x=coef.index, y="coef")
fig.show()

# Extensions

- Build two separate regression models - one for long stayers (21+ days), one for not long-stayers.
- Include IS_MINOR data in conjunction with above