# Modelling - Regression

This notebook explores regression models to **predict Length of Stay (days)** as a baseline to the [Long Stayer Risk Stratification](https://github.com/nhsx/skunkworks-long-stayer-risk-stratification) model which achieved a Mean Absolute Error **(MAE) of 3.8 days** (2.2 median absolute error).

This notebook begins with statistical tests to check the validity of linear models before implementing tree-based models.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import kstest, shapiro, anderson
from catboost import CatBoostRegressor
from xgboost import XGBRegressor

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

%matplotlib inline
plt.rcParams["figure.figsize"] = [15, 8]

## Load features

In [None]:
features = pd.read_parquet("../../data/features.parquet")
features.shape

## Cap length of stay

The highest length of stay is ~250 days, so we will check the distribution of length of stay and cap high values.

In [None]:
# Check distribution of length of stay
features.groupby(by="LENGTH_OF_STAY").count().AGE_ON_ADMISSION.plot();

In [None]:
# Cap maximum length of stay to 30 days
features.LENGTH_OF_STAY = features.LENGTH_OF_STAY.apply(lambda x: 30 if x > 30 else x)

## Define target and training features

In [None]:
X = features.drop(columns="LENGTH_OF_STAY")
y = features.LENGTH_OF_STAY

## Variance Inflaction Factors

Simple linear models (ordinary least squares) assume there is no multi-collinearity.

Variance inflaction factors (VIF) help quantify the extent of any collinearity present.

We are looking for VIF ~< 10 across our features.

In [None]:
# Takes ~6 minutes to run on a STANDARD_DS3_V2
vif = pd.DataFrame()
vif["feature"] = X.columns

vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
vif

## Residual distributions of OLS

If VIF factors indicate a lack of multi-collinearity (they do not), check for normality of residuals aka homoescadisticity.

This requires training a linear model, calculating the residuals and checking visually and statistically that they are normally distributed.

In [None]:
# Train basic OLS model for statistical testing only
reg = linear_model.LinearRegression()
reg.fit(X, y)
pred = pd.Series(reg.predict(X))
resid = y - pred

In [None]:
# Visual inspection
resid.hist(bins=30);

### Shapiro-Wilk test for normality

* Null hypothesis = our residuals are drawn from normal distribution
* Alternate hypothesis = our residuals are not drawn from normal distribution (and fail requirements of OLS model)
* Test statistic shows how much distribution differs to normal distribution
* p-value is probability null hypothesis true
* p-value < 0.05 leads us to reject null hypothesis

In [None]:
shapiro(resid)

### One-sided Kolmogorov-Smirnov test for normality

* Null hypothesis = our residuals are drawn from normal distribution
* Alternate hypothesis = our residuals are not drawn from normal distribution (and fail requirements of OLS model)
* Test statistic shows how much distribution differs to normal distribution
* p-value is probability null hypothesis true
* p-value < 0.05 leads us to reject null hypothesis

In [None]:
kstest(resid, "norm")

### Anderson-Darling test for normality

* Null hypothesis = our residuals are drawn from normal distribution
* Alternate hypothesis = our residuals are not drawn from normal distribution (and fail requirements of OLS model)
* Test statistic is compared to critical value at the significance level required (e.g. 5%)
* Test statistic > critical value for 5% significance level leads us to reject null hypothesis

In [None]:
anderson(resid, "norm")

**Statistical testing invalidate assumptions for OLS models**

## Train/test split

For model evaluation, we will hold back a 25% test set, and use cross-validation on the remaining 75% for all models until the final comparison is made.

In [None]:
# Split data for train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.25, random_state=42
)

# Modeling

Strategy is to try a number of regression models with:

* GridsearchCV for hyperparameter tuning with cross validation, refitting full training set for best model
* Test all final models against the held-out test set.

OLS models are excluded due to statistical assumptions not being met. NN are excluded due to complexity/interpretability issues.

### Elastic net regression

A regularised linear model.

In [None]:
# Takes ~2 seconds to run on a STANDARD_DS3_V2
gsc = GridSearchCV(
    estimator=linear_model.ElasticNet(),
    param_grid={
        "l1_ratio": [0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1],
    },
    cv=5,
    scoring="neg_mean_absolute_error",
    verbose=1,
    n_jobs=-1,
    refit=True,
)
grid_result = gsc.fit(X_train, y_train)
# store the performance metrics and model in a dictionary
models = {
    "elastic": {
        "mae_mean": np.round(-grid_result.best_score_, 2),
        "model": grid_result.best_estimator_,
    }
}
models["elastic"]

### Random forest regressor

Trained without hyperparameter tuning, tree depth of 5

In [None]:
# Takes ~2 mins to run on a STANDARD_DS3_V2
gsc = GridSearchCV(
    estimator=RandomForestRegressor(),
    param_grid={"n_estimators": [10, 100, 1000], "max_depth": [5, 10, None]},
    cv=5,
    scoring="neg_mean_absolute_error",
    verbose=1,
    n_jobs=-1,
    refit=True,
)
grid_result = gsc.fit(X_train, y_train)
# store the performance metrics and model in a dictionary
models["randomforest"] = {
    "mae_mean": np.round(-grid_result.best_score_, 2),
    "model": grid_result.best_estimator_,
}
models["randomforest"]

### Catboost

Boosted tree optimised for categorical features

In [None]:
# extract categorical features
num_features = [
    "AGE_ON_ADMISSION",
    "EL CountLast12m",
    "EMCountLast12m",
    "OP First CountLast12m",
    "OP FU CountLast12m",
]
cat_features = list(set(X_train.columns) - set(num_features))

In [None]:
# Takes ~2 mins to run on a STANDARD_DS3_V2
gsc = GridSearchCV(
    estimator=CatBoostRegressor(verbose=False, cat_features=cat_features),
    param_grid={
        "max_depth": [5, 10, 16],
        "learning_rate": [0.01, 0.1, 0.5, 1],
        "iterations": [10, 100, 1000],
    },
    cv=5,
    scoring="neg_mean_absolute_error",
    verbose=1,
    n_jobs=-1,
    refit=True,
)
grid_result = gsc.fit(X_train, y_train)
# store the performance metrics and model in a dictionary
models["catboost"] = {
    "mae_mean": np.round(-grid_result.best_score_, 2),
    "model": grid_result.best_estimator_,
}
models["catboost"]

In [None]:
grid_result.best_params_

In [None]:
reg = CatBoostRegressor(depth=5, verbose=False)
cv = -cross_val_score(reg, X_train, y_train, scoring="neg_mean_absolute_error", cv=5)
# store the performance metrics in a dictionary
models["catboost"] = {"cv_mae_mean": cv.mean(), "cv_mae_std": cv.std()}
# add the model fitted to the full training set for final testing
models["catboost"]["model"] = reg.fit(X_train, y_train)
print(
    f'Catboost MAE: {models["catboost"]["cv_mae_mean"].round(2)} ({models["catboost"]["cv_mae_std"].round(2)})'
)

### XGBoost

In [None]:
reg = XGBRegressor(max_depth=5, random_state=42)
cv = -cross_val_score(reg, X_train, y_train, scoring="neg_mean_absolute_error", cv=5)
# store the performance metrics in a dictionary
models["xgboost"] = {"cv_mae_mean": cv.mean(), "cv_mae_std": cv.std()}
# add the model fitted to the full training set for final testing
models["xgboost"]["model"] = reg.fit(X_train, y_train)
print(
    f'XGBoost MAE: {models["xgboost"]["cv_mae_mean"].round(2)} ({models["xgboost"]["cv_mae_std"].round(2)})'
)

In [None]:
models