# Modelling - Regression

This notebook explores regression models to **predict Length of Stay (days)** as a baseline to the [Long Stayer Risk Stratification](https://github.com/nhsx/skunkworks-long-stayer-risk-stratification) model which achieved a Mean Absolute Error **(MAE) of 3.8 days** (2.2 median absolute error).

This notebook is broken down into:

1. Statistical tests to check the validity of linear models using Ordinary Least Squares (OLS)
2. Training a range of baseline models using cross validation
3. Testing final models on a test dataset
4. Exploring in more detail the best performing baseline model

In [None]:
import pandas as pd
import pickle
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import linear_model
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from scipy.stats import kstest, shapiro, anderson
from catboost import CatBoostRegressor
from xgboost import XGBRegressor

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

%matplotlib inline
plt.rcParams["figure.figsize"] = [15, 8]

In [None]:
def train_model(gsc, X_train, y_train):
    """Uses a GridSearchCV instance to find a reasonable model, and store
    performance and fitted model into a python dict

    Parameters:

        gsc (sklearn.model_selection.GridSearchCV object): defined model
        X_train (pandas dataframe): training dataframe with features
        y_train (pandas dataframe): training dataframe with targets

    Returns:

        (dict): resulting fitted model and performance metrics
    """

    grid_result = gsc.fit(X_train, y_train)

    model = {
        "cv_mae_mean": np.round(
            -grid_result.cv_results_["mean_test_score"][grid_result.best_index_], 3
        ),
        "cv_mae_std": np.round(
            grid_result.cv_results_["std_test_score"][grid_result.best_index_], 2
        ),
        "model": grid_result.best_estimator_,
    }

    # retrain the best estimator on the full training set - note that refit=True does not appear to do this
    model["model"].fit(X_train, y_train)
    model["mae"] = np.round(
        mean_absolute_error(y_train, model["model"].predict(X_train)), 3
    )

    return model

## Load features

In [None]:
features = pd.read_parquet("../../data/features.parquet")
features.shape

## Cap length of stay

The highest length of stay is ~250 days, so we will check the distribution of length of stay and cap high values.

In [None]:
# Check distribution of length of stay
features.groupby(by="LENGTH_OF_STAY").count().AGE_ON_ADMISSION.plot();

In [None]:
# Cap maximum length of stay to 30 days
features.LENGTH_OF_STAY = features.LENGTH_OF_STAY.apply(lambda x: 30 if x > 30 else x)

## Define target and training features

In [None]:
X = features.drop(columns="LENGTH_OF_STAY")
y = features.LENGTH_OF_STAY

## Variance Inflaction Factors

Simple linear models (ordinary least squares) assume there is no multi-collinearity.

Variance inflaction factors (VIF) help quantify the extent of any collinearity present.

We are looking for VIF ~< 10 across our features.

In [None]:
# Takes ~6 minutes to run on a STANDARD_DS3_V2
vif = pd.DataFrame()
vif["feature"] = X.columns

vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
vif

## Residual distributions of OLS

If VIF factors indicate a lack of multi-collinearity (they do not), check for normality of residuals aka homoescadisticity.

This requires training a linear model, calculating the residuals and checking visually and statistically that they are normally distributed.

In [None]:
# Train basic OLS model for statistical testing only
reg = linear_model.LinearRegression()
reg.fit(X, y)
pred = pd.Series(reg.predict(X))
resid = y - pred

In [None]:
# Visual inspection
resid.hist(bins=30);

### Shapiro-Wilk test for normality

* Null hypothesis = our residuals are drawn from normal distribution
* Alternate hypothesis = our residuals are not drawn from normal distribution (and fail requirements of OLS model)
* Test statistic shows how much distribution differs to normal distribution
* p-value is probability null hypothesis true
* p-value < 0.05 leads us to reject null hypothesis

In [None]:
shapiro(resid)

### One-sided Kolmogorov-Smirnov test for normality

* Null hypothesis = our residuals are drawn from normal distribution
* Alternate hypothesis = our residuals are not drawn from normal distribution (and fail requirements of OLS model)
* Test statistic shows how much distribution differs to normal distribution
* p-value is probability null hypothesis true
* p-value < 0.05 leads us to reject null hypothesis

In [None]:
kstest(resid, "norm")

### Anderson-Darling test for normality

* Null hypothesis = our residuals are drawn from normal distribution
* Alternate hypothesis = our residuals are not drawn from normal distribution (and fail requirements of OLS model)
* Test statistic is compared to critical value at the significance level required (e.g. 5%)
* Test statistic > critical value for 5% significance level leads us to reject null hypothesis

In [None]:
anderson(resid, "norm")

**Statistical testing invalidate assumptions for OLS models**

## Train/test split

For model evaluation, we will hold back a 25% test set, and use cross-validation on the remaining 75% for all models until the final comparison is made.

In [None]:
# Split data for train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.25, random_state=42
)

# Modeling

Strategy is to try a number of regression models with:

* GridsearchCV for hyperparameter tuning with cross validation, refitting full training set for best model
* Test all final models against the held-out test set.

OLS models are excluded due to statistical assumptions not being met. NN are excluded due to complexity/interpretability issues.

In [None]:
# Initiate empty models dictionary
models = {}

## Mean model

The simplest baseline model takes the mean length of stay as its prediction

In [None]:
model_name = "mean"

# define gridsearch parameters
gsc = GridSearchCV(
    estimator=DummyRegressor(strategy="mean"),
    param_grid={},
    cv=5,
    scoring="neg_mean_absolute_error",
    verbose=1,
    n_jobs=-1,
    refit=True,
)

# takes ~1 seconds to run on a STANDARD_DS3_V2
models[model_name] = train_model(gsc, X_train, y_train)
models[model_name]

### Elastic net regression

A regularised linear model.

In [None]:
model_name = "elastic"

# define gridsearch parameters
gsc = GridSearchCV(
    estimator=linear_model.ElasticNet(),
    param_grid={
        "l1_ratio": [0.1, 0.5, 0.7, 0.9, 0.95, 0.99, 1],
    },
    cv=5,
    scoring="neg_mean_absolute_error",
    verbose=1,
    n_jobs=-1,
    refit=True,
)

# takes ~2 seconds to run on a STANDARD_DS3_V2
models[model_name] = train_model(gsc, X_train, y_train)
models[model_name]

### Random forest regressor

In [None]:
model_name = "randomforest"

# define gridsearch parameters
gsc = GridSearchCV(
    estimator=RandomForestRegressor(),
    param_grid={"n_estimators": [10, 100, 1000], "max_depth": [5, 10, None]},
    cv=5,
    scoring="neg_mean_absolute_error",
    verbose=1,
    n_jobs=-1,
    refit=True,
)

# takes ~15 mins to run on a STANDARD_DS3_V2
models[model_name] = train_model(gsc, X_train, y_train)
models[model_name]

### Catboost

Boosted tree optimised for categorical features

In [None]:
model_name = "catboost"

# extract categorical features
num_features = [
    "AGE_ON_ADMISSION",
    "EL CountLast12m",
    "EMCountLast12m",
    "OP First CountLast12m",
    "OP FU CountLast12m",
]
cat_features = list(set(X_train.columns) - set(num_features))

# define gridsearch parameters
gsc = GridSearchCV(
    estimator=CatBoostRegressor(verbose=False, cat_features=cat_features),
    param_grid={
        "max_depth": [5, 10, None],
        "learning_rate": [0.01, 0.1, 1],
        "iterations": [10, 100, 1000],
    },
    cv=5,
    scoring="neg_mean_absolute_error",
    verbose=1,
    n_jobs=-1,
    refit=True,
)

# takes ~20 mins to run on a STANDARD_DS3_V2
models[model_name] = train_model(gsc, X_train, y_train)
models[model_name]

### XGBoost

In [None]:
model_name = "xgboost"

# define gridsearch parameters
gsc = GridSearchCV(
    estimator=XGBRegressor(random_state=42),
    param_grid={
        "n_estimators": [1, 5],
        "learning_rate": [0.01, 0.1, 1],
        "max_depth": [5, 10, None],
    },
    cv=5,
    scoring="neg_mean_absolute_error",
    verbose=1,
    n_jobs=-1,
    refit=True,
)

# takes ~1 mins to run on a STANDARD_DS3_V2
models[model_name] = train_model(gsc, X_train, y_train)
models[model_name]

## Save models

In [None]:
# save models outside the git tree
with open("../../models/regression.pickle", "wb") as handle:
    pickle.dump(models, handle)

## Load models

In [None]:
# load models from outside the git tree
with open("../../models/regression.pickle", "rb") as handle:
    models = pickle.load(handle)
models

## Validate models

Use the held-out test set to evaluate the performance of all the tuned models

In [None]:
for model in models:
    preds = models[model]["model"].predict(X_test)
    mae = mean_absolute_error(y_test, preds)
    print(f"{model} test mae: {mae.round(3)}")

## Model exploration

A single performance metric can be a misleading summary of how a model performs. We will take the "best performing" baseline model, an XGBoostRegressor, and explore in more detail how the model performs.

Todo:

- [ ] Plot predicted vs actual
- [ ] Plot errors
- [ ] Convert into risk scores
- [ ] Explore feature importance
- [ ] Retrain simpler model with only top ~10 features?
- [ ] Fairness analysis