# Fairness analysis - regression model

Fairness can mean many things.

In this analysis, we focus on the accuracy of the predictions in the machine learning model.

> Does the machine learning model perform better or worse for any of the following demographics: "ETHNIC_CATEGORY_CODE_DESCRIPTION", "IMD county decile", "OAC Group Name", "OAC Subgroup Name", "OAC Supergroup Name", "PATIENT_GENDER_CURRENT_DESCRIPTION", "POST_CODE_AT_ADMISSION_DATE_DISTRICT", "Rural urban classification"?

This demographic data was **excluded** when training the models, and in this notebook, is reintroduced so we can plot the underlying distribution of count, Length of Stay and the error in the predictions provided by the model.

It's important to note that many demographic subgroups will have very small representation (count), so it's important to understand the distribution of your data before jumping to conclusions if model performance is poor for a group of ~100 invidividuals over 200,000 in total.

In [None]:
import math
import pickle

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

%matplotlib inline
plt.rcParams["figure.figsize"] = [15, 8]

## Load data

In [None]:
# sensitive columns before one-hot-encoding
sensitive_columns = [
    "ETHNIC_CATEGORY_CODE_DESCRIPTION",
    "IMD county decile",
    "OAC Group Name",
    "OAC Subgroup Name",
    "OAC Supergroup Name",
    "PATIENT_GENDER_CURRENT_DESCRIPTION",
    "POST_CODE_AT_ADMISSION_DATE_DISTRICT",
    "Rural urban classification",
]

# one-hot encoded dataframe
features_sensitive_df = pd.read_parquet(
    "../../data/processed/features-sensitive.parquet"
)

# note catboost requires not one-hot encoding features, as it deals with them during training
features_sensitive_catboost_df = pd.read_parquet(
    "../../data/processed/features-sensitive-catboost.parquet"
)

# Separate training/test features
X = features_sensitive_df.drop(columns="LENGTH_OF_STAY")
y = features_sensitive_df.LENGTH_OF_STAY
X_catboost = features_sensitive_catboost_df.drop(columns="LENGTH_OF_STAY")
y_catboost = features_sensitive_catboost_df.LENGTH_OF_STAY

# Split data for train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.75, random_state=42
)
print(X_train.shape, X_test.shape)

X_train_catboost, X_test_catboost, y_train_catboost, y_test_catboost = train_test_split(
    X_catboost, y_catboost, train_size=0.75, random_state=42
)
print(X_train_catboost.shape, X_test_catboost.shape)

## Explore underlying distribution of sensitive features

Before exploring the models, we need an understanding of actual variation in the data for each subcategory.

First we will plot the count of each demographic across the sensitive features, so we can identify and low `n` that may skew further analysis.

In [None]:
for feature in sensitive_columns:
    data = pd.DataFrame(columns=["subfeature", "count"])
    # iterate over all subcategories for the feature
    for subfeature in features_sensitive_df[feature].unique():
        subset = features_sensitive_df[features_sensitive_df[feature] == subfeature]
        count = subset.LENGTH_OF_STAY.count()
        df = pd.DataFrame(data=[{"subfeature": subfeature, "count": count}])
        data = pd.concat([data, df])

    fig = px.bar(data, x="subfeature", y="count", title=feature, orientation="v")
    # note: horizontal lines don't render correctly in Azure ML Notebooks, but do in VS Code
    fig.add_hline(y=data["count"].mean(), line_dash="dash", line_color="green")
    fig.update_xaxes(title="")
    fig.show()

### Explore mean Length of Stay by subcategory

How does the actual Length of Stay vary by subcategory? This is important to understand if the model is capturing the historical reality or introducing additional errors.

In [None]:
global_mean = features_sensitive_df.LENGTH_OF_STAY.mean()
for feature in sensitive_columns:
    data = pd.DataFrame(columns=["subfeature", "mean"])
    # iterate through all subcategories
    for subfeature in features_sensitive_df[feature].unique():
        subset = features_sensitive_df[features_sensitive_df[feature] == subfeature]
        mean_los = subset.LENGTH_OF_STAY.mean()
        df = pd.DataFrame(data=[{"subfeature": subfeature, "mean_los": mean_los}])
        data = pd.concat([data, df])

    fig = px.bar(
        data,
        x="subfeature",
        y="mean_los",
        title=feature,
        orientation="v",
        labels={"mean_los": "Mean Length of Stay (days)"},
    )
    fig.add_hline(y=global_mean, line_dash="dash", line_color="green")
    fig.update_xaxes(title="")
    fig.show()

## Load models

With an understanding of the underlying distribution of demographics and their length of stay, we'll load the trained models to explore how the performance varies by demographic.

In [None]:
# load models from outside the git tree
with open("../../models/regression.pickle", "rb") as handle:
    models = pickle.load(handle)
models

## Generate model predictions

In [None]:
# pick one model to analyse:
model = "xgboost"

if model == "catboost":
    model_X_test = X_test_catboost
    model_y_test = y_test_catboost
else:
    model_X_test = X_test
    model_y_test = y_test

# Generate predictions
preds = np.clip(
    models[model]["model"].predict(model_X_test.drop(columns=sensitive_columns)),
    0,
    None,
)

# Create a combined data frame
fairness_df = model_X_test.copy()
fairness_df["LENGTH_OF_STAY"] = model_y_test
fairness_df["LENGTH_OF_STAY_PREDICTED"] = preds

### Explore MAE by feature

Do some subcategories of a feature have a higher error than others? Ie. is the model less accurate for certain demographics?

Rather than plotting MAE per feature, calculate the difference between the mean MAE and each subcategory, to highlight any subcategories with higher MAE than the mean.

In [None]:
for feature in sensitive_columns:
    data = pd.DataFrame(columns=["subfeature", "mae"])
    for subfeature in fairness_df[feature].unique():
        subset = fairness_df[fairness_df[feature] == subfeature]
        mae = mean_absolute_error(
            subset.LENGTH_OF_STAY, subset.LENGTH_OF_STAY_PREDICTED
        )
        df = pd.DataFrame(data=[{"subfeature": subfeature, "mae": mae}])
        data = pd.concat([data, df])
    mean_mae = data["mae"].mean()  # note this is different to overall MAE of model
    data["mae_diff"] = data.mae - mean_mae
    fig = px.bar(
        data,
        x="subfeature",
        y="mae_diff",
        title=feature,
        orientation="v",
        labels={"mae_diff": "Error compared to mean error (days)"},
    )
    fig.add_hline(y=data.mae_diff.mean(), line_dash="dash", line_color="green")
    fig.update_xaxes(title="")
    fig.show()

### Explore MAE ratio to LoS per feature

The models developed here showed a higher MAE for higher LoS, so features with higher LoS to start with, will have a higher MAE. This is not an indication of discrimation, but of the model's limited predictive power at higher LoS.

To account for this, we can calculate the MAE difference for each subfeature, scaled to the mean LoS for that subfeature, to check for any subfeatures which stand out.



In [None]:
for feature in sensitive_columns:
    # capture the mae_mean_ratio for each subfeature
    data = pd.DataFrame(columns=["subfeature", "mae_mean_ratio"])

    for subfeature in fairness_df[feature].unique():
        subset = fairness_df[fairness_df[feature] == subfeature]
        mean_los = subset.LENGTH_OF_STAY.mean()
        mae = mean_absolute_error(
            subset.LENGTH_OF_STAY, subset.LENGTH_OF_STAY_PREDICTED
        )
        mae_mean_los_ratio = mae / mean_los if mean_los > 0 else 0
        df = pd.DataFrame(
            data=[{"subfeature": subfeature, "mae_mean_los_ratio": mae_mean_los_ratio}]
        )
        data = pd.concat([data, df])

    # calculate the difference of each ratio to the mean, to highlight any anomalies
    data["mae_mean_los_ratio_diff"] = (
        data["mae_mean_los_ratio"] - data["mae_mean_los_ratio"].mean()
    )
    fig = px.bar(
        data,
        x="subfeature",
        y="mae_mean_los_ratio_diff",
        title=feature,
        orientation="v",
        labels={
            "mae_mean_los_ratio_diff": "Weighted error compared to mean error (a.u.)"
        },
    )
    fig.add_hline(
        y=data.mae_mean_los_ratio_diff.mean(), line_dash="dash", line_color="green"
    )
    fig.update_xaxes(title="")
    fig.show()

## Extensions

* Establish statistical significance tests for differences in performance between demographics