# HSMA 6 - Session 4G - Exercise 2B - LoS Dataset Explainable AI

## Core

We're going to work with a dataset to try to predict patient length of stay. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from xgboost import XGBRegressor
# import any other libraries you need
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, \
                            r2_score, root_mean_squared_error

# Additional imports for explainable AI
from sklearn.inspection import PartialDependenceDisplay, permutation_importance

# Import shap for shapley values
import shap

# JavaScript Important for the interactive charts later on
shap.initjs()

Run this cell to load in the dataframe containing the LOS data and view the dataframe.

In [None]:
los_df = pd.read_csv("../datasets/los_dataset/LengthOfStay.csv", index_col="eid")
los_df.head()

Run these cells to tidy the dataframe.

In [None]:
# Remove the date column
los_df = los_df.drop(columns="vdate")
# Convert the gender column to 0/1 and rename
los_df['gender'].replace('M', 0, inplace=True)
los_df['gender'].replace('F', 1, inplace=True)
los_df.rename(columns={'female': 'gender'}, inplace=True)

# Convert the facilities to one hot encoding
# Bonus - astype('int') will convert the true/false values to 0/1
# not necessary - it will work regardless
one_hot = pd.get_dummies(los_df['facid']).astype('int')
los_df = los_df.drop('facid', axis=1)
los_df = los_df.join(one_hot)

# Convert the readmission count to one hot encoding
# Bonus - astype('int') will convert the true/false values to 0/1
# not necessary - it will work regardless
one_hot = pd.get_dummies(los_df['rcount'], prefix="rcount").astype('int')
los_df = los_df.drop('rcount', axis=1)
los_df = los_df.join(one_hot)
los_df.head()

Train an xgboost decision tree model to predict length of stay based on the variables in this dataset. 

In [None]:
X = los_df.drop(columns='lengthofstay')
y = los_df['lengthofstay']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size = 0.25,
    random_state=42
    )

regr_xgb = XGBRegressor(random_state=42)

# Train the model using the training sets
regr_xgb.fit(X_train, y_train)

# Make predictions using the testing set
y_pred_train = regr_xgb.predict(X_train)
y_pred_test = regr_xgb.predict(X_test)

Assess the performance of this model.

In [None]:
print("TRAINING DATA")
print(f"Mean absolute error: {mean_absolute_error(y_train, y_pred_train):.2f}")
print(f"Mean absolute percentage error: {mean_absolute_percentage_error(y_train, y_pred_train):.2%}" )
print("Root Mean squared error: %.2f" % root_mean_squared_error(y_train, y_pred_train))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(y_train, y_pred_train))

In [None]:
print("TESTING DATA")
print(f"Mean absolute error: {mean_absolute_error(y_test, y_pred_test):.2f}")
print(f"Mean absolute percentage error: {mean_absolute_percentage_error(y_test, y_pred_test):.2%}" )
print("Root Mean squared error: %.2f" % root_mean_squared_error(y_test, y_pred_test))
# The coefficient of determination: 1 is perfect prediction
print("Coefficient of determination: %.2f" % r2_score(y_test, y_pred_test))

In [None]:
def plot_residuals(actual, predicted):
    residuals = actual - predicted

    plt.figure(figsize=(10, 5))
    plt.hist(residuals, bins=20)
    plt.axvline(x = 0, color = 'r')
    plt.xlabel('Residual')
    plt.ylabel('Frequency')
    plt.title('Distribution of Residuals')
    plt.show()

plot_residuals(y_test, y_pred_test)

In [None]:
def plot_actual_vs_predicted(actual, predicted):
    fig, ax = plt.subplots(figsize=(6, 6))

    ax.scatter(actual, predicted, color="black", alpha=0.05)
    ax.axline((1, 1), slope=1)
    plt.xlabel('True Values')
    plt.ylabel('Predicted Values')
    plt.title('True vs Predicted Values')
    plt.show()

plot_actual_vs_predicted(y_test, y_pred_test)

## Explainable AI

### Explore feature importance

#### Importance with MDI

### Importance using PFI

### PDPs + ICE Plots

### SHAP

Create the SHAP explainer and the shap values.

#### Return just the values

#### Feature Table

Create a table that shows feature importance for MDI, PFI and SHAP.

##### Display the top 10 features according to MDI, PFI and SHAP.

### SHAP Plots

#### Global: Beeswarm

Create a beeswarm plot.

### Global: Bar

Create a bar plot.

### Local: Waterfall Plots

Create a waterfall plot.

### Local: Force Plots 

#### Global: Force Plots

### Dependence Plots

#### Simple scatter of a single feature

Create a scatterplot of the respiration feature.

Colour the respiration plot by the most strongly interacting feature.

Colour the respiration plot by gender.

### Scatter of multiple features 

Create a dependence plot for two different features on the same plot (plotted side-by-side).