# Interpretability for XGBoost models

### Setup

In [None]:
import numpy as np
import boto3
import pickle as pkl    

s3 = boto3.resource('s3')
s3client = boto3.client('s3')

Following parameters are used in the notebook visualization

In [None]:
## model (XGBoost booster) file ## 

# s3.download_file('BUCKET_NAME', 'MODEL_LOCATION', 'xgboost_model')
model = pkl.load(open('xgboost_model', 'rb'))

In [None]:
## shap values ##

# s3.download_file('BUCKET_NAME', 'SHAP_LOCATION', 'shap_values')
shap_values = pkl.load(open('shap_values', 'rb'))
N_ROWS = shap_values.shape[0]
N_FEATURES = shap_values.shape[1]

baseline_value = shap_values[0, -1]  # last column is the baseline
shap_values = np.delete(shap_values, -1, axis=1) # remove the last column

In [None]:
## feature_names ## 

feature_names = None
# if feature names are not available then create a dummy list
if not feature_names: 
    feature_names = ['f{0:0>2d}'.format(i) for i in range(shap_values.shape[1] - 1)]
model.feature_names = feature_names

In [None]:
## reference dataset used for SHAP computation ## 

# s3.download_file('BUCKET_NAME', 'DATA_LOCATION', 'shap_reference_data.csv')

# This dataset is assumed to *NOT* contain the output label 
# and have the same number of features as the shap_values
data = np.loadtxt('shap_reference_data.csv', delimiter=',')

## Tree plot

An XGBoost model consists of an ensemble of classification and regression trees (CART).  Change the `NUM_TREE` value (ordinal number of the target tree) to plot other trees. 

In [None]:
import xgboost as xgb
import matplotlib.pyplot as plt 

NUM_TREE = 3

fig, ax = plt.subplots(figsize=(30, 30))
xgb.plot_tree(model, num_trees=NUM_TREE, ax=ax)
plt.show()

## XGBoost Feature importances

XGBoost provides multiple feature importance metrics to understand the influence of each feature on the model.

- *gain*: contribution of feature to the model calculated by taking the increase in prediction accuracy ('total_gain' is the sum of all gain across all splits, 'gain' is the average of all gain across all splits)

- *cover*: coverage of a feature calculated by the number of data points affected by a split involving the feature ('total_cover' is the sum of all coverage, 'cover' is the average across all splits)

- *weight*: percentage representing the relative number of times a particular feature occurs in the trees of the model.

In [None]:
MAX_FEATURES = 10

fig = plt.figure(figsize=(18, 15))
fig.suptitle('Feature importances for the top {} features'.format(MAX_FEATURES), fontsize=24)

for index, type in enumerate(('gain', 'total_gain', 'cover', 'total_cover', 'weight'), start=1): 
    ax = fig.add_subplot(3, 2, index)
    xgb.plot_importance(model, importance_type=type, title=type,
                        ax=ax, grid=False, height=0.4, 
                        max_num_features=MAX_FEATURES, show_values=False)

## Explaining the model using SHAP

We use SHAP values as means to understand the contributions of the features to the model predictions. SHAP (SHapley Additive exPlanations) by Lundberg and Lee (2016) is a method to
explain individual predictions and is based on the game theoretically optimal
Shapley Values.

A prediction can be explained by assuming that each feature value of the
instance is a “player” in a game where the prediction is the payout. Shapley
values – a method from coalitional game theory – tells us how to fairly
distribute the “payout” among the features.

Reference: "Explainable machine-learning predictions for the prevention of hypoxaemia during surgery", Nature Biomedical Engineering, 2018

------

**Be careful to interpret the Shapley value correctly**: The Shapley value is the
average contribution of a feature value to the prediction in different
coalitions. The Shapley value is NOT the difference in prediction when we would
remove the feature from the model.

In [None]:
import shap
shap.initjs()

### SHAP summary

A global aggregation of the individual Shapley values gives the overall average contributions of the features.

In [None]:
shap.summary_plot(shap_values[:,:-1], feature_names=feature_names, plot_type="bar")

The **summary plot** below can provide more context over the bar chart of feature importances. It tells which features are most important, and also their range of effects over the dataset. The color allows us to match how changes in the value of a feature effect the change in risk.

In [None]:
shap.summary_plot(shap_values[:,:-1], features=data, feature_names=feature_names)

### SHAP dependence plots

A **SHAP dependence plot** shows how the model output varies by feature value while showing the interaction between two features. The feature used for coloring is automatically chosen to highlight what might be driving these interactions. This shows how the model depends on the given feature, and can be considered a richer extenstion of the classical parital dependence plots. Vertical dispersion of the data points represents interaction effects. Grey ticks along the y-axis are data points where the feature’s value was NaN.

Here we plot the top ranked features  (ordered by mean absolute SHAP value over all the samples). 

In [None]:
N_RANKS = 3
for r in range(N_RANKS):
    shap.dependence_plot("rank({})".format(r), 
                         shap_values, 
                         data, 
                         feature_names=feature_names,
                         interaction_index='auto')

### SHAP force plots

A **force plot** explanation shows how features are contributing to push the model output from the base value (the average model output over the dataset) to the actual prediction. Features pushing the prediction higher are shown in **red**, those pushing the prediction lower are in **blue**. 

In [None]:
import numpy as np
N_ROWS = shap_values.shape[1]
N_SAMPLES = min(1000, N_ROWS)
sampled_indices = np.random.randint(N_ROWS, size=N_SAMPLES)

In [None]:
shap.force_plot(baseline_value, 
                shap_values[sampled_indices, :], 
                data[sampled_indices, :], 
                feature_names=feature_names)

### Outliers

Outliers are extreme values that deviate from other observations on data. It's useful to understand the influence of various features for outlier predictions to determine if it's a novelty, an experimental error, or a shortcoming in the model. 

Here we show force plot for prediction outliers that are on either side of the baseline value. 

In [None]:
# top outliers
from scipy import stats
N_OUTLIERS = 3  # number of outliers on each side of the tail

shap_sum = np.sum(shap_values, axis=1)
z_scores = stats.zscore(shap_sum)
outlier_indices = (np.argpartition(z_scores, -N_OUTLIERS)[-N_OUTLIERS:]).tolist()
outlier_indices += (np.argpartition(z_scores, N_OUTLIERS)[:N_OUTLIERS]).tolist()

In [None]:
for fig_index, outlier_index in enumerate(outlier_indices, start=1): 
    shap.force_plot(baseline_value, 
                    shap_values[outlier_index, :], 
                    data[outlier_index, :], 
                    matplotlib=True, 
                    feature_names=feature_names) 