
Ahh, SHAP. As you know it's become one of the leading frameworks for explaining ML model predictions. I'd guess it's popularity is due to its appealing theoretical basis, its universal applicability to any type of ML model, and its easy-to-use python package. SHAP promises to turn your black box ML model into a nice friendly interpretable model. The hilarious irony is that, when I first started using it in my work, SHAP itself was a complete black box to me. In this post, we'll change all that by diving into the SHAP paper, illuminating the key theoretical ideas behind its development step by step, and implementing it from scratch in python. If you aren't already familiar with how to compute and interpret SHAP values in practice, I'd recommend that you go check out the [documentation for the shap python package](https://shap.readthedocs.io/en/latest/index.html) before diving into this post.


![Snow, trees, and mountain overlook Lake Tahoe](shap-thumbnail.jpg "")





## What is SHAP?

SHAP (SHapley Additive exPlanations) is a conceptual framework, a set of computational methods, and a python library for generating explanations of ML model predictions. The "SHAP" [backronym](https://en.wikipedia.org/wiki/Backronym) was introduced in [Lundberg and Lee 2017](https://arxiv.org/abs/1705.07874), which I call the _SHAP paper_, that expanded on several previously existing ideas which we'll build up in the following sections.



* First, _Shapley values_, a concept from cooperative game theory which originally had nothing to do with machine learning.
* Next, _Shapley regression values_, which showed how to use Shapley values to generate explanations of model predictions.
* Finally, _Shapley sampling values_, which offered a computationally tractable way to compute Shapley regression values for any type of model.

The SHAP paper tied Shapley regression values and several other existing model explanation methods together by showing they are all members of a class called "additive feature attribution methods." Under the right conditions, these additive feature attribution methods can generate Shapley values, and when they do we can call them SHAP values.

After establishing this theoretical framework, the authors go on to discuss various computational methods for computing SHAP values; some are model-agnostic, meaning they work with any type of model, and others are model-specific, meaning they work for specific types of models. It turns out that  the previously existing Shapley sampling values method is a model-agnostic approach, but while it's the most intuitive, computationally speaking it's relatively inefficient.
Thus the authors propose a novel model-agnostic approach called Kernel SHAP, which is really just [LIME](https://lime-ml.readthedocs.io/en/latest/) parameterized to yield SHAP values.

Model-specific approaches can be potentially much more efficient than model-agnostic ones by taking advantage of model idiosyncrasies.
For example, there is an analytical solution for the SHAP values of linear models, so Linear SHAP is extremely efficient.
Similarly, Deep SHAP and Tree SHAP (proposed later in  [Lundberg et al 2020](https://www.sciencedirect.com/science/article/pii/S2666827022000500#b20)) take advantage of idiosyncrasies of deep learning and tree-based models to compute SHAP values efficiently.

The important thing about these different methods is that they provide computationally tractable ways to compute SHAP values, but ultimately, they are all based on the original method—Shapley sampling values. Thus, for the remainder of this post, we'll focus on this method, building it up from Shapley values to Shapley regression values to Shapley sampling values and ultimately implementing it from scratch in python.


## Shapley Values

The [Shapley value](https://en.wikipedia.org/wiki/Shapley_value) is named in honor of Nobel prize winning economist Loyd Shapley who introduced the idea in the field of coalitional game theory in the 1950's. Shapley proposed a way to determine how a coalition of players can fairly share the payout they receive from a cooperative game. We'll introduce the mathematical formalism in the next section, so for now let's just touch on the intuition for the approach.  Essentially, the method distributes the payout among the players according to the expected contribution of each player across all possible combinations of the players. The thought experiment works as follows:



1. Draw a random permutation (ordering) of the players.
2. Have the first player play alone, generating some payout. Then have the first two players play together, generating some payout. Then the first three, and so on.
3. As each new player is added, attribute the change in the payout to this new player.
4. Repeat this experiment for all permutations of the players. A player's Shapley value is the average change in payout (across all permutations) when that player is added to the game.

Next we'll see how this idea can be applied to model explanations.



## Shapley Regression Values

The next idea came from [Lipovetsky and Conklin 2001](https://onlinelibrary.wiley.com/doi/abs/10.1002/asmb.446), who proposed a way to use Shapley values to explain the predictions of a linear regression model. 
_Shapley regression values_ assign an importance value to each feature that represents the effect on the model prediction of including that feature. 
The basic idea is to train a second model without the feature of interest, and then to compare the predictions from the model with the feature and the model without the feature.
This procedure of training two models and comparing their predictions is repeated for all possible subsets of the other features; the average difference in predictions is the Shapley value for the feature of interest.

The Shapley value for feature $i$ on instance $x$ is given by equation 4 in the SHAP paper:

$$
\phi_i = \sum_{S \subseteq F \setminus \{i\}} 
\frac{|S|!(|F| - |S| - 1)!}{|F|!}
[f_{S \cup \{i\}}(x_{S \cup \{i\}}) - f_S(x_S) ]
$$

where 

* $\phi_i$ is the Shapley value for feature of interest $i$,
* $F$ is the set of all features,
* $F \setminus \{i\}$ is the set of all features except the feature of interest,
* $S$ is a subset of features not including the feature of interest,
* $f_{S}$ is a model trained on the feature subset $S$,
* and $f_{S \cup \{i\}}$ is a model trained on the feature subset $S$ plus the feature of interest.

Let's break this down a bit to ensure we have the intuition.
The sum is indexed by $S \subseteq F \setminus \{i\}$, all subsets of the "other features".
Let's say we're predicting income from age, sex, and education. For feature of interest education, we need to consider the following subsets of the other features: [], [age], [sex], and [age, sex].
For each of these subsets, we compute the difference in prediction on a given instance $x$ between a model trained on the subset and a model trained on the subset plus the education feature.
The terms for our income example would look like this.

| $S$ | $ \frac{\lvert S\rvert !(\lvert F\rvert  - \lvert S\rvert  - 1)!}{\lvert F\rvert !} $ | $[f_{S \cup \{i\}}(x_{S \cup \{i\}}) - f_S(x_S) ]$ |
|---|---|---|
| [] | $ 0! (3 - 0 - 1)! / 3! = 1/3$ | $f_{\text{[educ]}}(x) - f_{\text{[]}}(x) $ |
| [age] | $ 1! (3 - 1 - 1)! / 3! = 1/6$ | $f_{\text{[age, educ]}}(x) - f_{\text{[age]}}(x) $ |
| [sex] | $ 1! (3 - 1 - 1)! / 3! = 1/6$ | $f_{\text{[sex,educ]}}(x) - f_{\text{[sex]}}(x) $ |
| [age,sex] | $ 2! (3 - 2 - 1)! / 3! = 1/3$ | $f_{\text{[age,sex,educ]}}(x) - f_{\text{[age,sex]}}(x) $ |

What's with the factor $|S|!(|F| - |S| - 1)! / |F|! $?
The keen reader will notice this factor kind of looks like the answers to those combinatorics questions like how many unique ways can you order the letters in the word MISSISSIPPI. 
The combinatorics connection is that Shapley values are defined in terms of all permutations of the players , where the included players come first, then the player of interest, followed by the excluded players. In ML models, the order of features doesn't matter, so we can work with unordered subsets of features, scaling the prediction difference terms by the number of permutations that involve the same sets of included and excluded features.  With that in mind, we can see the factor gives us a weighted average over all feature combinations, where the numerator gives the number of permutations in which the included features come first, followed by the feature of interest, followed by the excluded features, and the denominator is the total number of feature permutations.

This Shapley regression value concept is the essence of SHAP. But there's a big problem with the formulation above. Namely, we are going to have to train a whole bunch of new subset models&mdash;one for each subset of the features. If our model has $M$ features, we'll have to train $2^M$ models, so this will get impractical in a hurry. 


## Shapley Sampling Values

Next, [Štrumbelj and Kononenko 2014](https://link.springer.com/article/10.1007/s10115-013-0679-x) proposed _Shapley sampling values_,  a method which provides a much more efficient way to approximate the subset models used to calculate Shapley regression values. In this approach, the effect of removing some features from the model is approximated by the conditional expectation of the model given the known features.

$$ f_S(x_S)  := E[f(x) | x_S]  $$ 

This means we're approximating the output of a subset model by averaging over outputs of the full model. That's great because now we don't have to train all those new subset models, we can just query our full model over some set of inputs and average over the outputs to compute these conditional expectation subset models.

Now how exactly do we compute that conditional expectation? First we rewrite the above conditional expectation (equation 10 in the SHAP paper)

$$ E[f(x) | x_S]  = E_{x_{\bar{S}}|x_S} [f(x)]$$ 

where $\bar{S}$ is the set of excluded or missing features.
Beside this equation in the paper they give the note "expectation over $x_{\bar{S}} | x_S$,  which means we're taking the expectation over the missing features given the known features.
Then we get another step (equation 11)

$$E_{x_{\bar{S}}|x_S} [f(x)] \approx E_{x_{\bar{S}}} [f(x)]$$ 

Now it's not an equality but an approximation. The authors give the note "assume feature independence". 
The intuition here is that if the missing features are correlated with the known features, then their distribution depends on the particular values taken by the known features. But here the authors make the simplifying assumption that known and missing features are independent, which allows us to replace the conditional expectation with an unconditional expectation over the missing features.

:::{.callout-note}
Is that a problem?  🤷‍♀️ Uh, maybe. Problem enough that people have worked out some ways to relax this assumption, e.g. [partition masking](https://shap-lrjball.readthedocs.io/en/latest/generated/shap.PartitionExplainer.html), but that makes Owen values, not Shapley values, and today we're talking about Shapley values.
:::

Anyway, how do we compute this unconditional expectation over the missing features in practice?
We'll need to use a so-called *background dataset*, which is just some set of observations  of our feature variables that represents their distribution. A good candidate is the training data we used to train our model.  Štrumbelj and Kononenko 2014 propose a way to estimate this conditional expectation using resampling of the background dataset.

The idea is to notice that the instance of interest $x$ is a feature vector comprised of the set of "known" features $x_S$ and the set of excluded features $x_{\bar{S}}$ such that $x=\{x_S,x_{\bar{S}} \}$.
Our resampling scheme will be based on constructing Frankenstein samples $x^*=\{x_S,z_{\bar{S}} \}$ where $z_{\bar{S}}$ are values of the missing features drawn from some random observation in the background dataset.
We can then compute an estimate $\hat{f}_S(x)$ of the conditional expectation $E_{x_{\bar{S}}}[f(x)]$ as

$$\hat{f}_S(x) = \frac{1}{n} \sum_{i=1}^n f(\{x_S, z_{\bar{S}}^{(i)} \}) $$

where $z_{\bar{S}}^{(i)}$ is the vector of values of the excluded features from the $i$-th row of the background dataset.
With this method of estimating the subset models, we can now adapt the Shapley regression value formula to estimate the Shapley value for the $i$-th feature on instance $x$

$$
\phi_i = \sum_{S \subseteq F \setminus \{i\}} 
\frac{|S|!(|F| - |S| - 1)!}{|F|!}
[\hat{f}_{S \cup \{i\}}(x_{S \cup \{i\}}) - \hat{f}_S(x_S) ]
$$

Now we've got everything we need to compute Shapley sampling values. And according to the SHAP paper: "if we assume feature independence when approximating conditional expectations (Equation 11) … then SHAP values can be estimated directly using the Shapley sampling values method." Nice, that means we're ready to implement an algorithm to compute SHAP values.


## How to Implement SHAP from Scratch

Here's my implementation of a class that can compute SHAP values using the Shapley sampling values method.  We'll talk through it after the code.


In [1]:
import numpy as np 
from typing import Any, Callable, Iterable
from math import factorial
from itertools import chain, combinations

class ShapExplainerFromScratch():
    def __init__(self,
                 model: Callable[[np.ndarray], float], 
                 background_dataset: np.ndarray,
                 max_samples: int = None):
        self.model = model
        if max_samples:
            max_samples = min(max_samples, background_dataset.shape[0]) 
            rng = np.random.default_rng()
            self.background_dataset = rng.choice(background_dataset, size=max_samples, replace=False, axis=0)
        else:
            self.background_dataset = background_dataset

    def shap_values(self, X: np.ndarray) -> np.ndarray:
        "SHAP Values for instances in DataFrame or 2D array"
        shap_values = np.empty(X.shape)
        for i in range(X.shape[0]):
            for j in range(X.shape[1]):
                shap_values[i, j] = self._compute_single_shap_value(X[i, :], j)
        return shap_values
       
    def _compute_single_shap_value(self, 
                                   instance: np.array,
                                   feature: int) -> float:
        "Compute a single SHAP value (equation 4)"
        other_features = [j for j in range(len(instance)) if j != feature]
        other_feature_subsets = self._get_all_subsets(other_features)
        shap_value = 0
        for other_feature_subset in other_feature_subsets:
            prediction_without_feature = self._subset_model(other_feature_subset, instance)
            prediction_with_feature = self._subset_model(other_feature_subset + (feature,), instance)
            multiplier = (
                factorial(len(other_feature_subset)) 
                * factorial(len(instance) - len(other_feature_subset) - 1) 
                / factorial(len(instance))
            )
            shap_value += multiplier * (prediction_with_feature - prediction_without_feature)
        return shap_value
    
    def _get_all_subsets(self, items: list) -> Iterable:
        "_generate_all_subsets([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
        return chain.from_iterable(combinations(items, r) for r in range(len(items)+1))
    
    def _subset_model(self, feature_subset: tuple[int, ...], instance: np.array) -> float:
        "subset model prediction f_S(x) for feature subset S on single instance x (Eq 11)"
        masked_background_dataset = self.background_dataset.copy()
        for j in range(masked_background_dataset.shape[1]):
            if j in feature_subset:
                masked_background_dataset[:, j] = instance[j]
        conditional_expectation_of_model = self.model(masked_background_dataset).mean()
        return conditional_expectation_of_model          

The `SHAPExplainerFromScratch` API is similar to that of the [`KernelExplainer`](https://shap-lrjball.readthedocs.io/en/latest/generated/shap.KernelExplainer.html) from the python library, taking two required arguments during instantiation:

* `model`: "User supplied function that takes a matrix of samples (# samples x # features) and computes the output of the model for those samples." That means if our model is a scikit-learn model, we'll need to pass in its predict method, not the model object itself.
* `background_dataset`: "The background dataset to use for integrating out features." We know about this idea from the Shapley sampling values section above; a good choice for this data could be the training dataset we used to fit the model. By default, we'll use all the rows of this background dataset, but we'll also implement the ability to sample down to the desired number of rows with an argument called `max_samples`.

Like the `KernelExplainer`, this class has a method called `shap_values` which estimates the SHAP values for a set of instances. It takes an argument `X` which is "a matrix of samples (# samples x # features) on which to explain the model’s output."
This `shap_values` method just loops through each feature value of each instance of the input samples `X` and calls an internal method named `_compute_single_shap_value` to compute each SHAP value. 

The `_compute_single_shap_value` method is the real workhorse of the class. It implements equation 4 from the SHAP paper as described in the Shapley regression values section above.
This method takes two arguments: a single instance (a 1D array of feature values) and the index of the feature of interest (the one we want the SHAP value for).
Equation 4 is a sum over all subsets of the model features (excluding the feature of interest), so we need to loop over each feature subset and compute the summand for each one.
We'll loop over an iterable named `other_feature_subsets` which yields each possible subset of features excluding the feature of interest.
Before starting the loop, we initialize our shap value to zero; inside the loop, for each subset, we compute the difference in the predictions of the two subset models, one with the feature of interest and the other without. 
The predictions of these subset models are generated by an internal method named `_subset_model` which takes the subset and the instance and returns the prediction.
We then compute the summand for the current subset and add it to the current SHAP value.
Once we finish the loop, we've got the shap value for the current instance and feature of interest.

The `_subset_model` method implements Equation 11 from the SHAP paper, taking the feature subset and an instance and returning the subset model prediction for the given instance.
To do this it creates a "masked" background dataset full of Frankenstein samples by copying the (possibly down-sampled) background dataset and replacing values of the features in the subset with the known values from the instance. It then generates predictions for each row of this "masked" background dataset, and returns the average prediction over all rows.


## Testing the Implementation

Let's check our work by comparing SHAP values computed by our implementation with those from the SHAP python library.
We'll use our old friend the diabetes dataset, training a linear model, a gradient boosting tree, and a support vector machine—just for the hell of it 😅. 


In [2]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR 
from sklearn.ensemble import GradientBoostingRegressor

X, y = load_diabetes(as_frame=False, return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=42)

gbt_model = GradientBoostingRegressor()
gbt_model.fit(X_train, y_train);
svm_model = SVR() 
svm_model.fit(X_train, y_train);
linear_model = LinearRegression()
linear_model.fit(X_train, y_train);

Here's a little function to compare the SHAP values generated by our implentation and those from the library `KernelExplainer`.

In [4]:
import shap

def compare_methods(model, X_background, X_instances):
        
    library_explainer = shap.KernelExplainer(model.predict, X_background)
    library_shap_values = library_explainer.shap_values(X_instances)

    from_scratch_explainer = ShapExplainerFromScratch(model.predict, X_background)
    from_scratch_shap_values = from_scratch_explainer.shap_values(X_instances)

    return np.allclose(library_shap_values, from_scratch_shap_values)

In [5]:
compare_methods(linear_model, 
                X_background=X_train[:100, :], 
                X_instances=X_test[:5, :])

  0%|          | 0/5 [00:00<?, ?it/s]

True

In [6]:
compare_methods(svm_model, 
                X_background=X_train[:100, :], 
                X_instances=X_test[:5, :])

  0%|          | 0/5 [00:00<?, ?it/s]

True

In [7]:
compare_methods(gbt_model, 
                X_background=X_train[:100, :], 
                X_instances=X_test[:5, :])

  0%|          | 0/5 [00:00<?, ?it/s]

True

Beautiful! Our Implementation is consistent with the SHAP library explainer!

## Wrapping Up

Well I hope this one was helpful to you. The research phase actually took me a lot longer than I expected; it just took me a while to figure out what SHAP really is and how those different ideas and papers fit together. I thought the implementation itself was pretty fun and relatively easy. What do you think? 

## References

* [The SHAP Paper (Lundberg and Lee, 2017)](https://arxiv.org/abs/1705.07874)
* [Interpretable Machine Learning by Christoph Molnar](https://christophm.github.io/interpretable-ml-book/)