## How to apply Shapley values in practice outside a game theory setting:

Let's say, we're interested in how each feature affects the prediction of some data point in a linear model. Consider $y, \epsilon \in \mathbb{R}^n$, and $X \in \mathbb{R}^{n\times p}$ and $\beta \in \mathbb{R}^p$ and the following linear model

$y_i = x_i^T\beta + \epsilon_i, \quad i = 1, \dots, n$

In words, the Shapley value $\phi_j$ of feature $j$ corresponds to how much $j$ contributes to the prediction of some data point compared to the average prediction for the dataset.

Denote $\hat f(x) = X\hat\beta$ the prediction function. 

Then

$$\phi_j(\hat f) = \hat\beta_jx_j - E[\hat\beta_jX_j] = \hat\beta_j(x_j - E[X_j])$$

The Shapley value of a feature value is not the difference of the predicted value after removing the feature from the model training. The interpretation of the Shapley value is: Given the current set of feature values, the contribution of a feature value to the difference between the actual prediction and the mean prediction is the estimated Shapley value.

In general, exact calculation of shapley values becomes problematic if more features are added, since the computation time grows exponentially. Monte-Carlo methods for approximation is available:

$$\phi_j ≈ \frac{1}{M} \sum_{m=1}^M \left( \hat{f}(x^m_{+j}) - \hat{f}(x^m_{-j}) \right),$$

where $\hat{f}(x^m_{+j})$ is a prediction for $x$ but with a random number of feature values replaced by feature values from a random data point $z$, except for the respective value of feature $j$.

Algorithm:

for $m = 1, \dots, M$ do:

1. Draw random instance $z$ from the data matrix $X$.
2. Select subset of features $S ⊆ [p]$ where $j \notin S$.
3. Construct two new indices:
   1. With $j$ from $x$: $x_{+j}$ where all values in $x \in S$ are replaced by values in $z$.
   2. Without $j$ from $x$: $x_{-j}$ where all values in $x \in S$ are replaced by values in $z$ and also the value for $j$ is replaced (by values in $z$).
4. Compute marginal contribution: $\phi_j^m = \hat{f}(x_{+j}) - \hat{f}(x_{-j})$

Shapley value is then $\phi_j = \frac{1}{M} \sum_{m=1}^M \phi_j^m$

Let's try it out

In [4]:
from causalAssembly.models_fcm import FCM
from sympy import symbols, Eq
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

import numpy as np
import pandas as pd

ModuleNotFoundError: No module named 'sklearn'

Collecting scikit-learn
  Downloading scikit_learn-1.4.2-cp311-cp311-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.4.2-cp311-cp311-macosx_12_0_arm64.whl (10.5 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.5/10.5 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0m36m0:00:01[0m:01[0m
[?25hDownloading joblib-1.4.2-py3-none-any.whl (301 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.8/301.8 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m[31m6.5 MB/s[0m eta [36m0:00:01[0m
[?25hDownloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.4.2 threadpoolc