

## Mean Decrease in Impurity (MDI)

**What is it?**  
The default “feature importance” metric in decision‑tree and tree‑ensemble libraries (e.g., `RandomForestClassifier`, `GradientBoostingClassifier` in scikit‑learn).  
> “Across all splits using feature _j_, how much did those splits reduce impurity?”

---

### 1. Node impurity
- **Classification**  
  - Gini:  
    $G = 1 - \sum_{k=1}^K p_k^2$  
  - Entropy:  
    $H = -\sum_{k=1}^K p_k \log p_k$
- **Regression**  
  - Mean squared error:  
    $I = \tfrac{1}{n}\sum_i (y_i - \bar y)^2$

---

### 2. Impurity decrease for one split
For node _t_ with _nₜ_ samples and impurity $I(t)$, split into left _L_ and right _R_:

$$
\Delta I(t,j) = I(t) - \Bigl(\tfrac{n_L}{n_t}I(L) + \tfrac{n_R}{n_t}I(R)\Bigr)
$$

---

### 3. From one tree to many
- **Single tree**:  
  $\text{MDI}_j = \sum_{t:\,feat(t)=j} \Delta I(t,j)$

- **Ensemble of M trees**:  
  $$
  \frac{1}{M}\sum_{m=1}^M \text{MDI}_j^{(m)}
  $$  
  (Often normalized so $\sum_j \text{MDI}_j = 1$)

---

### 4. Key caveats
1. **Bias** toward high‑cardinality/continuous features  
2. **Underestimates** correlated features  
3. **Model‑specific**, not causal  
4. **Global only** (no sample‑level insight)

---

### 5. Quick recipe
1. Train your tree ensemble.  
2. Sum impurity drops per feature across all splits & trees.  
3. (Optionally) normalize to sum to 1.



In [1]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

# Load data
X, y = load_iris(return_X_y=True)
feature_names = load_iris().feature_names

# Train
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Extract MDI importances
importances = rf.feature_importances_
df_imp = pd.DataFrame({
    'feature': feature_names,
    'MDI_importance': importances
}).sort_values('MDI_importance', ascending=False)

print(df_imp)

             feature  MDI_importance
2  petal length (cm)        0.436130
3   petal width (cm)        0.436065
0  sepal length (cm)        0.106128
1   sepal width (cm)        0.021678




## Mean Decrease in Accuracy (MDA) / Permutation Importance

**Idea:** “How much does model performance drop if I break feature $j$?”  
> **$ \text{Importance}_j = \text{Baseline score} - \text{Permuted score} $**

---

### 1. Quick Recipe
1. **Choose metric** (e.g. accuracy, AUC, $R^2$, MAE)  
2. **Baseline**: evaluate model on held‑out data → $ \text{score}_0 $  
3. **For each feature $j$**  
   - Shuffle column $j$ in $X_{\text{val}}$  
   - Recompute $ \text{score}_j $  
   - $ \text{Drop}_j = \text{score}_0 - \text{score}_j $  
4. **Repeat** $n$ times & average to reduce noise  



In [3]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
import pandas as pd

# 1. Data split
X, y = load_iris(return_X_y=True)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

# 2. Train
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 3. Permutation importance
results = permutation_importance(
    model, X_val, y_val,
    n_repeats=10,
    random_state=42,
    scoring='accuracy'
)
imp_df = pd.DataFrame({
    'feature': load_iris().feature_names,
    'mean_drop': results.importances_mean,
    'std_drop': results.importances_std
}).sort_values('mean_drop', ascending=False)

print(imp_df)


             feature  mean_drop  std_drop
3   petal width (cm)   0.175556  0.036447
2  petal length (cm)   0.144444  0.038809
1   sepal width (cm)   0.000000  0.000000
0  sepal length (cm)   0.000000  0.000000


s
x

## Local Surrogate Methods (e.g. LIME)

**Key idea:** Fit a simple interpretable model in a small neighborhood around one instance to explain its prediction.

---

### 1. Quick Recipe
1. **Select** instance $x_0$.  
2. **Generate** $N$ perturbed samples $\{x^{(i)}\}$ near $x_0$.  
3. **Predict** black‑box outputs $y^{(i)} = f(x^{(i)})$.  
4. **Weight** each sample by proximity $w_i$.  
5. **Fit** weighted simple model $g$ (e.g. sparse linear):  
   $$
   \min_g \sum_i w_i\,(y^{(i)} - g(x^{(i)}))^2 + \Omega(g)
   $$
6. **Read off** $g$’s coefficients or rules as local explanations.


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from lime.lime_tabular import LimeTabularExplainer

# 1. Load & split
X, y = load_iris(return_X_y=True)
feature_names = load_iris().feature_names
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 2. Train black‑box
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 3. Explain one instance
i = 0
explainer = LimeTabularExplainer(
    X_train,
    feature_names=feature_names,
    class_names=load_iris().target_names,
    discretize_continuous=True
)
exp = explainer.explain_instance(
    X_test[i],
    model.predict_proba,
    num_features=4
)

# 4. Show local feature weights
print(exp.as_list())

## Intuition: SHAP as “Fair Credit Assignment”

Imagine you and friends jointly win a prize of \$100. You want to split it fairly based on each person’s contribution. Shapley values do exactly that for features: they ask, “How much extra value does feature $j$ bring when added to any coalition of other features?”

1. **Baseline**  
   Think of the baseline as the prize if no one contributed:  
   $$
   \phi_0 = \mathbb E[f(X)].
   $$
2. **Marginal Contribution**  
   For each feature $j$, consider every possible group $S$ of the other features:
   - Measure $f_{S\cup\{j\}}(x)$: prediction when you “add” feature $j$ to $S$.
   - Subtract $f_S(x)$: prediction with only $S$.
   - That difference is $j$’s marginal contribution in coalition $S$.
3. **Weighted Average**  
   Weight each coalition by how many ways it could form:
   $$
   \frac{|S|!\,(d-|S|-1)!}{d!}.
   $$
   This ensures symmetry and efficiency (total adds up correctly).

The result $\phi_j$ tells you, on average, how much feature $j$ increased (or decreased) the prediction relative to the baseline.

---

## Step‐by‐Step Breakdown

1. **Choose background** $B$  
   A set of “reference” samples to estimate $\mathbb E[f(X)]$ and $f_S(x)$.
2. **Simplify inputs**  
   Encode presence/absence of each feature as a binary vector $z'\in\{0,1\}^d$.
3. **Define surrogate**  
   $$
   g(z') = \phi_0 + \sum_{j=1}^d \phi_j\,z'_j.
   $$
4. **Fit surrogate** by minimizing weighted loss  
   $$
   \min_{\phi}\; \sum_{z'} \pi_x(z')\,\bigl(f(h_x(z')) - g(z')\bigr)^2,
   $$
   where $h_x(z')$ fills in original feature values when $z'_j=1$ and samples from $B$ when $z'_j=0$, and
   $$
   \pi_x(z') = \frac{(d-1)}{\binom{d}{|z'|}\,|z'|\,(d-|z'|)}.
   $$

---

## Toy Example (2 Features)

Model: $$f(x) = 10 + 2x_1 + 3x_2.$$

Instance: $x=(x_1=1,\;x_2=2)$.

- Baseline: $\mathbb E[f(X)] = 10$ (if we center $x_1,x_2$).
- Coalitions for feature 1:
  - $S=\varnothing$: $f_{\{1\}}(x)=10+2\cdot1=12$, $f_{\varnothing}(x)=10$ → marginal = $2$.
  - $S=\{2\}$: $f_{\{1,2\}}(x)=10+2\cdot1+3\cdot2=18$, $f_{\{2\}}(x)=10+3\cdot2=16$ → marginal = $2$.
  - Weight for each: $\tfrac{0!1!}{2!}=½$; $\tfrac{1!0!}{2!}=½$.
  - $\phi_1 = ½\cdot2 + ½\cdot2 = 2$.
- Similarly for feature 2: $\phi_2 = 3$ × (½+½) = 3 × 1 = 3.
- Check: $10 + 2 + 3 = 15 = f(x)$.

So SHAP gives $\phi_1=2,\ \phi_2=3$ exactly matching our model’s weights.

---

## Why SHAP Matters

- **Fairness**: satisfies efficiency, symmetry, dummy, linearity.
- **Consistency**: adding a stronger feature cannot decrease its attribution.
- **Local & Global coherence**: you get exact local decompositions; averaging $|\phi_j|$ yields faithful global importance.
- **Model‑specific optimizations**:  
  - **TreeSHAP** computes exact values in $O(TL^2)$.  
  - **KernelSHAP** approximates with weighted regression.

---

## Practical Tips

- **Choose background wisely**: sample ~50–200 representative points.
- **Use fast algorithms**: `TreeExplainer` for trees; `KernelExplainer` sparingly on ≤ 1000 background points.
- **Visualize**:  
  - **Force plot** for single predictions  
  - **Summary plot** (beeswarm) for global view  
  - **Dependence plot** for feature effects
- **Watch for pitfalls**: correlated features, baseline choice, computational cost.
