# TreeSHAP Algorithm: A Complete Pedagogical Guide

## 🎯 What is SHAP and Why Do We Need It?

Imagine you're a doctor using an AI model to diagnose patients. The model says "80% chance of disease X" for a patient. As a doctor, you'd naturally ask: **"Why? Which symptoms led to this conclusion?"**

SHAP (SHapley Additive exPlanations) answers this question by telling us **how much each feature contributed** to the final prediction.

### The Fundamental SHAP Equation

For any prediction, SHAP guarantees:

$$\text{Prediction} = \text{Expected Value} + \sum_{i=1}^{n} \phi_i$$

Where:
- $\phi_i$ = SHAP value for feature $i$ (its contribution)
- Expected Value = average prediction across all data
- $n$ = number of features

**Example**: If a house price model predicts $400K:
- Expected price (baseline): $300K
- Square footage contribution: +$80K  
- Location contribution: +$30K
- Age contribution: -$10K
- **Total**: $300K + $80K + $30K - $10K = $400K ✓

---

## 🧠 The Core Intuition: Shapley Values from Game Theory

SHAP values come from **cooperative game theory**. Think of features as "players" in a team trying to achieve a prediction.

### The Coalition Game

Imagine features as teammates:
1. **Solo performance**: How good is each feature alone?
2. **Team performance**: How good are they together?
3. **Fair credit**: How do we fairly distribute the team's success?

**Key insight**: A feature's contribution = its **marginal impact** when added to different coalitions of other features.

### Mathematical Definition

For feature $i$, its Shapley value is:

$$\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(|N|-|S|-1)!}{|N|!} [f(S \cup \{i\}) - f(S)]$$

Where:
- $S$ = coalition of features (subset not including $i$)
- $N$ = all features
- $f(S)$ = model's expected output using only features in $S$
- $\frac{|S|!(|N|-|S|-1)!}{|N|!}$ = combinatorial weight (how often this coalition occurs)

**Translation**: Average the marginal contribution of feature $i$ across all possible coalitions, weighted by how likely each coalition is.

---

## 🌳 Why TreeSHAP is Special

Computing exact Shapley values normally requires $2^n$ evaluations (exponential!). For 20 features, that's over 1 million calculations per prediction.

**TreeSHAP breakthrough**: Exploit tree structure to compute exact Shapley values in **polynomial time**.

### Key Insight: Trees Have Structure

In a tree model:
1. **Path uniqueness**: Each prediction follows exactly one path
2. **Feature splits**: Each internal node tests one feature
3. **Conditional expectations**: We can compute $f(S)$ efficiently using tree structure

---

## 🔧 TreeSHAP Algorithm: Step by Step

### Step 1: Understanding Tree Predictions

For any tree, a prediction is determined by:
1. **Starting at root**: Begin with all data
2. **Following splits**: At each node, go left or right based on feature values
3. **Reaching leaf**: Final prediction is the leaf value

```
Example Tree:
     [Root: Age < 30?]
    /                \
[Income < 50K?]    [Leaf: +0.8]
  /          \
[Leaf: -0.2] [Leaf: +0.3]
```

### Step 2: Feature Presence vs Absence

TreeSHAP compares two scenarios for each feature:
- **Present**: Feature has its actual value (sample follows its natural path)
- **Absent**: Feature is unknown (sample follows training data distribution)

### Step 3: Path Probability Tracking

For each node in the tree, we track:
- $p_0$ = probability of reaching this node when feature is **absent**
- $p_1$ = probability of reaching this node when feature is **present**

**Key equations**:
- When feature is absent: $p_0 = \text{fraction of training data that went this way}$
- When feature is present: $p_1 = 1$ (if sample goes this way) or $0$ (if not)

### Step 4: The Recursive Magic

```python
def tree_shap_recursive(node, p_zero, p_one, parent_feature):
    if node.is_leaf:
        # Contribution = difference in probabilities × leaf value
        contribution = (p_one - p_zero) × node.leaf_value
        shap_values[parent_feature] += contribution
        return
    
    # Get training data split probabilities
    p_left = node.left_samples / node.total_samples
    p_right = node.right_samples / node.total_samples
    
    feature = node.split_feature
    
    if sample_goes_left(feature):
        # Sample goes left, so when feature is present: p_one = 1 for left, 0 for right
        recurse(left_child, p_zero × p_left, p_one, feature)
        recurse(right_child, p_zero × p_right, 0, feature)
    else:
        # Sample goes right
        recurse(left_child, p_zero × p_left, 0, feature)
        recurse(right_child, p_zero × p_right, p_one, feature)
```

### Step 5: Combining Across Trees

For ensemble models (like gradient boosting):

$$\phi_i^{\text{total}} = \sum_{t=1}^{T} \eta \cdot \phi_i^{(t)}$$

Where:
- $T$ = number of trees
- $\eta$ = learning rate
- $\phi_i^{(t)}$ = SHAP value for feature $i$ from tree $t$

---

## 📊 Concrete Example: How It Works

Let's trace through a simple example.

### The Data
```
Training data:
Age | Income | Bought
25  | 30K    | No
35  | 60K    | Yes  
45  | 40K    | No
```

### The Tree
```
     [Age < 30?]
    /           \
 [No: -0.5]   [Income < 50K?]
               /            \
            [No: -0.2]   [Yes: +0.8]
```

### New Sample to Explain
```
Age: 35, Income: 70K
Prediction: +0.8
```

### TreeSHAP Calculation

**Step 1**: Sample path = Right → Right (Age ≥ 30, Income ≥ 50K)

**Step 2**: Calculate contributions

*For Age feature:*
- At root: 1/3 of training data goes left (Age < 30), 2/3 goes right
- When Age is absent: $p_0 = 1.0$ (start), then splits to 1/3 left, 2/3 right
- When Age is present: Sample goes right, so $p_1 = 1.0$ for right subtree

*For Income feature:*
- At Income node: 1/2 of remaining data goes left, 1/2 goes right  
- When Income is absent: Follow training distribution
- When Income is present: Sample goes right to +0.8 leaf

**Step 3**: Final SHAP values
- Age contribution: Difference between "with Age" vs "without Age" expected outcomes
- Income contribution: Similar calculation for Income feature
- Must sum to: Prediction - Expected = 0.8 - (expected baseline)

---

## 🎨 Visual Intuition

Think of TreeSHAP as asking: **"What if this feature didn't exist?"**

```
Without Feature A:     With Feature A:
     [?]         →         [A < 5?]
   /     \               /         \
[Mixed]  [Mixed]    [Clear]    [Clear]
```

The **difference** in expected outcomes gives us Feature A's contribution.

### Why This Works

1. **Counterfactual reasoning**: "What would happen if..."
2. **Fair attribution**: Each feature gets credit for its unique contribution
3. **Mathematical guarantee**: Always adds up to the actual prediction

---

## 🔍 Advanced Concepts

### Handling Missing Values

When a feature value is missing:
```python
if feature_is_missing:
    # Follow the tree's default direction for missing values
    p_one = p_zero  # No contribution from this feature
```

### Feature Interactions

TreeSHAP naturally captures feature interactions:
- If features A and B work together, their combined effect shows up in their individual SHAP values
- Non-linear relationships are automatically handled by tree structure

### Ensemble Models

For gradient boosting with multiple trees:
1. Compute SHAP for each tree independently  
2. Scale by learning rate
3. Sum across all trees
4. Result: Total feature contribution across entire ensemble

---

## ⚙️ Implementation Details

### Computational Complexity

- **Naive Shapley**: $O(2^n \cdot T \cdot \text{tree_depth})$ 
- **TreeSHAP**: $O(n \cdot T \cdot \text{tree_depth})$

For 10 features and 100 trees:
- Naive: ~102,400 operations
- TreeSHAP: ~1,000 operations (100× faster!)

### Memory Requirements

TreeSHAP needs to store:
- Path probabilities: $O(\text{tree_depth})$
- Feature contributions: $O(n)$
- Total: $O(n + \text{tree_depth})$ per prediction

### Numerical Stability

Key considerations:
1. **Probability tracking**: Use log-space for very deep trees
2. **Additivity enforcement**: Explicitly correct small numerical errors
3. **Missing value handling**: Ensure consistent behavior

---

## 🧪 Validation and Debugging

### The Additivity Test

Every SHAP implementation must pass:
```python
def test_additivity(model, X):
    shap_values = model.shap_values(X)
    predictions = model.predict(X)
    expected = model.expected_value
    
    for i in range(len(X)):
        reconstructed = expected + sum(shap_values[i])
        actual = predictions[i]
        error = abs(reconstructed - actual)
        assert error < 1e-10, f"Additivity violated: {error}"
```

### Common Issues and Fixes

1. **Large additivity errors**: Usually indicates wrong probability calculations
2. **Inconsistent explanations**: Check feature masking and tree traversal
3. **Performance issues**: Verify tree depth and ensemble size

### Debugging Tips

```python
# Check individual tree contributions
for tree_idx, (tree, mask) in enumerate(model.trees):
    tree_shap = compute_tree_shap(tree, mask, x)
    tree_pred = tree.predict(x[mask])
    tree_expected = tree.expected_value
    print(f"Tree {tree_idx}: sum(SHAP)={sum(tree_shap):.6f}, "
          f"pred-expected={tree_pred - tree_expected:.6f}")
```

---

## 🎓 Key Takeaways

### Theoretical Foundations
1. **Game theory**: SHAP values are the unique fair attribution method
2. **Efficiency**: TreeSHAP exploits tree structure for polynomial-time computation
3. **Additivity**: Mathematical guarantee that explanations sum to predictions

### Practical Benefits
1. **Local explanations**: Understand individual predictions
2. **Global insights**: Aggregate SHAP values for feature importance
3. **Model debugging**: Identify problematic features or interactions

### When to Use TreeSHAP
- ✅ Tree-based models (Random Forest, XGBoost, LightGBM, etc.)
- ✅ Need exact explanations (not approximations)
- ✅ Want guaranteed additivity
- ✅ Have reasonable number of features (< 1000)

### Limitations
- ❌ Only works for tree-based models
- ❌ Assumes features are independent (for baseline calculation)
- ❌ Can be slow for very large ensembles
- ❌ Requires understanding of tree structure

---

## 📚 Further Reading

### Essential Papers
1. **Original SHAP**: Lundberg & Lee (2017) - "A Unified Approach to Interpreting Model Predictions"
2. **TreeSHAP**: Lundberg et al. (2020) - "From local explanations to global understanding with explainable AI for trees"
3. **Shapley Values**: Shapley (1953) - "A value for n-person games"

### Practical Resources
- **SHAP Python Library**: `pip install shap`
- **XGBoost Integration**: Built-in `.get_booster().predict(contrib=True)`
- **Interpretability Guidelines**: Model Cards, Fairness considerations

### Advanced Topics
- **SHAP for Deep Learning**: DeepLIFT, GradientSHAP
- **Causal SHAP**: Handling confounded features
- **Distributional SHAP**: Explanations for probability distributions

---

*TreeSHAP represents a beautiful intersection of game theory, computer science, and machine learning - turning the complex problem of model interpretation into an elegant, efficient algorithm that provides mathematically principled explanations for tree-based models.*