# SHAP v1.1: Model-Agnostic Feature Importance

**NEW in ml4t-diagnostic v1.1**: SHAP importance now works with **ANY sklearn-compatible model**!

This notebook demonstrates:
1. **TreeExplainer**: Fast, exact computation for tree models (LightGBM, XGBoost)
2. **LinearExplainer**: Fast, exact computation for linear models (LogisticRegression)
3. **KernelExplainer**: Model-agnostic fallback (SVM, KNN, ANY model)
4. **Auto-Selection**: Automatic explainer selection based on model type
5. **Performance Comparison**: Speed vs quality trade-offs
6. **Best Practices**: Tips for using each explainer effectively

## Installation

```bash
# Standard ML support (Tree, Linear, Kernel explainers)
pip install ml4t-diagnostic[ml]

# Neural network support (adds Deep explainer)
pip install ml4t-diagnostic[deep]

# GPU acceleration (10-50x speedup for large datasets)
pip install ml4t-diagnostic[gpu]

# Everything (all explainers + GPU)
pip install ml4t-diagnostic[all-ml]
```

In [None]:
# Imports
import time

# Models
import lightgbm as lgb
import numpy as np

# Visualization
import plotly.graph_objects as go
import polars as pl
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

from ml4t.diagnostic.evaluation import compute_shap_importance

print("‚úÖ All imports successful!")

## 1. Generate Synthetic Trading Data

Create a realistic quantitative trading dataset with:
- **Features**: momentum, volatility, volume, spread, etc.
- **Target**: Binary classification (trade success/failure)
- **Signal**: momentum + volatility interaction (typical quant pattern)

In [None]:
# Generate synthetic trading data
np.random.seed(42)
n_samples = 1000
n_features = 10

# Feature names
feature_names = [
    "momentum_5d",
    "momentum_20d",
    "volatility_5d",
    "volatility_20d",
    "volume_ratio",
    "spread",
    "rsi",
    "macd",
    "atr",
    "beta",
]

# Generate features
X = np.random.randn(n_samples, n_features)

# Create target with momentum + volatility interaction (realistic quant signal)
signal = (
    0.5 * X[:, 0]  # momentum_5d (strong)
    + 0.3 * X[:, 2]  # volatility_5d (medium)
    + 0.2 * X[:, 0] * X[:, 2]  # interaction
    + 0.1 * np.random.randn(n_samples)  # noise
)
y = (signal > 0).astype(int)

# Convert to Polars DataFrame
X_df = pl.DataFrame(X, schema=feature_names)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X_df.to_numpy(), y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Features: {X_train.shape[1]}")
print(f"Class balance: {np.mean(y_train):.2%} positive")

## 2. TreeExplainer: Fast, Exact for Tree Models

**Best for**: LightGBM, XGBoost, RandomForest  
**Speed**: <10ms per sample  
**Quality**: Exact SHAP values  
**Use when**: You have tree-based models (most quant ML workflows)

In [None]:
# Train LightGBM model
lgb_model = lgb.LGBMClassifier(
    n_estimators=100, learning_rate=0.1, max_depth=5, random_state=42, verbose=-1
)
lgb_model.fit(X_train, y_train)

# Compute SHAP importance (auto-selects TreeExplainer)
start_time = time.time()
result_tree = compute_shap_importance(model=lgb_model, X=X_test, feature_names=feature_names)
elapsed_time = time.time() - start_time

# Display results
print(f"Explainer used: {result_tree['explainer_type']}")
print(f"Computation time: {elapsed_time:.2f} seconds")
print(f"Time per sample: {elapsed_time / result_tree['n_samples'] * 1000:.2f}ms\n")

print("Top 5 features:")
for feat, imp in zip(result_tree["feature_names"][:5], result_tree["importances"][:5]):
    print(f"  {feat:20s}: {imp:.4f}")

## 3. LinearExplainer: Fast, Exact for Linear Models

**Best for**: LogisticRegression, Ridge, Lasso, LinearSVM  
**Speed**: <100ms per sample  
**Quality**: Exact SHAP values  
**Use when**: You have linear models (e.g., factor models, simple baselines)

In [None]:
# Train Logistic Regression model
lr_model = LogisticRegression(penalty="l2", C=1.0, max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)

# Compute SHAP importance (auto-selects LinearExplainer)
start_time = time.time()
result_linear = compute_shap_importance(model=lr_model, X=X_test, feature_names=feature_names)
elapsed_time = time.time() - start_time

# Display results
print(f"Explainer used: {result_linear['explainer_type']}")
print(f"Computation time: {elapsed_time:.2f} seconds")
print(f"Time per sample: {elapsed_time / result_linear['n_samples'] * 1000:.2f}ms\n")

print("Top 5 features:")
for feat, imp in zip(result_linear["feature_names"][:5], result_linear["importances"][:5]):
    print(f"  {feat:20s}: {imp:.4f}")

## 4. KernelExplainer: Model-Agnostic Fallback

**Best for**: SVM, KNN, ANY sklearn-compatible model  
**Speed**: 100-5000ms per sample (SLOW!)  
**Quality**: Approximate SHAP values  
**Use when**: No specialized explainer available (universal fallback)

‚ö†Ô∏è **Performance tip**: Use `max_samples` parameter to limit computation time!

In [None]:
# Train SVM model (no specialized explainer available)
svm_model = SVC(
    kernel="rbf",
    C=1.0,
    probability=True,  # Required for SHAP
    random_state=42,
)
svm_model.fit(X_train, y_train)

# Compute SHAP importance (auto-selects KernelExplainer)
# Use max_samples for speed (KernelExplainer is slow!)
start_time = time.time()
result_kernel = compute_shap_importance(
    model=svm_model,
    X=X_test,
    feature_names=feature_names,
    max_samples=50,  # Limit to 50 samples for demo (faster)
    performance_warning=True,  # Warn if computation will be slow
    show_progress=False,  # Set to True to see progress bar
)
elapsed_time = time.time() - start_time

# Display results
print(f"Explainer used: {result_kernel['explainer_type']}")
print(f"Computation time: {elapsed_time:.2f} seconds")
print(f"Time per sample: {elapsed_time / result_kernel['n_samples'] * 1000:.2f}ms\n")

print("Top 5 features:")
for feat, imp in zip(result_kernel["feature_names"][:5], result_kernel["importances"][:5]):
    print(f"  {feat:20s}: {imp:.4f}")

print("\n‚ö†Ô∏è Note: KernelExplainer is MUCH slower than Tree/Linear explainers!")
print("   Always use max_samples parameter to limit computation time.")

## 5. Auto-Selection Behavior

When `explainer_type='auto'` (default), the function tries explainers in order:

1. **TreeExplainer**: Check for tree-like attributes (tree_, estimators_, booster_)
2. **LinearExplainer**: Check for linear attributes (coef_, intercept_)
3. **KernelExplainer**: Universal fallback (works for ANY model)

You can override with explicit `explainer_type` parameter.

In [None]:
# Demonstrate auto-selection
models = [("LightGBM", lgb_model), ("LogisticRegression", lr_model), ("SVM", svm_model)]

print("Auto-selection results:\n")
for name, model in models:
    result = compute_shap_importance(
        model=model,
        X=X_test[:10],  # Small subset for speed
        feature_names=feature_names,
        performance_warning=False,
    )
    print(f"{name:25s} ‚Üí {result['explainer_type']:10s} explainer")

## 6. Explicit Explainer Selection

Force a specific explainer (useful for comparison or debugging):

In [None]:
# Compare TreeExplainer vs KernelExplainer on same model
print("Comparing explainers on LightGBM model:\n")

# Default: TreeExplainer (fast, exact)
start_time = time.time()
result_tree = compute_shap_importance(
    model=lgb_model, X=X_test, feature_names=feature_names, explainer_type="tree"
)
time_tree = time.time() - start_time

# Force: KernelExplainer (slow, approximate)
start_time = time.time()
result_kernel_lgb = compute_shap_importance(
    model=lgb_model,
    X=X_test,
    feature_names=feature_names,
    explainer_type="kernel",
    max_samples=50,  # Limit for speed
    performance_warning=False,
)
time_kernel = time.time() - start_time

# Compare
print(
    f"TreeExplainer:   {time_tree:.2f}s ({time_tree / result_tree['n_samples'] * 1000:.2f}ms/sample)"
)
print(
    f"KernelExplainer: {time_kernel:.2f}s ({time_kernel / result_kernel_lgb['n_samples'] * 1000:.2f}ms/sample)"
)
print(f"\nSpeedup: {time_kernel / time_tree:.1f}x faster with TreeExplainer!")

print("\nüí° Tip: Always use specialized explainers (Tree, Linear) when available.")

## 7. Performance Comparison Visualization

Visualize speed vs quality trade-offs:

In [None]:
# Collect performance data
explainers = ["Tree", "Linear", "Kernel"]
times = [
    time_tree / result_tree["n_samples"] * 1000,
    elapsed_time / result_linear["n_samples"] * 1000,  # From earlier
    time_kernel / result_kernel_lgb["n_samples"] * 1000,
]
quality = ["Exact", "Exact", "Approx"]

# Create bar chart
fig = go.Figure(
    [
        go.Bar(
            x=explainers,
            y=times,
            text=[f"{t:.1f}ms" for t in times],
            textposition="auto",
            marker_color=["green", "blue", "red"],
        )
    ]
)

fig.update_layout(
    title="SHAP Explainer Performance Comparison",
    xaxis_title="Explainer Type",
    yaxis_title="Time per Sample (ms, log scale)",
    yaxis_type="log",
    showlegend=False,
    height=400,
)

fig.show()

print("\nüìä Key Takeaways:")
print("  ‚Ä¢ TreeExplainer: Fastest, exact (use for tree models)")
print("  ‚Ä¢ LinearExplainer: Fast, exact (use for linear models)")
print("  ‚Ä¢ KernelExplainer: Slowest, approximate (universal fallback)")

## 8. Feature Importance Comparison

Compare SHAP importance across different models:

In [None]:
# Create comparison DataFrame
import polars as pl

comparison = pl.DataFrame(
    {
        "Feature": feature_names,
        "Tree (LightGBM)": [
            result_tree["importances"][result_tree["feature_names"].index(f)] for f in feature_names
        ],
        "Linear (LogReg)": [
            result_linear["importances"][result_linear["feature_names"].index(f)]
            for f in feature_names
        ],
        "Kernel (SVM)": [
            result_kernel["importances"][result_kernel["feature_names"].index(f)]
            for f in feature_names
        ],
    }
)

# Sort by tree importance
comparison = comparison.sort("Tree (LightGBM)", descending=True)

print("\nFeature Importance Comparison:")
print(comparison)

In [None]:
# Visualize comparison
fig = go.Figure()

# Add traces for each model
fig.add_trace(
    go.Bar(
        name="Tree (LightGBM)",
        x=comparison["Feature"].to_list(),
        y=comparison["Tree (LightGBM)"].to_list(),
        marker_color="green",
    )
)

fig.add_trace(
    go.Bar(
        name="Linear (LogReg)",
        x=comparison["Feature"].to_list(),
        y=comparison["Linear (LogReg)"].to_list(),
        marker_color="blue",
    )
)

fig.add_trace(
    go.Bar(
        name="Kernel (SVM)",
        x=comparison["Feature"].to_list(),
        y=comparison["Kernel (SVM)"].to_list(),
        marker_color="red",
    )
)

fig.update_layout(
    title="SHAP Feature Importance: Model Comparison",
    xaxis_title="Feature",
    yaxis_title="Mean |SHAP value|",
    barmode="group",
    height=500,
    xaxis_tickangle=-45,
)

fig.show()

print("\nüìä Interpretation:")
print("  ‚Ä¢ momentum_5d: Top feature across all models (matches ground truth)")
print("  ‚Ä¢ volatility_5d: Important (momentum-volatility interaction)")
print("  ‚Ä¢ Different models capture different aspects of the signal")

## 9. Best Practices

### TreeExplainer
‚úÖ **Use for**: LightGBM, XGBoost, RandomForest  
‚úÖ **Performance**: Fast (<10ms/sample)  
‚úÖ **Quality**: Exact SHAP values  
‚úÖ **Tip**: Default choice for tree models

### LinearExplainer
‚úÖ **Use for**: LogisticRegression, Ridge, Lasso  
‚úÖ **Performance**: Fast (<100ms/sample)  
‚úÖ **Quality**: Exact SHAP values  
‚úÖ **Tip**: Great for factor models and baselines

### KernelExplainer
‚ö†Ô∏è **Use for**: SVM, KNN, ANY model (universal fallback)  
‚ö†Ô∏è **Performance**: SLOW (100-5000ms/sample)  
‚ö†Ô∏è **Quality**: Approximate SHAP values  
‚ö†Ô∏è **Tip**: Always use `max_samples` parameter!

```python
# Good: Limit samples for speed
result = compute_shap_importance(
    model, X, 
    max_samples=100,  # Much faster
    show_progress=True  # Show progress
)

# Bad: Full dataset (can take hours!)
result = compute_shap_importance(model, X_large)  # Slow!
```

### GPU Acceleration
üöÄ **Use for**: Large datasets (>10K samples)  
üöÄ **Speedup**: 10-50x faster  
üöÄ **Requires**: `pip install ml4t-diagnostic[gpu]`

```python
result = compute_shap_importance(
    model, X_large,
    use_gpu=True  # or 'auto' for automatic detection
)
```

## 10. Summary

### v1.1 Key Features
- ‚úÖ **Multi-Explainer Support**: Tree, Linear, Kernel, Deep
- ‚úÖ **Universal Compatibility**: Works with ANY sklearn model
- ‚úÖ **Smart Auto-Selection**: Automatically picks best explainer
- ‚úÖ **100% Backward Compatible**: All v1.0 code works unchanged

### Quick Reference

```python
from ml4t.diagnostic.evaluation import compute_shap_importance

# Auto-selection (recommended)
result = compute_shap_importance(model, X)
print(f"Used: {result['explainer_type']}")  # 'tree', 'linear', 'kernel', 'deep'

# Explicit selection
result = compute_shap_importance(model, X, explainer_type='kernel')

# Performance optimization
result = compute_shap_importance(
    model, X,
    max_samples=100,  # Limit samples (for KernelExplainer)
    use_gpu=True,  # Enable GPU (if available)
    show_progress=True  # Show progress bar
)
```

### When to Use Each Explainer

| Model Type | Recommended Explainer | Speed | Quality |
|------------|----------------------|-------|--------|
| LightGBM, XGBoost | TreeExplainer | ‚úÖ Fast | ‚úÖ Exact |
| LogisticRegression, Ridge | LinearExplainer | ‚úÖ Fast | ‚úÖ Exact |
| TensorFlow, PyTorch | DeepExplainer | ‚ö†Ô∏è Medium | ‚ö†Ô∏è Approx |
| SVM, KNN, Other | KernelExplainer | ‚ùå Slow | ‚ö†Ô∏è Approx |

### Additional Resources
- **Documentation**: See `compute_shap_importance` docstring
- **Migration Guide**: `docs/MIGRATION.md`
- **README**: `README.md` (v1.1 section)

### Feedback
Questions or issues? File an issue on GitHub!