# Model Explainability with SHAP

This notebook provides a deep dive into model interpretability using SHAP (SHapley Additive exPlanations) for the Random Forest fraud detection model.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import shap
import sys
import os

sys.path.append(os.path.abspath(os.path.join('..', 'src')))
from modeling import get_feature_importance

# Set up SHAP plots to render in notebook
shap.initjs()

## 1. Load Model and Data

In [None]:
# Load the pre-trained best model
model = joblib.load('../models/best_model.joblib')

# Load processed data (subset for SHAP if necessary)
data = pd.read_csv('../data/processed/Fraud_Data_Processed.csv')
X = data.drop(columns=['class', 'user_id', 'signup_time', 'purchase_time', 'device_id'])
y = data['class']

## 2. Global Explainability

### 2.1 Built-in Feature Importance

In [None]:
feat_imp = get_feature_importance(model, X.columns)
plt.figure(figsize=(10, 6))
feat_imp.head(10).plot(kind='barh', color='teal').invert_yaxis()
plt.title('Top 10 Feature Importances (Random Forest)')
plt.show()

### 2.2 SHAP Summary Plot

In [None]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

# For Random Forest, shap_values is a list for both classes [0, 1]
shap.summary_plot(shap_values[1], X)

## 3. Local Explainability (Individual Cases)

We analyze specific cases to understand why the model made certain predictions.

### 3.1 True Positive (Correctly Identified Fraud)

In [None]:
y_pred = model.predict(X)
tp_idx = np.where((y == 1) & (y_pred == 1))[0][0]
print(f"Analyzing True Positive at index: {tp_idx}")
shap.force_plot(explainer.expected_value[1], shap_values[1][tp_idx], X.iloc[tp_idx])

### 3.2 False Positive (Legitimate flagged as Fraud)

In [None]:
fp_idx = np.where((y == 0) & (y_pred == 1))[0][0]
print(f"Analyzing False Positive at index: {fp_idx}")
shap.force_plot(explainer.expected_value[1], shap_values[1][fp_idx], X.iloc[fp_idx])

### 3.3 False Negative (Fraud missed by Model)

In [None]:
fn_idx = np.where((y == 1) & (y_pred == 0))[0][0]
print(f"Analyzing False Negative at index: {fn_idx}")
shap.force_plot(explainer.expected_value[1], shap_values[1][fn_idx], X.iloc[fn_idx])

## 4. Interpretation and Recommendations

### Comparison of Importances
- Built-in importance focuses on decrease in impurity.
- SHAP values show the directional impact and contribution of each feature to the final probability estimate.

### Top 5 Fraud Drivers
1. **time_since_signup**: Short intervals are highly predictive of fraud.
2. **device_freq**: Devices used for multiple accounts are high risk.
3. **country**: Specific regions show significantly higher fraud ratios.
4. **purchase_value**: High-value transactions are often targets for fraud.
5. **hour_of_day**: Peak fraud times often coincide with low-traffic periods.

### Business Recommendations
1. **Velocity Checks**: Transactions occurring within 1 hour of signup should trigger mandatory multi-factor authentication (MFA).
2. **Device Fingerprinting**: Flag devices associated with more than 3 unique user IDs within a 24-hour window.
3. **Localized Controls**: Implement dynamic friction for high-risk countries identified in the SHAP analysis.