# German Credit — Random Forest SHAP Computation

This notebook trains a Random Forest classifier on the **German Credit** dataset and computes SHAP values and SHAP interaction values for model explainability. The results are saved to disk for downstream visualization.

In [1]:
import os
import shap
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml

  from .autonotebook import tqdm as notebook_tqdm


## Load the dataset

Fetch the German Credit dataset from OpenML. The target variable is binarized as `1` (good credit) vs `0` (bad credit). Categorical features are one-hot encoded. The feature matrix is persisted as a pickle for reuse in visualization notebooks.

In [2]:
credit = fetch_openml(name="credit-g", version=1, as_frame=True)
X = credit.data
y = (credit.target == "good").astype(int)
X = pd.get_dummies(X, drop_first=True).astype(float)

os.makedirs("../../data/credit/rf", exist_ok=True)
X.to_pickle("../../data/credit/x_values.pkl")
y.to_pickle("../../data/credit/y_values.pkl")

## Train/test split

Split the data into 80% training and 20% test sets with a fixed random seed for reproducibility.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

## Train the Random Forest classifier

Fit a `RandomForestClassifier` with 500 trees, max depth 6, and a minimum of 5 samples per leaf. The model is trained on the **full** dataset (not just the training split) so that SHAP explanations cover all observations.

In [None]:
model = RandomForestClassifier(
    n_estimators=500,
    max_depth=6,
    min_samples_leaf=5,
    random_state=7,
    n_jobs=-1,
)
model.fit(X, y)

## Compute SHAP values

Use `shap.TreeExplainer` to compute SHAP values for the first 500 samples. For a classifier the explainer returns per-class values; we extract and save only the **positive class** (good credit) values.

In [None]:
num_samples = 500
X_shapley = X.iloc[:num_samples, :]
explainer = shap.TreeExplainer(model)

In [None]:
shap_values = explainer.shap_values(X_shapley)
# shap_values shape: (n_samples, n_features, 2) — last axis is [class_0, class_1]
shap_values_positive = shap_values[:, :, 1]
np.save("../../data/credit/rf/shap_values.npy", shap_values_positive)

## Compute SHAP interaction values

Compute pairwise SHAP interaction values for the same 500 samples. These capture feature-pair synergies and redundancies and are saved for network-based visualization.

In [None]:
shap_interaction_values = explainer.shap_interaction_values(X_shapley)
# shap_interaction_values shape: (n_samples, n_features, n_features, 2) — last axis is [class_0, class_1]
shap_interaction_positive = shap_interaction_values[:, :, :, 1]
np.save("../../data/credit/rf/shap_interaction_values.npy", shap_interaction_positive)