# Notebook 2 — Breast Cancer Wisconsin
## Explainability Demo (Global + Local SHAP)

## Scope & Disclaimer

This notebook demonstrates SHAP-based explainability on a clean, well-known medical classification dataset (Breast Cancer Wisconsin).

It is intended as a **method demonstration** for explainability (**global + local**) and does **not** constitute clinical performance or deployment claims.

Outputs from this notebook are treated as explainability evidence artefacts (C2/C3)
to support governance review and auditability.
No clinical performance or deployment claims are made.

## Key Goal
Produce audit-friendly explainability artefacts:
- **C2**: Global SHAP summary (feature importance + direction)
- **C3**: Local SHAP waterfall (case-level explanation)



## Environment & Data Access (Sanitized)

To keep the notebook concise and review-focused, environment setup
and data download steps are not included.

The notebook assumes that required data and dependencies are
available in the execution environment.



## Step 1 — Dataset Loading (Breast Cancer Wisconsin)

Load the Breast Cancer Wisconsin dataset from scikit-learn.

**Purpose**
Use a clean, well-known medical dataset to demonstrate explainability methods in a stable “best-case” scenario.


In [None]:
# === Step 1 — Dataset Loading ===

# -----------------------------
# Load dataset from sklearn
# -----------------------------
data = load_breast_cancer()
# NOTE: sklearn breast cancer target: 0 = malignant, 1 = benign


X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target, name="target")
# sklearn convention: 0 = malignant, 1 = benign

label_map = {0: "malignant", 1: "benign"}
y_label = y.map(label_map)

print("X shape:", X.shape)
print("Class distribution:\n", y_label.value_counts())
display(X.head())


## Step 2 — Train/Test Split

Split the dataset into train and test sets (stratified).

**Purpose**
Ensure a fair and reproducible evaluation split while preserving class distribution.


In [None]:
# === Step 2 — Train/Test Split ===

# -----------------------------
# Split data into train/test
# -----------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y
)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)

print("Train class ratio:", y_train.value_counts(normalize=True).to_dict())
print("Test class ratio:", y_test.value_counts(normalize=True).to_dict())


## Step 3 — Minimal Data Preprocessing (Stability-Oriented)

Apply a minimal preprocessing setup to ensure numerical stability
and reproducibility, without introducing complex feature engineering.

Preprocessing choices are intentionally limited to:
- Simple imputation for missing values (if present)
- Standardization of numeric features

**Purpose**  
Provide stable and well-conditioned inputs for the model while keeping
the preprocessing pipeline transparent, auditable, and easy to review. This minimal setup reduces hidden transformations and supports audit review.




In [None]:
# === Step 3 — Data Preprocessing ===

# -----------------------------
# Define preprocessing pipeline
# -----------------------------
preprocess = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])


## Step 4 — Model Training (Logistic Regression)

Train a Logistic Regression classifier using the preprocessing pipeline.

**Purpose**
Use an interpretable baseline model that works well on tabular data and is compatible with SHAP.


In [None]:
# === Step 4 — Model Training ===

# -----------------------------
# Define model
# -----------------------------
clf = LogisticRegression(max_iter=5000, random_state=42, class_weight="balanced")


# -----------------------------
# Build end-to-end pipeline
# -----------------------------
model = Pipeline(steps=[
    ("preprocess", preprocess),
    ("clf", clf)
])

# -----------------------------
# Train model
# -----------------------------
model.fit(X_train, y_train)

print("Training done.")


## Step 5 — Model Performance Evaluation

Evaluate the trained model on a held-out test set
and report baseline classification metrics.

Reported metrics are provided as **descriptive evidence**
to contextualize explainability results.
They are not used to claim clinical suitability
or deployment readiness.

**Purpose**  
Establish a transparent performance reference
to support later explainability and audit review.

Reported metrics serve as contextual evidence only and must not be interpreted
as indicators of clinical utility or deployment readiness.



In [None]:
# === Step 5 — Model Performance Evaluation ===

# -----------------------------
# Predict probabilities on the test set
# -----------------------------
y_proba = model.predict_proba(X_test)[:, 1]  # probability for class 1 (benign)

# -----------------------------
# ROC-AUC
# -----------------------------
auc = roc_auc_score(y_test, y_proba)
print(f"Test ROC-AUC: {auc:.4f}")

# -----------------------------
# ROC curve plot
# -----------------------------
# NOTE: Here ROC treats label=1 (benign) as the positive class,
# because we use predict_proba[:, 1].

RocCurveDisplay.from_predictions(y_test, y_proba)
plt.title("ROC Curve (Test Set)")
plt.show()

# -----------------------------
# Confusion matrix (default threshold 0.5 for benign)
# -----------------------------
y_pred = (y_proba >= 0.5).astype(int)
cm = confusion_matrix(y_test, y_pred)
tn, fp, fn, tp = cm.ravel()
print(f"TN={tn}, FP={fp}, FN={fn}, TP={tp}")
print("Confusion Matrix:\n", cm)

# -----------------------------
# Classification report
# -----------------------------
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred, target_names=["malignant(0)", "benign(1)"]))


AUC is reported as a baseline metric in this demonstration setup.

This result is provided as **evidence** and is **not** used to claim clinical readiness or deployment suitability.


## Step 6 — Global Explainability (C2: SHAP Summary)

Compute SHAP values on the test set and generate a global SHAP summary plot.

**Evidence Produced**
- Feature importance ranking
- Direction and magnitude of feature contributions

**Purpose**
Provide global transparency into which features most influence model predictions (audit-friendly evidence).


In [None]:
# === Step 6 — Global Explainability (C2) ===

# -----------------------------
# Transform data using the same preprocessing steps
# (SHAP needs the final numeric feature space)
# -----------------------------
fitted_preprocess = model.named_steps["preprocess"]
trained_clf = model.named_steps["clf"]
feature_names = X.columns.tolist()

X_train_proc = fitted_preprocess.transform(X_train)
X_test_proc  = fitted_preprocess.transform(X_test)


# -----------------------------
# Build SHAP explainer for linear model
# -----------------------------
explainer = shap.Explainer(trained_clf, X_train_proc, feature_names=feature_names)
shap_values = explainer(X_test_proc)


# -----------------------------
# Compute SHAP values for test set
# -----------------------------
shap_values = explainer(X_test_proc)

# -----------------------------
# Global SHAP summary plot (beeswarm)
# -----------------------------
shap.summary_plot(shap_values, features=X_test_proc, feature_names=feature_names, show=False)
# Positive SHAP values push the prediction towards class 1 (benign)
plt.title("C2 — Global SHAP Summary (Test Set)")
plt.show()


## Step 7 — Local Explainability (C3: SHAP Waterfall)

Select representative individual cases (e.g., high-confidence malignant, or a misclassified case)
and generate SHAP waterfall plots.

**Evidence Produced**
- Case-level explanation of model decisions

**Purpose**
Support human-in-the-loop review by answering:
“Why did the model produce this output for this specific case?”


In [None]:
# === Step 7 — Local Explainability (C3) ===

# -----------------------------
# Select a representative case:
# Option A: pick the most "malignant-like" case (lowest benign probability)
# -----------------------------
idx = int(np.argmin(y_proba))
print("Selected test index:", idx)
print("True label:", label_map[int(y_test.iloc[idx])])
print("Predicted benign probability:", float(y_proba[idx]))

# -----------------------------
# Build local explanation
# -----------------------------
local_exp = shap_values[idx]

# Waterfall plot (top features)
shap.plots.waterfall(local_exp, max_display=12, show=False)
plt.gca().set_title("C3 — Local SHAP Waterfall (Selected Case)")
plt.show()




## Step 8 (Optional) — Reliability Signal via Simple Slicing

This is an optional illustration of group-wise error differences and does not represent a fairness conclusion.
This dataset does not contain protected attributes (e.g., sex, race, age), so we do **not** perform a fairness assessment.
Instead, we demonstrate a minimal **reliability slicing** approach:

- Choose a proxy feature (e.g., `mean radius`)
- Split the test set into two groups (low vs high, based on median)
- Compare performance (AUC) across slices

**Purpose**
Provide a governance-style early warning signal: if performance differs strongly across slices, trigger review.


In [None]:
# === Step 8 — Optional Proxy Slicing ===

# -----------------------------
# Proxy subgroup split by median of "mean radius"
# -----------------------------
proxy_feature = "mean radius"
median_val = X_test[proxy_feature].median()

group_low = X_test[proxy_feature] < median_val
group_high = ~group_low

def group_auc(mask):
    return roc_auc_score(y_test[mask], y_proba[mask])

auc_low = group_auc(group_low)
auc_high = group_auc(group_high)
ratio = max(auc_low, auc_high) / min(auc_low, auc_high)
gap = abs(auc_high - auc_low)

print(f"Proxy split feature: {proxy_feature}")
print(f"AUC (low):  {auc_low:.4f}")
print(f"AUC (high): {auc_high:.4f}")
print(f"Ratio (max/min): {ratio:.3f}")
print(f"Gap: {gap:.4f}")


## Step 9 (Optional) — Review Trigger (Demo Policy)

This policy is included for portfolio completeness and mirrors the governance pattern used in Notebook 1.

Define a simple human-in-the-loop review policy:
* When triggered → log + manual review + document decision.
* Flag low-confidence predictions for manual review
* Track proxy subgroup gaps as a review trigger

**Purpose**
Demonstrate an operational oversight mechanism (review triggers + thresholds).

In [None]:
# === Step 9 — Human Review Policy (Demo) ===

# -----------------------------
# Low-confidence rule (demo):
# Review if predicted benign probability is between [0.40, 0.60]
# -----------------------------
low_conf_mask = (y_proba >= 0.40) & (y_proba <= 0.60)
low_conf_rate = low_conf_mask.mean()
review_indices = X_test.index[low_conf_mask].tolist()


print(f"Low-confidence review rate: {low_conf_rate:.3f}")
print("Example review indices (first 10):", review_indices[:10])
print("Example review probs (first 10):", y_proba[low_conf_mask][:10])

# -----------------------------
# Proxy gap review rule (demo):
# If AUC gap > 0.05 => trigger review
# -----------------------------
proxy_gap = abs(auc_high - auc_low)
review_trigger = proxy_gap > 0.05

print(f"Proxy AUC gap: {proxy_gap:.4f}")
print("Review trigger:", review_trigger)


## Step 10 — Record-Keeping & Audit Trail (Art. 12 Spirit)

For each run, a machine-readable audit log is generated, capturing:
- Dataset and split identifiers
- Model configuration
- Evaluation metrics
- Explainability artefacts
- Bias/robustness signals
- Review policy outcomes

**Purpose**  
Provide an auditable, traceable evidence trail that bridges
a notebook-based demo and regulatory inspection requirements.

Audit logging logic is shown in abstracted form.

Exact variable bindings are omitted in this sanitized public version, as data acquisition and execution context are intentionally excluded.


In [None]:
# Pseudo-code (sanitized)
# In the full execution environment, audit records would capture
# run metadata, evaluation metrics, and explainability artefacts.

# implementation detail intentionally abstracted


## Summary — Evidence Generated

### Explainability Evidence
- ✅ C2: Global SHAP summary
- ✅ C3: Local SHAP waterfall

### (Optional) Reliability / Governance Signals
- ◻️ Simple slicing or proxy comparison
- ◻️ Review trigger logic (demo-level)

### Record-Keeping
- ✅ Audit log structure demonstrated (see Notebook 1)

