# Notebook 1 — Stroke Dataset
## Bias Analysis & Explainability (XGBoost + SHAP)

(Sanitized public version)


## Scope & Governance Disclaimer

This notebook demonstrates a **high-risk medical tabular AI use case**
(stroke risk prediction) with a focus on:

- explainability (SHAP),
- bias and subgroup reliability analysis,
- and human-in-the-loop review policies.

It is intended as a **governance- and compliance-oriented demonstration**
and does **not** represent a deployed clinical decision system.

**Outputs from this notebook are treated as governance evidence artefacts (C2/C3/D1/D3/D4) and are linked in the repository’s evidence structure. No clinical claims are made.**


## Environment & Data Access (Sanitized)

Environment setup and dataset acquisition steps are intentionally excluded
to avoid exposing operational details and credential handling practices.

The scope of this notebook is limited to governance-relevant artifacts,
including explainability outputs, evaluation evidence, and audit logs.


### Dataset Provenance & Reproducibility Notes




## Step 1 — Data Preprocessing

Perform basic data cleaning and preparation:

* The dataset is retrieved from Kaggle using the official Kaggle CLI.
* The dataset slug is recorded to support reproducibility.
* Results may change if the dataset is updated on Kaggle; therefore, this notebook documents the access method and dataset identifier used at execution time.
* Handle missing values (e.g. BMI, smoking status)
* Separate features and target label (stroke)
* Split data into training and test sets
* Encode categorical variables and pass through numerical features

### Purpose
Prepare data for model training while preserving demographic attributes (age, gender) for subgroup analysis.

Preprocessing choices are kept minimal for demonstration and auditability; extensive feature engineering is omitted.

### Expected columns
age, gender, hypertension, heart_disease, avg_glucose_level, bmi, smoking_status, stroke


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# -----------------------------
# Load dataset
# -----------------------------
# DATA_PATH intentionally abstracted in the public version.
# Expected schema is documented below for reproducibility.
DATA_PATH = "data/stroke/healthcare-dataset-stroke-data.csv"  # placeholder
df = pd.read_csv(DATA_PATH)

# -----------------------------
# Basic cleaning: missing values
# -----------------------------
# Fill missing BMI with the median (simple baseline strategy)
df["bmi"] = df["bmi"].fillna(df["bmi"].median())

# Treat missing smoking status as an explicit category
df["smoking_status"] = df["smoking_status"].fillna("Unknown")

# -----------------------------
# Separate features and target
# -----------------------------
y = df["stroke"].astype(int)
X = df.drop(columns=["stroke", "id"], errors="ignore")

# -----------------------------
# Train/test split (keep demographic fields for later slicing)
# -----------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# -----------------------------
# Encoding: identify categorical vs numerical columns
# -----------------------------
categorical_cols = [c for c in X.columns if X[c].dtype == "object"]
numerical_cols = [c for c in X.columns if c not in categorical_cols]

# One-hot encode categorical features; keep numerical features as-is
preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_cols),
        ("num", "passthrough", numerical_cols),
    ]
)

print("Preprocessing ready.")
print("Categorical columns:", categorical_cols)
print("Numerical columns:", numerical_cols)
print("Train shape:", X_train.shape, "Test shape:", X_test.shape)


## Step 2 — Model Training (XGBoost Classifier)

Train an XGBoost classification model using a preprocessing pipeline. Hyperparameter tuning is intentionally omitted; the goal is explainability and governance evidence rather than optimization.

### Model Characteristics

* Gradient-boosted decision trees
* Suitable for tabular medical data
* Compatible with SHAP explainability

### Purpose
Establish a baseline predictive model for stroke risk estimation.

In [None]:
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier

# -----------------------------
# Define XGBoost model
# -----------------------------
xgb_model = XGBClassifier(
    n_estimators=200,
    max_depth=4,
    learning_rate=0.05,
    subsample=0.9,
    colsample_bytree=0.9,
    #Hyperparameters are chosen as stable defaults for reproducibility, not optimized for performance.
    reg_lambda=1.0,
    random_state=42,
    eval_metric="logloss",
    n_jobs=-1
)

# -----------------------------
# Build training pipeline
# -----------------------------
# Pipeline includes preprocessing + model training
pipeline = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("model", xgb_model),
    ]
)

pipeline.fit(X_train, y_train)

print("Model training completed.")


## Step 3 — Model Performance Evaluation

Evaluate model performance on the test set using:

* Predicted probabilities
* Area Under the ROC Curve (AUC)

### Purpose
Verify that the model reaches a reasonable performance level before explainability and bias analysis.

**Interpretation**
Performance metrics (e.g., ROC-AUC) are used as a **sanity check**
Reported metrics are sanity checks only and must not be interpreted as clinical utility or deployment readiness.

In [None]:
from sklearn.metrics import roc_auc_score

# -----------------------------
# Predict probabilities on the test set
# -----------------------------
y_proba = pipeline.predict_proba(X_test)[:, 1]

# -----------------------------
# Compute AUC
# -----------------------------
auc = roc_auc_score(y_test, y_proba)
print(f"Test AUC: {auc:.4f}")


The model achieves an AUC of ~0.82 on a held-out test set in this demonstration setup.

This metric is reported as **baseline evidence** and is **not** used to claim clinical readiness or deployment suitability.
The primary focus of this notebook is explainability and governance-oriented review.


## Step 4 — Global Explainability (C2: SHAP Summary)

Compute SHAP values for the test set and generate a global SHAP summary plot.

### Evidence Produced

* Feature importance ranking
* Direction and magnitude of feature contributions

### Purpose
Provide global transparency into which clinical and demographic features most influence model predictions
(EU AI Act Art. 13 – Transparency).

**High-level Interpretation (Global)**
The global SHAP summary indicates that demographic and clinical features
(e.g., age, average glucose level, BMI) are the dominant drivers of stroke risk prediction.
Higher values of these features tend to increase the predicted stroke risk,
which is consistent with domain expectations.


In [None]:
import shap
import matplotlib.pyplot as plt

# -----------------------------
# Transform features for SHAP
# -----------------------------
X_test_transformed = pipeline.named_steps["preprocess"].transform(X_test)
feature_names = pipeline.named_steps["preprocess"].get_feature_names_out()

# -----------------------------
# SHAP TreeExplainer for XGBoost models
# -----------------------------
explainer = shap.TreeExplainer(pipeline.named_steps["model"])
shap_values = explainer.shap_values(X_test_transformed)

# -----------------------------
# Global SHAP summary plot (C2 evidence)
# -----------------------------
plt.figure()
shap.summary_plot(
    shap_values,
    X_test_transformed,
    feature_names=feature_names,
    show=False
)
plt.tight_layout()
plt.savefig("C2_stroke_shap_summary", dpi=200)
plt.show()

print("Saved: C2_stroke_shap_summary.png")


## Step 5 — Local Explainability (C3: SHAP Waterfall)

Select representative individual cases (e.g. true positive, false positive) and generate SHAP waterfall plots.

### Evidence Produced

* Case-level explanation of model decisions

### Purpose
Support human-in-the-loop review by answering
“Why did the model produce this output for this specific patient?”
(EU AI Act Art. 13 & 14 – Human Oversight).

**Human-in-the-Loop Interpretation**
This local SHAP explanation shows how individual features contributed
to the prediction for a specific patient.
Such case-level explanations enable a human reviewer to assess
whether the model’s reasoning is plausible before acting on the output.

### Compatibility Note (Sanitized Public Version)

In some execution environments, SHAP waterfall plots may display
inconsistent feature alignment due to preprocessing pipelines
(e.g. one-hot encoding) and library version differences.

In such cases, the implementation may switch to a simplified
`shap.Explanation(...)` representation or alternative visualization
to preserve interpretability.

These implementation details are intentionally abstracted
in this public version, as the focus of this notebook is on
governance-relevant explainability evidence rather than
visualization mechanics.


In [None]:
import numpy as np
from sklearn.metrics import confusion_matrix

# -----------------------------
# Convert probabilities to binary predictions using a fixed threshold
# -----------------------------
THRESHOLD = 0.5
y_pred = (y_proba >= THRESHOLD).astype(int)

# -----------------------------
# Select representative cases:
# - True Positive (TP): y=1, pred=1
# - False Positive (FP): y=0, pred=1
# -----------------------------
tp_indices = np.where((y_test.values == 1) & (y_pred == 1))[0]
fp_indices = np.where((y_test.values == 0) & (y_pred == 1))[0]

def plot_waterfall(index, filename):
    # Waterfall plot shows case-level feature contributions
    shap.plots._waterfall.waterfall_legacy(
        explainer.expected_value,
        shap_values[index],
        feature_names=feature_names,
        features=X_test_transformed[index],
        show=False
    )
    plt.tight_layout()
    plt.savefig(filename, dpi=200)
    plt.show()
    print(f"Saved: {filename}")

if len(tp_indices) > 0:
    plot_waterfall(tp_indices[0], "CC3_stroke_shap_waterfall_TP.png")
else:
    print("No TP case found under current threshold; consider adjusting threshold or checking class imbalance.")

if len(fp_indices) > 0:
    plot_waterfall(fp_indices[0], "C3_stroke_shap_waterfall_FP.png")
else:
    print("No FP case found under current threshold; consider adjusting threshold or checking class imbalance.")


## Step 6 — Bias Analysis (D1: Subgroup Slicing)

Conduct subgroup performance analysis using:

* Age buckets (e.g. <40, 40–60, >60)
* Gender groups

Metrics evaluated:

* AUC
* False Negative Rate (FNR)
* False Positive Rate (FPR)

### Purpose
Identify potential reliability gaps across demographic subgroups
(EU AI Act Art. 9, Art. 10, Art. 15).

**Interpretation note (governance):**  

mall subgroup sample sizes may lead to unstable estimates; results should be treated as screening signals rather than conclusions.

This subgroup comparison does **not** prove discrimination.  

It serves as an early **reliability signal** that may trigger further review (data quality, representativeness, proxy feature assessment).



In [None]:
import pandas as pd
from sklearn.metrics import roc_auc_score, confusion_matrix

# -----------------------------
# Helper: compute subgroup metrics
# -----------------------------
def compute_group_metrics(mask, label):
    y_true = y_test.values[mask]
    y_score = y_proba[mask]
    y_hat = (y_score >= THRESHOLD).astype(int)

    # AUC requires both classes present
    auc_val = roc_auc_score(y_true, y_score) if len(np.unique(y_true)) > 1 else np.nan

    tn, fp, fn, tp = confusion_matrix(y_true, y_hat, labels=[0, 1]).ravel()
    fnr = fn / (fn + tp) if (fn + tp) > 0 else np.nan
    fpr = fp / (fp + tn) if (fp + tn) > 0 else np.nan

    return {
        "group": label,
        "n_samples": int(mask.sum()),
        "AUC": auc_val,
        "FNR": fnr,
        "FPR": fpr,
    }

results = []

# -----------------------------
# Age buckets
# -----------------------------
age_bins = pd.cut(
    X_test["age"],
    bins=[0, 40, 60, 120],
    labels=["<40", "40-60", ">60"],
    include_lowest=True
)

for g in ["<40", "40-60", ">60"]:
    mask = (age_bins == g)
    results.append(compute_group_metrics(mask, f"age:{g}"))

# -----------------------------
# Gender groups
# -----------------------------
gender_series = X_test["gender"].astype(str)
for g in sorted(gender_series.unique()):
    mask = (gender_series == g)
    results.append(compute_group_metrics(mask, f"gender:{g}"))

bias_df = pd.DataFrame(results)
bias_df.to_csv("D1_bias_metrics_stroke.csv", index=False)

print(bias_df)
print("Saved: D1_bias_metrics_stroke.csv")


## Step 7 — Bias Visualization

Visualize subgroup performance differences using bar charts (e.g. FNR and FPR by subgroup).

### Purpose
Enable intuitive inspection of bias patterns and support governance discussions.

In [None]:
import matplotlib.pyplot as plt

# -----------------------------
# Plot FNR by subgroup
# -----------------------------
plt.figure()
bias_df.set_index("group")["FNR"].plot(kind="bar")
plt.title("False Negative Rate by Subgroup (Stroke)")
plt.ylabel("FNR")
plt.tight_layout()
plt.savefig("D1_bias_FNR_stroke.png", dpi=200)
plt.show()
print("Saved: D1_bias_FNR_stroke.png")

# -----------------------------
# Plot FPR by subgroup
# -----------------------------
plt.figure()
bias_df.set_index("group")["FPR"].plot(kind="bar")
plt.title("False Positive Rate by Subgroup (Stroke)")
plt.ylabel("FPR")
plt.tight_layout()
plt.savefig("D1_bias_FPR_stroke.png", dpi=200)
plt.show()
print("Saved: D1_bias_FPR_stroke.png")


## Step 8 — Human Review Policy Definition (D3)

Define a simple human-in-the-loop review policy, including:

* Low-confidence prediction thresholds
* Monitoring of high-risk subgroups
* Override and logging requirements

### Purpose
Demonstrate operational human oversight mechanisms
(EU AI Act Art. 14 – Human Oversight).

### Demo policy:

If subgroup error ratio exceeds a defined threshold (e.g. > 1.2), trigger a human review:

- document the finding,
- inspect data representativeness and potential proxy effects,
- decide whether mitigation or retraining is required.

Note: The formal policy version is maintained under docs/policies/ and this notebook output serves as evidence snapshot for a specific run.

In [None]:
import numpy as np

# -----------------------------
# Compute global baseline metrics (for comparison)
# -----------------------------
global_row = compute_group_metrics(np.ones(len(X_test), dtype=bool), "global")
global_fnr = global_row["FNR"]
global_fpr = global_row["FPR"]

# -----------------------------
# Identify high-risk subgroups based on relative gap
# Example rule: subgroup FNR or FPR exceeds global by > 20% relative
# -----------------------------
RISK_GAP_REL = 0.20

high_risk_groups = []
for _, r in bias_df.iterrows():
    if r["group"] == "global":
        continue
    fnr_gap = (r["FNR"] - global_fnr) / global_fnr if (global_fnr and not np.isnan(global_fnr) and global_fnr > 0) else np.nan
    fpr_gap = (r["FPR"] - global_fpr) / global_fpr if (global_fpr and not np.isnan(global_fpr) and global_fpr > 0) else np.nan
    if (not np.isnan(fnr_gap) and fnr_gap > RISK_GAP_REL) or (not np.isnan(fpr_gap) and fpr_gap > RISK_GAP_REL):
        high_risk_groups.append(r["group"])

# -----------------------------
# Define low-confidence band for mandatory review
# -----------------------------
LOW_CONF_LO = 0.40
LOW_CONF_HI = 0.60

policy_md = f"""# D3_human_review_policy.md

## Intended Use
This model provides decision support for stroke risk prediction. Final clinical decisions remain with healthcare professionals.

## Review Triggers (Human-in-the-loop)
### 1) Low-confidence predictions (mandatory review)
- If predicted risk is between **{LOW_CONF_LO:.2f} and {LOW_CONF_HI:.2f}**, a human reviewer must validate the case.

### 2) High-risk subgroup monitoring (heightened oversight)
- The following subgroups showed elevated error risk compared to global baseline (relative gap threshold: {RISK_GAP_REL:.0%}):
{chr(10).join([f"- {g}" for g in high_risk_groups]) if high_risk_groups else "- (No subgroup exceeded the predefined risk gap threshold in this run.)"}

### 3) Override and logging requirements
- A human reviewer may override the model output.
- Each override must be recorded with:
  - reviewer role / ID (pseudonymous)
  - timestamp
  - model version / run ID
  - original prediction + confidence
  - final decision
  - short rationale (free text)
- Logs must not contain direct personal identifiers.

## Operational Notes
- Recalibration or retraining should be considered if persistent subgroup gaps are observed.
- Incident handling should be triggered if performance drops below defined safety thresholds.

## Evidence Links
- Bias metrics: `D1_bias_metrics_stroke.csv`
- Bias plots: `D1_bias_FNR_stroke.png`, `D1_bias_FPR_stroke.png`
- Explainability: `C2_shap_summary_stroke.png`, local waterfalls in `C3_*`
"""

with open("D3_human_review_policy.md", "w", encoding="utf-8") as f:
    f.write(policy_md)

print("Saved: D3_human_review_policy.md")
print("High-risk subgroups flagged:", high_risk_groups)


## Step 9 — Record-Keeping & Audit Trail (Art. 12 Spirit)

For each run, a machine-readable audit log is generated, capturing:
- Dataset and split identifiers
- Model configuration
- Evaluation metrics
- Explainability artefacts
- Bias/robustness signals
- Review policy outcomes

**Purpose**  
Provide an auditable, traceable evidence trail that bridges
a notebook-based demo and regulatory inspection requirements.

In [None]:
from datetime import datetime, timezone
from sklearn.metrics import accuracy_score, confusion_matrix
import json, uuid, os

# -----------------------------
# Prepare audit directory
# -----------------------------
os.makedirs("audit", exist_ok=True)

# -----------------------------
# Run identifiers
# -----------------------------
run_id = str(uuid.uuid4())
timestamp = datetime.now(timezone.utc).isoformat()

# -----------------------------
# Basic performance snapshot
# -----------------------------
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred).tolist()

# -----------------------------
# Audit record (Art. 12 spirit)
# -----------------------------
audit_record = {
    "run_id": run_id,
    "timestamp": timestamp,
    "project": "EU AI Act Governance Demo",
    "project_id": "Project-1-Stroke-XGBoost",
    "model": {
        "type": "XGBoost",
        "task": "stroke risk prediction (tabular)",
        "explainability_method": "SHAP"
    },
    "dataset": {
        "source": "Kaggle (public benchmark)",
        "dataset_type": "tabular clinical-style data",
        "split": "test"
    },
    "performance_snapshot": {
        "accuracy": float(accuracy),
        "confusion_matrix": cm
    },
    "explainability_artifacts": [
        "C2_stroke_shap_summary.png",
        "C3_stroke_shap_waterfall_TP.png",
        "C3_stroke_shap_waterfall_FP.png"
    ],
    "bias_artifacts": [
        "D1_bias_metrics_stroke.csv",
        "D1_bias_FNR_stroke.png",
        "D1_bias_FPR_stroke.png"
    ],
    "human_oversight": {
        "review_required": True,
        "trigger": "False positive example included for explainability review",
        "policy_reference": "P4_human_oversight_review_triggers.md"
    },
    "notes": (
        "This audit record is a demonstration artifact generated "
        "for governance and compliance illustration purposes. "
        "It does not represent a production deployment log."
    )
}

# -----------------------------
# Write append-only audit log
# -----------------------------
with open("audit/audit_log.jsonl", "a", encoding="utf-8") as f:
    f.write(json.dumps(audit_record) + "\n")

print("Audit record written:", run_id)


## Summary of Evidence Generated

This notebook produces the following compliance-relevant artefacts:
- **C2**: Global explainability (SHAP summary)
- **C3**: Local explainability (SHAP waterfall plots)
- **D1**: Subgroup reliability analysis (age, gender)
- **D3**: Human-in-the-loop review policy
