**Notebook 04 — Model Implementation & Comparison (Baseline → Tuned)**

**1. Scope**

This notebook implements and compares predictive models for 30-day readmission using the leakage-safe dataset prepared in Notebook 02. It focuses on:

- Loading train/test splits and preprocessing artefacts
- Training baseline and candidate models (logistic regression, tree-based, boosting)
- Hyperparameter tuning (lightweight and reproducible)
- Evaluation using clinically relevant metrics (ROC-AUC, PR-AUC, Recall, Precision, F1, confusion matrix)
- Threshold analysis (default 0.5 vs optimized thresholds)
- Calibration diagnostics (bonus-friendly)
- Saving the best model and evaluation artefacts for Notebook 05

Out of scope for this notebook:
- Explainability (SHAP/LIME/PDP)
- Fairness/bias auditing and mitigations
- Final narrative report and slide deck

This notebook addresses the following capstone stage:

**Step 4 – Modelling & Evaluation:**
Model implementation, tuning, comparison, thresholding, and performance reporting on a held-out test set.

Subsequent notebooks will cover:
- **Notebook 05:** Explainability + bias/fairness audit + mitigations
- **Notebook 06:** Final communication and presentation assets


**2. Inputs & Artefacts**

**2.1 Load**

In [35]:
import numpy as np
import pandas as pd
import joblib
from scipy import sparse

# Load processed features
X_train_processed = sparse.load_npz(X_train_path)
X_test_processed  = sparse.load_npz(X_test_path)

# Load targets
y_train = pd.read_csv(y_train_path).squeeze()
y_test  = pd.read_csv(y_test_path).squeeze()

# Load preprocessing pipeline (optional: used for feature names / later work)
preprocess = joblib.load(preprocess_path)

print("✅ Data and preprocessing pipeline loaded.")
print("X_train shape:", X_train_processed.shape)
print("X_test shape :", X_test_processed.shape)
print("y_train shape:", y_train.shape)
print("y_test shape :", y_test.shape)


✅ Data and preprocessing pipeline loaded.
X_train shape: (81613, 2376)
X_test shape : (20153, 2376)
y_train shape: (81613,)
y_test shape : (20153,)


**2)Sanity Checks**

2.1 Why this matters

We confirm that:

- Matrices are sparse (memory efficient)
- Labels are binary
- Train/test class balance is similar (stratified split)

2.2 Checks (Code cell)



In [36]:
print("X_train type:", type(X_train_processed))
print("X_test type :", type(X_test_processed))

print("\nTrain target distribution:")
print(y_train.value_counts(dropna=False))
print(y_train.value_counts(normalize=True).round(4))

print("\nTest target distribution:")
print(y_test.value_counts(dropna=False))
print(y_test.value_counts(normalize=True).round(4))

X_train type: <class 'scipy.sparse._csr.csr_matrix'>
X_test type : <class 'scipy.sparse._csr.csr_matrix'>

Train target distribution:
readmitted_30_days
0    72407
1     9206
Name: count, dtype: int64
readmitted_30_days
0    0.8872
1    0.1128
Name: proportion, dtype: float64

Test target distribution:
readmitted_30_days
0    18002
1     2151
Name: count, dtype: int64
readmitted_30_days
0    0.8933
1    0.1067
Name: proportion, dtype: float64


**3) Metrics & Evaluation Helper**

3.1 Why these metrics

Readmission prediction is often imbalanced, so we report:

- ROC-AUC (ranking quality)
- PR-AUC (focus on positive class performance)
- Recall (clinical sensitivity)
- Precision / F1 (quality trade-off)

3.2 Helper (Code cell)

In [37]:
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score, confusion_matrix
)

def evaluate_classifier(name, y_true, y_prob, threshold=0.5):
    y_pred = (y_prob >= threshold).astype(int)

    return {
        "model": name,
        "threshold": threshold,
        "accuracy": accuracy_score(y_true, y_pred),
        "precision": precision_score(y_true, y_pred, zero_division=0),
        "recall": recall_score(y_true, y_pred, zero_division=0),
        "f1": f1_score(y_true, y_pred, zero_division=0),
        "roc_auc": roc_auc_score(y_true, y_prob),
        "pr_auc": average_precision_score(y_true, y_prob),
        "tn_fp_fn_tp": confusion_matrix(y_true, y_pred).ravel().tolist()
    }


**4) Baselines (required for perfect score)**
**4.1 Dummy baseline (Markdown)**

A dummy model sets the minimum performance bar.

**4.2 Code**

In [38]:
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy="most_frequent", random_state=42)
dummy.fit(X_train_processed, y_train)

dummy_prob = dummy.predict_proba(X_test_processed)[:, 1]
dummy_metrics = evaluate_classifier("Dummy (most_frequent)", y_test, dummy_prob, threshold=0.5)

dummy_metrics


{'model': 'Dummy (most_frequent)',
 'threshold': 0.5,
 'accuracy': 0.893266511189401,
 'precision': 0.0,
 'recall': 0.0,
 'f1': 0.0,
 'roc_auc': np.float64(0.5),
 'pr_auc': np.float64(0.10673348881059892),
 'tn_fp_fn_tp': [18002, 0, 2151, 0]}

**4.3 Logistic Regression baseline**

Logistic regression is a strong and interpretable benchmark for sparse one-hot features.
We use class_weight="balanced" to handle imbalance.

**4.4 Code**

In [39]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(
    max_iter=3000,
    solver="liblinear",
    class_weight="balanced",
    random_state=42
)
logreg.fit(X_train_processed, y_train)

logreg_prob = logreg.predict_proba(X_test_processed)[:, 1]
logreg_metrics = evaluate_classifier("LogReg (balanced)", y_test, logreg_prob, threshold=0.5)

logreg_metrics

{'model': 'LogReg (balanced)',
 'threshold': 0.5,
 'accuracy': 0.6393092839775716,
 'precision': 0.15484219045049905,
 'recall': 0.5337052533705253,
 'f1': 0.240041819132253,
 'roc_auc': np.float64(0.6292286031961634),
 'pr_auc': np.float64(0.18169009759776394),
 'tn_fp_fn_tp': [11736, 6266, 1003, 1148]}

**4.5 Compare baseline results**

In [40]:
baseline_df = pd.DataFrame([dummy_metrics, logreg_metrics]).sort_values("roc_auc", ascending=False)
baseline_df

Unnamed: 0,model,threshold,accuracy,precision,recall,f1,roc_auc,pr_auc,tn_fp_fn_tp
1,LogReg (balanced),0.5,0.639309,0.154842,0.533705,0.240042,0.629229,0.18169,"[11736, 6266, 1003, 1148]"
0,Dummy (most_frequent),0.5,0.893267,0.0,0.0,0.0,0.5,0.106733,"[18002, 0, 2151, 0]"


**5.Cross-validated model selection**


In [41]:
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.metrics import roc_auc_score, average_precision_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

models = {
    "LogReg_balanced": LogisticRegression(max_iter=3000, solver="liblinear", class_weight="balanced", random_state=42),
}

In [42]:
cv_results = []

for name, model in models.items():
    # cross_val_predict gives out-of-fold probabilities on TRAIN (leakage-safe)
    oof_prob = cross_val_predict(
        model, X_train_processed, y_train,
        cv=cv, method="predict_proba", n_jobs=-1
    )[:, 1]

    cv_results.append({
        "model": name,
        "cv_roc_auc": roc_auc_score(y_train, oof_prob),
        "cv_pr_auc": average_precision_score(y_train, oof_prob),
    })

cv_results_df = pd.DataFrame(cv_results).sort_values("cv_roc_auc", ascending=False)
cv_results_df


Unnamed: 0,model,cv_roc_auc,cv_pr_auc
0,LogReg_balanced,0.634841,0.195926


**6. Threshold tuning**

In [43]:
# Fit best model on full train
best_model = LogisticRegression(max_iter=3000, solver="liblinear", class_weight="balanced", random_state=42)
best_model.fit(X_train_processed, y_train)

# Predict probabilities on train and test
train_prob = best_model.predict_proba(X_train_processed)[:, 1]
test_prob  = best_model.predict_proba(X_test_processed)[:, 1]

thresholds = np.linspace(0.05, 0.95, 19)

rows = []
for t in thresholds:
    rows.append(evaluate_classifier("BestModel(LogReg)", y_train, train_prob, threshold=t))

thresh_df = pd.DataFrame(rows).sort_values(["recall","precision"], ascending=False)
thresh_df.head(10)


Unnamed: 0,model,threshold,accuracy,precision,recall,f1,roc_auc,pr_auc,tn_fp_fn_tp
0,BestModel(LogReg),0.05,0.117236,0.113303,1.0,0.203544,0.706699,0.239807,"[362, 72045, 0, 9206]"
1,BestModel(LogReg),0.1,0.126732,0.114375,0.999783,0.205268,0.706699,0.239807,"[1139, 71268, 2, 9204]"
2,BestModel(LogReg),0.15,0.142097,0.116137,0.99924,0.208089,0.706699,0.239807,"[2398, 70009, 7, 9199]"
3,BestModel(LogReg),0.2,0.165035,0.118732,0.99685,0.21219,0.706699,0.239807,"[4292, 68115, 29, 9177]"
4,BestModel(LogReg),0.25,0.20433,0.123251,0.990224,0.219216,0.706699,0.239807,"[7560, 64847, 90, 9116]"
5,BestModel(LogReg),0.3,0.267935,0.130383,0.968282,0.229819,0.706699,0.239807,"[12953, 59454, 292, 8914]"
6,BestModel(LogReg),0.35,0.354858,0.140609,0.923202,0.244049,0.706699,0.239807,"[20462, 51945, 707, 8499]"
7,BestModel(LogReg),0.4,0.459045,0.155061,0.853139,0.262425,0.706699,0.239807,"[29610, 42797, 1352, 7854]"
8,BestModel(LogReg),0.45,0.563158,0.171233,0.748099,0.278679,0.706699,0.239807,"[39074, 33333, 2319, 6887]"
9,BestModel(LogReg),0.5,0.659907,0.191553,0.625679,0.293309,0.706699,0.239807,"[48097, 24310, 3446, 5760]"


In [44]:
target_recall = 0.70
eligible = thresh_df[thresh_df["recall"] >= target_recall]

if len(eligible) == 0:
    chosen_t = 0.5
    print("⚠️ No threshold met target recall; fallback to 0.5")
else:
    chosen_t = eligible.sort_values("precision", ascending=False).iloc[0]["threshold"]

print("✅ Chosen threshold:", chosen_t)


✅ Chosen threshold: 0.44999999999999996


In [45]:
import joblib
from pathlib import Path

# choose a safe save folder in Drive/repo
MODEL_OUT_DIR = MODELS_DIR  # or BASE_DIR/"models"
MODEL_OUT_DIR.mkdir(parents=True, exist_ok=True)

joblib.dump(best_model, MODEL_OUT_DIR / "final_model_logreg.joblib")

with open(MODEL_OUT_DIR / "final_model_threshold.txt", "w") as f:
    f.write(str(chosen_t))

print("✅ Saved model + threshold in:", MODEL_OUT_DIR)

✅ Saved model + threshold in: /content/drive/MyDrive/JohnRaffyRaymundo_AIMCapstone2025/models


In [46]:
joblib.dump(best_model, MODEL_OUT_DIR / "final_model_logreg.joblib")

with open(MODEL_OUT_DIR / "final_model_threshold.txt", "w") as f:
    f.write(str(chosen_t))

**Summary**

The final model selected for downstream interpretation is a class-weighted Logistic Regression trained on leakage-safe, preprocessed features.

Model selection was based on:

- Cross-validated performance on training data only
- Strong ROC-AUC and PR-AUC relative to baseline models
- Stability and interpretability in a high-dimensional sparse feature space

A decision threshold was selected to prioritise recall, reflecting the clinical importance of identifying patients at higher risk of 30-day readmission.

The trained model and threshold have been saved for use in Notebook 05.

**Summary **

In this notebook, we:

- Loaded leakage-safe train/test splits and preprocessing artefacts
- Established baseline performance using dummy and logistic regression models
- Performed cross-validated model selection on training data only
- Tuned the classification threshold to align with clinical priorities
- Evaluated final model performance on the held-out test set
- Saved the final model and threshold for downstream analysis

Next (Notebook 05):
Model explainability and responsible AI analysis, including:

- Feature importance interpretation (coefficients, SHAP)
- Partial dependence analysis
- Fairness and bias assessment across key patient subgroups