# 03 — Train Baseline Model (Logistic Regression)

**Project:** Early ICU Mortality Prediction Using Structured EHR Data  
**Dataset:** MIMIC-IV Clinical Database Demo (v2.2)

## Goal of this notebook
Train and evaluate a **baseline** mortality model using lab features from the first 24 hours of ICU admission.

We will:
1. Load `dataset_model_ready.csv` (cohort + label + lab features)
2. Define feature columns and the label
3. Split into train/test sets (patient-level split to reduce leakage across multiple stays)
4. Train a baseline pipeline:
   - Impute missing values
   - Standardize features
   - Logistic Regression (with class_weight='balanced')
5. Evaluate:
   - ROC-AUC
   - PR-AUC (Average Precision)
   - Confusion matrix at a chosen threshold
   - Simple calibration check (optional plot)
6. Save:
   - `baseline_logreg.joblib`
   - `baseline_metrics.json`

## Inputs
- `dataset_model_ready.csv`

## Outputs
- `baseline_logreg.joblib`
- `baseline_metrics.json`


In [None]:
# Setup
import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)

import sys, platform
print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())
print("Pandas:", pd.__version__)


## 1) Load dataset

In [None]:
DATA_DIR = Path(".")

if not (DATA_DIR / "dataset_model_ready.csv").exists():
    alt = Path("/mnt/data")
    if (alt / "dataset_model_ready.csv").exists():
        DATA_DIR = alt

DATASET_PATH = DATA_DIR / "dataset_model_ready.csv"
print("Dataset path:", DATASET_PATH.resolve())

df = pd.read_csv(DATASET_PATH)

print("Loaded dataset:", df.shape)
display(df.head(5))


## 2) Define label and feature columns

- **Label:** `label_mortality`  
- **Features:** lab aggregates and measured indicators (columns starting with `lab_`)

We exclude identifiers and timestamps from the feature set.


In [None]:
LABEL_COL = "label_mortality"

# Feature columns created in notebook 02
feature_cols = [c for c in df.columns if c.startswith("lab_")]

# Basic checks
assert LABEL_COL in df.columns, f"Missing label column: {LABEL_COL}"
assert len(feature_cols) > 0, "No lab feature columns found (expected columns starting with 'lab_')"

print("Num feature columns:", len(feature_cols))
print("Num rows:", len(df))
print("Label distribution:")
display(df[LABEL_COL].value_counts())

# Drop rows with missing label (should not happen, but safe)
df = df[df[LABEL_COL].notna()].copy()
df[LABEL_COL] = df[LABEL_COL].astype(int)


## 3) Train/test split (patient-level)

To reduce information leakage when a patient has multiple ICU stays, we split by `subject_id`:
- All stays of a patient go to either train or test, not both.

> On the demo dataset, sizes are small, so results will be noisy. The goal is to validate the pipeline end-to-end.


In [None]:
from sklearn.model_selection import GroupShuffleSplit

X = df[feature_cols]
y = df[LABEL_COL]
groups = df["subject_id"]

gss = GroupShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(gss.split(X, y, groups=groups))

X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

print("Train size:", X_train.shape, " Test size:", X_test.shape)
print("Train label distribution:", y_train.value_counts().to_dict())
print("Test label distribution:", y_test.value_counts().to_dict())


## 4) Baseline model pipeline

We use a simple but strong baseline for tabular EHR data:
- `SimpleImputer(median)` for missing values
- `StandardScaler` for feature scaling
- `LogisticRegression` with `class_weight='balanced'`

This yields a solid starting point and is easy to interpret.


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

model = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler(with_mean=True, with_std=True)),
    ("clf", LogisticRegression(
        max_iter=2000,
        class_weight="balanced",
        solver="lbfgs"
    ))
])

model


## 5) Train

In [None]:
model.fit(X_train, y_train)
print("Trained baseline logistic regression ✅")


## 6) Evaluate (ROC-AUC, PR-AUC, confusion matrix)

We evaluate on the test set using predicted probabilities.

- **ROC-AUC**: overall ranking quality
- **PR-AUC (Average Precision)**: more informative for imbalanced outcomes
- Confusion matrix at threshold 0.5 (then we also show the best F1 threshold as a reference)


In [None]:
from sklearn.metrics import roc_auc_score, average_precision_score, confusion_matrix, classification_report, precision_recall_curve

# Predict probabilities for the positive class
proba_test = model.predict_proba(X_test)[:, 1]

roc = roc_auc_score(y_test, proba_test) if y_test.nunique() > 1 else np.nan
pr  = average_precision_score(y_test, proba_test) if y_test.nunique() > 1 else np.nan

print(f"ROC-AUC: {roc:.3f}")
print(f"PR-AUC (Avg Precision): {pr:.3f}")

# Threshold 0.5
pred_05 = (proba_test >= 0.5).astype(int)
cm_05 = confusion_matrix(y_test, pred_05, labels=[0, 1])
print("\nConfusion matrix @ 0.5 threshold (rows=true, cols=pred):\n", cm_05)
print("\nClassification report @ 0.5 threshold:")
print(classification_report(y_test, pred_05, digits=3))

# Find threshold that maximizes F1 (for reference only)
prec, rec, thr = precision_recall_curve(y_test, proba_test)
f1 = 2 * (prec * rec) / (prec + rec + 1e-12)
best_idx = np.nanargmax(f1)
best_thr = thr[best_idx-1] if best_idx > 0 and best_idx-1 < len(thr) else 0.5

print(f"\nBest-F1 threshold (reference): {best_thr:.3f}")

pred_best = (proba_test >= best_thr).astype(int)
cm_best = confusion_matrix(y_test, pred_best, labels=[0, 1])
print("Confusion matrix @ best-F1 threshold:\n", cm_best)


## 7) Quick feature importance (coefficients)

For logistic regression, the magnitude of coefficients (after scaling) gives a sense of which lab features push risk up/down.

> These are *not* causal. They’re a quick sanity check.


In [None]:
import numpy as np

# Extract coefficients from the pipeline
coef = model.named_steps["clf"].coef_.ravel()

coef_df = pd.DataFrame({
    "feature": feature_cols,
    "coef": coef,
    "abs_coef": np.abs(coef)
}).sort_values("abs_coef", ascending=False)

display(coef_df.head(20))


## 8) Save model + metrics

We save:
- the trained sklearn pipeline as `baseline_logreg.joblib`
- metrics in `baseline_metrics.json`


In [None]:
import json
import joblib

MODEL_PATH = Path("baseline_logreg.joblib")
METRICS_PATH = Path("baseline_metrics.json")

joblib.dump(model, MODEL_PATH)

metrics = {
    "n_rows": int(len(df)),
    "n_features": int(len(feature_cols)),
    "train_rows": int(len(X_train)),
    "test_rows": int(len(X_test)),
    "roc_auc": None if np.isnan(roc) else float(roc),
    "pr_auc": None if np.isnan(pr) else float(pr),
    "threshold_0.5_confusion_matrix": cm_05.tolist(),
    "best_f1_threshold": float(best_thr),
    "best_f1_confusion_matrix": cm_best.tolist(),
}

with open(METRICS_PATH, "w") as f:
    json.dump(metrics, f, indent=2)

print("Saved:")
print(" ", MODEL_PATH.resolve())
print(" ", METRICS_PATH.resolve())
