# 05 — Error Analysis & Case Review

**Project:** Early ICU Mortality Prediction Using Structured EHR Data  
**Dataset:** MIMIC-IV Clinical Database Demo (v2.2)

## Goal of this notebook
Go beyond aggregate metrics and answer:
- *Which* cases does the model get wrong?
- Are mistakes clustered (e.g., by low data availability, certain lab patterns)?
- Are errors driven by missingness or extreme values?
- What should we do next to improve the model?

We will:
1. Load test predictions (`eval_predictions.csv`) + threshold report (`threshold_report.csv`)
2. Join predictions back to the model-ready dataset (`dataset_model_ready.csv`)
3. Load `labevents.csv` to inspect raw time-stamped labs within the 0–24h window
4. (Optional) Load `hosp/d_labitems.csv.gz` (if available) to map `itemid -> lab name`
5. Create a small set of **case review tables** for:
   - Top false positives (high risk predicted, survived)
   - Top false negatives (low risk predicted, died)
6. Summarize common error patterns and produce an “action list” for improvements

## Inputs
- `eval_predictions.csv`
- `threshold_report.csv`
- `dataset_model_ready.csv`
- `labevents.csv`
- (optional) `d_labitems.csv` or `d_labitems.csv.gz` (to label labs)

## Outputs
- `case_review_false_positives.csv`
- `case_review_false_negatives.csv`
- `error_analysis_summary.md` (a short write-up you can paste into README)


In [None]:
# Setup
import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)

import sys, platform
print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())
print("Pandas:", pd.__version__)


## 1) Resolve paths and load artifacts

In [None]:
def resolve_dir():
    d = Path(".")
    if (d / "eval_predictions.csv").exists():
        return d
    alt = Path("/mnt/data")
    if (alt / "eval_predictions.csv").exists():
        return alt
    return d

DATA_DIR = resolve_dir()
print("Using DATA_DIR:", DATA_DIR.resolve())

PRED_PATH = DATA_DIR / "eval_predictions.csv"
THR_PATH = DATA_DIR / "threshold_report.csv"
DATASET_PATH = DATA_DIR / "dataset_model_ready.csv"
LABEVENTS_PATH = DATA_DIR / "labevents.csv"

for p in [PRED_PATH, THR_PATH, DATASET_PATH, LABEVENTS_PATH]:
    assert p.exists(), f"Missing required file: {p}"

pred = pd.read_csv(PRED_PATH)
thr = pd.read_csv(THR_PATH)
df = pd.read_csv(DATASET_PATH)
labevents = pd.read_csv(LABEVENTS_PATH)

print("Loaded:")
print("  pred:", pred.shape)
print("  thr:", thr.shape)
print("  dataset:", df.shape)
print("  labevents:", labevents.shape)

display(pred.head(5))
display(thr.head(5))


## 2) Join predictions back to the dataset

This gives us the full feature vector and metadata for each evaluated ICU stay.


In [None]:
LABEL_COL = "label_mortality"
assert LABEL_COL in df.columns, "dataset_model_ready.csv must contain label_mortality"

df_merged = pred.merge(
    df,
    on=["subject_id", "hadm_id", "stay_id", LABEL_COL],
    how="left",
    validate="one_to_one"
)

print("Merged rows:", df_merged.shape)
missing_rows = df_merged.isna().all(axis=1).sum()
print("All-null rows after merge (should be 0):", missing_rows)

display(df_merged.head(5))


## 3) Select cases for review (false positives / false negatives)

We'll use the **same threshold definition** as in notebook 04:
- default `threshold = 0.5`
- but we also show the best-F1 threshold from the threshold report for context.

Then we select:
- Top `K` false positives (highest predicted risk among survivors)
- Top `K` false negatives (lowest predicted risk among deaths)


In [None]:
# Choose a default threshold
DEFAULT_THRESHOLD = 0.5

# Best-F1 threshold from report (if present)
best_row = thr.loc[thr["f1"].idxmax()]
best_thr = float(best_row["threshold"])
print("Best-F1 threshold (from threshold_report.csv):", best_thr)

# Label predictions at default threshold
df_merged["pred_label"] = (df_merged["pred_proba"] >= DEFAULT_THRESHOLD).astype(int)

fp = df_merged[(df_merged[LABEL_COL] == 0) & (df_merged["pred_label"] == 1)].copy()
fn = df_merged[(df_merged[LABEL_COL] == 1) & (df_merged["pred_label"] == 0)].copy()

print("False positives:", len(fp))
print("False negatives:", len(fn))

K = 10
fp_top = fp.sort_values("pred_proba", ascending=False).head(K)
fn_top = fn.sort_values("pred_proba", ascending=True).head(K)

display(fp_top[["subject_id", "hadm_id", "stay_id", LABEL_COL, "pred_proba"]])
display(fn_top[["subject_id", "hadm_id", "stay_id", LABEL_COL, "pred_proba"]])


## 4) Optional: map lab `itemid` to names via `d_labitems`

If you have `d_labitems.csv` (or `d_labitems.csv.gz`) in your folder, we'll load it.
Otherwise, we’ll proceed using itemids only.


In [None]:
d_labitems = None

candidates = [
    DATA_DIR / "d_labitems.csv",
    DATA_DIR / "d_labitems.csv.gz",
    # if user still has the original folder structure:
    DATA_DIR / "hosp" / "d_labitems.csv.gz",
    DATA_DIR / "hosp" / "d_labitems.csv",
]

for c in candidates:
    if c.exists():
        if c.suffix == ".gz":
            d_labitems = pd.read_csv(c, compression="gzip")
        else:
            d_labitems = pd.read_csv(c)
        print("Loaded d_labitems from:", c.resolve())
        break

if d_labitems is None:
    print("d_labitems not found; continuing with itemid only.")
else:
    # Standard columns: itemid, label, fluid, category, etc.
    display(d_labitems.head(5))


## 5) Build a simple "chart pack" for each case

For each selected ICU stay:
- Pull **raw** lab events within the leakage-safe window:
  `intime <= charttime <= prediction_time`
- Summarize:
  - top labs by absolute abnormality (z-score within that case) — simple heuristic
  - count of labs measured
  - earliest and latest labs

This is a lightweight case-review aid.


In [None]:
# Parse times needed for window filtering
for c in ["intime", "prediction_time"]:
    if c in df_merged.columns:
        df_merged[c] = pd.to_datetime(df_merged[c], errors="coerce")

labevents["charttime"] = pd.to_datetime(labevents["charttime"], errors="coerce")
labevents["valuenum"] = pd.to_numeric(labevents["valuenum"], errors="coerce")

# Helper: lab window for a single stay
def labs_for_stay(stay_row):
    subject_id = int(stay_row["subject_id"])
    hadm_id = int(stay_row["hadm_id"])
    intime = stay_row.get("intime", pd.NaT)
    ptime = stay_row.get("prediction_time", pd.NaT)
    if pd.isna(intime) or pd.isna(ptime):
        return pd.DataFrame()

    labs = labevents[(labevents["subject_id"] == subject_id) & (labevents["hadm_id"] == hadm_id)].copy()
    labs = labs[labs["charttime"].notna()]
    labs = labs[(labs["charttime"] >= intime) & (labs["charttime"] <= ptime)]
    labs = labs[labs["valuenum"].notna()].copy()

    if d_labitems is not None and "itemid" in labs.columns and "itemid" in d_labitems.columns:
        labs = labs.merge(d_labitems[["itemid", "label"]], on="itemid", how="left")
    return labs

# Helper: within-case lab summary
def summarize_labs(labs):
    if labs.empty:
        return pd.DataFrame(), {"n_events": 0, "n_unique_labs": 0}
    # Heuristic: within-case zscore per itemid (if multiple values)
    labs2 = labs.copy()
    labs2["lab_name"] = labs2["label"] if "label" in labs2.columns else labs2["itemid"].astype(str)
    grp = labs2.groupby("lab_name")["valuenum"]
    labs2["z"] = (labs2["valuenum"] - grp.transform("mean")) / (grp.transform("std") + 1e-12)
    labs2["abs_z"] = labs2["z"].abs()

    # Summary per lab_name
    summ = (labs2.groupby("lab_name")["valuenum"]
            .agg(["count", "min", "max", "mean"])
            .reset_index())
    # Add a "max abs z" score as a crude salience marker
    zmax = labs2.groupby("lab_name")["abs_z"].max().reset_index().rename(columns={"abs_z": "max_abs_z"})
    summ = summ.merge(zmax, on="lab_name", how="left").sort_values(["max_abs_z", "count"], ascending=False)

    stats = {"n_events": int(len(labs2)), "n_unique_labs": int(labs2["lab_name"].nunique())}
    return summ, stats

# Example: build a pack for the first FP if any
if len(fp_top) > 0:
    row0 = fp_top.iloc[0]
    labs0 = labs_for_stay(row0)
    summ0, stats0 = summarize_labs(labs0)
    print("Example case:", int(row0["stay_id"]), "pred_proba:", float(row0["pred_proba"]), "label:", int(row0[LABEL_COL]))
    print("Lab stats:", stats0)
    display(summ0.head(15))
    display(labs0.sort_values("charttime").head(20))
else:
    print("No false positives found at the default threshold.")


## 6) Generate case review tables for top false positives and false negatives

For each case, we store:
- identifiers + true label + predicted probability
- number of labs observed in the window
- number of unique lab tests
- top 10 lab summaries (as a JSON string for portability)

This makes it easy to review cases and share results.


In [None]:
import json

def build_case_review(df_cases, tag, top_labs=10):
    rows = []
    for _, r in df_cases.iterrows():
        labs = labs_for_stay(r)
        summ, stats = summarize_labs(labs)
        top = summ.head(top_labs).to_dict(orient="records") if not summ.empty else []
        rows.append({
            "tag": tag,
            "subject_id": int(r["subject_id"]),
            "hadm_id": int(r["hadm_id"]),
            "stay_id": int(r["stay_id"]),
            "label_mortality": int(r[LABEL_COL]),
            "pred_proba": float(r["pred_proba"]),
            "n_lab_events_0_24h": int(stats["n_events"]),
            "n_unique_labs_0_24h": int(stats["n_unique_labs"]),
            "top_labs_summary": json.dumps(top),
        })
    return pd.DataFrame(rows)

fp_review = build_case_review(fp_top, "false_positive")
fn_review = build_case_review(fn_top, "false_negative")

display(fp_review.head())
display(fn_review.head())


## 7) Look for common error patterns (lightweight)

We compare false positives vs false negatives on:
- predicted probability distribution
- lab measurement density (how much data is available)
- missingness (how many 'measured' indicators are 0)

This is meant to produce *actionable next steps*.


In [None]:
# Measurement density comparison
combined = pd.concat([fp_review, fn_review], ignore_index=True)

if not combined.empty:
    display(combined.groupby("tag")[["pred_proba", "n_lab_events_0_24h", "n_unique_labs_0_24h"]].describe())

# Missingness summary from model-ready dataset (measured indicators)
measured_cols = [c for c in df_merged.columns if c.endswith("_measured") and c.startswith("lab_")]
if measured_cols:
    # Only for the cases we are reviewing
    case_ids = set(combined["stay_id"].tolist()) if not combined.empty else set()
    cases_full = df_merged[df_merged["stay_id"].isin(case_ids)].copy()
    if not cases_full.empty:
        cases_full["tag"] = np.where(
            (cases_full[LABEL_COL] == 0) & (cases_full["pred_label"] == 1),
            "false_positive",
            np.where((cases_full[LABEL_COL] == 1) & (cases_full["pred_label"] == 0), "false_negative", "other")
        )
        miss_rate = (cases_full[measured_cols] == 0).mean(axis=1)
        cases_full["missingness_rate_measured"] = miss_rate
        display(cases_full.groupby("tag")["missingness_rate_measured"].describe())
else:
    print("No measured indicator columns found.")


## 8) Write a short error analysis summary (Markdown)

This generates a concise summary you can paste into your README.

It includes:
- How many FPs/FNs at threshold 0.5
- A quick note on measurement density
- Suggested next improvements


In [None]:
summary_lines = []
summary_lines.append("# Error Analysis Summary (Baseline Logistic Regression)")
summary_lines.append("")
summary_lines.append(f"- Threshold used for review: **{DEFAULT_THRESHOLD:.2f}**")
summary_lines.append(f"- False positives: **{len(fp)}** (survived, predicted high risk)")
summary_lines.append(f"- False negatives: **{len(fn)}** (died, predicted low risk)")
summary_lines.append("")

if not combined.empty:
    for tag in ["false_positive", "false_negative"]:
        sub = combined[combined["tag"] == tag]
        if not sub.empty:
            summary_lines.append(f"## {tag.replace('_',' ').title()}")
            summary_lines.append(f"- Mean predicted risk: **{sub['pred_proba'].mean():.3f}**")
            summary_lines.append(f"- Mean lab events in 0–24h: **{sub['n_lab_events_0_24h'].mean():.1f}**")
            summary_lines.append(f"- Mean unique labs in 0–24h: **{sub['n_unique_labs_0_24h'].mean():.1f}**")
            summary_lines.append("")

summary_lines.append("## Hypotheses for errors")
summary_lines.append("- Some errors may be driven by **limited lab coverage** (missingness) in the first 24 hours.")
summary_lines.append("- Some false positives may represent **high acuity survivors** (treated effectively) — mortality is not the only notion of risk.")
summary_lines.append("- Some false negatives may require **vitals/clinical context** not present in labs alone.")
summary_lines.append("")
summary_lines.append("## Next improvements (action list)")
summary_lines.append("1. Add **vitals** from `icu.chartevents` (first 24h) and compare.")
summary_lines.append("2. Add `d_labitems` labels and build a **human-readable** report of top contributing labs.")
summary_lines.append("3. Try a stronger model for tabular data (e.g., **HistGradientBoosting** or **XGBoost/LightGBM**).")
summary_lines.append("4. Use time-aware splits when moving to full MIMIC-IV.")
summary_lines.append("")

summary_md = "\n".join(summary_lines)
print(summary_md)

OUT_MD = Path("error_analysis_summary.md")
OUT_MD.write_text(summary_md, encoding="utf-8")
print("\nSaved:", OUT_MD.resolve())


## 9) Save case review CSVs

In [None]:
FP_OUT = Path("case_review_false_positives.csv")
FN_OUT = Path("case_review_false_negatives.csv")

fp_review.to_csv(FP_OUT, index=False)
fn_review.to_csv(FN_OUT, index=False)

print("Saved:")
print(" ", FP_OUT.resolve())
print(" ", FN_OUT.resolve())
