# 02 — Engineer Lab Features (First 24h Window)

**Project:** Early ICU Mortality Prediction Using Structured EHR Data  
**Dataset:** MIMIC-IV Clinical Database Demo (v2.2)

## Goal of this notebook
Create a **model-ready feature table** from laboratory measurements.

We will:
1. Load the cohort table created in `01_build_icustay_cohort.ipynb`
2. Load `labevents`
3. **Filter labs to the leakage-safe window:** `charttime <= prediction_time`  
   (and optionally `charttime >= intime` to stay within the ICU stay)
4. Aggregate lab results into features per ICU stay
5. Save:
   - `lab_features_24h.csv`
   - `dataset_model_ready.csv` (cohort + labs)

## Inputs
- `cohort_icustay_mortality.csv`
- `labevents.csv`

## Outputs
- `lab_features_24h.csv`
- `dataset_model_ready.csv`


In [None]:
# Setup
import pandas as pd
import numpy as np
from pathlib import Path

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 120)

import sys, platform
print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())
print("Pandas:", pd.__version__)


## 1) Locate files and load cohort + labevents

In [None]:
DATA_DIR = Path(".")

# If not found in current directory, fall back to /mnt/data (common in hosted notebooks)
if not (DATA_DIR / "cohort_icustay_mortality.csv").exists():
    alt = Path("/mnt/data")
    if (alt / "cohort_icustay_mortality.csv").exists():
        DATA_DIR = alt

COHORT_PATH = DATA_DIR / "cohort_icustay_mortality.csv"
LABEVENTS_PATH = DATA_DIR / "labevents.csv"

print("Using DATA_DIR:", DATA_DIR.resolve())
print("Cohort path:", COHORT_PATH.resolve())
print("Labevents path:", LABEVENTS_PATH.resolve())

cohort = pd.read_csv(COHORT_PATH)
labevents = pd.read_csv(LABEVENTS_PATH)

print("\nLoaded:")
print("  cohort:", cohort.shape)
print("  labevents:", labevents.shape)

display(cohort.head(5))
display(labevents.head(5))


## 2) Parse timestamps and validate required columns

We need:
- From cohort: `subject_id`, `hadm_id`, `stay_id`, `intime`, `prediction_time`, `label_mortality`
- From labs: `subject_id`, `hadm_id`, `itemid`, `charttime`, `valuenum`

We will primarily use `valuenum` for numeric labs. (Some rows may have non-numeric `value`; those are ignored in this baseline.)


In [None]:
required_cohort = {"subject_id", "hadm_id", "stay_id", "intime", "prediction_time", "label_mortality"}
required_labs = {"subject_id", "hadm_id", "itemid", "charttime", "valuenum"}

missing_cohort = required_cohort - set(cohort.columns)
missing_labs = required_labs - set(labevents.columns)

assert not missing_cohort, f"Cohort missing: {missing_cohort}"
assert not missing_labs, f"Labevents missing: {missing_labs}"

# Parse datetimes
cohort["intime"] = pd.to_datetime(cohort["intime"], errors="coerce")
cohort["prediction_time"] = pd.to_datetime(cohort["prediction_time"], errors="coerce")

labevents["charttime"] = pd.to_datetime(labevents["charttime"], errors="coerce")

# Ensure numeric valuenum
labevents["valuenum"] = pd.to_numeric(labevents["valuenum"], errors="coerce")

print("Parsed times and coerced numeric values ✅")


## 3) Link labs to ICU stays and enforce the 24h leakage-safe window

### Important notes
- `labevents` is recorded at the **hospital admission** level (`hadm_id`), not the ICU-stay level.
- An admission can have multiple ICU stays.
- We attach labs to each ICU stay by joining on (`subject_id`, `hadm_id`) then filtering by time.

### Window definition (default)
- Keep labs where:  
  `intime <= charttime <= prediction_time`  
This ensures we only use labs available from ICU admission through the first 24 hours.


In [None]:
# Join labs to cohort using admission identifiers
labs_joined = labevents.merge(
    cohort[["subject_id", "hadm_id", "stay_id", "intime", "prediction_time"]],
    on=["subject_id", "hadm_id"],
    how="inner",
    validate="many_to_many"
)

print("Joined labs rows:", labs_joined.shape)

# Filter to leakage-safe window
labs_window = labs_joined[
    (labs_joined["charttime"].notna()) &
    (labs_joined["intime"].notna()) &
    (labs_joined["prediction_time"].notna()) &
    (labs_joined["charttime"] >= labs_joined["intime"]) &
    (labs_joined["charttime"] <= labs_joined["prediction_time"])
].copy()

print("Labs in 0-24h ICU window:", labs_window.shape)

# Optional: keep only numeric labs for baseline
labs_window = labs_window[labs_window["valuenum"].notna()].copy()
print("Labs in window with numeric valuenum:", labs_window.shape)

display(labs_window[["stay_id", "itemid", "charttime", "valuenum"]].head(10))


### Leakage sanity check

We should have **zero** lab events with `charttime > prediction_time` after filtering.


In [None]:
leak_count = (labs_window["charttime"] > labs_window["prediction_time"]).sum()
print("Leakage events (charttime > prediction_time):", int(leak_count))
assert leak_count == 0, "Leakage detected: some labs occur after prediction_time"


## 4) Choose which labs to featurize

MIMIC has many lab `itemid`s. For a baseline model, we can:
- take the **top N most frequently measured labs** in the 24h window, and featurize those.

This avoids overly sparse feature matrices on small datasets like the demo.


In [None]:
# Top labs by frequency in the window
lab_counts = labs_window["itemid"].value_counts()
display(lab_counts.head(20))

TOP_N = 30  # good starting point for demo scale
top_itemids = lab_counts.head(TOP_N).index.tolist()
print(f"Using TOP_N={TOP_N} labs -> {len(top_itemids)} itemids")


## 5) Aggregate lab measurements into per-stay features

For each (`stay_id`, `itemid`), compute:
- `min`, `max`, `mean`, `std`, `count`

Then pivot so each `stay_id` becomes a single row with many columns like:
- `lab_50882_mean`, `lab_50882_min`, ...


In [None]:
labs_top = labs_window[labs_window["itemid"].isin(top_itemids)].copy()

agg = (
    labs_top
    .groupby(["stay_id", "itemid"])["valuenum"]
    .agg(["min", "max", "mean", "std", "count"])
    .reset_index()
)

display(agg.head(10))
print("Aggregated rows:", agg.shape)


### Pivot into a wide feature table

In [None]:
# Create wide features: columns = f"lab_{itemid}_{stat}"
features_wide = agg.pivot(index="stay_id", columns="itemid")
features_wide.columns = [f"lab_{int(itemid)}_{stat}" for stat, itemid in features_wide.columns]
features_wide = features_wide.reset_index()

print("Wide feature table:", features_wide.shape)
display(features_wide.head(5))


## 6) Add missingness indicators (very useful in EHR)

Missingness itself is predictive in clinical data.  
We add a per-lab indicator: 1 if that lab was measured at least once in the window, else 0.

We define "measured" as `count > 0`.


In [None]:
# Create missingness indicators from count columns
count_cols = [c for c in features_wide.columns if c.endswith("_count") and c.startswith("lab_")]
miss_indicators = features_wide[["stay_id"]].copy()

for c in count_cols:
    base = c.replace("_count", "")
    miss_indicators[f"{base}_measured"] = (features_wide[c].fillna(0) > 0).astype("int8")

print("Missingness indicators:", miss_indicators.shape)
display(miss_indicators.head(5))


## 7) Merge features back onto the cohort and create a model-ready dataset

We keep:
- identifiers + label from cohort
- lab feature columns
- measured indicators

No imputation here yet — that belongs in the modeling notebook/pipeline.


In [None]:
lab_features = features_wide.merge(miss_indicators, on="stay_id", how="left")

dataset = cohort.merge(lab_features, on="stay_id", how="left")

print("Dataset (cohort + lab features):", dataset.shape)
display(dataset.head(5))

# Quick: how many stays have at least one lab feature?
feature_cols = [c for c in dataset.columns if c.startswith("lab_")]
has_any = dataset[feature_cols].notna().any(axis=1).mean()
print(f"Fraction of ICU stays with ANY lab feature (non-null): {has_any:.3f}")


## 8) Save artifacts

- `lab_features_24h.csv`: features only (keyed by stay_id)
- `dataset_model_ready.csv`: cohort + label + features


In [None]:
LAB_FEATURES_PATH = Path("lab_features_24h.csv")
DATASET_PATH = Path("dataset_model_ready.csv")

lab_features.to_csv(LAB_FEATURES_PATH, index=False)
dataset.to_csv(DATASET_PATH, index=False)

print("Saved:")
print(" ", LAB_FEATURES_PATH.resolve())
print(" ", DATASET_PATH.resolve())


## Next notebook
**`03_train_baseline_model.ipynb`**
- Load `dataset_model_ready.csv`
- Split train/test by ICU stay (and later by time if using full data)
- Build a preprocessing + model pipeline:
  - imputation
  - scaling (for linear models)
  - baseline logistic regression
- Evaluate with ROC-AUC and PR-AUC
