# Preprocessing Pipeline (Leakage-Safe)
## Early Sepsis Prediction (PhysioNet 2019 / Kaggle consolidated)

### Objectives
- Load the raw dataset (`Dataset.csv`) in a reproducible manner
- Create a leakage-safe patient-level Train/Validation/Test split (stratified by patient-level sepsis)
- Save split patient IDs for consistent downstream experiments
- Prepare the data representation required for baseline ML models (later)

In [1]:
from pathlib import Path
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

Load Data

In [2]:
# Paths
DATA_DIR = Path("../data/raw")
DATA_FILE = DATA_DIR / "Dataset.csv"

INTERIM_DIR = Path("../data/interim")
INTERIM_DIR.mkdir(parents=True, exist_ok=True)

# Load dataset
df = pd.read_csv(DATA_FILE)

print("Loaded Dataset.csv")
print("Shape:", df.shape)
print("Columns:", len(df.columns))

Loaded Dataset.csv
Shape: (1552210, 44)
Columns: 44


Defining Key Columns

In [3]:
patient_id_col = "Patient_ID"
time_col = "ICULOS"
target_col = "SepsisLabel"

# Drop obvious index artifact if present
if "Unnamed: 0" in df.columns:
    df = df.drop(columns=["Unnamed: 0"])

print("After dropping index artifact (if present):", df.shape)
print("Unique patients:", df[patient_id_col].nunique())

After dropping index artifact (if present): (1552210, 43)
Unique patients: 40336


In [4]:
RANDOM_SEED = 42

# Patient-level label: 1 if patient ever becomes septic, else 0
patient_labels = (
    df.groupby(patient_id_col)[target_col]
      .max()
      .astype(int)
)

print("Patients:", patient_labels.shape[0])
print("Sepsis patients:", int(patient_labels.sum()))
print("Patient-level prevalence:", round(patient_labels.mean(), 4))

Patients: 40336
Sepsis patients: 2932
Patient-level prevalence: 0.0727


Leakage Safe Split

In [5]:
# 70% train, 30% temp
train_ids, temp_ids = train_test_split(
    patient_labels.index,
    test_size=0.30,
    random_state=RANDOM_SEED,
    stratify=patient_labels.values
)

# temp -> 15% val, 15% test
val_ids, test_ids = train_test_split(
    temp_ids,
    test_size=0.50,
    random_state=RANDOM_SEED,
    stratify=patient_labels.loc[temp_ids].values
)

print("Train patients:", len(train_ids))
print("Val patients:  ", len(val_ids))
print("Test patients: ", len(test_ids))

def prevalence(ids):
    return patient_labels.loc[ids].mean()

print("\nPatient-level sepsis prevalence:")
print("Train:", round(prevalence(train_ids), 4))
print("Val:  ", round(prevalence(val_ids), 4))
print("Test: ", round(prevalence(test_ids), 4))

# Sanity: no overlap
overlap = (set(train_ids) & set(val_ids)) | (set(train_ids) & set(test_ids)) | (set(val_ids) & set(test_ids))
print("\nPatient overlap across splits:", len(overlap))# 70% train, 30% temp
train_ids, temp_ids = train_test_split(
    patient_labels.index,
    test_size=0.30,
    random_state=RANDOM_SEED,
    stratify=patient_labels.values
)

# temp -> 15% val, 15% test
val_ids, test_ids = train_test_split(
    temp_ids,
    test_size=0.50,
    random_state=RANDOM_SEED,
    stratify=patient_labels.loc[temp_ids].values
)

print("Train patients:", len(train_ids))
print("Val patients:  ", len(val_ids))
print("Test patients: ", len(test_ids))

def prevalence(ids):
    return patient_labels.loc[ids].mean()

print("\nPatient-level sepsis prevalence:")
print("Train:", round(prevalence(train_ids), 4))
print("Val:  ", round(prevalence(val_ids), 4))
print("Test: ", round(prevalence(test_ids), 4))

# Sanity: no overlap
overlap = (set(train_ids) & set(val_ids)) | (set(train_ids) & set(test_ids)) | (set(val_ids) & set(test_ids))
print("\nPatient overlap across splits:", len(overlap))

Train patients: 28235
Val patients:   6050
Test patients:  6051

Patient-level sepsis prevalence:
Train: 0.0727
Val:   0.0727
Test:  0.0727

Patient overlap across splits: 0
Train patients: 28235
Val patients:   6050
Test patients:  6051

Patient-level sepsis prevalence:
Train: 0.0727
Val:   0.0727
Test:  0.0727

Patient overlap across splits: 0


In [6]:
pd.Series(train_ids).to_csv(INTERIM_DIR / "train_patient_ids.csv", index=False, header=[patient_id_col])
pd.Series(val_ids).to_csv(INTERIM_DIR / "val_patient_ids.csv", index=False, header=[patient_id_col])
pd.Series(test_ids).to_csv(INTERIM_DIR / "test_patient_ids.csv", index=False, header=[patient_id_col])

print("Saved split IDs to:", INTERIM_DIR.resolve())

Saved split IDs to: C:\Users\Nikhitha\OneDrive\Documents\EarlySepsisPrediction\data\interim
