# Stage 2 — Onset Definition & Label Construction

STAGE ALIGNMENT:

Input:
    Data/interim/sepsis_timeseries_full.pkl

Output:
    Data/processed/sepsis_labeled_2h.pkl
    Data/processed/sepsis_labeled_4h.pkl
    Data/processed/sepsis_labeled_6h.pkl

Next Stage Dependency:
    Baseline modeling and custom neural network training.

In [1]:
import pandas as pd
from pathlib import Path

INTERIM_PATH = Path("../Data/interim")
PROCESSED_PATH = Path("../Data/processed")

df = pd.read_pickle(INTERIM_PATH / "sepsis_timeseries_full.pkl")

print("Loaded dataset shape:", df.shape)
print("Unique patients:", df['id'].nunique())


Loaded dataset shape: (602568, 47)
Unique patients: 1275


### Operational Definition of Onset

Since the sepsis label is constant per patient,
and no 0→1 transition exists within trajectories,
we define sepsis onset for septic patients as:

The final recorded timestep in their trajectory.

For each prediction horizon H ∈ {2h, 4h, 6h}:

At time t:
Label_H = 1 if onset occurs within the next H hours.
Label_H = 0 otherwise.

For non-septic patients:
Label_H = 0 for all timesteps.


In [2]:
# Extract septic patients
septic_ids = df[df['sepsis'] == 1]['id'].unique()

# Compute onset time (final timestep) per septic patient
onset_times = (
    df[df['id'].isin(septic_ids)]
    .groupby('id')['timestep']
    .max()
    .reset_index()
)

onset_times.columns = ['id', 'onset_time']

print("Septic patients:", len(onset_times))
onset_times.head()


Septic patients: 296


Unnamed: 0,id,onset_time
0,11555,1978.5
1,11592,1646.0
2,11626,522.0
3,11657,568.5
4,11658,952.0


In [3]:
# Merge onset time into main dataframe
df = df.merge(onset_times, on='id', how='left')

# Non-septic patients will have NaN onset_time
print("Rows with onset_time:", df['onset_time'].notna().sum())
print("Rows without onset_time:", df['onset_time'].isna().sum())


Rows with onset_time: 364745
Rows without onset_time: 237823


In [4]:
def create_horizon_label(df, horizon):
    df_copy = df.copy()
    
    # Initialize label
    df_copy[f'label_{horizon}h'] = 0
    
    # Only septic patients can have positive labels
    mask = df_copy['onset_time'].notna()
    
    # Label = 1 if onset occurs within next H hours
    df_copy.loc[mask, f'label_{horizon}h'] = (
        df_copy.loc[mask, 'timestep'] >= 
        df_copy.loc[mask, 'onset_time'] - horizon
    ).astype(int)
    
    return df_copy


In [5]:
df_2h = create_horizon_label(df, 2)
df_4h = create_horizon_label(df, 4)
df_6h = create_horizon_label(df, 6)

print("2h positives:", df_2h['label_2h'].sum())
print("4h positives:", df_4h['label_4h'].sum())
print("6h positives:", df_6h['label_6h'].sum())


2h positives: 1480
4h positives: 2664
6h positives: 3848


In [6]:
# Drop helper column
df_2h = df_2h.drop(columns=['onset_time'])
df_4h = df_4h.drop(columns=['onset_time'])
df_6h = df_6h.drop(columns=['onset_time'])

# Save processed datasets
df_2h.to_pickle(PROCESSED_PATH / "sepsis_labeled_2h.pkl")
df_4h.to_pickle(PROCESSED_PATH / "sepsis_labeled_4h.pkl")
df_6h.to_pickle(PROCESSED_PATH / "sepsis_labeled_6h.pkl")

print("Saved labeled datasets successfully.")


Saved labeled datasets successfully.


## Stage 2 — Summary

Objective:
To operationally define sepsis onset and construct prediction labels
for 2-hour, 4-hour, and 6-hour horizons.

Key Decisions:

1. Onset Definition:
Since sepsis labels are constant per patient and no 0→1 transition exists,
sepsis onset was defined as the final recorded timestep
for septic patients.

2. Horizon-Based Labeling:
For each horizon H ∈ {2h, 4h, 6h}:
A timestep is labeled 1 if onset occurs within the next H hours.
All other timesteps are labeled 0.
Non-septic patients are labeled 0 for all timesteps.

3. Dataset Construction:
Three processed datasets were created:
- sepsis_labeled_2h.pkl
- sepsis_labeled_4h.pkl
- sepsis_labeled_6h.pkl

Observations:

- Total patients: 1,275
- Septic patients: 296
- Non-septic patients: 979

Positive samples per horizon:
- 2h: 1,480
- 4h: 2,664
- 6h: 3,848

The shorter the horizon, the more extreme the class imbalance.

Implications for Next Stage:

- Severe class imbalance must be handled during training.
- Evaluation metrics must go beyond accuracy.
- Patient-level splitting must be enforced to prevent leakage.

Stage 2 successfully transforms the raw time-series dataset
into horizon-specific supervised learning datasets,
ready for baseline and neural network modeling.
