# Early Prediction of Sepsis Using ICU Time-Series Data
## Exploratory Data Analysis (EDA)

**Dataset:** PhysioNet / Computing in Cardiology Challenge 2019  
**Prediction Task:** Predict onset of sepsis within the next 6 hours  
**Unit of analysis:** Hourly ICU time steps per patient

### Objectives of this notebook
- Understand the structure and format of the raw ICU time-series data
- Examine feature groups (vitals, labs, demographics, outcome)
- Analyse missingness patterns and temporal coverage
- Inspect the distribution and behaviour of the `SepsisLabel`
- Identify challenges relevant for preprocessing and modelling

### Key EDA Questions

1. How many patient ICU stays are available in the dataset?
2. What is the typical length of an ICU time series?
3. How prevalent is sepsis at the time-step and patient level?
4. How severe is missingness across vital signs and laboratory variables?
5. Are there observable temporal patterns prior to sepsis onset?


In [1]:
# Core Python utilities
from pathlib import Path
import os

# Numerical and data handling
import numpy as np
import pandas as pd

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings for clarity
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 50)

# Set a clean plotting style
sns.set(style="whitegrid")

### Raw Data Directory

The raw PhysioNet 2019 challenge data is stored in `data/raw/`.
The dataset consists of multiple files, where each file corresponds
to a single ICU patient stay with hourly time-series observations.

At this stage, we only inspect the directory structure and load
individual patient files for exploratory purposes.

In [3]:
# Paths
DATA_DIR = Path("../data/raw")
DATA_FILE = DATA_DIR / "Dataset.csv"

# Sanity check
print("Dataset.csv exists:", DATA_FILE.exists())
print("Dataset.csv path:", DATA_FILE.resolve())

# Load (keep it simple for EDA; we'll optimize later if needed)
df = pd.read_csv(DATA_FILE)

# Basic inspection
print("\nShape (rows, columns):", df.shape)
print("\nColumn names:")
print(df.columns.tolist())

print("\nPreview (first 5 rows):")
display(df.head())

print("\nData types summary:")
display(df.dtypes.value_counts())

print("\nMissing values (top 15 columns by missing count):")
missing_counts = df.isna().sum().sort_values(ascending=False)
display(missing_counts.head(15))


Dataset.csv exists: True
Dataset.csv path: C:\Users\Nikhitha\OneDrive\Desktop\EarlySepsisPrediction\data\raw\Dataset.csv

Shape (rows, columns): (1552210, 44)

Column names:
['Unnamed: 0', 'Hour', 'HR', 'O2Sat', 'Temp', 'SBP', 'MAP', 'DBP', 'Resp', 'EtCO2', 'BaseExcess', 'HCO3', 'FiO2', 'pH', 'PaCO2', 'SaO2', 'AST', 'BUN', 'Alkalinephos', 'Calcium', 'Chloride', 'Creatinine', 'Bilirubin_direct', 'Glucose', 'Lactate', 'Magnesium', 'Phosphate', 'Potassium', 'Bilirubin_total', 'TroponinI', 'Hct', 'Hgb', 'PTT', 'WBC', 'Fibrinogen', 'Platelets', 'Age', 'Gender', 'Unit1', 'Unit2', 'HospAdmTime', 'ICULOS', 'SepsisLabel', 'Patient_ID']

Preview (first 5 rows):


Unnamed: 0.1,Unnamed: 0,Hour,HR,O2Sat,Temp,SBP,MAP,DBP,Resp,EtCO2,BaseExcess,HCO3,FiO2,pH,PaCO2,SaO2,AST,BUN,Alkalinephos,Calcium,Chloride,Creatinine,Bilirubin_direct,Glucose,Lactate,Magnesium,Phosphate,Potassium,Bilirubin_total,TroponinI,Hct,Hgb,PTT,WBC,Fibrinogen,Platelets,Age,Gender,Unit1,Unit2,HospAdmTime,ICULOS,SepsisLabel,Patient_ID
0,0,0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,68.54,0,,,-0.02,1,0,17072
1,1,1,65.0,100.0,,,72.0,,16.5,,,,0.4,,,,,,,,,,,,,,,,,,,,,,,,68.54,0,,,-0.02,2,0,17072
2,2,2,78.0,100.0,,,42.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,68.54,0,,,-0.02,3,0,17072
3,3,3,73.0,100.0,,,,,17.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,68.54,0,,,-0.02,4,0,17072
4,4,4,70.0,100.0,,129.0,74.0,69.0,14.0,,,26.0,0.4,,,,,23.0,,9.6,104.0,0.8,,161.0,,1.6,2.1,3.2,,,29.7,9.5,30.6,11.3,,330.0,68.54,0,,,-0.02,5,0,17072



Data types summary:


float64    38
int64       6
Name: count, dtype: int64


Missing values (top 15 columns by missing count):


Bilirubin_direct    1549220
Fibrinogen          1541968
TroponinI           1537429
Bilirubin_total     1529069
Alkalinephos        1527269
AST                 1527027
Lactate             1510764
PTT                 1506511
SaO2                1498649
EtCO2               1494574
Phosphate           1489909
HCO3                1487182
Chloride            1481744
BaseExcess          1468065
PaCO2               1465909
dtype: int64

Identifying Key Columns

In [4]:
# Identify key columns
patient_id_col = "Patient_ID"
time_col = "ICULOS"
target_col = "SepsisLabel"

print("Patient ID column:", patient_id_col)
print("Time column:", time_col)
print("Target column:", target_col)

# Basic sanity checks
print("\nUnique patients:", df[patient_id_col].nunique())
print("Min ICULOS:", df[time_col].min())
print("Max ICULOS:", df[time_col].max())

print("\nTarget value counts (time-step level):")
print(df[target_col].value_counts())

Patient ID column: Patient_ID
Time column: ICULOS
Target column: SepsisLabel

Unique patients: 40336
Min ICULOS: 1
Max ICULOS: 336

Target value counts (time-step level):
SepsisLabel
0    1524294
1      27916
Name: count, dtype: int64


Sepsis Prevalence

In [5]:
# Time-step level prevalence
timestep_prevalence = df[target_col].mean()

# Patient-level prevalence (did the patient ever have SepsisLabel = 1?)
patient_prevalence = (
    df.groupby(patient_id_col)[target_col]
      .max()
      .mean()
)

print(f"Time-step level sepsis prevalence: {timestep_prevalence:.4f}")
print(f"Patient-level sepsis prevalence: {patient_prevalence:.4f}")

Time-step level sepsis prevalence: 0.0180
Patient-level sepsis prevalence: 0.0727


ICU length-of-stay distribution

In [6]:
# ICU length of stay per patient
icu_los = (
    df.groupby(patient_id_col)[time_col]
      .max()
)

print("ICU length-of-stay summary (hours):")
display(icu_los.describe())

ICU length-of-stay summary (hours):


count    40336.000000
mean        39.010115
std         22.924641
min          8.000000
25%         24.000000
50%         39.000000
75%         47.000000
max        336.000000
Name: ICULOS, dtype: float64