# 05a: Data Preparation for CNN-LSTM (Temporal Resolution Experiments)

This notebook prepares ICU time series and static data for CNN-LSTM modeling, enabling systematic experiments with different temporal resolutions (6, 12, 24, and 36 hours) to identify the earliest point at which accurate mortality predictions can be made.

**Goals:**
- Prepare dynamic (time series), static, and outcome data for CNN-LSTM modeling
- Compare temporal resolutions: 6, 12, 24, and 36 hours
- Save processed datasets for each resolution for downstream modeling and analysis

## Workflow Overview

1. Data Loading and Audit
2. Feature Selection
3. Sequence Construction for Multiple Temporal Resolutions
4. Padding, Scaling, and Splitting
5. Save Processed Data for Each Resolution

In [1]:
# Imports and Configuration
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
tf.random.set_seed(42)

## 1. Data Loading and Initial Audit

Load the cleaned ICU dataset and missingness mask.

In [2]:
data = pd.read_csv('../data/processed/timeseries_cleaned_all_features.csv')
mask = pd.read_csv('../data/processed/timeseries_missingness_mask.csv')
print('Data shape:', data.shape)
display(data.head())

Data shape: (295354, 48)


Unnamed: 0,RecordID,Minutes,ALP,ALT,AST,Albumin,BUN,Bilirubin,Cholesterol,Creatinine,...,Age,Gender,Height,ICUType,In-hospital_death,Length_of_stay,SAPS-I,SOFA,Survival,Weight
0,132539,7,0.0,0.0,0.0,0.0,13.0,0.0,0.0,0.8,...,54.0,0.0,170.2,4.0,0.0,5.0,6.0,1.0,70.0,78.6
1,132539,37,0.0,0.0,0.0,0.0,13.0,0.0,0.0,0.8,...,54.0,0.0,170.2,4.0,0.0,5.0,6.0,1.0,70.0,78.6
2,132539,97,0.0,0.0,0.0,0.0,13.0,0.0,0.0,0.8,...,54.0,0.0,170.2,4.0,0.0,5.0,6.0,1.0,70.0,78.6
3,132539,157,0.0,0.0,0.0,0.0,13.0,0.0,0.0,0.8,...,54.0,0.0,170.2,4.0,0.0,5.0,6.0,1.0,70.0,78.6
4,132539,188,0.0,0.0,0.0,0.0,13.0,0.0,0.0,0.8,...,54.0,0.0,170.2,4.0,0.0,5.0,6.0,1.0,70.0,78.6


## 2. Feature Selection
Select features based on EDA findings and clinical relevance.

In [3]:
time_series_features = [
    'HR', 'SysABP', 'DiasABP', 'MAP', 'NISysABP', 'NIDiasABP', 'NIMAP', 'MechVent',
    'RespRate', 'SaO2', 'FiO2', 'PaO2', 'PaCO2',
    'Creatinine', 'BUN', 'Urine',
    'Na', 'K', 'Glucose', 'Lactate', 'HCO3', 'pH',
    'GCS', 'Temp',
    'Shock Index', 'PaO2/FiO2 ratio', 'Pulse Pressure', 'MAP to HR ratio'
]
static_features = ['Age', 'Gender', 'Height', 'Weight', 'ICUType', 'SAPS-I', 'SOFA']
target_col = 'In-hospital_death'


In [4]:
# Derived Feature Engineering: Add required features for sequence construction
data['Shock Index'] = data['HR'] / data['SysABP']
data['PaO2/FiO2 ratio'] = data['PaO2'] / data['FiO2'].replace(0, np.nan)
data['Pulse Pressure'] = data['SysABP'] - data['DiasABP']
data['MAP to HR ratio'] = data['MAP'] / data['HR'].replace(0, np.nan)
# Replace inf/nan with 0 for derived features
for col in ['Shock Index', 'PaO2/FiO2 ratio', 'Pulse Pressure', 'MAP to HR ratio']:
    data[col] = data[col].replace([np.inf, -np.inf], np.nan).fillna(0)

In [5]:
def construct_sequences(data, time_series_features, hours):
    minute_window = hours * 60
    max_time = 2880  # 48 hours in minutes
    window_boundaries = np.arange(0, max_time + minute_window, minute_window)
    sequences = []
    for record_id, group in data.groupby('RecordID'):
        patient_sequence = []
        for i in range(len(window_boundaries) - 1):
            start_time = window_boundaries[i]
            end_time = window_boundaries[i+1]
            window_data = group[(group['Minutes'] >= start_time) & (group['Minutes'] < end_time)]
            if not window_data.empty:
                features = window_data[time_series_features].iloc[0].values
            else:
                features = np.full(len(time_series_features), np.nan)
            patient_sequence.append(features)
        sequences.append(patient_sequence)
    return np.array(sequences, dtype=np.float32)

In [11]:
temporal_resolutions = [6, 12, 24, 36]
sequences_dict = {}
for hours in temporal_resolutions:
    print(f'Constructing sequences for {hours}-hour resolution...')
    X_seq = construct_sequences(data, time_series_features, hours)
    sequences_dict[hours] = X_seq
    print(f'Shape for {hours}-hour: {X_seq.shape}')

Constructing sequences for 6-hour resolution...
Shape for 6-hour: (3997, 8, 28)
Constructing sequences for 12-hour resolution...
Shape for 12-hour: (3997, 4, 28)
Constructing sequences for 24-hour resolution...
Shape for 24-hour: (3997, 2, 28)
Constructing sequences for 36-hour resolution...
Shape for 36-hour: (3997, 2, 28)


## 4. Padding, Scaling, and Splitting

For each temporal resolution, pad sequences, scale features, split data, and save processed arrays.

In [12]:
from collections import defaultdict
results = defaultdict(dict)
patient_static_data = data.groupby('RecordID')[static_features].first().reset_index()
patient_outcomes = data.groupby('RecordID')[target_col].first().reset_index()
X_static = patient_static_data[static_features].values
y = patient_outcomes[target_col].values
for hours, X_seq in sequences_dict.items():
    max_seq_len = X_seq.shape[1]
    X_seq_padded = pad_sequences(X_seq, maxlen=max_seq_len, dtype='float32', padding='post', value=0.0)
    n_patients = X_seq_padded.shape[0]
    indices = np.arange(n_patients)
    train_idx, test_idx = train_test_split(indices, test_size=0.2, stratify=y, random_state=42)
    train_idx, val_idx  = train_test_split(train_idx, test_size=0.2, stratify=y[train_idx], random_state=42)
    X_train, X_val, X_test = X_seq_padded[train_idx], X_seq_padded[val_idx], X_seq_padded[test_idx]
    static_train, static_val, static_test = X_static[train_idx], X_static[val_idx], X_static[test_idx]
    y_train, y_val, y_test = y[train_idx], y[val_idx], y[test_idx]
    # Clean NaNs/Infs
    def clean_array(arr): return np.nan_to_num(arr, nan=0.0, posinf=1e6, neginf=-1e6)
    X_train = clean_array(X_train)
    X_val   = clean_array(X_val)
    X_test  = clean_array(X_test)
    static_train = clean_array(static_train)
    static_val   = clean_array(static_val)
    static_test  = clean_array(static_test)
    # Scale time series features
    n_features = X_train.shape[2]
    seq_scaler = StandardScaler()
    X_train_reshaped = X_train.reshape(-1, n_features)
    X_train_scaled = seq_scaler.fit_transform(X_train_reshaped).reshape(-1, max_seq_len, n_features)
    X_val_reshaped = X_val.reshape(-1, n_features)
    X_val_scaled = seq_scaler.transform(X_val_reshaped).reshape(-1, max_seq_len, n_features)
    X_test_reshaped = X_test.reshape(-1, n_features)
    X_test_scaled = seq_scaler.transform(X_test_reshaped).reshape(-1, max_seq_len, n_features)
    # Scale static features
    static_scaler = StandardScaler()
    static_train_scaled = static_scaler.fit_transform(static_train)
    static_val_scaled   = static_scaler.transform(static_val)
    static_test_scaled  = static_scaler.transform(static_test)
    # Impute dynamic features
    X_train_flat = X_train_scaled.reshape(X_train_scaled.shape[0], -1)
    imputer = SimpleImputer(strategy='mean')
    X_train_imputed_flat = imputer.fit_transform(X_train_flat)
    X_val_flat  = X_val_scaled.reshape(X_val_scaled.shape[0], -1)
    X_test_flat = X_test_scaled.reshape(X_test_scaled.shape[0], -1)
    X_val_imputed_flat  = imputer.transform(X_val_flat)
    X_test_imputed_flat = imputer.transform(X_test_flat)
    X_train_imputed = X_train_imputed_flat.reshape(-1, X_train_scaled.shape[1], X_train_scaled.shape[2])
    X_val_imputed   = X_val_imputed_flat.reshape(-1, X_val_scaled.shape[1], X_val_scaled.shape[2])
    X_test_imputed  = X_test_imputed_flat.reshape(-1, X_test_scaled.shape[1], X_test_scaled.shape[2])
    # SMOTE on training set
    smote = SMOTE(random_state=42)
    X_train_res_flat, y_train_res = smote.fit_resample(X_train_imputed_flat, y_train)
    X_train_res = X_train_res_flat.reshape(-1, X_train_imputed.shape[1], X_train_imputed.shape[2])
    n_orig = static_train_scaled.shape[0]
    n_total = X_train_res.shape[0]
    n_synth = n_total - n_orig
    static_train_res = static_train_scaled.copy()
    if n_synth > 0:
        minority_class = 1 if np.sum(y_train == 1) < np.sum(y_train == 0) else 0
        minority_indices = np.where(y_train == minority_class)[0]
        synth_static = static_train_scaled[np.random.choice(minority_indices, size=n_synth, replace=True)]
        static_train_res = np.concatenate([static_train_scaled, synth_static], axis=0)
    results[hours]['X_train_final'] = X_train_res
    results[hours]['y_train_final'] = y_train_res
    results[hours]['static_train_final'] = static_train_res
    results[hours]['X_val_final'] = X_val_imputed
    results[hours]['static_val_final'] = static_val_scaled
    results[hours]['y_val_final'] = y_val
    results[hours]['X_test_final'] = X_test_imputed
    results[hours]['static_test_final'] = static_test_scaled
    results[hours]['y_test_final'] = y_test
    print(f'Processed arrays for {hours}-hour resolution saved in results dict.')

Processed arrays for 6-hour resolution saved in results dict.
Processed arrays for 12-hour resolution saved in results dict.
Processed arrays for 24-hour resolution saved in results dict.
Processed arrays for 36-hour resolution saved in results dict.


## 5. Save Processed Data for Each Resolution

Save the processed arrays for each temporal resolution to separate files for downstream modeling.

In [13]:
for hours in temporal_resolutions:
    np.savez(f'../data/processed/cnn_lstm_{hours}hr_data.npz',
             X_train=results[hours]['X_train_final'], y_train=results[hours]['y_train_final'],
             X_val=results[hours]['X_val_final'],   y_val=results[hours]['y_val_final'],
             X_test=results[hours]['X_test_final'], y_test=results[hours]['y_test_final'],
             static_train=results[hours]['static_train_final'],
             static_val=results[hours]['static_val_final'],
             static_test=results[hours]['static_test_final'])
    print(f'Prepared {hours}-hour sequence data saved to ../data/processed/cnn_lstm_{hours}hr_data.npz')

Prepared 6-hour sequence data saved to ../data/processed/cnn_lstm_6hr_data.npz
Prepared 12-hour sequence data saved to ../data/processed/cnn_lstm_12hr_data.npz
Prepared 24-hour sequence data saved to ../data/processed/cnn_lstm_24hr_data.npz
Prepared 36-hour sequence data saved to ../data/processed/cnn_lstm_36hr_data.npz
