# 05a: Data Preparation for CNN-LSTM

This notebook provides a comprehensive, stepwise workflow for preparing ICU time series and static data for deep learning models.

**Goals:**
- Prepare dynamic (time series), static, and outcome data for CNN-LSTM modeling
- Select clinically relevant features and apply appropriate transformations
- Address missingness, scaling, and class imbalance
- Construct patient-level sequences and align static features
- Save processed datasets for downstream modeling and analysis


## Workflow Overview

This notebook prepares ICU time series and static data for CNN-LSTM modeling. The workflow includes:

1. **Data Loading and Audit:** Load cleaned data, preview, and check types.
2. **Feature Selection:** Select EDA-driven dynamic and static features.
3. **Log Transformation:** Apply log1p to skewed static features.
4. **Sequence Construction:** Group by patient, extract sequences and static features.
5. **Padding and Scaling:** Pad sequences and scale features.
6. **Train/Val/Test Split:** Patient-index-based splitting for alignment.
7. **Class Imbalance Handling:** Apply SMOTE to training set.
8. **Save Processed Data:** Store arrays for modeling.






## 1. Imports and Configuration

Import all necessary libraries for data preparation, including pandas, numpy, scikit-learn, imbalanced-learn, and TensorFlow. Set random seeds for reproducibility and configure display/plotting options for consistency.


In [49]:
# Imports and Configuration
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)
tf.random.set_seed(42)
pd.set_option('display.max_columns', 100)
sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Data Loading and Initial Audit

Load the cleaned ICU dataset and missingness mask. Preview the first few rows and check data types to confirm correct loading. Summarize the shape and schema of the dataset for transparency.

In [50]:
# Load cleaned data
data = pd.read_csv('/content/sample_data/timeseries_cleaned_all_features.csv')
mask = pd.read_csv('/content/sample_data/timeseries_missingness_mask.csv')
print('Data shape:', data.shape)

# Preview
display(data.head())
print(data.dtypes)

Data shape: (295354, 48)


Unnamed: 0,RecordID,Minutes,ALP,ALT,AST,Albumin,BUN,Bilirubin,Cholesterol,Creatinine,DiasABP,FiO2,GCS,Glucose,HCO3,HCT,HR,K,Lactate,MAP,MechVent,Mg,NIDiasABP,NIMAP,NISysABP,Na,PaCO2,PaO2,Platelets,RespRate,SaO2,SysABP,Temp,TroponinI,TroponinT,Urine,WBC,pH,Age,Gender,Height,ICUType,In-hospital_death,Length_of_stay,SAPS-I,SOFA,Survival,Weight
0,132539,7,0.0,0.0,0.0,0.0,13.0,0.0,0.0,0.8,0.0,0.0,15.0,205.0,26.0,33.7,73.0,4.4,0.0,0.0,0.0,1.5,65.0,92.33,147.0,137.0,0.0,0.0,221.0,19.0,0.0,0.0,35.1,0.0,0.0,900.0,11.2,0.0,54.0,0.0,170.2,4.0,0.0,5.0,6.0,1.0,70.0,78.6
1,132539,37,0.0,0.0,0.0,0.0,13.0,0.0,0.0,0.8,0.0,0.0,15.0,205.0,26.0,33.7,77.0,4.4,0.0,0.0,0.0,1.5,58.0,91.0,157.0,137.0,0.0,0.0,221.0,19.0,0.0,0.0,35.6,0.0,0.0,60.0,11.2,0.0,54.0,0.0,170.2,4.0,0.0,5.0,6.0,1.0,70.0,78.6
2,132539,97,0.0,0.0,0.0,0.0,13.0,0.0,0.0,0.8,0.0,0.0,15.0,205.0,26.0,33.7,60.0,4.4,0.0,0.0,0.0,1.5,62.0,87.0,137.0,137.0,0.0,0.0,221.0,18.0,0.0,0.0,35.6,0.0,0.0,30.0,11.2,0.0,54.0,0.0,170.2,4.0,0.0,5.0,6.0,1.0,70.0,78.6
3,132539,157,0.0,0.0,0.0,0.0,13.0,0.0,0.0,0.8,0.0,0.0,15.0,205.0,26.0,33.7,62.0,4.4,0.0,0.0,0.0,1.5,52.0,75.67,123.0,137.0,0.0,0.0,221.0,19.0,0.0,0.0,35.6,0.0,0.0,170.0,11.2,0.0,54.0,0.0,170.2,4.0,0.0,5.0,6.0,1.0,70.0,78.6
4,132539,188,0.0,0.0,0.0,0.0,13.0,0.0,0.0,0.8,0.0,0.0,15.0,205.0,26.0,33.7,62.0,4.4,0.0,0.0,0.0,1.5,52.0,75.67,123.0,137.0,0.0,0.0,221.0,19.0,0.0,0.0,35.6,0.0,0.0,170.0,11.2,0.0,54.0,0.0,170.2,4.0,0.0,5.0,6.0,1.0,70.0,78.6


RecordID               int64
Minutes                int64
ALP                  float64
ALT                  float64
AST                  float64
Albumin              float64
BUN                  float64
Bilirubin            float64
Cholesterol          float64
Creatinine           float64
DiasABP              float64
FiO2                 float64
GCS                  float64
Glucose              float64
HCO3                 float64
HCT                  float64
HR                   float64
K                    float64
Lactate              float64
MAP                  float64
MechVent             float64
Mg                   float64
NIDiasABP            float64
NIMAP                float64
NISysABP             float64
Na                   float64
PaCO2                float64
PaO2                 float64
Platelets            float64
RespRate             float64
SaO2                 float64
SysABP               float64
Temp                 float64
TroponinI            float64
TroponinT     

## 2. Feature Selection

Select features based on EDA findings and clinical relevance. Avoid duplicate or already-encoded columns. Document the rationale for each feature group.

- **Dynamic (time series) features:**
    - Cardiovascular: HR, SysABP, DiasABP, MAP, NISysABP, NIDiasABP, NIMAP, MechVent
    - Respiratory: RespRate, SaO2, FiO2, PaO2, PaCO2
    - Renal: Creatinine, BUN, Urine
    - Metabolic/Electrolytes: Na, K, Glucose, Lactate, HCO3, pH
    - Neurological: GCS
    - Other: Temp
- **Static features:**
    - Age, Gender, Height, Weight, ICUType, SAPS-I, SOFA, Length_of_stay, Survival
- **Target:**
    - In-hospital_death (binary)

In [51]:
# Dynamic (time series) features
time_series_features = [
    'HR', 'SysABP', 'DiasABP', 'MAP', 'NISysABP', 'NIDiasABP', 'NIMAP', 'MechVent',
    'RespRate', 'SaO2', 'FiO2', 'PaO2', 'PaCO2',
    'Creatinine', 'BUN', 'Urine',
    'Na', 'K', 'Glucose', 'Lactate', 'HCO3', 'pH',
    'GCS', 'Temp',
    # Clinically meaningful combinations of vitals
    'Shock Index',          # HR / SysABP
    'PaO2/FiO2 ratio',      # PaO₂ / FiO₂
    'Pulse Pressure',       # SysABP − DiasABP
    'MAP to HR ratio'       # MAP / HR
]

# Static features
static_features = [
    'Age', 'Gender', 'Height', 'Weight', 'ICUType', 'SAPS-I', 'SOFA',
    'Length_of_stay', 'Survival'
]

target_col = 'In-hospital_death'

# Derived feature calculations with safe division
import numpy as np

data['Shock Index'] = np.where(
    data['SysABP'] > 0,
    data['HR'] / data['SysABP'],
    np.nan
)

data['PaO2/FiO2 ratio'] = np.where(
    data['FiO2'] > 0,
    data['PaO2'] / data['FiO2'],
    np.nan
)

data['Pulse Pressure'] = np.where(
    (data['SysABP'].notna()) & (data['DiasABP'].notna()),
    data['SysABP'] - data['DiasABP'],
    np.nan
)

data['MAP to HR ratio'] = np.where(
    data['HR'] > 0,
    data['MAP'] / data['HR'],
    np.nan
)

print("Dynamic Features:", time_series_features)
print("Static Features:", static_features)
print("Target:", target_col)


Dynamic Features: ['HR', 'SysABP', 'DiasABP', 'MAP', 'NISysABP', 'NIDiasABP', 'NIMAP', 'MechVent', 'RespRate', 'SaO2', 'FiO2', 'PaO2', 'PaCO2', 'Creatinine', 'BUN', 'Urine', 'Na', 'K', 'Glucose', 'Lactate', 'HCO3', 'pH', 'GCS', 'Temp', 'Shock Index', 'PaO2/FiO2 ratio', 'Pulse Pressure', 'MAP to HR ratio']
Static Features: ['Age', 'Gender', 'Height', 'Weight', 'ICUType', 'SAPS-I', 'SOFA', 'Length_of_stay', 'Survival']
Target: In-hospital_death


## 4. Log Transformation of Skewed Features

Apply log1p transformation to highly skewed static features as recommended by EDA: `Weight`, `Length_of_stay`, and `Survival`. This helps reduce the impact of outliers and long-tailed distributions.

In [52]:
# Apply log1p to highly skewed static features
# Dynamically identify highly skewed static features
skewed_features = []
# Calculate skewness for static features, excluding non-numeric or already processed columns
# For simplicity and based on EDA, we'll check the original static features before any transformations
# Assuming original data is available or we re-calculate skewness on a copy before log1p
# Let's calculate skewness on the current 'data' DataFrame for the static features
# We should exclude 'Gender' and 'ICUType' as they are categorical/already encoded and not suitable for skewness check
static_features_for_skew_check = [col for col in static_features if col not in ['Gender', 'ICUType']]

# Calculate skewness, drop NA values for accurate calculation
skewness = data[static_features_for_skew_check].skew().sort_values(ascending=False)

# Define a threshold for high skewness (e.g., absolute skewness > 1)
skewness_threshold = 1.0
highly_skewed_features = skewness[abs(skewness) > skewness_threshold].index.tolist()

print(f"Highly skewed features (absolute skewness > {skewness_threshold}): {highly_skewed_features}")

# Apply log1p transformation to the identified highly skewed features
for col in highly_skewed_features:
    if col in data.columns:
        # Check if the column contains non-positive values before applying log1p
        if (data[col] < 0).any():
            print(f"Warning: Column '{col}' contains negative values. log1p may not be appropriate.")
            # Decide how to handle negative values, e.g., skip transformation or apply a different method
            pass # Skipping transformation for columns with negative values
        else:
            data[col] = np.log1p(data[col])
            print(f"Applied log1p to '{col}'")
    else:
        print(f"Warning: Column '{col}' not found in data.")

Highly skewed features (absolute skewness > 1.0): ['Height', 'Survival', 'Length_of_stay', 'Weight']
Applied log1p to 'Height'
Applied log1p to 'Survival'
Applied log1p to 'Length_of_stay'
Applied log1p to 'Weight'


## 5. Sequence Construction and Target Extraction

Group by `RecordID` to create time series sequences and extract static features and target labels for each patient. This ensures each patient is represented as a sequence for the CNN-LSTM model.

In [53]:
# Group by RecordID and create sequences
grouped = data.groupby('RecordID')
X_seq = [group[time_series_features].values for _, group in grouped]
X_static = grouped[static_features].first().values
y = grouped[target_col].first().values

## 6. Sequence Padding and Feature Scaling

Pad all sequences to the same length and scale features using `StandardScaler`. This ensures uniform input shape and normalized values for the neural network.

In [54]:
# Pad sequences
max_seq_len = max([seq.shape[0] for seq in X_seq])
X_seq_padded = pad_sequences(X_seq, maxlen=max_seq_len, dtype='float32', padding='post', value=0.0)

# Scale features
scaler = StandardScaler()
n_features = len(time_series_features)
X_seq_reshaped = X_seq_padded.reshape(-1, n_features)
X_seq_scaled = scaler.fit_transform(X_seq_reshaped).reshape(-1, max_seq_len, n_features)

## 7. Train/Validation/Test Split and Class Imbalance Handling

Split the data into train, validation, and test sets. Use SMOTE to address class imbalance in the training set. This is critical due to the observed imbalance in the target variable (`In-hospital_death`).

In [55]:
# Patient-index-based splitting for aligned arrays
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

# Get patient indices
n_patients = len(y)
indices = np.arange(n_patients)

# Split indices for train, val, test
train_idx, test_idx = train_test_split(indices, test_size=0.2, stratify=y, random_state=42)
train_idx, val_idx = train_test_split(train_idx, test_size=0.2, stratify=y[train_idx], random_state=42)

# Use indices to split all arrays
X_train, X_val, X_test = X_seq_scaled[train_idx], X_seq_scaled[val_idx], X_seq_scaled[test_idx]
static_train, static_val, static_test = X_static[train_idx], X_static[val_idx], X_static[test_idx]
y_train, y_val, y_test = y[train_idx], y[val_idx], y[test_idx]

# Scale static features
static_scaler = StandardScaler()
static_train = static_scaler.fit_transform(static_train)
static_val = static_scaler.transform(static_val)
static_test = static_scaler.transform(static_test)

# Impute NaN values in X_train before SMOTE
imputer = SimpleImputer(strategy='mean')
X_train_flat = X_train.reshape(X_train.shape[0], -1)
X_train_imputed = imputer.fit_transform(X_train_flat)


# SMOTE on flattened and imputed time series
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train_imputed, y_train)
X_train_res = X_train_res.reshape(-1, X_train.shape[1], X_train.shape[2])

# For static features: assign static features for synthetic samples by random sampling from minority class
n_orig = static_train.shape[0]
n_total = X_train_res.shape[0]
n_synth = n_total - n_orig
static_train_res = static_train.copy()
if n_synth > 0:
    # Find indices of minority class in original static_train
    minority_class = 1 if np.sum(y_train == 1) < np.sum(y_train == 0) else 0
    minority_indices = np.where(y_train == minority_class)[0]
    synth_static = static_train[np.random.choice(minority_indices, size=n_synth, replace=True)]
    static_train_res = np.concatenate([static_train, synth_static], axis=0)

## 8. Save Prepared Data

Save the processed arrays for model training and evaluation. This ensures reproducibility and easy loading for downstream modeling notebooks.

In [56]:
# Save processed data
np.savez('/content/sample_data/cnn_lstm_data.npz',
         X_train=X_train_res, y_train=y_train_res,
         X_val=X_val, y_val=y_val,
         X_test=X_test, y_test=y_test,
         static_train=static_train_res, static_val=static_val, static_test=static_test)
print("Prepared data saved.")

Prepared data saved.


In [57]:
# Calculate derived features
# Shock Index = HR / SysABP
# Handle division by zero or very small SysABP values to avoid inf or NaN
data['Shock Index'] = data['HR'] / data['SysABP'].replace(0, np.nan) # Replace 0 with NaN before division
# Address potential inf values resulting from division by very small non-zero values if needed
data['Shock Index'] = data['Shock Index'].replace([np.inf, -np.inf], np.nan)


# PaO₂/FiO₂ ratio (oxygenation)
# Handle division by zero or very small FiO2 values
data['PaO2/FiO2 ratio'] = data['PaO2'] / data['FiO2'].replace(0, np.nan)
# Address potential inf values
data['PaO2/FiO2 ratio'] = data['PaO2/FiO2 ratio'].replace([np.inf, -np.inf], np.nan)

# Pulse Pressure = SysABP − DiasABP
data['Pulse Pressure'] = data['SysABP'] - data['DiasABP']

# MAP to HR ratio
# Handle division by zero or very small HR values
data['MAP to HR ratio'] = data['MAP'] / data['HR'].replace(0, np.nan)
# Address potential inf values
data['MAP to HR ratio'] = data['MAP to HR ratio'].replace([np.inf, -np.inf], np.nan)


print("Derived features calculated.")
display(data[time_series_features].head())

Derived features calculated.


Unnamed: 0,HR,SysABP,DiasABP,MAP,NISysABP,NIDiasABP,NIMAP,MechVent,RespRate,SaO2,FiO2,PaO2,PaCO2,Creatinine,BUN,Urine,Na,K,Glucose,Lactate,HCO3,pH,GCS,Temp,Shock Index,PaO2/FiO2 ratio,Pulse Pressure,MAP to HR ratio
0,73.0,0.0,0.0,0.0,147.0,65.0,92.33,0.0,19.0,0.0,0.0,0.0,0.0,0.8,13.0,900.0,137.0,4.4,205.0,0.0,26.0,0.0,15.0,35.1,,,0.0,0.0
1,77.0,0.0,0.0,0.0,157.0,58.0,91.0,0.0,19.0,0.0,0.0,0.0,0.0,0.8,13.0,60.0,137.0,4.4,205.0,0.0,26.0,0.0,15.0,35.6,,,0.0,0.0
2,60.0,0.0,0.0,0.0,137.0,62.0,87.0,0.0,18.0,0.0,0.0,0.0,0.0,0.8,13.0,30.0,137.0,4.4,205.0,0.0,26.0,0.0,15.0,35.6,,,0.0,0.0
3,62.0,0.0,0.0,0.0,123.0,52.0,75.67,0.0,19.0,0.0,0.0,0.0,0.0,0.8,13.0,170.0,137.0,4.4,205.0,0.0,26.0,0.0,15.0,35.6,,,0.0,0.0
4,62.0,0.0,0.0,0.0,123.0,52.0,75.67,0.0,19.0,0.0,0.0,0.0,0.0,0.8,13.0,170.0,137.0,4.4,205.0,0.0,26.0,0.0,15.0,35.6,,,0.0,0.0


In [None]:
# Visualize distribution of Shock Index, PaO2/Fio2 ratio, Pulse Pressure

