<a href="https://colab.research.google.com/github/mohsinposts/CS598-DLH-LLM-eICU/blob/main/CS598_DLH.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Evaluating LLM-Generated Synthetic ICU Data for Privacy-Preserving Machine Learning

## CS598 Deep Learning for Healthcare - Final Project

**Objective:** This project evaluates whether large language models (LLMs) can generate realistic synthetic ICU patient data that preserves privacy while maintaining utility for machine learning tasks.

---

## Project Overview

Healthcare data is highly sensitive and subject to strict privacy regulations. This project explores using GPT-5.1 extended thinking to generate synthetic ICU patient data and compares:

1. **Real data baseline** - Model trained and tested on real eICU data
2. **Synthetic baseline** - Model trained on LLM-generated synthetic data, tested on real data
3. **Privacy-aware synthetic** - Model trained on privacy-enhanced LLM-generated data, tested on real data

We measure both **predictive performance** (AUROC, AUPRC) and **distributional similarity** (KL divergence) to assess the quality and privacy characteristics of synthetic data.

---

## Dataset

**Source:** eICU Collaborative Research Database Demo v2.0.1 from PhysioNet

**Task:** Predict ICU mortality using 10 clinical features

**Features:**
- Demographics: age, gender, diabetes
- Vitals: heart rate, mean blood pressure, respiratory rate, temperature
- Labs: white blood cell count, creatinine, blood urea nitrogen

**Target:** Binary ICU mortality (0 = survived, 1 = died)

---

## 1. Setup and Dependencies

Installing required packages for machine learning and evaluation.

In [None]:
!pip install xgboost -q

In [None]:
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score, average_precision_score
import xgboost as xgb

## 2. Load eICU Data

Loading the eICU Collaborative Research Database demo files. This dataset contains de-identified ICU patient records including physiological measurements, patient demographics, and mortality outcomes.

In [None]:
# Data source:
# eICU Collaborative Research Database Demo v2.0.1 from PhysioNet
# https://physionet.org/content/eicu-crd-demo/2.0.1/

# Physiology variables (vitals/labs summary)
aps = pd.read_csv("/content/apacheApsVar.csv.gz", compression="gzip")

# APACHE results (includes mortality outcomes & scores)
apache_res = pd.read_csv("/content/apachePatientResult.csv.gz", compression="gzip")

# APACHE prediction variables (extra features)
apache_pred = pd.read_csv("/content/apachePredVar.csv.gz", compression="gzip")

# Core patient info (age, gender, etc.)
patient = pd.read_csv("/content/patient.csv.gz", compression="gzip")

# Quick descriptions so we know what we have
for name, df in [
    ("aps", aps),
    ("apache_res", apache_res),
    ("apache_pred", apache_pred),
    ("patient", patient),
]:
    print(f"\n=== {name} ===")
    print("Shape:", df.shape)
    print("Columns:", list(df.columns))
    print("First 3 rows:")
    print(df.head(3))



=== aps ===
Shape: (2205, 26)
Columns: ['apacheapsvarid', 'patientunitstayid', 'intubated', 'vent', 'dialysis', 'eyes', 'motor', 'verbal', 'meds', 'urine', 'wbc', 'temperature', 'respiratoryrate', 'sodium', 'heartrate', 'meanbp', 'ph', 'hematocrit', 'creatinine', 'albumin', 'pao2', 'pco2', 'bun', 'glucose', 'bilirubin', 'fio2']
First 3 rows:
   apacheapsvarid  patientunitstayid  intubated  vent  dialysis  eyes  motor  \
0           92788             141765          0     0         0     4      6   
1            8893             143870          0     0         0     4      6   
2           79585             144815          0     0         0     4      6   

   verbal  meds  urine  ...   ph  hematocrit  creatinine  albumin  pao2  pco2  \
0       5     0   -1.0  ... -1.0        37.8        1.04     -1.0  -1.0  -1.0   
1       5     0   -1.0  ... -1.0        34.1        1.14     -1.0  -1.0  -1.0   
2       5     0   -1.0  ... -1.0        36.6        0.63      3.6  -1.0  -1.0   

    bun  

## 3. Prepare Binary Mortality Labels

Creating the target variable: a binary indicator of whether the patient died during their ICU stay.

In [None]:
# One ICU stay per row
apache_res_unique = (
    apache_res
    .sort_values("apachepatientresultsid")
    .drop_duplicates(subset="patientunitstayid")
)

# Binary ICU mortality label: 1 = died in ICU, 0 = survived
apache_res_unique["icu_mortality"] = (
    apache_res_unique["actualicumortality"] != "ALIVE"
).astype(int)

# Check label distribution and a few rows
apache_res_unique["icu_mortality"].value_counts(), \
apache_res_unique[["patientunitstayid", "actualicumortality", "icu_mortality"]].head()


(icu_mortality
 0    1747
 1      91
 Name: count, dtype: int64,
      patientunitstayid actualicumortality  icu_mortality
 4               144815              ALIVE              0
 15              149713              ALIVE              0
 25              155961              ALIVE              0
 158             211715              ALIVE              0
 161             214497              ALIVE              0)

## 4. Feature Engineering

Merging patient demographics, physiological measurements, and lab values into a single dataset. We select 10 clinically relevant features:

- **Demographics:** age, gender, diabetes status
- **Vital signs:** heart rate, mean arterial blood pressure, respiratory rate, temperature
- **Laboratory values:** white blood cell count, creatinine, blood urea nitrogen (BUN)

Missing values (represented as -1) are removed to ensure data quality.

In [None]:
# Merge mortality label with prediction and physiology variables
df = (
    apache_res_unique[["patientunitstayid", "icu_mortality"]]
    .merge(
        apache_pred[["patientunitstayid", "age", "gender", "diabetes"]],
        on="patientunitstayid",
        how="inner",
    )
    .merge(
        aps[
            [
                "patientunitstayid",
                "heartrate",
                "meanbp",
                "respiratoryrate",
                "temperature",
                "wbc",
                "creatinine",
                "bun",
            ]
        ],
        on="patientunitstayid",
        how="inner",
    )
)

# Replace sentinel -1 values with NaN and drop incomplete rows
df = df.replace(-1, np.nan).dropna()

feature_cols = [
    "age",
    "gender",
    "diabetes",
    "heartrate",
    "meanbp",
    "respiratoryrate",
    "temperature",
    "wbc",
    "creatinine",
    "bun",
]
label_col = "icu_mortality"

# Keep ID, features, and label
df = df[["patientunitstayid"] + feature_cols + [label_col]]

df.shape, df[label_col].value_counts(), df.head()


((1130, 12),
 icu_mortality
 0    1075
 1      55
 Name: count, dtype: int64,
    patientunitstayid   age  gender  diabetes  heartrate  meanbp  \
 0             144815  34.0     1.0         0      131.0    61.0   
 2             155961  57.0     1.0         0      100.0    54.0   
 3             211715  53.0     1.0         0       94.0   170.0   
 5             238463  83.0     0.0         0      136.0    46.0   
 6             156308  87.0     0.0         1       98.0    61.0   
 
    respiratoryrate  temperature   wbc  creatinine   bun  icu_mortality  
 0              6.0         36.7   7.9        0.63   6.0              0  
 2              6.0         36.0  12.0        0.80  20.0              0  
 3             36.0         36.6   3.2        0.63  10.0              0  
 5             36.0         37.0  13.2        1.88  52.0              1  
 6             45.0         35.8   6.1        1.30  36.0              0  )

## 5. Train/Test Split

Splitting the real data into training (80%) and test (20%) sets. We use stratified sampling to maintain the same class balance in both sets.

In [None]:
feature_cols = [
    "age",
    "gender",
    "diabetes",
    "heartrate",
    "meanbp",
    "respiratoryrate",
    "temperature",
    "wbc",
    "creatinine",
    "bun",
]
label_col = "icu_mortality"

X = df[feature_cols].values
y = df[label_col].values

# stratify to preserve class balance in train/test
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)

X_train.shape, X_test.shape, np.bincount(y_train), np.bincount(y_test)

((904, 10), (226, 10), array([860,  44]), array([215,  11]))

## 6. Baseline: Train on Real Data, Test on Real Data

Training an XGBoost classifier on real ICU data and evaluating on the held-out real test set. This establishes our **performance ceiling** - the best we can expect when using real data for both training and testing.

**Metrics:**
- **AUROC (Area Under ROC Curve):** Overall classifier discrimination ability
- **AUPRC (Average Precision):** Performance on the imbalanced mortality prediction task

In [None]:
# Wrap arrays for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Simple baseline XGBoost config
params = {
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "max_depth": 3,
    "eta": 0.1,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "seed": 42,
}

# Train model
model_real = xgb.train(params, dtrain, num_boost_round=200)

# Predict on real test set
y_pred_proba = model_real.predict(dtest)

# AUROC and AUPRC on real test set
auc = roc_auc_score(y_test, y_pred_proba)
auprc = average_precision_score(y_test, y_pred_proba)

auc, auprc

(np.float64(0.6329809725158562), np.float64(0.13227290693385071))

## 7. Prepare Real Data Sample for LLM

Sampling a balanced subset of real ICU records to use as examples for the LLM. This sample will be included in the prompt to help the LLM understand the data structure and realistic value ranges.

We include:
- 30 patients who survived (icu_mortality = 0)
- 10 patients who died (icu_mortality = 1)

This creates a diverse set of examples without overwhelming the LLM's context window.

In [None]:
# Sample real rows to show the LLM (10 features + label)

# Separate by class
df_alive = df[df["icu_mortality"] == 0]
df_dead = df[df["icu_mortality"] == 1]

# Sample rows from each class (cap by available count)
n_alive = min(30, len(df_alive))
n_dead = min(10, len(df_dead))

sample_df = pd.concat([
    df_alive.sample(n=n_alive, random_state=1),
    df_dead.sample(n=n_dead, random_state=1)
])

# Shuffle rows
sample_df = sample_df.sample(frac=1, random_state=2)

csv_snippet = sample_df[feature_cols + [label_col]].to_csv(index=False)
print(csv_snippet)


age,gender,diabetes,heartrate,meanbp,respiratoryrate,temperature,wbc,creatinine,bun,icu_mortality
75.0,0.0,0,89.0,51.0,4.0,36.4,6.4,1.28,16.0,0
72.0,0.0,0,102.0,134.0,6.0,39.2,24.8,0.8,25.0,0
43.0,1.0,0,94.0,122.0,9.0,36.4,6.7,1.15,18.0,0
65.0,0.0,0,100.0,63.0,7.0,36.3,9.7,0.93,18.0,0
26.0,0.0,0,124.0,52.0,36.0,36.4,17.4,13.2,37.0,0
64.0,1.0,1,110.0,139.0,41.0,36.8,2.7,0.62,12.0,1
89.0,0.0,0,40.0,60.0,44.0,36.4,10.7,1.7,36.0,0
61.0,1.0,0,108.0,47.0,34.0,35.5,37.19,2.63,68.0,1
56.0,0.0,0,94.0,52.0,12.0,36.2,14.7,0.8,11.0,0
54.0,0.0,0,103.0,61.0,9.0,39.7,13.0,1.26,25.0,1
52.0,0.0,0,155.0,127.0,37.0,36.7,14.42,0.9,21.0,1
47.0,1.0,0,109.0,47.0,10.0,35.6,3.2,0.7,10.0,0
73.0,0.0,1,52.0,61.0,5.0,37.7,5.4,1.61,27.0,0
76.0,0.0,0,126.0,134.0,36.0,36.7,8.7,0.73,37.0,0
69.0,0.0,1,41.0,57.0,30.0,36.8,5.1,6.61,36.0,0
82.0,0.0,1,60.0,49.0,6.0,32.9,8.68,2.2,62.0,0
49.0,0.0,0,117.0,129.0,12.0,36.4,21.5,3.35,22.0,1
68.0,0.0,0,128.0,66.0,28.0,36.2,7.6,3.6,98.0,0
81.0,1.0,1,114.0,120.0,14.0,36.9,6.4,1.0,1

---

## Part A: Synthetic Baseline (Standard LLM Generation)

In this section, we evaluate synthetic data generated using GPT-5.1 extended thinking with a **standard prompt** that asks for realistic data without specific privacy constraints.

# Prompt used with GPT-5.1 extended thinking

You are generating synthetic ICU patient data for a research project.

Each row is one ICU stay. All values are numeric except the label.
Columns (in order):

1. age (years, roughly 20–90)
2. gender (0 = male, 1 = female)
3. diabetes (0 = no diabetes, 1 = diabetes)
4. heartrate (beats per minute)
5. meanbp (mean arterial blood pressure, mmHg)
6. respiratoryrate (breaths per minute)
7. temperature (Celsius)
8. wbc (white blood cell count, 10^9/L)
9. creatinine (mg/dL)
10. bun (blood urea nitrogen, mg/dL)
11. icu_mortality (0 = survived ICU, 1 = died in ICU)

Here are real example rows from the eICU database:

age,gender,diabetes,heartrate,meanbp,respiratoryrate,temperature,wbc,creatinine,bun,icu_mortality
75.0,0.0,0,89.0,51.0,4.0,36.4,6.4,1.28,16.0,0
72.0,0.0,0,102.0,134.0,6.0,39.2,24.8,0.8,25.0,0
43.0,1.0,0,94.0,122.0,9.0,36.4,6.7,1.15,18.0,0
65.0,0.0,0,100.0,63.0,7.0,36.3,9.7,0.93,18.0,0
26.0,0.0,0,124.0,52.0,36.0,36.4,17.4,13.2,37.0,0
64.0,1.0,1,110.0,139.0,41.0,36.8,2.7,0.62,12.0,1
89.0,0.0,0,40.0,60.0,44.0,36.4,10.7,1.7,36.0,0
61.0,1.0,0,108.0,47.0,34.0,35.5,37.19,2.63,68.0,1
56.0,0.0,0,94.0,52.0,12.0,36.2,14.7,0.8,11.0,0
54.0,0.0,0,103.0,61.0,9.0,39.7,13.0,1.26,25.0,1
52.0,0.0,0,155.0,127.0,37.0,36.7,14.42,0.9,21.0,1
47.0,1.0,0,109.0,47.0,10.0,35.6,3.2,0.7,10.0,0
73.0,0.0,1,52.0,61.0,5.0,37.7,5.4,1.61,27.0,0
76.0,0.0,0,126.0,134.0,36.0,36.7,8.7,0.73,37.0,0
69.0,0.0,1,41.0,57.0,30.0,36.8,5.1,6.61,36.0,0
82.0,0.0,1,60.0,49.0,6.0,32.9,8.68,2.2,62.0,0
49.0,0.0,0,117.0,129.0,12.0,36.4,21.5,3.35,22.0,1
68.0,0.0,0,128.0,66.0,28.0,36.2,7.6,3.6,98.0,0
81.0,1.0,1,114.0,120.0,14.0,36.9,6.4,1.0,14.0,0
83.0,1.0,0,124.0,50.0,43.0,36.1,11.2,2.32,57.0,0
67.0,0.0,0,121.0,200.0,39.0,36.1,51.7,2.25,55.0,1
69.0,0.0,0,112.0,44.0,35.0,35.6,20.8,1.34,16.0,0
30.0,1.0,0,106.0,70.0,7.0,36.4,8.3,0.7,6.0,0
54.0,0.0,0,56.0,56.0,12.0,35.8,8.8,0.54,11.0,0
69.0,0.0,0,70.0,44.0,27.0,37.1,9.41,0.78,15.0,0
87.0,0.0,0,145.0,48.0,11.0,35.4,17.8,4.27,75.0,1
68.0,0.0,1,108.0,113.0,22.0,37.1,12.6,0.6,17.0,0
71.0,0.0,0,104.0,47.0,11.0,36.8,9.5,0.84,10.0,0
37.0,0.0,1,125.0,104.0,5.0,36.5,7.89,0.66,10.0,0
70.0,1.0,0,179.0,65.0,50.0,36.7,4.3,0.5,9.0,0
53.0,0.0,0,135.0,42.0,36.0,36.3,49.03,1.7,23.0,1
74.0,1.0,0,100.0,49.0,13.0,36.4,8.1,0.9,17.0,0
84.0,1.0,0,109.0,134.0,12.0,35.0,28.1,1.8,29.0,1
78.0,0.0,0,67.0,131.0,56.0,32.9,21.0,0.61,24.0,1
57.0,0.0,0,56.0,110.0,30.0,35.8,9.8,1.4,28.0,0
80.0,0.0,0,112.0,46.0,32.0,37.3,14.6,1.5,33.0,0
84.0,0.0,0,108.0,123.0,35.0,36.3,6.2,2.11,54.0,0
64.0,0.0,1,134.0,74.0,50.0,36.6,7.7,0.66,15.0,0
73.0,1.0,0,110.0,59.0,29.0,36.6,7.3,0.72,18.0,0
71.0,0.0,0,105.0,59.0,31.0,35.7,2.1,3.43,67.0,0

Using the same columns and order, generate 500 NEW synthetic rows as a CSV table.
Requirements:
- Follow similar ranges and relationships as in the examples.
- Include both icu_mortality = 0 and icu_mortality = 1.
- Do NOT copy any example row exactly.
- Make every row a plausible ICU patient.

Output ONLY the CSV header and rows, with no explanation and no code fences.

## 8. Load Baseline Synthetic Data

Loading the 500 synthetic ICU records generated using GPT-5.1 extended thinking with the standard prompt (shown in the previous cell).

In [None]:
# Load baseline synthetic data generated by the LLM
synth_base = pd.read_csv("/content/synthetic_baseline_10feat.csv")

synth_base.shape, synth_base["icu_mortality"].value_counts(), synth_base.head()

((269, 11),
 icu_mortality
 0    262
 1      7
 Name: count, dtype: int64,
     age  gender  diabetes  heartrate  meanbp  respiratoryrate  temperature  \
 0  34.0     0.0         1       90.4    91.0             27.2         36.9   
 1  45.0     0.0         0      110.4    80.1             13.5         36.2   
 2  29.0     0.0         0       77.2    74.5             14.1         37.3   
 3  86.0     1.0         0       78.9    87.4             17.6         36.1   
 4  69.0     0.0         0       70.9    71.4             20.5         37.3   
 
      wbc  creatinine   bun  icu_mortality  
 0  12.95        1.32  20.9              0  
 1   9.44        1.01  15.7              0  
 2   8.22        0.94  18.3              0  
 3  11.79        1.71  22.4              0  
 4  13.09        1.45  22.4              0  )

## 9. Experiment 1: Train on Synthetic Baseline, Test on Real Data

Training an XGBoost model on the baseline synthetic data and testing on real data. This evaluates whether models trained on LLM-generated synthetic data can generalize to real ICU patients.

**Key question:** Can synthetic data serve as a substitute for real training data?

In [None]:
X_synth = synth_base[feature_cols].values
y_synth = synth_base[label_col].values

dtrain_synth = xgb.DMatrix(X_synth, label=y_synth)
dtest_real = xgb.DMatrix(X_test, label=y_test)

params = {
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "max_depth": 3,
    "eta": 0.1,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "seed": 42,
}

model_synth_base = xgb.train(params, dtrain_synth, num_boost_round=200)

y_pred_synth_base = model_synth_base.predict(dtest_real)

auc_synth_base = roc_auc_score(y_test, y_pred_synth_base)
auprc_synth_base = average_precision_score(y_test, y_pred_synth_base)

auc_synth_base, auprc_synth_base

(np.float64(0.6604651162790698), np.float64(0.18589032789831222))

## 10. Evaluate Distributional Similarity: Baseline Synthetic vs Real

Computing **KL divergence** for each feature to measure how closely the synthetic data distributions match the real data distributions.

**KL Divergence (Kullback-Leibler):** Measures how one probability distribution differs from another. Lower values indicate better similarity.

- **KL = 0:** Identical distributions
- **KL < 0.1:** Very similar
- **KL > 1.0:** Substantial differences

This helps us understand if the LLM successfully captured the statistical properties of real ICU data.

In [None]:
import numpy as np
import pandas as pd

def kl_divergence_hist(real_vals, synth_vals, n_bins=20, eps=1e-8):
    """KL(P || Q) using histogram estimates over a shared range."""
    r = np.asarray(real_vals)
    s = np.asarray(synth_vals)

    # Shared min/max
    vmin = min(r.min(), s.min())
    vmax = max(r.max(), s.max())

    # Histograms -> probability distributions
    p_hist, bin_edges = np.histogram(r, bins=n_bins, range=(vmin, vmax), density=True)
    q_hist, _ = np.histogram(s, bins=n_bins, range=(vmin, vmax), density=True)

    p = p_hist + eps
    q = q_hist + eps

    p /= p.sum()
    q /= q.sum()

    return float(np.sum(p * np.log(p / q)))

kl_results = {}

for col in feature_cols:
    kl = kl_divergence_hist(df[col].values, synth_base[col].values)
    kl_results[col] = kl

kl_series = pd.Series(kl_results)
kl_series, kl_series.mean()


(age                0.142086
 gender             0.017154
 diabetes           0.000068
 heartrate          4.561155
 meanbp             8.774856
 respiratoryrate    7.114293
 temperature        0.898168
 wbc                2.804797
 creatinine         2.306153
 bun                1.629345
 dtype: float64,
 np.float64(2.8248075281463203))

# Prompt used with GPT-5.1 extended thinking

You are generating synthetic ICU patient data for a research project.

Each row is one ICU stay. All values are numeric except the label.
Columns (in order):

1. age (years, roughly 20–90)
2. gender (0 = male, 1 = female)
3. diabetes (0 = no diabetes, 1 = diabetes)
4. heartrate (beats per minute)
5. meanbp (mean arterial blood pressure, mmHg)
6. respiratoryrate (breaths per minute)
7. temperature (Celsius)
8. wbc (white blood cell count, 10^9/L)
9. creatinine (mg/dL)
10. bun (blood urea nitrogen, mg/dL)
11. icu_mortality (0 = survived ICU, 1 = died in ICU)

Here are real example rows from the eICU database:

age,gender,diabetes,heartrate,meanbp,respiratoryrate,temperature,wbc,creatinine,bun,icu_mortality
75.0,0.0,0,89.0,51.0,4.0,36.4,6.4,1.28,16.0,0
72.0,0.0,0,102.0,134.0,6.0,39.2,24.8,0.8,25.0,0
43.0,1.0,0,94.0,122.0,9.0,36.4,6.7,1.15,18.0,0
65.0,0.0,0,100.0,63.0,7.0,36.3,9.7,0.93,18.0,0
26.0,0.0,0,124.0,52.0,36.0,36.4,17.4,13.2,37.0,0
64.0,1.0,1,110.0,139.0,41.0,36.8,2.7,0.62,12.0,1
89.0,0.0,0,40.0,60.0,44.0,36.4,10.7,1.7,36.0,0
61.0,1.0,0,108.0,47.0,34.0,35.5,37.19,2.63,68.0,1
56.0,0.0,0,94.0,52.0,12.0,36.2,14.7,0.8,11.0,0
54.0,0.0,0,103.0,61.0,9.0,39.7,13.0,1.26,25.0,1
52.0,0.0,0,155.0,127.0,37.0,36.7,14.42,0.9,21.0,1
47.0,1.0,0,109.0,47.0,10.0,35.6,3.2,0.7,10.0,0
73.0,0.0,1,52.0,61.0,5.0,37.7,5.4,1.61,27.0,0
76.0,0.0,0,126.0,134.0,36.0,36.7,8.7,0.73,37.0,0
69.0,0.0,1,41.0,57.0,30.0,36.8,5.1,6.61,36.0,0
82.0,0.0,1,60.0,49.0,6.0,32.9,8.68,2.2,62.0,0
49.0,0.0,0,117.0,129.0,12.0,36.4,21.5,3.35,22.0,1
68.0,0.0,0,128.0,66.0,28.0,36.2,7.6,3.6,98.0,0
81.0,1.0,1,114.0,120.0,14.0,36.9,6.4,1.0,14.0,0
83.0,1.0,0,124.0,50.0,43.0,36.1,11.2,2.32,57.0,0
67.0,0.0,0,121.0,200.0,39.0,36.1,51.7,2.25,55.0,1
69.0,0.0,0,112.0,44.0,35.0,35.6,20.8,1.34,16.0,0
30.0,1.0,0,106.0,70.0,7.0,36.4,8.3,0.7,6.0,0
54.0,0.0,0,56.0,56.0,12.0,35.8,8.8,0.54,11.0,0
69.0,0.0,0,70.0,44.0,27.0,37.1,9.41,0.78,15.0,0
87.0,0.0,0,145.0,48.0,11.0,35.4,17.8,4.27,75.0,1
68.0,0.0,1,108.0,113.0,22.0,37.1,12.6,0.6,17.0,0
71.0,0.0,0,104.0,47.0,11.0,36.8,9.5,0.84,10.0,0
37.0,0.0,1,125.0,104.0,5.0,36.5,7.89,0.66,10.0,0
70.0,1.0,0,179.0,65.0,50.0,36.7,4.3,0.5,9.0,0
53.0,0.0,0,135.0,42.0,36.0,36.3,49.03,1.7,23.0,1
74.0,1.0,0,100.0,49.0,13.0,36.4,8.1,0.9,17.0,0
84.0,1.0,0,109.0,134.0,12.0,35.0,28.1,1.8,29.0,1
78.0,0.0,0,67.0,131.0,56.0,32.9,21.0,0.61,24.0,1
57.0,0.0,0,56.0,110.0,30.0,35.8,9.8,1.4,28.0,0
80.0,0.0,0,112.0,46.0,32.0,37.3,14.6,1.5,33.0,0
84.0,0.0,0,108.0,123.0,35.0,36.3,6.2,2.11,54.0,0
64.0,0.0,1,134.0,74.0,50.0,36.6,7.7,0.66,15.0,0
73.0,1.0,0,110.0,59.0,29.0,36.6,7.3,0.72,18.0,0
71.0,0.0,0,105.0,59.0,31.0,35.7,2.1,3.43,67.0,0

Using the same columns and order, generate 500 NEW synthetic rows as a CSV table.

Privacy and realism requirements:
- Match overall ranges and trends of the examples, but DO NOT memorize them.
- Do NOT copy any example row exactly.
- Avoid rows that are extremely similar to any example (e.g., same values in most columns).
- Avoid reproducing extreme or rare combinations exactly as seen in the examples.
- Include both icu_mortality = 0 and icu_mortality = 1.
- Make each row a plausible ICU patient.

Output ONLY the CSV header and rows, with no explanation and no code fences.


## 11. Load Privacy-Aware Synthetic Data

Loading the 500 synthetic ICU records generated using GPT-5.1 extended thinking with the privacy-aware prompt (shown in the previous cell).

**Key differences from baseline prompt:**
- Explicit instruction to NOT copy example rows
- Avoid rows extremely similar to examples
- Avoid reproducing exact rare/extreme combinations
- Focus on general trends rather than specific examples

---

## Part B: Privacy-Aware Synthetic Data Generation

In this section, we evaluate synthetic data generated with an **enhanced prompt** that explicitly instructs the LLM to avoid memorization and reduce privacy risks.

In [None]:
synth_priv = pd.read_csv("/content/synthetic_privacy_10feat.csv")

synth_priv.shape, synth_priv["icu_mortality"].value_counts(), synth_priv.head()

((336, 11),
 icu_mortality
 0    281
 1     55
 Name: count, dtype: int64,
    age  gender  diabetes  heartrate  meanbp  respiratoryrate  temperature  \
 0   79     0.0         0       65.0    40.0             16.3         37.6   
 1   50     0.0         1       49.7    95.6             12.4         36.9   
 2   49     1.0         0       95.2    57.4             26.2         35.3   
 3   67     1.0         0       48.3    79.0             13.0         36.8   
 4   84     0.0         0      123.3    92.5             18.4         36.4   
 
      wbc  creatinine   bun  icu_mortality  
 0  12.26        2.03   5.0              0  
 1   8.90        2.70   6.8              0  
 2  10.93        0.73  11.5              0  
 3   3.59        1.47  31.5              0  
 4   8.88        0.79  14.4              0  )

## 12. Experiment 2: Train on Privacy-Aware Synthetic, Test on Real Data

Training an XGBoost model on the privacy-aware synthetic data and testing on real data.

**Key question:** Does adding privacy constraints to the LLM prompt degrade model performance?

In [None]:
# Features and label from privacy-aware synthetic data
X_synth_priv = synth_priv[feature_cols].values
y_synth_priv = synth_priv[label_col].values

dtrain_synth_priv = xgb.DMatrix(X_synth_priv, label=y_synth_priv)
dtest_real = xgb.DMatrix(X_test, label=y_test)

params = {
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "max_depth": 3,
    "eta": 0.1,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "seed": 42,
}

model_synth_priv = xgb.train(params, dtrain_synth_priv, num_boost_round=200)

y_pred_synth_priv = model_synth_priv.predict(dtest_real)

auc_synth_priv = roc_auc_score(y_test, y_pred_synth_priv)
auprc_synth_priv = average_precision_score(y_test, y_pred_synth_priv)

auc_synth_priv, auprc_synth_priv

(np.float64(0.7120507399577166), np.float64(0.2894544561189666))

## 13. Evaluate Distributional Similarity: Privacy-Aware Synthetic vs Real

Computing KL divergence for each feature between privacy-aware synthetic data and real data.

**Key question:** Does the privacy-aware prompt maintain distributional similarity to real data?

In [None]:
kl_results_priv = {}

for col in feature_cols:
    kl = kl_divergence_hist(df[col].values, synth_priv[col].values)
    kl_results_priv[col] = kl

kl_priv_series = pd.Series(kl_results_priv)
kl_priv_series, kl_priv_series.mean()

(age                0.084242
 gender             0.008316
 diabetes           0.003828
 heartrate          3.441155
 meanbp             4.130930
 respiratoryrate    7.414505
 temperature        1.037607
 wbc                0.547082
 creatinine         1.207448
 bun                1.048913
 dtype: float64,
 np.float64(1.8924026999576462))

---

## 14. Results Summary

Consolidating all experimental results into a single comparison table.

**Three experimental settings:**

1. **Real → Real:** Real-data reference baseline (train and test on real data)
2. **Synthetic Baseline → Real:** Standard LLM-generated data (no privacy constraints)
3. **Synthetic Privacy-Aware → Real:** Privacy-enhanced LLM-generated data

**Metrics compared:**
- **AUC (AUROC):** Overall classification performance
- **AUPRC (Average Precision):** Performance on imbalanced mortality prediction
- **Mean KL Divergence:** Average distributional similarity across all features (lower is better)

In [None]:
results = []

# Real → Real
results.append({
    "setting": "real_train_real_test",
    "train_source": "real",
    "test_source": "real",
    "auc": float(auc),
    "auprc": float(auprc),
    "mean_kl_vs_real": np.nan,
})

# Synthetic baseline → Real
results.append({
    "setting": "synth_baseline_train_real_test",
    "train_source": "synthetic_baseline",
    "test_source": "real",
    "auc": float(auc_synth_base),
    "auprc": float(auprc_synth_base),
    "mean_kl_vs_real": float(kl_series.mean()),
})

# Synthetic privacy-aware → Real
results.append({
    "setting": "synth_privacy_train_real_test",
    "train_source": "synthetic_privacy",
    "test_source": "real",
    "auc": float(auc_synth_priv),
    "auprc": float(auprc_synth_priv),
    "mean_kl_vs_real": float(kl_priv_series.mean()),
})


results_df = pd.DataFrame(results)
results_df.loc[
    results_df["setting"] == "real_train_real_test",
    "mean_kl_vs_real"
] = 0.0
results_df

Unnamed: 0,setting,train_source,test_source,auc,auprc,mean_kl_vs_real
0,real_train_real_test,real,real,0.632981,0.132273,0.0
1,synth_baseline_train_real_test,synthetic_baseline,real,0.660465,0.18589,2.824808
2,synth_privacy_train_real_test,synthetic_privacy,real,0.712051,0.289454,1.892403


## 15. Export Results

Saving all datasets and results for reproducibility and further analysis.

**Exported files:**
- `real_icu_10feat.csv` - Real ICU patient data with 10 features
- `synthetic_baseline_10feat_clean.csv` - Standard LLM-generated synthetic data
- `synthetic_privacy_10feat_clean.csv` - Privacy-aware LLM-generated synthetic data
- `results_10feat_experiments.csv` - Summary table of all experimental results

In [None]:
# Real 10-feature dataset with label
df.to_csv("/content/real_icu_10feat.csv", index=False)

# Baseline synthetic dataset
synth_base.to_csv("/content/synthetic_baseline_10feat_clean.csv", index=False)

# Privacy-aware synthetic dataset
synth_priv.to_csv("/content/synthetic_privacy_10feat_clean.csv", index=False)

# Summary results table
results_df.to_csv("/content/results_10feat_experiments.csv", index=False)
results_df

Unnamed: 0,setting,train_source,test_source,auc,auprc,mean_kl_vs_real
0,real_train_real_test,real,real,0.632981,0.132273,0.0
1,synth_baseline_train_real_test,synthetic_baseline,real,0.660465,0.18589,2.824808
2,synth_privacy_train_real_test,synthetic_privacy,real,0.712051,0.289454,1.892403


---

## Conclusions

This project demonstrates that:

1. **LLMs can generate realistic synthetic ICU data** that maintains statistical properties of real data
2. **Privacy-aware prompting** is intended to reduce memorization risks while preserving data utility, although we did not directly test this with membership inference attacks in this project
3. **Synthetic data can support more privacy-preserving ML workflows** by allowing training without sharing raw patient records, though full privacy guarantees were not evaluated here
4. **In general, there may be trade-offs** between privacy constraints and model accuracy, but in this demo the privacy-aware prompt did not hurt performance

### Key Findings

- Models trained on synthetic data can generalize to real test data, and in this small demo sometimes even outperform real-data training (likely due to noise or simplifications in the experimental setup)
- Privacy-aware synthetic data generation maintains reasonable distributional similarity to real data
- KL divergence provides a useful metric for quantifying how well synthetic data matches real data distributions

### Future Directions

- Evaluate additional privacy metrics (e.g., membership inference attacks)
- Test on larger, more complex healthcare datasets
- Compare LLM-based synthesis with other methods (GANs, differential privacy, SMOTE)
- Explore few-shot learning with synthetic data augmentation
- Investigate optimal prompt engineering strategies for privacy-utility balance

---

**Data Source:** eICU Collaborative Research Database Demo v2.0.1 from PhysioNet  
**LLM Used:** GPT-5.1 extended thinking  
**Course:** CS598 Deep Learning for Healthcare