# MIMIC Notebook

This notebook runs through investigations on internal NHSX datasets collated from the MIMIC-III dataset.

For users who do not have MIMIC-III access then investigations cannot be run through until access is completed. In the meantime you can access the investigations on the open access SUPPORT dataset.

The notebook that produces our single table is found here <https://github.com/DaveBrind/SynthVAE/blob/main/MIMIC_preproc.ipynb>. If you want to create a single table yourself then follow the example csv file given at <https://github.com/DaveBrind/SynthVAE/blob/main/example_input.csv>

In [None]:
#%% -------- Import Libraries -------- #

# Standard imports
from tokenize import String
import numpy as np
import pandas as pd
import torch

# VAE is in other folder
import sys

sys.path.append("../")

# Opacus support for differential privacy
from opacus.utils.uniform_sampler import UniformWithReplacementSampler

# For VAE dataset formatting
from torch.utils.data import TensorDataset, DataLoader

# VAE functions
from VAE import Decoder, Encoder, VAE

# For datetime columns we need a transformer
from rdt.transformers import datetime

# Utility file contains all functions required to run notebook
from utils import (
    set_seed,
    mimic_pre_proc,
    constraint_filtering,
    plot_elbo,
    plot_likelihood_breakdown,
    plot_variable_distributions,
    reverse_transformers,
)
from metrics import distribution_metrics, privacy_metrics

import warnings

warnings.filterwarnings("ignore")  # We suppress warnings to avoid SDMETRICS throwing unique synthetic data warnings (i.e.
# data in synthetic set is not in the real data set) as well as SKLEARN throwing convergence warnings (pre-processing uses
# GMM from sklearn and this throws non convergence warnings)

set_seed(0)

## Data Loading & Column Definitions

First we need to load in the MIMIC dataset from a specified filepath. 

We then need to create lists indicating which columns are:
a) continuous
b) categorical
c) datetime

Currently other data types are not supported. Importantly if columns contain missing data then they need to be dropped - Do not include these in original column lists & instead drop them from the loaded set in the cell below.

In [None]:
# Load in the mimic single table data

filepath = ""

data_supp = pd.read_csv(filepath)
original_categorical_columns = [
    "ETHNICITY",
    "DISCHARGE_LOCATION",
    "GENDER",
    "FIRST_CAREUNIT",
    "VALUEUOM",
    "LABEL",
]
original_continuous_columns = ["SUBJECT_ID", "VALUE", "age"]
original_datetime_columns = ["ADMITTIME", "DISCHTIME", "DOB", "CHARTTIME"]

# Drop DOD column as it contains NANS - for now

# data_supp = data_supp.drop('DOD', axis = 1)

## Data Pre-Processing

Data can be pre-processed in 2 ways. Either we use <b>"standard"</b> option which performs a standard scaler on continuous variables - This has known limitations as:

- Data in tables is usually non-gaussian and SynthVAE implements a gaussian loss, so this will perform worse unless the data is KNOWN to follow a gaussian distribution already.

Or we use the second option of <b>"GMM"</b>. This performs a variational gaussian mixture model to scale the data & transform it to a gaussian distribution. We use a maximum number of clusters of 10 but the variational method will select the best number of clusters for that continuous variable. This also has known limitations:

- 10 Clusters is arbitrary and may not be enough for certain variables.
- We are fitting a model to transform the data and hence we are approximating before model is trained. This will lose fidelity as the distribution will not be transformed perfectly.


For datasets that include datetime columns, original_metric_set returns the initial dataset after these columns have been transformed. This is because:

- Our evaluation suite cannot calculate certain metrics on datetime objects so these need to be converted to continuous values first

In [None]:
pre_proc_method = "standard"  # Select pre-processing method standard or GMM

#%% -------- Data Pre-Processing -------- #

(
    x_train,
    original_metric_set,
    reordered_dataframe_columns,
    continuous_transformers,
    categorical_transformers,
    datetime_transformers,
    num_categories,
    num_continuous,
) = mimic_pre_proc(data_supp=data_supp, pre_proc_method=pre_proc_method)



## Creation & Training of VAE

We can adapt certain parameters of the model e.g. batch size, latent dimension size etc. This model implements early stopping and these values can be adapted.

We can also activate differential privacy by implementing dp-sgd through the opacus library.

In [None]:
#%% -------- Create & Train VAE -------- #

# User defined hyperparams
# General training
batch_size = 32
latent_dim = 256
hidden_dim = 256
n_epochs = 5
logging_freq = 1  # Number of epochs we should log the results to the user
patience = 5  # How many epochs should we allow the model train to see if
# improvement is made
delta = 10  # The difference between elbo values that registers an improvement
filepath = None  # Where to save the best model


# Privacy params
differential_privacy = False  # Do we want to implement differential privacy
sample_rate = 0.1  # Sampling rate
C = 1e16  # Clipping threshold - any gradients above this are clipped
noise_scale = None  # Noise multiplier - influences how much noise to add
target_eps = 1  # Target epsilon for privacy accountant
target_delta = 1e-5  # Target delta for privacy accountant


# Prepare data for interaction with torch VAE
Y = torch.Tensor(x_train)
dataset = TensorDataset(Y)

generator = None
sample_rate = batch_size / len(dataset)
data_loader = DataLoader(
    dataset,
    batch_sampler=UniformWithReplacementSampler(
        num_samples=len(dataset), sample_rate=sample_rate, generator=generator
    ),
    pin_memory=True,
    generator=generator,
)

# Create VAE

encoder = Encoder(x_train.shape[1], latent_dim, hidden_dim=hidden_dim)
decoder = Decoder(latent_dim, num_continuous, num_categories=num_categories)

vae = VAE(encoder, decoder)

print(vae)

if differential_privacy == False:
    (
        training_epochs,
        log_elbo,
        log_reconstruction,
        log_divergence,
        log_categorical,
        log_numerical,
    ) = vae.train(
        data_loader, 
        n_epochs=n_epochs,
        logging_freq=logging_freq,
        patience=patience,
        delta=delta,
    )

elif differential_privacy == True:
    (
        training_epochs,
        log_elbo,
        log_reconstruction,
        log_divergence,
        log_categorical,
        log_numerical,
    ) = vae.diff_priv_train(
        data_loader,
        n_epochs=n_epochs,
        logging_freq=logging_freq,
        patience=patience,
        delta=delta,
        C=C,
        target_eps=target_eps,
        target_delta=target_delta,
        sample_rate=sample_rate,
        noise_scale=noise_scale,
    )
    print(f"(epsilon, delta): {vae.get_privacy_spent(target_delta)}")


## Plotting Elbo Functionality

Here we can plot and save the ELBO graph for the training run

In [None]:
#%% -------- Plot Loss Features ELBO Breakdown -------- #

elbo_fig = plot_elbo(
    n_epochs=training_epochs,
    log_elbo=log_elbo,
    log_reconstruction=log_reconstruction,
    log_divergence=log_divergence,
    saving_filepath="",
    pre_proc_method=pre_proc_method,
)


## Plotting Reconstruction Breakdown

Here we can plot the breakdown of reconstruction loss i.e. visualise how the categorical and numerical losses change over training

In [None]:
#%% -------- Plot Loss Features Reconstruction Breakdown -------- #

likelihood_fig = plot_likelihood_breakdown(
    n_epochs=training_epochs,
    log_categorical=log_categorical,
    log_numerical=log_numerical,
    saving_filepath="",
    pre_proc_method=pre_proc_method,
)


## Synthetic Data Generation

Here we create synthetic data ready for metric testing as well as visualisation of variable reconstruction.

If you are using the MIMIC-III internal set from NHSX then constraint sampling here checks to ensure certain constraints are obeyed in the synthetic set. These are:

- age is greater than or equal to 0
- The admission date is after the date of birth
- The discharge date is after or equal to the admission date
- The first chart time is also after or equal to admission date

## Either run the cell directly below for constraints included in sampling OR run the cell second below to just generate a sample without constraints

In [None]:
#%% -------- Constraint Sampling -------- #

# For NHSX internal MIMIC set OR sets which follow a similar data structure

synthetic_supp = constraint_filtering(
    n_rows=data_supp.shape[0],
    vae=vae,
    reordered_cols=reordered_dataframe_columns,
    data_supp_columns=data_supp.columns,
    cont_transformers=continuous_transformers,
    cat_transformers=categorical_transformers,
    date_transformers=datetime_transformers,
    pre_proc_method=pre_proc_method,
)


In [None]:
#%% -------- Synthetic Data Generation Without Constraints -------- #

# For any other datasets OR for running without constraint sampling

synthetic_sample = vae.generate(data_supp.shape[0])

# Reverse the transformations

synthetic_supp = reverse_transformers(
    synthetic_set=synthetic_sample,
    data_supp_columns=data_supp.columns,
    cont_transformers=continuous_transformers,
    cat_transformers=categorical_transformers,
    date_transformers=datetime_transformers,
    pre_proc_method=pre_proc_method,
)



## Synthetic Variable Visualisation

Here we want to visualise the synthetic variables generated and compare them to the original set

In [None]:
#%% -------- Plot Histograms For All The Variable Distributions -------- #

plot_variable_distributions(
    categorical_columns=original_categorical_columns,
    continuous_columns=original_continuous_columns,
    data_supp=data_supp,
    synthetic_supp=synthetic_supp,
    saving_filepath="",
    pre_proc_method=pre_proc_method,
)


## Metric evaluation

For datasets that have datetime columns, we need to re-transform these into a numerical value as our metrics cannot handle datetime objects. We are then inputting <b>original_metric_set</b> alongside the newly transformed synthetic set i.e. <b>metric_synthetic_supp</b>. If datetimes are not included in the set then you can just run <b>data_supp</b> against <b>synthetic_supp</b> and skip the datetime handling.

We use the SDV evaluation framework. Supply the metrics you wish to find in the user_metrics list from SDV guidance. Can start here: https://sdv.dev/SDV/user_guides/evaluation/single_table_metrics.html

Note that not all of these will work, some are hit and miss. We predominantly rely on continuous and discrete KL divergence measures. You can also input <b>"gower"</b> and this will calculate the gower distance using the gower library.

In [None]:
#%% -------- Datetime Handling -------- #

# If the dataset has datetimes then we need to re-convert these to a numerical
# Value representing seconds, this is so we can evaluate the metrics on them

metric_synthetic_supp = synthetic_supp.copy()

for index, column in enumerate(original_datetime_columns):

    # Fit datetime transformer - converts to seconds
    temp_datetime = datetime.DatetimeTransformer()
    temp_datetime.fit(metric_synthetic_supp, columns=column)

    metric_synthetic_supp = temp_datetime.transform(metric_synthetic_supp)



In [None]:
#%% -------- SDV Metrics -------- #

# Define the metrics you want the model to evaluate

# Define distributional metrics required - for sdv_baselines this is set by default
distributional_metrics = [
    "SVCDetection",
    "GMLogLikelihood",
    "CSTest",
    "KSTest",
    "KSTestExtended",
    "ContinuousKLDivergence",
    "DiscreteKLDivergence",
]

gower = False

metrics = distribution_metrics(
    gower_bool=gower,
    distributional_metrics=distributional_metrics,
    data_supp=original_metric_set,
    synthetic_supp=metric_synthetic_supp,
    categorical_columns=original_categorical_columns,
    continuous_columns=original_continuous_columns,
    saving_filepath="",
    pre_proc_method=pre_proc_method,
)


In [None]:
metrics

# Privacy Metric Evaluation

Using SDV privacy metrics we can get an insight into how privacy is conserved when utilising dp-sgd methods. SDV's privacy metrics are limited in that they can only be used on similar data types. E.g. if we choose age to be the sensitive variable, we can build ML based models to predict a users age using the other columns. However we are forced to only use columns that are also continuous variables.

In [None]:
# Specify our private variable

private_variable = "ETHNICITY"

privacy_metric = privacy_metrics(
    private_variable=private_variable,
    data_supp=data_supp,
    synthetic_supp=synthetic_supp,
    categorical_columns=original_categorical_columns,
    continuous_columns=original_continuous_columns,
    saving_filepath=None,
    pre_proc_method=pre_proc_method,
)



In [None]:
privacy_metric