
# SUPPORT Notebook

This notebook runs through investigations on the open access SUPPORT dataset.

For users who do not have lots of computational resources or do not have access to MIMIC-III then this notebook should be used.

In [1]:
#%% -------- Import Libraries -------- #

# Standard imports
import numpy as np
import pandas as pd
import torch

# VAE is in other folder as well as opacus adapted library
import sys
sys.path.append('../')

# Opacus support for differential privacy
from opacus.utils.uniform_sampler import UniformWithReplacementSampler

# For the SUPPORT dataset
from pycox.datasets import support

# For VAE dataset formatting
from torch.utils.data import TensorDataset, DataLoader

# VAE functions
from VAE import Decoder, Encoder, VAE

# Utility file contains all functions required to run notebook
from utils import set_seed, support_pre_proc, plot_elbo, plot_likelihood_breakdown, plot_variable_distributions, reverse_transformers
from metrics import distribution_metrics, privacy_metrics

## Data Loading & Column Definitions

First we load in the SUPPORT dataset from pycox datasets. Then we define the continuous and categorical columns in that dataset

In [2]:
set_seed(0)

# Load in the support data
data_supp = support.read_df()

# Column Definitions
original_continuous_columns = ['duration'] + [f"x{i}" for i in range(7,15)]
original_categorical_columns = ['event'] + [f"x{i}" for i in range(1,7)]

## Data Pre-Processing

Data can be pre-processed in 2 ways. Either we use <b>"standard"</b> option which performs a standard scaler on continuous variables - This has known limitations as:

- Data in tables is usually non-gaussian and SynthVAE implements a gaussian loss, so this will perform worse unless the data is KNOWN to follow a gaussian distribution already.

Or we use the second option of <b>"GMM"</b>. This performs a variational gaussian mixture model to scale the data & transform it to a gaussian distribution. We use a maximum number of clusters of 10 but the variational method will select the best number of clusters for that continuous variable. This also has known limitations:

- 10 Clusters is arbitrary and may not be enough for certain variables.
- We are fitting a model to transform the data and hence we are approximating before model is trained. This will lose fidelity as the distribution will not be transformed perfectly.

SUPPORT is a limited dataset as it has no missingness (which our model currently does NOT handle) and it has no datetime columns or other data types. Be wary drawing any conclusions from this set due to these constraints as well as the dataset size. Testing/training new models with this set could be useful but conclusive results should be tested on other sets.

In [3]:
#%% -------- Data Pre-Processing -------- #

pre_proc_method = "GMM"

x_train, data_supp, reordered_dataframe_columns, continuous_transformers, categorical_transformers, num_categories, num_continuous = support_pre_proc(data_supp=data_supp, pre_proc_method=pre_proc_method)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


## Creation & Training of VAE

We can adapt certain parameters of the model e.g. batch size, latent dimension size etc. This model implements early stopping and these values can be adapted.

We can also activate differential privacy by implementing dp-sgd through the opacus library.

In [8]:
#%% -------- Create & Train VAE -------- #

# User defined hyperparams
# General training
batch_size=32
latent_dim=2
hidden_dim=256
n_epochs=50
logging_freq=1 # Number of epochs we should log the results to the user
patience=5 # How many epochs should we allow the model train to see if
# improvement is made
delta=10 # The difference between elbo values that registers an improvement
filepath=None # Where to save the best model


# Privacy params
differential_privacy = True # Do we want to implement differential privacy
sample_rate=0.1 # Sampling rate
C = 1e16 # Clipping threshold - any gradients above this are clipped
noise_scale=None # Noise multiplier - influences how much noise to add
target_eps=1 # Target epsilon for privacy accountant
target_delta=1e-5 # Target delta for privacy accountant

# Prepare data for interaction with torch VAE
Y = torch.Tensor(x_train)
dataset = TensorDataset(Y)

generator = None
sample_rate = batch_size / len(dataset)
data_loader = DataLoader(
    dataset,
    batch_sampler=UniformWithReplacementSampler(
        num_samples=len(dataset), sample_rate=sample_rate, generator=generator
    ),
    pin_memory=True,
    generator=generator,
)

# Create VAE
encoder = Encoder(x_train.shape[1], latent_dim, hidden_dim=hidden_dim)
decoder = Decoder(
    latent_dim, num_continuous, num_categories=num_categories
)

vae = VAE(encoder, decoder)

if(differential_privacy==False):
    log_elbo, log_reconstruction, log_divergence, log_categorical, log_numerical = vae.train(data_loader, n_epochs=n_epochs)
    
elif(differential_privacy==True):
    log_elbo, log_reconstruction, log_divergence, log_categorical, log_numerical = vae.diff_priv_train(
            data_loader,
            n_epochs=n_epochs,
            C=C,
            target_eps=target_eps,
            target_delta=target_delta,
            sample_rate=sample_rate,
            noise_scale=noise_scale
        )
    print(f"(epsilon, delta): {vae.get_privacy_spent(target_delta)}")



Encoder: gpu specified, cuda:0 used
Decoder: gpu specified, cuda:0 used


100%|██████████| 277/277 [00:05<00:00, 50.07it/s]


	Epoch:  0. Elbo:    85724.07. Reconstruction Loss:    85513.07. KL Divergence:      211.00. Categorical Loss:    -7883.67. Numerical Loss:   -77629.40


100%|██████████| 277/277 [00:05<00:00, 53.23it/s]


	Epoch:  1. Elbo:    86815.58. Reconstruction Loss:    86583.17. KL Divergence:      232.41. Categorical Loss:    -7876.63. Numerical Loss:   -78706.54


100%|██████████| 277/277 [00:05<00:00, 51.27it/s]


	Epoch:  2. Elbo:    85620.12. Reconstruction Loss:    85338.90. KL Divergence:      281.23. Categorical Loss:    -7896.74. Numerical Loss:   -77442.16


100%|██████████| 277/277 [00:05<00:00, 54.49it/s]


	Epoch:  3. Elbo:    86206.56. Reconstruction Loss:    85756.85. KL Divergence:      449.71. Categorical Loss:    -7916.10. Numerical Loss:   -77840.75


100%|██████████| 277/277 [00:05<00:00, 51.49it/s]


	Epoch:  4. Elbo:    85372.96. Reconstruction Loss:    84407.28. KL Divergence:      965.67. Categorical Loss:    -7892.44. Numerical Loss:   -76514.84


100%|██████████| 277/277 [00:05<00:00, 48.82it/s]


	Epoch:  5. Elbo:    87059.09. Reconstruction Loss:    85820.44. KL Divergence:     1238.65. Categorical Loss:    -7874.36. Numerical Loss:   -77946.08


100%|██████████| 277/277 [00:04<00:00, 56.48it/s]


	Epoch:  6. Elbo:    89224.93. Reconstruction Loss:    87528.26. KL Divergence:     1696.68. Categorical Loss:    -7885.56. Numerical Loss:   -79642.70


100%|██████████| 277/277 [00:04<00:00, 61.11it/s]


	Epoch:  7. Elbo:    89902.90. Reconstruction Loss:    87704.47. KL Divergence:     2198.43. Categorical Loss:    -7853.28. Numerical Loss:   -79851.19


100%|██████████| 277/277 [00:04<00:00, 59.90it/s]


	Epoch:  8. Elbo:    88988.45. Reconstruction Loss:    86623.15. KL Divergence:     2365.30. Categorical Loss:    -7832.17. Numerical Loss:   -78790.98


100%|██████████| 277/277 [00:04<00:00, 66.46it/s]


	Epoch:  9. Elbo:    89600.27. Reconstruction Loss:    87228.53. KL Divergence:     2371.74. Categorical Loss:    -7810.04. Numerical Loss:   -79418.49


100%|██████████| 277/277 [00:04<00:00, 59.81it/s]


	Epoch: 10. Elbo:    89288.02. Reconstruction Loss:    86878.84. KL Divergence:     2409.18. Categorical Loss:    -7804.59. Numerical Loss:   -79074.25


100%|██████████| 277/277 [00:04<00:00, 60.30it/s]


	Epoch: 11. Elbo:    89330.88. Reconstruction Loss:    86424.33. KL Divergence:     2906.55. Categorical Loss:    -7822.80. Numerical Loss:   -78601.53


100%|██████████| 277/277 [00:04<00:00, 60.61it/s]


	Epoch: 12. Elbo:    90695.30. Reconstruction Loss:    87203.22. KL Divergence:     3492.08. Categorical Loss:    -7848.05. Numerical Loss:   -79355.17


100%|██████████| 277/277 [00:04<00:00, 60.94it/s]


	Epoch: 13. Elbo:    91739.65. Reconstruction Loss:    86322.17. KL Divergence:     5417.48. Categorical Loss:    -7858.47. Numerical Loss:   -78463.70


100%|██████████| 277/277 [00:04<00:00, 60.68it/s]


	Epoch: 14. Elbo:    90681.59. Reconstruction Loss:    86736.12. KL Divergence:     3945.47. Categorical Loss:    -7863.74. Numerical Loss:   -78872.39


100%|██████████| 277/277 [00:04<00:00, 60.72it/s]


	Epoch: 15. Elbo:    89844.85. Reconstruction Loss:    85661.78. KL Divergence:     4183.07. Categorical Loss:    -7870.70. Numerical Loss:   -77791.09


100%|██████████| 277/277 [00:05<00:00, 55.38it/s]


	Epoch: 16. Elbo:    91469.09. Reconstruction Loss:    86201.69. KL Divergence:     5267.40. Categorical Loss:    -7889.14. Numerical Loss:   -78312.56


100%|██████████| 277/277 [00:05<00:00, 50.90it/s]


	Epoch: 17. Elbo:    89708.37. Reconstruction Loss:    84516.59. KL Divergence:     5191.78. Categorical Loss:    -7895.22. Numerical Loss:   -76621.37


100%|██████████| 277/277 [00:05<00:00, 53.33it/s]


	Epoch: 18. Elbo:    91762.05. Reconstruction Loss:    86621.11. KL Divergence:     5140.94. Categorical Loss:    -7890.84. Numerical Loss:   -78730.26


100%|██████████| 277/277 [00:04<00:00, 56.55it/s]


	Epoch: 19. Elbo:    93487.25. Reconstruction Loss:    88151.89. KL Divergence:     5335.36. Categorical Loss:    -7886.09. Numerical Loss:   -80265.80


100%|██████████| 277/277 [00:04<00:00, 56.30it/s]


	Epoch: 20. Elbo:    93210.17. Reconstruction Loss:    88290.59. KL Divergence:     4919.57. Categorical Loss:    -7872.18. Numerical Loss:   -80418.42


100%|██████████| 277/277 [00:04<00:00, 57.19it/s]


	Epoch: 21. Elbo:    92379.98. Reconstruction Loss:    87635.46. KL Divergence:     4744.52. Categorical Loss:    -7839.04. Numerical Loss:   -79796.42


100%|██████████| 277/277 [00:05<00:00, 53.11it/s]


	Epoch: 22. Elbo:    90989.83. Reconstruction Loss:    85019.93. KL Divergence:     5969.90. Categorical Loss:    -7860.57. Numerical Loss:   -77159.36


100%|██████████| 277/277 [00:05<00:00, 55.39it/s]


	Epoch: 23. Elbo:    93554.12. Reconstruction Loss:    85953.78. KL Divergence:     7600.34. Categorical Loss:    -7886.46. Numerical Loss:   -78067.33


100%|██████████| 277/277 [00:05<00:00, 53.74it/s]


	Epoch: 24. Elbo:    95048.27. Reconstruction Loss:    86177.81. KL Divergence:     8870.46. Categorical Loss:    -7884.28. Numerical Loss:   -78293.52


100%|██████████| 277/277 [00:05<00:00, 53.69it/s]


	Epoch: 25. Elbo:    96630.99. Reconstruction Loss:    86882.05. KL Divergence:     9748.94. Categorical Loss:    -7903.85. Numerical Loss:   -78978.20


100%|██████████| 277/277 [00:04<00:00, 55.95it/s]


	Epoch: 26. Elbo:    96565.89. Reconstruction Loss:    85853.92. KL Divergence:    10711.97. Categorical Loss:    -7917.76. Numerical Loss:   -77936.16


100%|██████████| 277/277 [00:05<00:00, 52.51it/s]


	Epoch: 27. Elbo:    99025.04. Reconstruction Loss:    85539.54. KL Divergence:    13485.50. Categorical Loss:    -7940.49. Numerical Loss:   -77599.05


100%|██████████| 277/277 [00:05<00:00, 53.65it/s]


	Epoch: 28. Elbo:    99087.67. Reconstruction Loss:    86067.79. KL Divergence:    13019.88. Categorical Loss:    -7918.98. Numerical Loss:   -78148.81


100%|██████████| 277/277 [00:04<00:00, 59.78it/s]


	Epoch: 29. Elbo:    96403.03. Reconstruction Loss:    83838.81. KL Divergence:    12564.22. Categorical Loss:    -7920.79. Numerical Loss:   -75918.02


100%|██████████| 277/277 [00:04<00:00, 59.13it/s]


	Epoch: 30. Elbo:   102230.41. Reconstruction Loss:    86258.38. KL Divergence:    15972.03. Categorical Loss:    -7948.26. Numerical Loss:   -78310.13


100%|██████████| 277/277 [00:05<00:00, 54.44it/s]


	Epoch: 31. Elbo:   105943.85. Reconstruction Loss:    86833.13. KL Divergence:    19110.72. Categorical Loss:    -7975.47. Numerical Loss:   -78857.66


100%|██████████| 277/277 [00:05<00:00, 53.43it/s]


	Epoch: 32. Elbo:   108519.30. Reconstruction Loss:    87006.02. KL Divergence:    21513.28. Categorical Loss:    -7966.84. Numerical Loss:   -79039.18


100%|██████████| 277/277 [00:04<00:00, 55.66it/s]


	Epoch: 33. Elbo:   109417.10. Reconstruction Loss:    87378.44. KL Divergence:    22038.66. Categorical Loss:    -7949.24. Numerical Loss:   -79429.20


100%|██████████| 277/277 [00:04<00:00, 56.93it/s]


	Epoch: 34. Elbo:   109008.13. Reconstruction Loss:    87933.75. KL Divergence:    21074.38. Categorical Loss:    -7957.61. Numerical Loss:   -79976.14


100%|██████████| 277/277 [00:04<00:00, 57.68it/s]


	Epoch: 35. Elbo:   108148.47. Reconstruction Loss:    87565.31. KL Divergence:    20583.16. Categorical Loss:    -7970.29. Numerical Loss:   -79595.02


100%|██████████| 277/277 [00:05<00:00, 53.74it/s]


	Epoch: 36. Elbo:   106463.46. Reconstruction Loss:    86627.59. KL Divergence:    19835.87. Categorical Loss:    -7960.65. Numerical Loss:   -78666.94


100%|██████████| 277/277 [00:05<00:00, 52.47it/s]


	Epoch: 37. Elbo:   110657.08. Reconstruction Loss:    89300.59. KL Divergence:    21356.49. Categorical Loss:    -7966.75. Numerical Loss:   -81333.84


100%|██████████| 277/277 [00:05<00:00, 49.27it/s]


	Epoch: 38. Elbo:   107234.63. Reconstruction Loss:    84748.58. KL Divergence:    22486.05. Categorical Loss:    -7962.37. Numerical Loss:   -76786.22


100%|██████████| 277/277 [00:05<00:00, 49.73it/s]


	Epoch: 39. Elbo:   112319.66. Reconstruction Loss:    87778.77. KL Divergence:    24540.89. Categorical Loss:    -7997.07. Numerical Loss:   -79781.70


100%|██████████| 277/277 [00:05<00:00, 51.21it/s]


	Epoch: 40. Elbo:   107543.08. Reconstruction Loss:    85593.88. KL Divergence:    21949.20. Categorical Loss:    -7980.04. Numerical Loss:   -77613.84


100%|██████████| 277/277 [00:04<00:00, 59.63it/s]


	Epoch: 41. Elbo:   112578.01. Reconstruction Loss:    84980.27. KL Divergence:    27597.74. Categorical Loss:    -8018.63. Numerical Loss:   -76961.63


100%|██████████| 277/277 [00:04<00:00, 59.03it/s]


	Epoch: 42. Elbo:   114319.19. Reconstruction Loss:    84730.64. KL Divergence:    29588.55. Categorical Loss:    -7971.37. Numerical Loss:   -76759.27


100%|██████████| 277/277 [00:04<00:00, 58.33it/s]


	Epoch: 43. Elbo:   111593.66. Reconstruction Loss:    84964.64. KL Divergence:    26629.03. Categorical Loss:    -7961.83. Numerical Loss:   -77002.80


100%|██████████| 277/277 [00:05<00:00, 55.17it/s]


	Epoch: 44. Elbo:   110490.46. Reconstruction Loss:    84179.62. KL Divergence:    26310.84. Categorical Loss:    -7957.32. Numerical Loss:   -76222.30


100%|██████████| 277/277 [00:05<00:00, 54.70it/s]


	Epoch: 45. Elbo:   108038.38. Reconstruction Loss:    83099.59. KL Divergence:    24938.79. Categorical Loss:    -7979.68. Numerical Loss:   -75119.91


100%|██████████| 277/277 [00:05<00:00, 49.57it/s]


	Epoch: 46. Elbo:   107844.08. Reconstruction Loss:    83320.78. KL Divergence:    24523.30. Categorical Loss:    -8020.73. Numerical Loss:   -75300.05


100%|██████████| 277/277 [00:05<00:00, 47.47it/s]


	Epoch: 47. Elbo:   111937.51. Reconstruction Loss:    86290.34. KL Divergence:    25647.17. Categorical Loss:    -8023.97. Numerical Loss:   -78266.37


100%|██████████| 277/277 [00:05<00:00, 51.63it/s]


	Epoch: 48. Elbo:   110564.21. Reconstruction Loss:    83760.34. KL Divergence:    26803.87. Categorical Loss:    -7997.39. Numerical Loss:   -75762.95


100%|██████████| 277/277 [00:05<00:00, 51.06it/s]

	Epoch: 49. Elbo:   115611.67. Reconstruction Loss:    85104.36. KL Divergence:    30507.31. Categorical Loss:    -8005.94. Numerical Loss:   -77098.42
(epsilon, delta): (0.9954940576041075, 18.0)





## Plotting Elbo Functionality

Here we can plot and save the ELBO graph for the training run

In [None]:
#%% -------- Plot Loss Features ELBO Breakdown -------- #

elbo_fig = plot_elbo(
    n_epochs=26, log_elbo=log_elbo, log_reconstruction=log_reconstruction,
    log_divergence=log_divergence, saving_filepath=""
)

## Plotting Reconstruction Breakdown

Here we can plot the breakdown of reconstruction loss i.e. visualise how the categorical and numerical losses change over training

In [None]:
#%% -------- Plot Loss Features Reconstruction Breakdown -------- #

likelihood_fig = plot_likelihood_breakdown(
    n_epochs=26, log_categorical=log_categorical, log_numerical=log_numerical,
    saving_filepath="", pre_proc_method=pre_proc_method
)

## Synthetic Data Generation

Here we create synthetic data ready for metric testing as well as visualisation of variable reconstruction. For this we simply generate from our generative model and then reverse transformations using the prior transformers.

In [9]:
#%% -------- Synthetic Data Generation -------- #

synthetic_sample = vae.generate(data_supp.shape[0])

if(torch.cuda.is_available()):
    synthetic_sample = pd.DataFrame(synthetic_sample.cpu().detach(), columns=reordered_dataframe_columns)
else:
    synthetic_sample = pd.DataFrame(synthetic_sample.detach(), columns=reordered_dataframe_columns)

# Reverse the transformations

synthetic_supp = reverse_transformers(synthetic_set=synthetic_sample, data_supp_columns=data_supp.columns, 
                                      cont_transformers=continuous_transformers, cat_transformers=categorical_transformers,
                                      pre_proc_method=pre_proc_method
                                     )

## Synthetic Variable Visualisation

Here we want to visualise the synthetic variables generated and compare them to the original set

In [None]:
#%% -------- Plot Histograms For All The Variable Distributions -------- #

plot_variable_distributions(
    categorical_columns=original_categorical_columns, continuous_columns=original_continuous_columns,
    data_supp=data_supp, synthetic_supp=synthetic_supp,saving_filepath="",
    pre_proc_method=pre_proc_method
)

## Metric evaluation

We use the SDV evaluation framework. Supply the metrics you wish to find in the user_metrics list from SDV guidance. Can start here: https://sdv.dev/SDV/user_guides/evaluation/single_table_metrics.html

Note that not all of these will work, some are hit and miss. We predominantly rely on continuous and discrete KL divergence measures. You can also input <b>"gower"</b> and this will calculate the gower distance using the gower library.

In [6]:
#%% -------- SDV Metrics -------- #

# Define the metrics you want the model to evaluate

gower=False

metrics = distribution_metrics(
    gower=gower, data_supp=data_supp, synthetic_supp=synthetic_supp,
    categorical_columns=original_categorical_columns, continuous_columns=original_continuous_columns,
    saving_filepath=None, pre_proc_method=pre_proc_method
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  real_data[pd.isna(real_data)] = 0.0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._where(-key, value, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  synthetic_data[pd.isna(synthetic_data)] = 0.0


# Privacy Metric Evaluation

Using SDV privacy metrics we can get an insight into how privacy is conserved when utilising dp-sgd methods. SDV's privacy metrics are limited in that they can only be used on similar data types. E.g. if we choose age to be the sensitive variably, we can build ML based models to predict a users age using the other columns. However we are forced to only use columns that are also continuous variables.

In [10]:
# Specify our private variable

private_variable = 'x14'

privacy_metric = privacy_metrics(private_variable=private_variable, data_supp=data_supp,
                                synthetic_supp=synthetic_supp, categorical_columns=original_categorical_columns,
                                continuous_columns=original_continuous_columns, saving_filepath=None, pre_proc_method=pre_proc_method)

  return f(*args, **kwargs)
  return c**2 / (c**2 - n**2)
  Lhat = muhat - Shat*mu
  a = (self.min - loc) / scale
  b = (self.max - loc) / scale
  df = fun(x) - f0
