# Validation of Synthesised Patient Trajectories

This is a notebook containing the code used to generate a small example synthetic dataset with three models: TimeGAN, DeepEcho, and ehrMGAN. 

Before running the models, download TimeGAN and ehrMGAN from the following links and save to the same directory as this file:

https://github.com/jsyoon0823/TimeGAN

https://github.com/jli0117/ehrMGAN

Note that TimeGAN and ehrmgan need to be ran with python 3.7

In [2]:
import numpy as np
import pandas as pd
import os

In [3]:
# create example data
temp = []
for x in range(1000):
    arr = np.array([[1, 0], [2,0],[3,0], [4,1]])
    temp.append(arr)

# TimeGAN implementation

ensure the required version of tensorflow is running

#!pip install --upgrade tensorflow==1.15.0

In [34]:
pwd

'/Users/lilyfelstead/Desktop/genreportdata/TimeGAN'

In [14]:
import os
os.chdir('TimeGAN')

In [7]:
pwd

'/Users/lilyfelstead/Desktop/genreportdata/TimeGAN'

In [8]:
## Necessary packages
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import warnings
warnings.filterwarnings("ignore")

# 1. TimeGAN model
from timegan import timegan
# 2. Data loading
from data_loading import real_data_loading, sine_data_generation
# 3. Metrics
from metrics.discriminative_metrics import discriminative_score_metrics
from metrics.predictive_metrics import predictive_score_metrics
from metrics.visualization_metrics import visualization

In [9]:
## Newtork parameters
parameters = dict()

parameters['module'] = 'gru' 
parameters['hidden_dim'] = 10
parameters['num_layer'] = 3
parameters['iterations'] = 200
parameters['batch_size'] = 50

In [10]:
generated_data = timegan(temp, parameters)   
print('Finish Synthetic Data Generation')





Instructions for updating:
This class is equivalent as tf.keras.layers.GRUCell, and will be replaced by that in Tensorflow 2.0.
Instructions for updating:
This class is equivalent as tf.keras.layers.StackedRNNCells, and will be replaced by that in Tensorflow 2.0.
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Please use `layer.add_weight` method instead.
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons


2023-04-28 15:38:20.050480: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2023-04-28 15:38:20.082225: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f9a50159af0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-04-28 15:38:20.082246: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version


Start Embedding Network Training
step: 0/200, e_loss: 0.4388
Finish Embedding Network Training
Start Training with Supervised Loss Only
step: 0/200, s_loss: 0.3495
Finish Training with Supervised Loss Only
Start Joint Training
step: 0/200, d_loss: 2.0915, g_loss_u: 0.6866, g_loss_s: 0.0963, g_loss_v: 0.058, e_loss_t0: 0.0664
Finish Joint Training
Finish Synthetic Data Generation


In [11]:
# look at first synthetic patient 
generated_data[0]

array([[1.01780093e+00, 2.13903189e-03],
       [2.01272541e+00, 2.17914581e-03],
       [3.01145506e+00, 3.80960107e-03],
       [3.98228419e+00, 9.97177720e-01]])

# deepEcho implementation

In [3]:
!pip install deepecho

In [4]:
# reformat example data
example = []
for i in range(1000):
    example.append([i,0,1,0])
    example.append([i,1,2,0])
    example.append([i,2,3,0])
    example.append([i,3,4,1])
exampledf = pd.DataFrame(example, columns = ['id','time','c1','d1'])

In [5]:
data_types = {
    'id': 'categorical',
    'c1': 'continuous',
    'd1': 'categorical',
}
sequence_index = 'time'
context_columns = []
entity_columns = ['id']

In [8]:
from deepecho import PARModel

model = PARModel(epochs=100, cuda=False)
model.fit(
    data=exampledf,
    entity_columns=entity_columns,
    context_columns=context_columns,
    data_types=data_types,
    sequence_index=sequence_index
)

Epoch 100 | Loss -0.0020709929522126913: 100%|█| 100/100 [01:01<00:00,  1.62it/s


In [12]:
# look at generated data
model.sample(5)

  output = output.append(group)
  output = output.append(group)
  output = output.append(group)
  output = output.append(group)
  output = output.append(group)
100%|█████████████████████████████████████████████| 5/5 [00:00<00:00, 99.15it/s]


Unnamed: 0,id,c1,d1
0,0,1.194418,0
1,0,1.675329,0
2,0,2.786574,0
3,0,3.894432,0
4,1,1.454316,0
5,1,2.040868,0
6,1,3.058112,1
7,1,3.162493,0
8,2,0.89841,0
9,2,2.047398,0


# EHR-M-GAN implementation

In [42]:
# needs python 3.7
#!pip3 install pickle5
#!pip uninstall tensorflow
#pip install tensorflow==1.13.2 

In [35]:
# navigate to ehrMGAN directory
os.chdir('..')
os.chdir('ehrMGAN')

In [34]:
%run -i 'main_train1.py' --dataset mimic --num_pre_epochs 50 --num_epochs 50 --epoch_ckpt_freq 50

start pretraining
pretraining epoch 0
pretraining epoch 1
pretraining epoch 2
pretraining epoch 3
pretraining epoch 4
pretraining epoch 5
pretraining epoch 6
pretraining epoch 7
pretraining epoch 8
pretraining epoch 9
pretraining epoch 10
pretraining epoch 11
pretraining epoch 12
pretraining epoch 13
pretraining epoch 14
pretraining epoch 15
pretraining epoch 16
pretraining epoch 17
pretraining epoch 18
pretraining epoch 19
pretraining epoch 20
pretraining epoch 21
pretraining epoch 22
pretraining epoch 23
pretraining epoch 24
pretraining epoch 25
pretraining epoch 26
pretraining epoch 27
pretraining epoch 28
pretraining epoch 29
pretraining epoch 30
pretraining epoch 31
pretraining epoch 32
pretraining epoch 33
pretraining epoch 34
pretraining epoch 35
pretraining epoch 36
pretraining epoch 37
pretraining epoch 38
pretraining epoch 39
pretraining epoch 40
pretraining epoch 41
pretraining epoch 42
pretraining epoch 43
pretraining epoch 44
pretraining epoch 45
pretraining epoch 46
pretr

FileNotFoundError: [Errno 2] No such file or directory: 'data/real/mimic/norm_stats.npz'

In [37]:
# read in output
gen_data =  np.load('data/fake/epoch4/gen_data.npz')
print(gen_data['c_gen_data'][1])
print(gen_data['d_gen_data'][1])

[[0.261321  ]
 [0.2884156 ]
 [0.67058027]
 [0.99934363]]
[[0.]
 [0.]
 [0.]
 [1.]]


In [38]:
# ehrmgan gives noramlised data...
# continuous data is scaled to be within the range of the data
e = gen_data['c_gen_data'][1]

In [39]:
e

array([[0.261321  ],
       [0.2884156 ],
       [0.67058027],
       [0.99934363]], dtype=float32)

In [40]:
scaled = []
for x in e: scaled.append(x*4)

In [41]:
print(scaled)

[array([1.045284], dtype=float32), array([1.1536624], dtype=float32), array([2.682321], dtype=float32), array([3.9973745], dtype=float32)]
