<a href="https://colab.research.google.com/github/noo-rashbass/synthetic-data-service/blob/master/Evaluation/discriminative_model_NEW_Lulu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 2.2 Discriminative Model
For a quantitative measure of similarity, we train a post-hoc time-series classification model (by optimizing a multi-layer GRU) to distinguish between sequences from the original and generated datasets. First, each original sequence is labeled **'1'**, and each generated sequence is labeled **'0'**. Then, an off-the-shelf (RNN) classifier is trained to distinguish between the two classes as a standard supervised task. We then report the classification error on the held-out test set, which gives a quantitative assessment of fidelity.

In [4]:
import tensorflow as tf
from tensorflow.keras.models import model_from_json

import numpy as np
import pandas as pd

from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split

Understanding the <!--[text](link)-->[classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report).

# Functions for Loading Data

Lulu's Notes:

* separated data loading into functions
* returned `hidden_dim` 
* replaced `mix_divide()` with `train_val_test_split()`

In [13]:
# def reshape_removena_stack(ori_data):
#   ori_data = np.split(ori_data, np.shape(ori_data)[0]/10, axis=0)
#   ori_data_new = []
#   for array in ori_data:
#     if not np.isnan(array).any():
#       ori_data_new.append(array)
#   return ori_data_new

# def load_DoppelGANger():
#   ori_data = np.load('synthetic data/doppelGANger/ori_features_prism.npy') #Aisha: Change the path of loaded data for consistency
#   gen_data = np.load('synthetic data/doppelGANger/features_600.npy')
#   return ori_data, gen_data

# def load_tGAN():
#   ori_data = pd.read_csv('synthetic data/TGAN/cat_time_10visits_all_noid.csv').values # shape (12390 patients visits, 10 features)
#   ori_data = reshape_removena_stack(ori_data) # shape (841 patients, 10 visits, 10 features)
#   gen_data = np.load('synthetic data/TGAN/gen_cat_time_10visits_wl_5000it.npy')[:np.shape(ori_data)[0]] # shape (841 patients, 10 visits, 10 features)
#   return ori_data, gen_data

def load_DoppelGANger():
  ori_data = pd.read_csv('/content/cat_time_5abovevisits_all.csv') # max timeseries length = 130
  # gen_data = pd.read_csv('/content/gen_doptf2_cat_5abovevisits_e100_lstm.csv') # max timeseries length = 107
  gen_data = pd.read_csv('/content/gen_doptf2_cat_5abovevisits_e200_lstm.csv') # max timeseries length = 111
  ori_data = ori_data.drop(columns=['diar_No', 'diar_Yes', 'head_No', 'head_Yes'])
  gen_data = gen_data.drop(columns=['diar_No', 'diar_Yes', 'head_No', 'head_Yes'])
  ori_data = cat_df_to_3d_array(ori_data, 130) # array (1347, 130, 10)
  gen_data = cat_df_to_3d_array(gen_data, 130) # array (1347, 130, 10)
  return np.nan_to_num(ori_data), np.nan_to_num(gen_data)

def cat_df_to_3d_array(data, max_length):
  data.fillna(0)
  # max_length = data['id'].value_counts().max() # if you want to get it from the data, but ori/gen may have different max lengths
  lst = []
  for i in data.id.unique():
    timeseries = data[data['id']==i].drop(columns='id').to_numpy()
    length = np.shape(timeseries)[0]
    timeseries = np.pad(timeseries, pad_width=((0,max_length-length), (0,0)), mode='constant') # fill remaining rows with zeros
    lst.append(timeseries)
  array = np.stack(lst)
  return array

In [6]:
def MinMaxScaler(data): # This is a normalisation method copied from TGANs code # Lulu: not used
  """Min Max normalizer.
  
  Args:
    - data: original data
  
  Returns:
    - norm_data: normalized data
  """
  numerator = data - np.min(data, 0)
  denominator = np.max(data, 0) - np.min(data, 0)
  norm_data = numerator / (denominator + 1e-7)
  return norm_data


def InputSize(ori_data): # Set the input size to the model
    no, seq_len, dim = np.asarray(ori_data).shape 
    hidden_dim = int(dim/2)
    input_dim = [None,dim]
    return input_dim, hidden_dim # Lulu: added hidden_dim and renamed input_size because of later conflict

In [7]:
def train_val_test_split(ori_data, gen_data, rate=(0.65, 0.2, 0.15)): # Lulu: using sklearn, replaces mix_divide
  # rate = (train, val, test) must sum to one

  data = np.concatenate([ori_data,gen_data],axis=0)
  labels = np.concatenate([np.ones(len(ori_data)), np.zeros(len(gen_data))], axis=0)

  train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size=rate[2])
  train_data, val_data, train_labels, val_labels = train_test_split(train_data, train_labels, train_size=rate[0]/(rate[0]+rate[1]))
  return train_data, val_data, test_data, train_labels, val_labels, test_labels

# Define Model

Lulu's Notes:

* No normalisation used
* Added internediate dense layer which improved score. This also makes the class much more flexible between "more features, shorter sequences" and "fewer features, longer sequences"
* Changed loss to `BinaryCrossentropy`

In [8]:
def discriminative_model(input_size, hidden_dim): 
    inputs = tf.keras.Input(shape = input_size)
    # normalised1 = LayerNormalization()(inputs1)
    GRU_output_sequence, GRU_last_state = tf.keras.layers.GRU(hidden_dim, return_sequences = True, return_state = True)(inputs)
    # Dense1 is the y_hat_logit in the original code
    Dense1 = tf.keras.layers.Dense(hidden_dim)(GRU_last_state) # Lulu: added intermediate dense layer with increased dimension, scores much better
    Dense2 = tf.keras.layers.Dense(1)(Dense1)

    # Acti1 is the y_hat in the original code
    # It is very odd that the original code seems to compare the result of Dense1 with the one-zero label # Lulu: it's OK, there are losses these types
    # while using Acti1 as the prediction result, but it doesn't make sense to me
    # I do what I think to be the right thing here - use Acti1 result as the prediction result

    Acti1 = tf.keras.layers.Activation(tf.keras.activations.sigmoid)(Dense2)  # Lulu: might not need separate activation layer
    
    model = tf.keras.Model(inputs = inputs, outputs = [Acti1])
    model.compile(optimizer = "adam", loss = tf.keras.losses.BinaryCrossentropy()) # Lulu: I think this is a better choice of loss for us
    
    return model 
                         


# tGAN

## Train

In [9]:
ori_data_tgan, gen_data_tgan = load_tGAN()
train_data_tgan, val_data_tgan, test_data_tgan, train_labels_tgan, val_labels_tgan, test_labels_tgan = train_val_test_split(ori_data=ori_data_tgan, gen_data=gen_data_tgan)
# Check shapes:
for array in [train_data_tgan, val_data_tgan, test_data_tgan, train_labels_tgan, val_labels_tgan, test_labels_tgan]:
  print(np.shape(array))


input_dim, hidden_dim = InputSize(ori_data_tgan)
model_tgan = discriminative_model(input_size=input_dim, hidden_dim=hidden_dim)

history_model_tgan = model_tgan.fit(train_data_tgan, train_labels_tgan, batch_size=128, epochs=200, validation_data=(val_data_tgan, val_labels_tgan))


FileNotFoundError: ignored

## Evaluate

In [None]:
model_tgan.evaluate(test_data_tgan, test_labels_tgan) # keras built in evaluation



0.007770351134240627

In [None]:
test_raw_pred_tgan = model_tgan.predict(test_data_tgan)
test_pred_tgan = np.round(test_raw_pred_tgan)

print(classification_report(test_labels_tgan, test_pred_tgan, digits=5)) # more detailed classification report using sklearn

              precision    recall  f1-score   support

         0.0    0.98561   1.00000   0.99275       137
         1.0    1.00000   0.98276   0.99130       116

    accuracy                        0.99209       253
   macro avg    0.99281   0.99138   0.99203       253
weighted avg    0.99221   0.99209   0.99209       253



In [None]:
exp_acc_tgan = np.sum(test_labels_tgan)/np.shape(test_labels_tgan)[0]
print('Expected accuracy for an untrained discriminative model = ', str(exp_acc_tgan))
print('Final accuracy of trained discriminative model = ', str(accuracy_score(test_labels_tgan, test_pred_tgan)))

Expected accuracy for an untrained discriminative model =  0.45849802371541504
Final accuracy of trained discriminative model =  0.9920948616600791


# DoppelGANger

## Train

Lulu: I chose to increase the hidden dimension to 64 because there are only 5 features. This allows the additional dense layer to train from the longer sequences of 130 (compared to length 10 in the tGAN output). Accuracy improved significantly.

In [14]:
ori_data_dop, gen_data_dop = load_DoppelGANger()
train_data_dop, val_data_dop, test_data_dop, train_labels_dop, val_labels_dop, test_labels_dop = train_val_test_split(ori_data=ori_data_dop, gen_data=gen_data_dop)
# Check shapes
for array in [train_data_dop, val_data_dop, test_data_dop, train_labels_dop, val_labels_dop, test_labels_dop]:
  print(np.shape(array))


input_dim, hidden_dim = InputSize(ori_data_dop)
model_dop = discriminative_model(input_size=input_dim, hidden_dim=64)

history_model_dop = model_dop.fit(train_data_dop, train_labels_dop, batch_size=128, epochs=100, validation_data=(val_data_dop, val_labels_dop))


(1750, 130, 6)
(539, 130, 6)
(405, 130, 6)
(1750,)
(539,)
(405,)
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73

## Evaluate

In [15]:
model_dop.evaluate(test_data_dop, test_labels_dop) # keras built in evaluation



0.028805462643504143

In [22]:
test_raw_pred_dop = model_dop.predict(test_data_dop)
test_pred_dop = np.round(test_raw_pred_dop)

report = classification_report(test_labels_dop, test_pred_dop, digits=5, output_dict=True) # more detailed classification report using sklearn
report = pd.DataFrame(report).transpose()
report.to_csv('discriminative_dop_results_0827am.csv')
print(report)

              precision    recall  f1-score     support
0.0            0.990050  1.000000  0.995000  199.000000
1.0            1.000000  0.990291  0.995122  206.000000
accuracy       0.995062  0.995062  0.995062    0.995062
macro avg      0.995025  0.995146  0.995061  405.000000
weighted avg   0.995111  0.995062  0.995062  405.000000


In [17]:
exp_acc_dop = np.sum(test_labels_dop)/np.shape(test_labels_dop)[0]
print('Expected accuracy for an untrained discriminative model = ', str(exp_acc_dop))
print('Final accuracy of trained discriminative model = ', str(accuracy_score(test_labels_dop, test_pred_dop)))

Expected accuracy for an untrained discriminative model =  0.508641975308642
Final accuracy of trained discriminative model =  0.9950617283950617


We can conclude that the discriminative model can distinguish the synthetic data from the real data very well, so we are expecting further improvements with our synthetic data.

References:
* <!--[Text](link)-->
[Jinsung Yoon, Daniel Jarrett, Mihaela van der Schaar. Time-series Generative Adversarial Networks](https://papers.nips.cc/paper/8789-time-series-generative-adversarial-networks.pdf 'Optional title')
