# An Example of Generating Synthetic Data

This notebook gives a walkthrough of how to generate synthetic data using the DoppelGANger. We first start by cleaning the `tsv` dataset, normalising the data and then transforming it to a `npz` file that can be fed into the model.

From the file generated by the DoppelGANger, we then show how to convert the `npz` file into a `csv` file for evaluation and for your own usage.

## Import Packages

In [1]:
import pandas as pd 
import numpy as np

from prism_prep.prism_ori_data_loading import *
from prism_prep.prsim_gen_data_loading import * 

## Load original prism tsv

In [2]:
data = pd.read_csv('../../../synthetic-data-service/isaFull.tsv', delimiter='\t')
data.head()

Unnamed: 0,Observation_Id,Participant_Id,Household_Id,Abdominal pain [HP_0002027],Abdominal pain duration (days) [EUPATH_0000154],Admitting hospital [EUPATH_0000318],Age at visit (years) [EUPATH_0000113],Anorexia [SYMP_0000523],Anorexia duration (days) [EUPATH_0000155],"Asexual Plasmodium parasite density, by microscopy [EUPATH_0000092]",...,Seizures duration (days) [EUPATH_0000163],Severe malaria criteria [EUPATH_0000046],Subjective fever [EUPATH_0000100],"Submicroscopic Plasmodium present, by LAMP [EUPATH_0000487]",Temperature (C) [EUPATH_0000110],Visit date [EUPATH_0000091],Visit type [EUPATH_0000311],Vomiting [HP_0002013],Vomiting duration (days) [EUPATH_0000165],Weight (kg) [EUPATH_0000732]
0,100118961,1001,HH216001613,No,0.0,,34.08,No,0.0,,...,0.0,,No,,36.9,2011-11-27,Unscheduled visit,No,0.0,67.0
1,100120115,1001,HH216001613,No,0.0,,37.24,No,0.0,0.0,...,0.0,,No,,36.7,2015-01-24,Scheduled visit,No,0.0,72.0
2,100120199,1001,HH216001613,No,0.0,,37.47,No,0.0,0.0,...,0.0,,No,,37.0,2015-04-18,Scheduled visit,No,0.0,72.0
3,100120031,1001,HH216001613,No,0.0,,37.01,No,0.0,0.0,...,0.0,,No,,36.9,2014-11-01,Scheduled visit,No,0.0,73.0
4,100119668,1001,HH216001613,No,0.0,,36.02,No,0.0,0.0,...,0.0,,No,,36.2,2013-11-03,Scheduled visit,No,0.0,74.0


## Prepare a clean version of the prism csv

Clean means 

-categorical data are one hot encoded  
-NAs are dealt with  
-dates are converted to delta_dday a.k.a dday (difference in dates between subsequent visits) 
-removing unwanted columns

#### Two options for dealing with first dday for each patient  

-the first visit day for each patient is converted to 'first_dday' (first visit date - first ever visit date in whole dataset), and is treated as an attribute to capture "global distribution" (hence an extra column). All files for this option can be found in the `data_attr` folder.   
-the first visit day for each patient is calculated with (first visit date - first ever visit date in whole dataset), but not treated as an attribute (no extra column). All files for this option can be found in the `data` folder

This `csv` file can be used for evaluation between real and generated data

In [3]:
# define if first dday for each patient should be considered as an attribute or feature
first_dday_as_attr = False
cleaned_df = clean_prism(data, first_dday_as_attr) #clean_prism is a highly specifc function!
cleaned_df.head()

Unnamed: 0,id,ab_pain_dur,age,aneroxia_dur,plasmodium_density,cough_dur,diarrhea_dur,fatigue_dur,fever_dur,headache_dur,...,malaria_treatment_quinine_or_artesunate_for_complicated_malaria,plasmodium_gametocytes_no,plasmodium_gametocytes_yes,plasmodium_lamp_negative,plasmodium_lamp_no_result,plasmodium_lamp_positive,visit_type_enrollment,visit_type_scheduled_visit,visit_type_unscheduled_visit,dday
0,1001,0.0,33.76,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,1,0,1,0,0,1,0,0,4
1,1001,0.0,34.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,1,0,1,0,0,0,1,0,88
2,1001,0.0,34.08,0.0,0.0,7.0,0.0,0.0,0.0,0.0,...,0,1,0,0,1,0,0,0,1,29
3,1001,0.0,34.25,0.0,0.0,0.0,0.0,1.0,2.0,0.0,...,0,1,0,1,0,0,0,1,0,61
4,1001,0.0,34.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,1,0,1,0,0,0,1,0,91


## Convert CSV into a form that can be fed into the DoppelGANger model

### Example of saving data in the format that can be processed by the doppelganger

As above, two options are shown  
- the first dday as attribute
- the first dday not as attribute

`real_data_loading` normalises the input data, prepares features and attributes from the cleaned `csv`. The data return is then saved into an `npz` file

In [9]:
if first_dday_as_attr: 
    data_feature, data_attribute, data_gen_flag, min_, max_, feature_cols, attribute_cols = real_data_loading('data_attr/ori_prism_cleaned.csv')
    # first dday as attr
    data_attribute = data_attribute.reshape((-1, 1))

    print(data_feature.shape)
    print(data_attribute.shape)
    print(data_gen_flag.shape)
    # save to npz file
    #np.savez('data_attr/data_train.npz', data_feature=data_feature, data_attribute=data_attribute, data_gen_flag=data_gen_flag)
else:
    data_feature, _, data_gen_flag, min_, max_, feature_cols, attribute_cols = real_data_loading('data/ori_prism_cleaned.csv')
    # temporarily use all ones for attributes
    data_attribute = np.ones((1347, 1))

    print(data_feature.shape)
    print(data_attribute.shape)
    print(data_gen_flag.shape)
    # save to npz file
    #np.savez('data/data_train.npz', data_feature=data_feature, data_attribute=data_attribute, data_gen_flag=data_gen_flag)


(1347, 130, 47)
(1347, 1)
(1347, 130)


### Example for the `pkl` file

2 `pkl` files are needed to describe the features and attributes in the dataset

Note: Regardless if first_dday has been treated as an attribute, both options will have the same number of attributes. In the case of `first_dday_as_attr = True`, the attribute is first_dday. For `first_dday_as_attr = False`, `np.ones` are temporarily used as attributes. Hence, both options have the same 2 `pkl` files and they are shown below.

Explanation:  

The first line `Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False)` corresponds to `ab_pain_dur` which is continuous, has only one dimension, is noramlized between 0 and 1.  
 
The 2nd last line `Output(type_=OutputType.DISCRETE, dim=3, normalization=None, is_gen_flag=False)` corresponds to `visit_type` which is categorical (discrete), contains 3 categories (enrollment, scheduled, unscheduled)

In [50]:
import pickle
from output import *
data_feature_output = [
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False),
    Output(type_=OutputType.DISCRETE, dim=3, normalization=None, is_gen_flag=False),
    Output(type_=OutputType.DISCRETE, dim=2, normalization=None, is_gen_flag=False),
    Output(type_=OutputType.DISCRETE, dim=3, normalization=None, is_gen_flag=False),
    Output(type_=OutputType.DISCRETE, dim=2, normalization=None, is_gen_flag=False),
    Output(type_=OutputType.DISCRETE, dim=7, normalization=None, is_gen_flag=False),
    Output(type_=OutputType.DISCRETE, dim=5, normalization=None, is_gen_flag=False),
    Output(type_=OutputType.DISCRETE, dim=2, normalization=None, is_gen_flag=False),
    Output(type_=OutputType.DISCRETE, dim=3, normalization=None, is_gen_flag=False),
    Output(type_=OutputType.DISCRETE, dim=3, normalization=None, is_gen_flag=False),
    Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False)]

# with open('data/data_feature_output.pkl', 'wb') as f:
#     pickle.dump(data_feature_output, f)

In [51]:
data_attribute_output = [Output(type_=OutputType.CONTINUOUS, dim=1, normalization=Normalization.ZERO_ONE, is_gen_flag=False)]

# with open('data/data_attribute_output.pkl', 'wb') as f:
#     pickle.dump(data_attribute_output, f)

## Training of DoppelGANger and generating of data

In the terminal, run the `main.py` file using the command `python main.py`. Remember to change the number of epochs and path to the data/ path where the generated data is to be saved etc accordingly. 

After training, the DoppelGANger generates a `npz` file. 

## Converting generated `npz` file back to original data format  
### 3 steps
#### 1. load original data to get name of columns, min/ max values using `real_data_loading`

#### 2. load generated `npz` data into intermediate csv for evaluation purposes  
The CSV contains one hot encoding (hard 0s and 1s instead of probabilities), unrounded values and dday which can be used for evaluation purposes. This `csv` file is compared to the cleaned version of the prism data. A walkthrough of evaluation methods can be found in 'prism_eval.ipynb`.

#### 3. transform intermediate csv into final csv that has the format in which the original data is in  
Convert the intermediate CSV above to a "normal" CSV format where one hot encoding has been reversed and the dday has been converted to calendar dates

In [4]:
if first_dday_as_attr:
    features, attributes, gen_flag, min_val, max_val, feature_cols, attribute_cols = real_data_loading('data_attr/ori_prism_cleaned.csv')
    gen_prism_at_int_c1000 = gen_data_loading('data_attr/generated_data_train_c1000.npz', feature_cols, attribute_cols, min_val, max_val, True, seq_len=130)
    #gen_prism_at_int_c1000.to_csv('data_attr/gen_prism_at_int_c1000.csv', index=False)
    gen_prism_at_final_c1000 = convert_df_to_ori_format(gen_prism_at_int_c1000, True)
    #gen_prism_at_final_c1000.to_csv('data_attr/gen_prism_at_final_c1000.csv', index=False)

else:
    features, attributes, gen_flag, min_val, max_val, feature_cols, attribute_cols = real_data_loading('data/ori_prism_cleaned.csv')
    gen_prism_int_c1000 = gen_data_loading('data/generated_data_train_c1000_cumsum.npz', feature_cols, attribute_cols, min_val, max_val, False, seq_len=130)
    #gen_prism_int_c1000.to_csv('data/gen_prism_int_c1000_cumsum.csv', index=False)
    gen_prism_final_c1000 = convert_df_to_ori_format(gen_prism_int_c1000, False)
    #gen_prism_final_c1000.to_csv('data/gen_prism_final_c1000_cumsum.csv', index=False)

### An example of the final form of generated data


In [5]:
gen_prism_at_final_e200.head()

Unnamed: 0,Participant_Id,Visit date [EUPATH_0000091],Abdominal pain duration (days) [EUPATH_0000154],Age at visit (years) [EUPATH_0000113],Anorexia duration (days) [EUPATH_0000155],"Asexual Plasmodium parasite density, by microscopy [EUPATH_0000092]",Cough duration (days) [EUPATH_0000156],Diarrhea duration (days) [EUPATH_0000157],Fatigue duration (days) [EUPATH_0000158],"Fever, subjective duration (days) [EUPATH_0000164]",...,Weight (kg) [EUPATH_0000732],Complicated malaria [EUPATH_0000040],Febrile [EUPATH_0000097],ITN last night [EUPATH_0000216],Malaria diagnosis and parasite status [EUPATH_0000338],Malaria treatment [EUPATH_0000740],"Plasmodium gametocytes present, by microscopy [EUPATH_0000207]","Submicroscopic Plasmodium present, by LAMP [EUPATH_0000487]",Visit type [EUPATH_0000311],Malaria diagnosis [EUPATH_0000090]
0,1,2011-08-11,0.0,21.84,0.0,32.3,0.0,0.0,0.0,0.0,...,66.9,not assessed,no,not applicable,blood smear negative / lamp positive,no malaria medications given,no,negative,enrollment,no
1,1,2011-10-01,0.0,23.32,0.0,32.5,1.0,0.0,0.0,0.0,...,67.0,not assessed,yes,yes,blood smear negative / lamp not done,no malaria medications given,no,no result,scheduled visit,no
2,1,2011-11-20,0.0,24.65,0.0,32.4,0.0,0.0,0.0,0.0,...,68.6,not assessed,no,yes,blood smear negative / lamp negative,no malaria medications given,no,no result,scheduled visit,no
3,1,2012-01-08,0.0,24.55,0.0,32.8,0.0,0.0,0.0,0.0,...,68.1,not assessed,no,yes,blood smear negative / lamp negative,no malaria medications given,no,no result,scheduled visit,no
4,1,2012-02-26,0.0,25.54,0.0,32.6,0.0,0.0,0.0,0.0,...,68.7,not assessed,no,yes,blood smear negative / lamp not done,no malaria medications given,no,no result,scheduled visit,no


In [6]:
gen_prism_final_c1000.head()

Unnamed: 0,Participant_Id,Visit date [EUPATH_0000091],Abdominal pain duration (days) [EUPATH_0000154],Age at visit (years) [EUPATH_0000113],Anorexia duration (days) [EUPATH_0000155],"Asexual Plasmodium parasite density, by microscopy [EUPATH_0000092]",Cough duration (days) [EUPATH_0000156],Diarrhea duration (days) [EUPATH_0000157],Fatigue duration (days) [EUPATH_0000158],"Fever, subjective duration (days) [EUPATH_0000164]",...,Weight (kg) [EUPATH_0000732],Complicated malaria [EUPATH_0000040],Febrile [EUPATH_0000097],ITN last night [EUPATH_0000216],Malaria diagnosis and parasite status [EUPATH_0000338],Malaria treatment [EUPATH_0000740],"Plasmodium gametocytes present, by microscopy [EUPATH_0000207]","Submicroscopic Plasmodium present, by LAMP [EUPATH_0000487]",Visit type [EUPATH_0000311],Malaria diagnosis [EUPATH_0000090]
0,1,2011-08-06,0.0,6.44,0.0,0.0,1.0,0.0,0.0,0.0,...,24.8,not assessed,no,not applicable,blood smear negative / lamp negative,no malaria medications given,no,negative,enrollment,no
1,1,2011-09-06,0.0,6.9,0.0,106.1,1.0,0.0,0.0,1.0,...,23.6,not assessed,yes,yes,blood smear negative / lamp not done,no malaria medications given,no,no result,unscheduled visit,no
2,1,2011-10-03,0.0,7.0,0.0,393.8,1.0,0.0,0.0,0.0,...,23.3,not assessed,no,yes,blood smear not indicated,no malaria medications given,no,no result,scheduled visit,no
3,1,2011-11-01,0.0,7.0,0.0,107.9,1.0,0.0,0.0,0.0,...,24.8,not assessed,no,yes,blood smear negative / lamp negative,no malaria medications given,no,negative,scheduled visit,no
4,1,2011-11-25,0.0,7.76,0.0,0.0,0.0,0.0,0.0,0.0,...,23.4,not assessed,no,yes,blood smear negative / lamp negative,no malaria medications given,no,negative,scheduled visit,no
