# 01 Introductionary Experiments
In this codebook, we will perform the basic experiments to motivate the use of disjoint generation.

Disjoint as opposed to single-synthesis generation remains largely unexplored in the literature. Single-synthesis generation seems like the obvious choice for most applications, because having access to all variables may allow the model to better capture the underlying distribution. However, as we shall show, disjoint generation has many benefits such as computational speedup, better generalization, and better privacy. Additionally, distributing the generation process across multiple models can allow for use of specialised models for different variable types.

In [1]:
import time
import pandas as pd

In [2]:
df_train = pd.read_csv('datasets/breast_cancer_train.csv')
df_test = pd.read_csv('datasets/breast_cancer_test.csv')

df_train.head()

Unnamed: 0,Age,Race,Marital Status,T Stage,N Stage,Sixth Stage,differentiate,Grade,A Stage,Tumor Size,Estrogen Status,Progesterone Status,Regional Node Examined,Reginol Node Positive,Survival Months,Status
0,45,White,Married,T1,N1,IIA,Poorly differentiated,3,Regional,15,Positive,Positive,24,1,80,Alive
1,61,White,Married,T1,N1,IIA,Moderately differentiated,2,Regional,17,Positive,Positive,8,1,48,Alive
2,61,White,Married,T2,N2,IIIA,Moderately differentiated,2,Regional,28,Positive,Negative,32,5,61,Alive
3,40,White,Married,T3,N2,IIIA,Moderately differentiated,2,Regional,60,Positive,Positive,14,5,53,Dead
4,45,White,Married,T1,N1,IIA,Well differentiated,1,Regional,14,Positive,Positive,13,1,94,Alive


In [5]:
# Establish the baseline with single-synthesis with a bayesian network model (probably the worst case classical ML model in terms of time complexity)
from disjoint_generative_model.utils.generative_model_adapters import generate_synthetic_data

start = time.time()
df_bn = generate_synthetic_data(df_train,'datasynthesizer')
end = time.time()

df_bn.to_csv('datasets/breast_cancer_datasynthesizer.csv', index=False)

print('Time taken for single-synthesis with bayesian network model: ', end-start)
# multiple repeats: 27.05, 25.92, 25.97, 26.30, 24.31

Adding ROOT Regional Node Examined
Adding attribute Reginol Node Positive
Adding attribute N Stage
Adding attribute Sixth Stage
Adding attribute T Stage
Adding attribute Tumor Size
Adding attribute Age
Adding attribute Survival Months
Adding attribute Marital Status
Adding attribute Status
Adding attribute differentiate
Adding attribute Grade
Adding attribute Race
Adding attribute Progesterone Status
Adding attribute Estrogen Status
Adding attribute A Stage
Time taken for single-synthesis with bayesian network model:  25.398772716522217


In [6]:
# Establish the baseline with single-synthesis with a CART model (the fastest classical ML model we encountered)
from disjoint_generative_model.utils.generative_model_adapters import generate_synthetic_data

start = time.time()
df_cart = generate_synthetic_data(df_train,'synthpop')
end = time.time()

df_cart.to_csv('datasets/breast_cancer_synthpop.csv', index=False)

print('Time taken for single-synthesis with CART model: ', end-start)
# multiple repeats: 4.18, 4.25, 4.06, 4.24, 4.24

Find out more at https://www.synthpop.org.uk/



Variable(s): Race, Marital.Status, T.Stage, N.Stage, Sixth.Stage, differentiate, A.Stage, Estrogen.Status, Progesterone.Status, Status have been changed for synthesis from character to factor.
Consider changing them to factors. You can do it using parameter 'minnumlevels'.
Synthetic data exported as csv file(s).
Information on synthetic data written to
  /home/lautrup/repositories/disjoint-synthetic-data-generation/synthesis_info_synthpop_temp_synthpop.txt 
Time taken for single-synthesis with CART model:  4.018636226654053


In [None]:
# Establish the baseline with single-synthesis with a GAN model
from disjoint_generative_model.utils.generative_model_adapters import generate_synthetic_data

start = time.time()
df_gan = generate_synthetic_data(df_train,'ctgan')
end = time.time()

df_gan.to_csv('datasets/breast_cancer_ctgan.csv', index=False)

print('Time taken for single-synthesis with GAN model: ', end-start)
# multiple repeats: 452.69, 919.70, 1206.95

  from .autonotebook import tqdm as notebook_tqdm
[2024-11-20T15:50:54.623156+0100][743091][CRITICAL] module disabled: /home/lautrup/sdg_env/lib/python3.10/site-packages/synthcity/plugins/generic/plugin_goggle.py
 57%|█████▋    | 1149/2000 [19:55<14:45,  1.04s/it] 

Time taken for single-synthesis with GAN model:  1206.9541540145874





In [None]:
# Setup the disjoint generative model with the same dataset
from disjoint_generative_model import DisjointGenerativeModels

start = time.time()     # By default, the joining scheme is concatenation (random)
dgms = DisjointGenerativeModels(df_train,['synthpop','synthpop'])   
df_dgms = dgms.fit_generate()   
end = time.time()

df_dgms.to_csv('datasets/breast_cancer_dgms_random.csv', index=False)

print('Time taken for disjoint generative models: ', end-start)
# multiple repeats: 5.88, 5.72, 5.84

Find out more at https://www.synthpop.org.uk/


Consider changing them to factors. You can do it using parameter 'minnumlevels'.

Variable(s): A.Stage, Progesterone.Status, N.Stage, Status numeric but with only 3 or fewer distinct values turned into factor(s) for synthesis.

Synthetic data exported as csv file(s).
Information on synthetic data written to
  /home/lautrup/repositories/disjoint-synthetic-data-generation/synthesis_info_synthpop_temp_synthpop.txt 


Find out more at https://www.synthpop.org.uk/


Consider changing them to factors. You can do it using parameter 'minnumlevels'.

Variable(s): Estrogen.Status, Race numeric but with only 3 or fewer distinct values turned into factor(s) for synthesis.

Synthetic data exported as csv file(s).
Information on synthetic data written to
  /home/lautrup/repositories/disjoint-synthetic-data-generation/synthesis_info_synthpop_temp_synthpop.txt 
Time taken for disjoint generative models:  5.8448545932769775


In [8]:
# Setup the evaluation methodology 
from syntheval import SynthEval

metrics = {
    "corr_diff" : {"mixed_corr": True},
    "auroc_diff" : {"model": "rf_cls"},
    "cls_acc"   : {"F1_type": "macro"},
    "eps_risk"  : {},
}

dfs = {
    'datasynthesizer': pd.read_csv('datasets/breast_cancer_datasynthesizer.csv'),
    'synthpop': pd.read_csv('datasets/breast_cancer_synthpop.csv'),
    'ctgan': pd.read_csv('datasets/breast_cancer_ctgan.csv'),
    'dgms_random': pd.read_csv('datasets/breast_cancer_dgms_random.csv')
}

SE = SynthEval(df_train, df_test, verbose=False)
res, rank = SE.benchmark(dfs, analysis_target_var='Status', **metrics, rank_strategy='summation')

res

Unnamed: 0_level_0,corr_mat_diff,corr_mat_diff,auroc,auroc,cls_F1_diff,cls_F1_diff,cls_F1_diff_hout,cls_F1_diff_hout,eps_identif_risk,eps_identif_risk,priv_loss_eps,priv_loss_eps,rank,u_rank,p_rank
Unnamed: 0_level_1,value,error,value,error,value,error,value,error,value,error,value,error,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
dataset,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
datasynthesizer,0.402019,,0.010163,,0.021067,0.00851,0.013658,0.004802,0.45563,,0.16182,,5.334312,3.951762,1.38255
synthpop,0.194283,,0.011891,,0.013569,0.007525,0.013176,0.004996,0.59657,,0.269575,,5.0936,3.959745,1.133855
ctgan,1.111686,,0.056804,,0.109858,0.00763,0.115485,0.015627,0.400075,,0.1305,,5.178015,3.708589,1.469426
dgms_random,1.436618,,0.011438,,0.019345,0.008425,0.031602,0.015937,0.342655,,0.107755,,5.475234,3.925644,1.54959


## Controlling for randomness
Even with the default concatentaion joining scheme, it seems that we are seeing a huge benefit to privacy, at only a small cost to utility. With this anectotically establised there are many experiments we should do to confirm this. We will start by investigating the variation of quality in the data due to random subdivisions of the columns alongside the random joining.

### Experiment 1: Random Subdivision into equal parts
In this experiment we do repeated generation of the same dataset using the same models, but with different random subdivisions of the columns. We will then compare the quality of the generated data using the same metrics as before.

In [None]:
from syntheval import SynthEval
from itertools import product
from disjoint_generative_model import DisjointGenerativeModels

NUM_REPEATS = 3

metrics = {
    "corr_diff" : {"mixed_corr": True},
    "auroc_diff" : {"model": "rf_cls"},
    "cls_acc"   : {"F1_type": "macro"},
    "eps_risk"  : {},
}

models = ['synthpop', 'datasynthesizer', 'ctgan']

times = []
dgms_dict = {}
for i, model in list(product(range(NUM_REPEATS), models)):
    start = time.time()
    dgms = DisjointGenerativeModels(df_train,[model, model])   
    df_dgms = dgms.fit_generate()   
    end = time.time()

    times.append(end-start)
    dgms_dict[f'{model}_{i}'] = df_dgms

SE = SynthEval(df_train, df_test, verbose=False)
res, rank = SE.benchmark(dgms_dict, analysis_target_var='Status', **metrics, rank_strategy='summation')

Find out more at https://www.synthpop.org.uk/


Consider changing them to factors. You can do it using parameter 'minnumlevels'.

Variable(s): Race, Progesterone.Status, Status numeric but with only 3 or fewer distinct values turned into factor(s) for synthesis.

Synthetic data exported as csv file(s).
Information on synthetic data written to
  /home/lautrup/repositories/disjoint-synthetic-data-generation/synthesis_info_synthpop_temp_0_synthpop.txt 


Find out more at https://www.synthpop.org.uk/


Consider changing them to factors. You can do it using parameter 'minnumlevels'.

Variable(s): Estrogen.Status, N.Stage, A.Stage numeric but with only 3 or fewer distinct values turned into factor(s) for synthesis.

Synthetic data exported as csv file(s).
Information on synthetic data written to
  /home/lautrup/repositories/disjoint-synthetic-data-generation/synthesis_info_synthpop_temp_1_synthpop.txt 
Adding ROOT T Stage
Adding ROOT Marital Status
Adding attribute Tumor Size
Adding attribute Age
Adding attribute Survival Months
Adding attribute Sixth Stage
Adding attribute N Stage
Adding attribute Status
Adding attribute Reginol Node Positive
Adding attribute Grade
Adding attribute Regional Node Examined
Adding attribute differentiate
Adding attribute A Stage
Adding attribute Race
Adding attribute Progesterone Status
Adding attribute Estrogen Status


[2024-11-25T12:48:57.720145+0100][1179366][CRITICAL] module disabled: /home/lautrup/sdg_env/lib/python3.10/site-packages/synthcity/plugins/generic/plugin_goggle.py
[2024-11-25T12:48:57.815345+0100][1179367][CRITICAL] module disabled: /home/lautrup/sdg_env/lib/python3.10/site-packages/synthcity/plugins/generic/plugin_goggle.py
 40%|███▉      | 799/2000 [08:27<12:42,  1.58it/s]  
 37%|███▋      | 749/2000 [09:44<16:15,  1.28it/s]
Find out more at https://www.synthpop.org.uk/


Consider changing them to factors. You can do it using parameter 'minnumlevels'.

Variable(s): Progesterone.Status, A.Stage numeric but with only 3 or fewer distinct values turned into factor(s) for synthesis.

Synthetic data exported as csv file(s).
Information on synthetic data written to
  /home/lautrup/repositories/disjoint-synthetic-data-generation/synthesis_info_synthpop_temp_0_synthpop.txt 


Find out more at https://www.synthpop.org.uk/


Consider changing them to factors. You can do it using parameter 'minnumlevels'.

Variable(s): Estrogen.Status, Status, Race, N.Stage numeric but with only 3 or fewer distinct values turned into factor(s) for synthesis.

Synthetic data exported as csv file(s).
Information on synthetic data written to
  /home/lautrup/repositories/disjoint-synthetic-data-generation/synthesis_info_synthpop_temp_1_synthpop.txt 
Adding ROOT Progesterone Status
Adding ROOT T Stage
Adding attribute Grade
Adding attribute Tumor Size
Adding attribute differentiate
Adding attribute Survival Months
Adding attribute Age
Adding attribute Reginol Node Positive
Adding attribute Sixth Stage
Adding attribute Status
Adding attribute N Stage
Adding attribute Regional Node Examined
Adding attribute Race
Adding attribute Marital Status
Adding attribute Estrogen Status
Adding attribute A Stage


[2024-11-25T12:59:12.738857+0100][1184743][CRITICAL] module disabled: /home/lautrup/sdg_env/lib/python3.10/site-packages/synthcity/plugins/generic/plugin_goggle.py
[2024-11-25T12:59:12.955432+0100][1184741][CRITICAL] module disabled: /home/lautrup/sdg_env/lib/python3.10/site-packages/synthcity/plugins/generic/plugin_goggle.py
 27%|██▋       | 549/2000 [05:23<14:15,  1.70it/s]  
 37%|███▋      | 749/2000 [08:21<13:56,  1.49it/s]
Find out more at https://www.synthpop.org.uk/


Consider changing them to factors. You can do it using parameter 'minnumlevels'.

Variable(s): Race, N.Stage, Status, Progesterone.Status numeric but with only 3 or fewer distinct values turned into factor(s) for synthesis.

Synthetic data exported as csv file(s).
Information on synthetic data written to
  /home/lautrup/repositories/disjoint-synthetic-data-generation/synthesis_info_synthpop_temp_0_synthpop.txt 


Find out more at https://www.synthpop.org.uk/


Consider changing them to factors. You can do it using parameter 'minnumlevels'.

Variable(s): Estrogen.Status, A.Stage numeric but with only 3 or fewer distinct values turned into factor(s) for synthesis.

Synthetic data exported as csv file(s).
Information on synthetic data written to
  /home/lautrup/repositories/disjoint-synthetic-data-generation/synthesis_info_synthpop_temp_1_synthpop.txt 
Adding ROOT Race
Adding ROOT Survival Months
Adding attribute Age
Adding attribute Tumor Size
Adding attribute Regional Node Examined
Adding attribute Marital Status
Adding attribute Sixth Stage
Adding attribute T Stage
Adding attribute N Stage
Adding attribute Reginol Node Positive
Adding attribute Estrogen Status
Adding attribute A Stage
Adding attribute differentiate
Adding attribute Grade
Adding attribute Status
Adding attribute Progesterone Status


[2024-11-25T13:08:15.399661+0100][1187317][CRITICAL] module disabled: /home/lautrup/sdg_env/lib/python3.10/site-packages/synthcity/plugins/generic/plugin_goggle.py
[2024-11-25T13:08:15.425318+0100][1187316][CRITICAL] module disabled: /home/lautrup/sdg_env/lib/python3.10/site-packages/synthcity/plugins/generic/plugin_goggle.py
 25%|██▍       | 499/2000 [05:39<17:00,  1.47it/s]  
 22%|██▏       | 449/2000 [05:40<19:35,  1.32it/s]


In [6]:
print(times)

res

[4.390730619430542, 17.136704921722412, 593.9686894416809, 3.6662650108337402, 17.34406304359436, 511.2317461967468, 4.5235512256622314, 26.698811292648315, 350.2445001602173]


Unnamed: 0_level_0,corr_mat_diff,corr_mat_diff,auroc,auroc,cls_F1_diff,cls_F1_diff,cls_F1_diff_hout,cls_F1_diff_hout,eps_identif_risk,eps_identif_risk,priv_loss_eps,priv_loss_eps,rank,u_rank,p_rank
Unnamed: 0_level_1,value,error,value,error,value,error,value,error,value,error,value,error,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
dataset,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
synthpop_0,2.125247,,0.005504,,0.014278,0.008249,0.011222,0.004756,0.217002,,0.0522,,5.682083,3.951285,1.730798
datasynthesizer_0,1.160832,,0.019208,,0.022066,0.008552,0.019962,0.006443,0.413497,,0.134228,,5.381364,3.92909,1.452274
ctgan_0,2.653008,,0.130907,,0.23393,0.009211,0.210169,0.037832,0.110365,,0.014169,,5.278352,3.402885,1.875466
synthpop_1,2.551341,,0.028732,,0.019998,0.008712,0.017652,0.01075,0.187919,,0.056301,,5.668135,3.912356,1.755779
datasynthesizer_1,2.019936,,0.01716,,0.01682,0.008174,0.026643,0.011433,0.237509,,0.054437,,5.630598,3.922544,1.708054
ctgan_1,2.170008,,0.204638,,0.25388,0.007676,0.234362,0.028539,0.15921,,0.020507,,5.10932,3.289037,1.820283
synthpop_2,1.18446,,0.008992,,0.025971,0.008363,0.015773,0.009322,0.391872,,0.140194,,5.407328,3.939394,1.467934
datasynthesizer_2,2.367455,,0.132487,,0.230625,0.00836,0.212417,0.03202,0.186055,,0.04698,,5.171707,3.404742,1.766965
ctgan_2,2.085125,,0.120284,,0.260141,0.007292,0.234294,0.030191,0.228561,,0.050708,,5.088636,3.367905,1.720731


In [None]:
res_datasynthesizer = res[res['dataset'].str.startswith('datasynthesizer')]
res_synthpop = res[res['dataset'].str.startswith('synthpop')]
res_ctgan = res[res['dataset'].str.startswith('ctgan')]

res_datasynthesizer, res_synthpop, res_ctgan

In [7]:
res.std()

corr_mat_diff     value    0.533358
                  error         NaN
auroc             value    0.073372
                  error         NaN
cls_F1_diff       value    0.118868
                  error    0.000558
cls_F1_diff_hout  value    0.108196
                  error    0.012845
eps_identif_risk  value    0.101673
                  error         NaN
priv_loss_eps     value    0.044531
                  error         NaN
rank                       0.237092
u_rank                     0.299699
p_rank                     0.145567
dtype: object

It seems that there is some variation in the quality of data due to differing subdivisions and different models. This points to two important observations:
 - Randoms subdivision and joining can lead to different quality of data.
 - Among these some are actually quite good.

This is a good sign, as it means that we can potentially find a good subdivision and joining that will lead to high quality data. Raising further research questions such as ''how can records be joined such that the quality of the data is maximized?'' and ''does other methods of subdivision than random lead to better quality data?''.

Before moving on to answer any of these questions, we will first investigate the effect of different sized subdivisions on the quality of the data.

### Experiment 2: Random Subdivision into different parts
This experiment continues from the previous one, but instead of subdividing the data into equal sized parts, we will randomly subdivide the data into different fractions. We will then compare the quality of the data generated using the same metrics as before.