# Experiments with marginal metrics

While pair-wise and dataset-wide metrics are sure to be affected by the disjoint generative models (DGMs) workflow, it was initially assumed that marginal metrics were unaffected beyond being aggregated in proportion to the size of the partition. This notebook explores the behavior of marginal metrics in the context of DGMs, and shows that the joining operation makes them be affected.

The following metrics are considered:

- Hellinger distance
- Kolglomorov-Smirnov/TVD statistic

In [1]:
### Imports
import pandas as pd

from pandas import DataFrame
from typing import List, Dict

from joblib import Parallel, delayed

from syntheval import SynthEval

from sklearn.ensemble import RandomForestClassifier
from disjoint_generative_model import DisjointGenerativeModels
from disjoint_generative_model.utils.joining_validator import JoiningValidator
from disjoint_generative_model.utils.joining_strategies import UsingJoiningValidator
from disjoint_generative_model.utils.dataset_manager import random_split_columns
from disjoint_generative_model.utils.generative_model_adapters import generate_synthetic_data

### Metrics
metrics = {
    "h_dist"    : {},
    "ks_test"   : {},
}

In [2]:
### Load data
df_train = pd.read_csv('experiments/datasets/hepatitis_train.csv')
df_test = pd.read_csv('experiments/datasets/hepatitis_test.csv')

label = 'b_class'

In [None]:
df_cart = generate_synthetic_data(df_train, 'synthpop')

df_bn = generate_synthetic_data(df_train, 'datasynthesizer')

df_ctgan = generate_synthetic_data(df_train, 'ctgan')

Find out more at https://www.synthpop.org.uk/



Variable(s): WBC, RBC, Plat, RNA.Base, RNA.4, RNA.12, RNA.EOT, RNA.EF have been changed for synthesis from character to factor.

Variable(s): Gender, Fever, Nausea.Vomting, Headache, Diarrhea, Fatigue...generalized.bone.ache, Jaundice, Epigastric.pain, b_class numeric but with only 3 or fewer distinct values turned into factor(s) for synthesis.

Synthetic data exported as csv file(s).
Information on synthetic data written to
  /home/lautrup/repositories/disjoint-synthetic-data-generation/synthesis_info_synthpop_temp_0_synthpop.txt 
Adding ROOT Baseline histological Grading
Adding attribute ALT 24
Adding attribute ALT 4
Adding attribute ALT 1


Exception ignored in: <function _releaseLock at 0x7f18da9a0940>
Traceback (most recent call last):
  File "/usr/lib/python3.10/logging/__init__.py", line 228, in _releaseLock
    def _releaseLock():
KeyboardInterrupt: 


Adding attribute ALT 12
Adding attribute AST 1
Adding attribute Age


In [5]:
### DGM with Random Forest
Rf = RandomForestClassifier(n_estimators=100)
JS = UsingJoiningValidator(JoiningValidator(Rf, verbose=False), patience=5)

prepared_splits = random_split_columns(df_train, {'split1': 1, 'split2': 1})

dgms = DisjointGenerativeModels(df_train,['synthpop', 'datasynthesizer'], 
                                prepared_splits= prepared_splits, 
                                joining_strategy=JS)
dgms.join_multiplier = 8    # to ensure high enough resolution

df_dgms = dgms.fit_generate()[:len(df_train)]

Adding ROOT Fever
Adding attribute ALT 24


Find out more at https://www.synthpop.org.uk/



Variable(s): Plat, RNA.Base, RBC, RNA.EF, WBC, Headache, Nausea.Vomting numeric but with only 3 or fewer distinct values turned into factor(s) for synthesis.

Adding attribute ALT 4
Synthetic data exported as csv file(s).
Information on synthetic data written to
  /home/lautrup/repositories/disjoint-synthetic-data-generation/synthesis_info_synthpop_temp_0_synthpop.txt 
Adding attribute AST 1
Adding attribute HGB
Adding attribute Fatigue & generalized bone ache
Adding attribute Jaundice
Adding attribute Diarrhea
Adding attribute Epigastric pain
Adding attribute Gender
Adding attribute b_class
Adding attribute RNA 12
Adding attribute RNA EOT
Adding attribute RNA 4
Final size of synthetic data: 5090


In [6]:
dfs = {
    'sp' : df_cart,
    'ds' : df_bn,
    'gan' : df_ctgan,
    'dgms' : df_dgms
}

SE = SynthEval(df_train, df_test, verbose=False)
res, _ = SE.benchmark(dfs, analysis_target_var=label,**metrics, rank_strategy='summation')

In [7]:
res

Unnamed: 0_level_0,avg_h_dist,avg_h_dist,ks_tvd_stat,ks_tvd_stat,frac_ks_sigs,frac_ks_sigs,rank,u_rank,p_rank,f_rank
Unnamed: 0_level_1,value,error,value,error,value,error,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
dataset,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
sp,0.005138,0.001426,0.014506,0.002161,0.0,,2.980356,2.980356,0.0,0.0
ds,0.011269,0.003119,0.020845,0.002543,0.137931,,2.829955,2.829955,0.0,0.0
gan,0.015594,0.003316,0.05873,0.008581,0.551724,,2.373952,2.373952,0.0,0.0
dgms,0.015053,0.00335,0.038575,0.004713,0.241379,,2.704992,2.704992,0.0,0.0


In [6]:
exp = (0.005138*len(prepared_splits['split1'])+0.011269*len(prepared_splits['split2']))/len(df_train.columns)
print(exp)
t = abs(0.015-exp)/0.003
t

0.008097793103448275


2.3007356321839083

It appers that the both metrics are affected by the DGMs workflow. This is likely due to the joining operation not being completely random, thus affecting which records are carried to the joined dataset.

To check that this is the case, we will run one more experiment using the concatenation joining instead of the joining validator.

In [4]:
### DGM with Concatenation
prepared_splits = random_split_columns(df_train, {'split1': 1, 'split2': 1})

dgms = DisjointGenerativeModels(df_train,['synthpop', 'datasynthesizer'], 
                                prepared_splits= prepared_splits)

df_dgms = dgms.fit_generate()[:len(df_train)]

SE = SynthEval(df_train, df_test, verbose=False)
SE.evaluate(df_dgms, analysis_target_var=label, **metrics)



Adding ROOT Baseline histological Grading
Adding attribute ALT 4


Find out more at https://www.synthpop.org.uk/



Variable(s): Fever, WBC, RNA.4, RNA.Base, RNA.EF, RNA.12, Jaundice, Gender, Headache, Plat, RBC, Epigastric.pain numeric but with only 3 or fewer distinct values turned into factor(s) for synthesis.

Adding attribute ALT 12
Synthetic data exported as csv file(s).
Information on synthetic data written to
  /home/lautrup/repositories/disjoint-synthetic-data-generation/synthesis_info_synthpop_temp_0_synthpop.txt 
Adding attribute ALT 1
Adding attribute AST 1
Adding attribute ALT 48
Adding attribute ALT 36
Adding attribute BMI
Adding attribute HGB
Adding attribute Fatigue & generalized bone ache
Adding attribute Diarrhea
Adding attribute Nausea/Vomting
Adding attribute b_class
Adding attribute RNA EOT


Unnamed: 0,metric,dim,val,err,n_val,n_err
0,avg_h_dist,u,0.006548,0.001403,0.993452,0.001403
1,ks_tvd_stat,u,0.016847,0.001765,0.983153,0.001765
2,frac_ks_sigs,u,0.0,,1.0,


In [7]:
exp = (0.005138*len(prepared_splits['split1'])+0.011269*len(prepared_splits['split2']))/len(df_train.columns)
print(exp)
t = abs(0.007-exp)/0.0014
t

0.008097793103448275


0.784137931034482

More plausible.