# Application in Synthetic Data Evaluation

In this notebook we illustrate how the new metrics can be applied to evaluate the quality of tabular synthetic data. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pcametric import PCAMetric

## Recreating Table 1 from the paper
In this section we show a basic example of the PCA metrics in action. We generate several synthetic datasets of varying quality (some engineered as adversarial examples) and compare the PCA metrics alongside regular metrics between them.

We use the cardiotocography dataset from the [UCI repository](https://archive.ics.uci.edu/dataset/193/cardiotocography) as the base dataset (split randomly into train and test at 66%-33% ratio). Synthetic datasets are generated using several backends viz. [SynthCity](https://github.com/vanderschaarlab/synthcity), [CTGAN](https://github.com/sdv-dev/CTGAN), [DataSynthesizer](https://github.com/DataResponsibly/DataSynthesizer) and [Synthpop](https://www.synthpop.org.uk/get-started.html) in R. The quality of the synthetic datasets are determined using various common metrics implemented in the [SynthEval](https://github.com/schneiderkamplab/syntheval) library. The PCA metrics are then used to compare the quality of the synthetic datasets.

In [8]:
import pandas as pd

DATA_NAME = 'cardiotocography'

df_train = pd.read_csv(f'datasets/{DATA_NAME}_train.csv')
df_test = pd.read_csv(f'datasets/{DATA_NAME}_test.csv')

In [9]:
from utils.synthetic_data import add_noise_to_dataset, independent_sampling
# code for generating the data are also available in the utils folder but we load csvs here for efficiency.

df_noisy = add_noise_to_dataset(df_train, noise_level=0.1, threshold=5).round(1)

df_indpt = independent_sampling(df_train)

df_tvae_100 = pd.read_csv(f'datasets/synthetic/{DATA_NAME}_tvae_100.csv').round(1)
df_tvae_180 = pd.read_csv(f'datasets/synthetic/{DATA_NAME}_tvae_180.csv').round(1)
df_tvae_300 = pd.read_csv(f'datasets/synthetic/{DATA_NAME}_tvae_300.csv').round(1)
df_tvae_500 = pd.read_csv(f'datasets/synthetic/{DATA_NAME}_tvae_500.csv').round(1)

df_adsgan = pd.read_csv(f'datasets/synthetic/{DATA_NAME}_adsgan.csv').round(1)
df_bn = pd.read_csv(f'datasets/synthetic/{DATA_NAME}_datasynthesizer.csv').round(1)
df_cart = pd.read_csv(f'datasets/synthetic/{DATA_NAME}_synthpop.csv').round(1)

df_best = df_test.sample(frac=0.8, random_state=42)
df_val = df_test.drop(index=df_best.index)
df_best.reset_index(drop=True, inplace=True)

In [10]:
### Evaluate the synthetic data
from syntheval import SynthEval

metrics = {
    "pca"       : {"preprocess": "mean"},
    "corr_diff" : {"mixed_corr": True},
    "mi_diff"   : {},
    "ks_test"   : {"sig_lvl": 0.05, "n_perms": 1000},
    "hit_rate"  : {"thres_percent": 0.05},
    "eps_risk"  : {},
    "mia_risk"  : {"num_eval_iter": 5}
}

SE = SynthEval(df_train, df_val, unique_threshold=10)
res_df, rank_df = SE.benchmark({'noisy': df_noisy,
                                'indpt': df_indpt,
                                'tvae_100': df_tvae_100,
                                'tvae_180': df_tvae_180,
                                'tvae_300': df_tvae_300,
                                'tvae_500': df_tvae_500,
                                'adsgan': df_adsgan,
                                'bn': df_bn,
                                'cart': df_cart,
                                'best': df_best}, analysis_target_var='Class', rank_strategy='summation', **metrics)

SynthEval: inferred categorical columns...


In [6]:
### Table 1 in the paper ###
res_df.T

Unnamed: 0,dataset,noisy,indpt,tvae_100,tvae_180,tvae_300,tvae_500,adsgan,bn,cart,best
pca_eigval_diff,value,2e-06,0.452948,0.039835,0.208126,0.06422,0.063852,0.119112,0.00184,0.000222,0.003263
pca_eigval_diff,error,,,,,,,,,,
pca_eigvec_ang,value,8e-06,0.350485,0.480847,0.340335,0.012754,0.035344,0.006887,0.003861,0.003029,0.004097
pca_eigvec_ang,error,,,,,,,,,,
corr_mat_diff,value,0.072219,6.905142,4.383363,3.378907,2.57978,2.235645,1.780792,2.643394,1.200779,0.978702
corr_mat_diff,error,,,,,,,,,,
mutual_inf_diff,value,0.169436,2.733968,10.064334,5.396227,2.70753,2.183597,1.803873,1.553276,1.094276,1.982659
mutual_inf_diff,error,,,,,,,,,,
ks_tvd_stat,value,0.001821,0.017224,0.213303,0.144682,0.095739,0.084122,0.082205,0.080502,0.021894,0.030964
ks_tvd_stat,error,0.001685,0.001771,0.027761,0.020624,0.011609,0.011684,0.015231,0.023369,0.002748,0.003333


### Figure 3
Figure 3 is about the behavoiur of the PCA metrics during training of a generative autoencoder model. Due to licencing of the CTGAN code that we had to adjust slightly, we put the code for this in a separate notebook that is a fork of the original repository.

[Link to Notebook in forked repository](https://github.com/notna07/ctgan-with-checkpoints/blob/main/gen_model_training_behaviour.ipynb)

<p align="center">
  <img src="datasets/results/tvae_loss.png" />
</p>


### Figure 4
Figure 4 is about checking the correlations between the PCA metrics and the regular metrics. 

The process and code for creating the correlation heatmap are part of a separate notebook. The blue annotations were added posthoc.

[Link to Notebook in separate repository](https://github.com/schneiderkamplab/syntheval-model-benchmark-example/blob/main/metric_correlations.ipynb)

<p align="center">
  <img src="datasets/results/corr_clust_result.png" />
</p>
