# YData-Synthetic Demo

https://docs.synthetic.ydata.ai/1.3/

https://github.com/ydataai/ydata-synthetic

`ydata-synthetic` is a GAN-oriented SD library. It provides different GANs to synthesise tabular and sequential data. At this moment, it doesn't support GANs with DP. The following list is all the models the library includes (2024-01-11):

* GAN
* CGAN (Conditional GAN)
* WGAN (Wasserstein GAN)
* WGAN-GP (Wassertein GAN with Gradient Penalty)
* DRAGAN (Deep Regret Analytic GAN)
* Cramer GAN (Cramer Distance Solution to Biased Wasserstein Gradients)
* CWGAN-GP (Conditional Wassertein GAN with Gradient Penalty)
* CTGAN (Conditional Tabular GAN)
* TimeGAN (specifically for time-series data)
* DoppelGANger (specifically for time-series data)

Besides, it also supports one probabilistic model, GMM, which is based on the mixture of several Gaussian distributions. Compared to GANs, GMMs are fast and easy to train. However they may suffer from the complexity of the real world data distribution.

Please be aware of the error when importing `pandas`, `numpy`, `matplotlib` and `seaborn` after installing `ydata-synthetic`, due to the inconsistent dependency.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

When installing `ydata-synthetic`, `conda` also installs `pmlb` (a dataset library for accessing some public data including adult dataset). However, the file is incomplete. Please uninstall `pmlb` after installing `ydata-synthetic` using `conda remove pmlb --force` and install the newest version of `pmlb`.

In [3]:
from pmlb import fetch_data
from ydata_synthetic.synthesizers.regular import RegularSynthesizer
from ydata_synthetic.synthesizers import ModelParameters, TrainParameters

In [4]:
%cd /Users/alex/PETsARD

/Users/alex/PETsARD


In [4]:
data = fetch_data('adult')

The column types should be specified before fitting the synthesizer. For the numerical columns, min-max scaling is used, while for the categorical columns, one-hot encoding is used.

In [5]:
num_cols = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_cols = ['workclass','education', 'education-num', 'marital-status',
                'occupation', 'relationship', 'race', 'sex', 'native-country', 'target']

In [5]:
df = pd.read_csv('[Adt Income] adult.csv')

In [6]:
df.dtypes

age                 int64
workclass          object
fnlwgt              int64
education          object
educational-num     int64
marital-status     object
occupation         object
relationship       object
race               object
gender             object
capital-gain        int64
capital-loss        int64
hours-per-week      int64
native-country     object
income             object
dtype: object

In [16]:
num_cols = ['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss', 'hours-per-week']
cat_cols = ['workclass','education', 'marital-status',
                'occupation', 'relationship', 'race', 'gender', 'native-country', 'income']

`ModelParameters` can define the hyperparameters for constructing a model, and `TrainParameters` contains the parameter used in training process. All the customisable parameters are shown below. Unfortunately, there is no documents about the explanation of these parameters. Further code review is needed for better understanding.

```python
_model_parameters = ['batch_size', 'lr', 'betas', 'layers_dim', 'noise_dim',
                     'n_cols', 'seq_len', 'condition', 'n_critic', 'n_features', 
                     'tau_gs', 'generator_dims', 'critic_dims', 'l2_scale', 
                     'latent_dim', 'gp_lambda', 'pac', 'gamma', 'tanh']
_model_parameters_df = [128, 1e-4, (None, None), 128, 264,
                        None, None, None, 1, None, 0.2, [256, 256], 
                        [256, 256], 1e-6, 128, 10.0, 10, 1, False]

_train_parameters = ['cache_prefix', 'label_dim', 'epochs', 'sample_interval', 
                     'labels', 'n_clusters', 'epsilon', 'log_frequency', 
                     'measurement_cols', 'sequence_length', 'number_sequences', 
                     'sample_length', 'rounds']
defaults=('', None, 300, 50, None, 10, 0.005, True, None, 1, 1, 1, 1)
```

In [8]:
# Define model and training parameters
ctgan_args = ModelParameters(batch_size=500, lr=2e-4, betas=(0.5, 0.9))
train_args = TrainParameters(epochs=101)

# Train the generator model
synth = RegularSynthesizer(modelname='ctgan', model_parameters=ctgan_args)
synth.fit(data=df, train_arguments=train_args, num_cols=num_cols, cat_cols=cat_cols)

# Generate 1000 new synthetic samples
synth_data = synth.sample(1000) 



Epoch: 0 | critic_loss: 0.25221312046051025 | generator_loss: 1.2265512943267822
Epoch: 1 | critic_loss: 0.21777501702308655 | generator_loss: 0.8753871917724609
Epoch: 2 | critic_loss: 0.2642207145690918 | generator_loss: 0.1758331060409546
Epoch: 3 | critic_loss: 0.09298767149448395 | generator_loss: -0.6084940433502197
Epoch: 4 | critic_loss: 0.05742849409580231 | generator_loss: -0.6408810019493103
Epoch: 5 | critic_loss: 0.07060733437538147 | generator_loss: -0.9243089556694031
Epoch: 6 | critic_loss: 0.08761823177337646 | generator_loss: -0.9330343008041382
Epoch: 7 | critic_loss: 0.046502768993377686 | generator_loss: -0.9552236795425415
Epoch: 8 | critic_loss: 0.13399776816368103 | generator_loss: -0.734030544757843
Epoch: 9 | critic_loss: 0.16479483246803284 | generator_loss: -0.8043594360351562
Epoch: 10 | critic_loss: -0.04691226780414581 | generator_loss: -0.7169569730758667
Epoch: 11 | critic_loss: -0.12420661002397537 | generator_loss: -0.4807487428188324
Epoch: 12 | crit

In [9]:
synth_data

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,53,Private,158766,HS-grad,8,Divorced,Craft-repair,Not-in-family,White,Female,-10,-1,39,United-States,<=50K
1,50,?,166353,HS-grad,9,Widowed,?,Unmarried,White,Female,5,0,40,United-States,<=50K
2,17,Private,301585,HS-grad,9,Never-married,Protective-serv,Own-child,White,Male,-2,0,39,United-States,<=50K
3,33,Private,121284,Some-college,9,Divorced,Tech-support,Not-in-family,White,Male,-17,-1,39,United-States,<=50K
4,17,Private,101175,Some-college,10,Never-married,Other-service,Not-in-family,White,Female,-17,0,19,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,17,Private,98544,Some-college,10,Never-married,Sales,Own-child,White,Female,-11,0,30,United-States,<=50K
996,32,Private,323185,Bachelors,12,Divorced,Prof-specialty,Not-in-family,White,Male,-25,0,59,United-States,<=50K
997,17,Private,93866,Some-college,9,Never-married,Farming-fishing,Own-child,White,Female,8,0,9,United-States,<=50K
998,16,?,188623,HS-grad,9,Never-married,Protective-serv,Own-child,White,Male,-12,0,39,United-States,<=50K


In [17]:
# Train the GMM
synth_gmm = RegularSynthesizer(modelname='fast')
synth_gmm.fit(data=df, cat_cols=cat_cols, num_cols=num_cols)

# Generate 1000 new synthetic samples
synth_gmm_data = synth_gmm.sample(1000) 

Hyperparameter search: 100%|██████████| 8/8 [07:12<00:00, 54.04s/it]


In [18]:
synth_gmm_data

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,35,Private,165855,Masters,13,Never-married,Prof-specialty,Not-in-family,White,Male,1157,-44,48,United-States,<=50K
1,58,Private,237037,Bachelors,10,Divorced,Sales,Not-in-family,White,Male,1322,378,45,United-States,<=50K
2,37,Private,118320,HS-grad,6,Divorced,Craft-repair,Own-child,White,Female,4773,682,38,United-States,<=50K
3,20,Private,101727,HS-grad,7,Never-married,Adm-clerical,Own-child,Black,Male,4788,-764,50,United-States,<=50K
4,34,Private,281083,Some-college,10,Never-married,Prof-specialty,Not-in-family,White,Female,3464,371,10,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,30,Private,275246,Some-college,12,Married-civ-spouse,Exec-managerial,Husband,White,Male,17781,-458,48,United-States,<=50K
996,20,Private,20268,HS-grad,5,Married-civ-spouse,Transport-moving,Husband,White,Male,-1311,175,45,United-States,<=50K
997,39,Private,345160,HS-grad,10,Married-civ-spouse,Craft-repair,Husband,White,Male,-5733,380,24,United-States,<=50K
998,32,Private,221465,Assoc-voc,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,-8204,134,66,United-States,>50K
