# `smartnoise-synth` Demo

https://docs.smartnoise.org/synth/index.html

https://github.com/opendp/smartnoise-sdk

`ydata-synthetic` is a GAN-oriented SD library. It provides different GANs to synthesise tabular and sequential data. At this moment, it doesn't support GANs with DP. The following list is all the models the library includes (2024-01-11):

* GAN
* CGAN (Conditional GAN)
* WGAN (Wasserstein GAN)
* WGAN-GP (Wassertein GAN with Gradient Penalty)
* DRAGAN (Deep Regret Analytic GAN)
* Cramer GAN (Cramer Distance Solution to Biased Wasserstein Gradients)
* CWGAN-GP (Conditional Wassertein GAN with Gradient Penalty)
* CTGAN (Conditional Tabular GAN)
* TimeGAN (specifically for time-series data)
* DoppelGANger (specifically for time-series data)

Besides, it also supports one probabilistic model, GMM, which is based on the mixture of several Gaussian distributions. Compared to GANs, GMMs are fast and easy to train. However they may suffer from the complexity of the real world data distribution.

Please be aware of the error when importing `pandas`, `numpy`, `matplotlib` and `seaborn` after installing `ydata-synthetic`, due to the inconsistent dependency.

In [1]:
from snsynth import Synthesizer

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
%cd /Users/alex/PETsARD

/Users/alex/PETsARD


In [4]:
df = pd.read_csv('[Adt Income] adult.csv')

In [5]:
df.dtypes

age                 int64
workclass          object
fnlwgt              int64
education          object
educational-num     int64
marital-status     object
occupation         object
relationship       object
race               object
gender             object
capital-gain        int64
capital-loss        int64
hours-per-week      int64
native-country     object
income             object
dtype: object

In [6]:
synth = Synthesizer.create('dpctgan', epsilon=1.0, verbose=True)
synth_data = synth.fit_sample(df, preprocessor_eps=0.5)

Spent 0.5 epsilon on preprocessor, leaving 0.5 for training




Epoch 1, Loss G: 0.6870, Loss D: 1.3878
epsilon is 0.17837682139144928, alpha is 63.0
Epoch 2, Loss G: 0.6962, Loss D: 1.3859
epsilon is 0.20520467877255033, alpha is 63.0
Epoch 3, Loss G: 0.7053, Loss D: 1.3901
epsilon is 0.23203253615365138, alpha is 63.0
Epoch 4, Loss G: 0.7033, Loss D: 1.3864
epsilon is 0.25886039353475243, alpha is 63.0
Epoch 5, Loss G: 0.7063, Loss D: 1.3871
epsilon is 0.2856882509158534, alpha is 63.0
Epoch 6, Loss G: 0.7097, Loss D: 1.3875
epsilon is 0.3125161082969545, alpha is 63.0
Epoch 7, Loss G: 0.7132, Loss D: 1.3904
epsilon is 0.3393439656780555, alpha is 63.0
Epoch 8, Loss G: 0.7134, Loss D: 1.3850
epsilon is 0.3661718230591566, alpha is 63.0
Epoch 9, Loss G: 0.7145, Loss D: 1.3868
epsilon is 0.3924151985062195, alpha is 60.0
Epoch 10, Loss G: 0.7056, Loss D: 1.3917
epsilon is 0.4171742677984172, alpha is 57.0
Epoch 11, Loss G: 0.7127, Loss D: 1.3824
epsilon is 0.4406523915074723, alpha is 54.0
Epoch 12, Loss G: 0.7044, Loss D: 1.3887
epsilon is 0.46306

會跑幾個 epoch 由 $\epsilon$ 決定，當 budget 用完就停止執行。preprocessing 也會用掉 $\epsilon$

In [7]:
synth_data

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,49,Private,46408.514160,HS-grad,15,Married-civ-spouse,Sales,Not-in-family,White,Male,12.148438,0.096313,40,United-States,<=50K
1,49,Private,39769.643066,Bachelors,15,Never-married,Craft-repair,Not-in-family,White,Female,11.445312,0.232422,4,United-States,<=50K
2,49,Private,27976.399902,Some-college,13,Married-civ-spouse,Sales,Husband,White,Male,6.187500,0.059692,60,United-States,<=50K
3,30,Private,25726.887695,Bachelors,15,Never-married,Machine-op-inspct,Husband,White,Male,14.785156,0.076416,24,United-States,<=50K
4,49,?,22632.566406,Some-college,10,Never-married,Machine-op-inspct,Not-in-family,White,Male,14.554688,0.234619,60,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,23,?,19822.883301,Some-college,10,Never-married,Sales,Husband,White,Male,9.707031,0.107300,20,United-States,<=50K
48838,49,Private,23896.626953,Bachelors,9,Divorced,Sales,Husband,White,Male,5.003906,0.042603,60,United-States,<=50K
48839,23,?,34436.853027,10th,12,Never-married,Machine-op-inspct,Husband,White,Male,7.265625,0.114990,60,United-States,<=50K
48840,76,Private,27775.310547,Some-college,13,Never-married,Prof-specialty,Not-in-family,White,Male,4.777344,0.035889,20,United-States,<=50K


In [8]:
from snsynth.transform import NoTransformer

synth_no = Synthesizer.create('dpctgan', epsilon=1.0, verbose=True)
synth_data_no = synth_no.fit_sample(df, transformer=NoTransformer())

ValueError: could not convert string to float: 'a'

In [10]:
synth_no = Synthesizer.create('dpctgan', epsilon=1.0, verbose=True)
synth_data_no = synth_no.fit_sample(df.loc[:, ['age', 'capital-gain']].values, transformer=NoTransformer())

AssertionError: 

In [11]:
from snsynth.transform import TableTransformer
from snsynth.transform.identity import IdentityTransformer

tt = TableTransformer([IdentityTransformer(), IdentityTransformer()])

synth_no = Synthesizer.create('dpctgan', epsilon=1.0, verbose=True)
synth_data_no = synth_no.fit_sample(df.loc[:, ['age', 'capital-gain']].values, transformer=tt)

NotImplementedError: 

目前 GAN-style 的 synthesizer 都沒有辦法跟 NoTransformer 或 IdentityTransformer 搭配使用，導致資料輸入一定必須經過前處理才能合成。但 cube-style 的 synthesizer 可以使用 IdentityTransformer 來合成。