# Tabular Data Synthesis with [Synthetic Data Vault](https://docs.sdv.dev/sdv) (CTGAN)

Let's use SDV's CTGAN algorithm to create synthetic data for a single table and evaluate it. CTGAN uses generative adversarial networks (GANs) to create synthesize data with high fidelity.

## 0. Background

Synthetic Data Vault is a python library for tabular data synthesis. Along with SynthPop, it is one of the most widely used tabular data synthesis libraries. The library has three main approaches for data synthesis.

### Main methods

1. **Gaussian copula**: This method models the joint distribution of columns in a tabular dataset as multivariate Gaussian. This is the fatest method but the assumptions may not apply to many real-world datasets.

2. **CTGAN**: Conditional Tabular GAN (CTGAN) is the main method of the library based on the 2019 paper "Modeling Tabular Data using Conditional GAN," which uses GAN (genrative adversarial network) to estimate complex joint distribution of a tabular dataset. The generator learns the joint distribution of the real data, and then generate a batch of synthetic rows. The discriminator (the critic) calculates loss by comparing the synthetic row and the real data, until it cannot distinguish the two.

3. **TVAE**: Tabular variational autoencoder (TVAE) is a method that was introduced in the above-mentioned 2019 paper, as a comparative method that uses VAE instead of GAN. As a VAE model, this unsupervised approach compresses a tabular dataset (input) as multivariate Gaussian layer, and then by using a trained decoder network, it generates synthetic data.

### CTGAN

GAN has been used frequently for tabular data synthesis because GAN has an advantage of learning flexible distributionl. Traditionally, Bayesian networks have been used as main methods for data synthesis (because they are generative models) and they often have been considered as too restrictive because of the limited choices of pdfs although there seems lack of comprehensive comparison between GAN and Bayesian networks.

To this date, GANs are still actively studied for tabular data synthesis. The main motivation, also mentioned in the CTGAN paper, is that the joint distribution of columns in real-world tabular data is quite complex. First, real-world data often contain mixed data types such as continuous and discrete. Second, the data tend to have multiple modes, which do not follow Gaussian distribution. Finally, categorical data often show severe class imbalance.

CTGAN's three main features address these challenges. First, the use mode-specific normalization that encodes (preprocesses) a continuous column into a N-dimensional ont-hot vector with a scalar vector where the one-hot vector is an indicator vector of N modes in the distribution of the columnal data. The number of modes is estimated by the variational Gaussian mixture (VGN) method.

Second, CTGAN uses a conditional generator by estimating the conditional distribution of rows given a particular value of a particular column. To incorporate a conditional generator into GAN architecture, the conditional generator must learn the real data conditional distribution. For discrete data, the reconstruction of real data is done by marginalizing this conditional distribution across all categories of a given column.

Finally, to address the class imbalance, when generating synthetic samples, they use a training-by-sampling method, which is simply sampling from a probability mass function of a discrete column, where the mass represents the log frequency of categories.

# 1. Loading the demo data
SDV provides a handful of demo datasets. Use `sdv.datasets.demo.get_available_demos(modality='single_table')` to get all single-table examples.

In [1]:
import sdv
from sdv.datasets import demo

In [2]:
demo.get_available_demos(modality='single_table')

Unnamed: 0,dataset_name,size_MB,num_tables
0,KRK_v1,0.06,1
1,adult,3.91,1
2,alarm,4.52,1
3,asia,1.28,1
4,census,98.17,1
5,census_extended,4.95,1
6,child,3.2,1
7,covtype,255.65,1
8,credit,68.35,1
9,expedia_hotel_logs,0.2,1


In this notebook, we use a small and simple dataset, `'fake_hotel_guests'`.

In [3]:
from sdv.datasets.demo import download_demo

real_data, metadata = download_demo(
    modality='single_table',
    dataset_name='fake_hotel_guests'
)

In [4]:
real_data.shape

(500, 9)

In [5]:
real_data.head()

Unnamed: 0,guest_email,has_rewards,room_type,amenities_fee,checkin_date,checkout_date,room_rate,billing_address,credit_card_number
0,michaelsanders@shaw.net,False,BASIC,37.89,27 Dec 2020,29 Dec 2020,131.23,"49380 Rivers Street\nSpencerville, AK 68265",4075084747483975747
1,randy49@brown.biz,False,BASIC,24.37,30 Dec 2020,02 Jan 2021,114.43,"88394 Boyle Meadows\nConleyberg, TN 22063",180072822063468
2,webermelissa@neal.com,True,DELUXE,0.0,17 Sep 2020,18 Sep 2020,368.33,"0323 Lisa Station Apt. 208\nPort Thomas, LA 82585",38983476971380
3,gsims@terry.com,False,BASIC,,28 Dec 2020,31 Dec 2020,115.61,"77 Massachusetts Ave\nCambridge, MA 02139",4969551998845740
4,misty33@smith.biz,False,BASIC,16.45,05 Apr 2020,,122.41,"1234 Corporate Drive\nBoston, MA 02116",3558512986488983


The demo datasets include metadata, a description of the dataset. It includes the primary keys as well as the data types for each column (called "sdtypes"), **which are different from pandas dtypes**.

In [6]:
metadata

{
    "primary_key": "guest_email",
    "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1",
    "columns": {
        "guest_email": {
            "sdtype": "email",
            "pii": true
        },
        "has_rewards": {
            "sdtype": "boolean"
        },
        "room_type": {
            "sdtype": "categorical"
        },
        "amenities_fee": {
            "sdtype": "numerical",
            "computer_representation": "Float"
        },
        "checkin_date": {
            "sdtype": "datetime",
            "datetime_format": "%d %b %Y"
        },
        "checkout_date": {
            "sdtype": "datetime",
            "datetime_format": "%d %b %Y"
        },
        "room_rate": {
            "sdtype": "numerical",
            "computer_representation": "Float"
        },
        "billing_address": {
            "sdtype": "address",
            "pii": true
        },
        "credit_card_number": {
            "sdtype": "credit_card_number",
            "pii": true
        }


## 2. Train a synthesizer

An SDV synthesizer is an object that you can use to create synthetic data. It learns patterns from the real data and replicates them to generate synthetic data. In this case, we're using CTGAN as a synthesizer. Note that the **synthesizer requires metadata** for training.

**For larger datasets, this phase may take longer. A drawback of a GAN-based model like CTGAN is performance.** CTGAN requires `pytorch` and provides an option to use CUDA as backend.

In [7]:
%%time

from sdv.single_table import CTGANSynthesizer

synthesizer = CTGANSynthesizer(metadata)
synthesizer.fit(real_data)

CPU times: user 1min 31s, sys: 1min 8s, total: 2min 40s
Wall time: 48.5 s


## 3. Generate Synthetic Data
Use the `sample` function and pass in any number of rows to synthesize. Compared to training, data synthesis is much faster. The synthesizer is generating synthetic guests in the same format as the original data.

In [8]:
%%time
synthetic_data = synthesizer.sample(num_rows=500)

CPU times: user 604 ms, sys: 837 ms, total: 1.44 s
Wall time: 494 ms


In [9]:
synthetic_data.head()

Unnamed: 0,guest_email,has_rewards,room_type,amenities_fee,checkin_date,checkout_date,room_rate,billing_address,credit_card_number
0,dsullivan@example.net,False,BASIC,8.33,12 Feb 2020,07 Jan 2020,160.19,"90469 Karla Knolls Apt. 781\nSusanberg, CA 70033",5161033759518983
1,steven59@example.org,False,BASIC,1.53,13 Oct 2020,31 Jul 2020,219.36,"6108 Carla Ports Apt. 116\nPort Evan, MI 71694",4133047413145475690
2,brandon15@example.net,False,DELUXE,21.4,25 Feb 2020,01 Apr 2020,162.96,86709 Jeremy Manors Apt. 786\nPort Garychester...,4977328103788
3,humphreyjennifer@example.net,False,BASIC,40.63,18 May 2020,19 Jan 2020,89.96,"8906 Bobby Trail\nEast Sandra, NY 43986",3524946844839485
4,joshuabrown@example.net,True,SUITE,,01 Jan 2021,09 Jun 2020,139.0,"732 Dennis Lane\nPort Nicholasstad, DE 49786",4446905799576890978


## 4. Evaluating Real vs. Synthetic Data

SDV has built-in functions for evaluating the synthetic data and getting more insight. They can be grouped into **Diagnostic** functions and **Data Quality** check functions.

### 4.1. Diagnostic

This step checks basic validity of the synthetic data.

1. Data structure: Checks to ensure the real and synthetic data have the same column names
2. Data validity: checks basic validity for each of the columns such as
    1. Primary keys must always be unique and non-null
    2. Continuous values in the synthetic data must adhere to the min/max range in the real data
    3. Discrete values in the synthetic data must adhere to the same categories as the real data

The scores from a dignostic check **MUST be 100%**.

In [10]:
from sdv.evaluation.single_table import run_diagnostic

diagnostic = run_diagnostic(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata
)

Generating report ...

(1/2) Evaluating Data Validity: |███████████████████████████████████████████████████| 9/9 [00:00<00:00, 759.93it/s]|
Data Validity Score: 100.0%

(2/2) Evaluating Data Structure: |██████████████████████████████████████████████████| 1/1 [00:00<00:00, 117.55it/s]|
Data Structure Score: 100.0%

Overall Score (Average): 100.0%



### 4.2. Data quality

This step checks the similarity between the real and synthetic data.

1. Column Shapes: The statistical similarity between the real and synthetic data for **single columns** of data. This is often called the marginal distribution of each column.
2. Column Pair Trends: The statistical similarity between the real and synthetic data for pairs of columns. This is often called **the correlation or bivariate distributions of the columns.**

According to their [documentation](https://docs.sdv.dev/sdv/single-table-data/evaluation/data-quality):

- A 100% score means that the patterns are exactly the same. For example, if you compared the real data with itself (identity), the score would be 100%.
- A 0% score means the patterns are as different as can be. This would entail that the synthetic data purposefully contains anti-patterns that are opposite from the real data.
- Any score in the middle can be interpreted along this scale. For example, a score of 80% means that the synthetic data is about 80% similar to the real data — about 80% of the trends are similar.

In [11]:
from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    real_data,
    synthetic_data,
    metadata
)

Generating report ...

(1/2) Evaluating Column Shapes: |███████████████████████████████████████████████████| 9/9 [00:00<00:00, 235.94it/s]|
Column Shapes Score: 76.27%

(2/2) Evaluating Column Pair Trends: |████████████████████████████████████████████| 36/36 [00:00<00:00, 136.47it/s]|
Column Pair Trends Score: 75.79%

Overall Score (Average): 76.03%



Their documentation explains why we don't check similarity in higher dimensions:

>Higher order distributions of 3 or more columns are not included in the Quality Report. We have found that very high order similarity may have an adverse effect on the synthetic data. After a certain point, it indicates that the synthetic data is just a copy of the real data. (For more information, see the NewRowSynthesis metric.)
>
>If higher order similarity is a requirement, you likely have a targeted use case for synthetic data (eg. machine learning efficacy). Until we add these reports, you may want to explore other metrics in the SDMetrics library. You may also want to try directly using your synthetic data for the downstream application.

## 5. Visualize the real vs. synthetic data

The library provides several handy functions to compare the distributions of the real and synthetic data. The visualization uses `plotly` interactive visualization.

### 5.1. PMF of a discrete column

In [12]:
from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    column_name='room_type',
    metadata=metadata
)

fig.show()

### 5.2. PDF by category by using a column-pair plot (intersection of a discrete and a continuous columns)

In [14]:
from sdv.evaluation.single_table import get_column_pair_plot

fig = get_column_pair_plot(
    real_data=real_data,
    synthetic_data=synthetic_data,
    column_names=['room_rate', 'room_type'],
    metadata=metadata
)

fig.show()

## 6. CTGAN customization

When using this synthesizer, we can make a tradeoff between training time and data quality using the epochs parameter: Higher epochs means that the synthesizer will train for longer, and ideally improve the data quality.

In [15]:
%%time
custom_synthesizer = CTGANSynthesizer(
    metadata,
    epochs=1000)
custom_synthesizer.fit(real_data)

CPU times: user 4min 32s, sys: 3min 33s, total: 8min 5s
Wall time: 1min 29s


In [16]:
synthetic_data_customized = custom_synthesizer.sample(num_rows=500)

quality_report = evaluate_quality(
    real_data,
    synthetic_data_customized,
    metadata
)

Generating report ...

(1/2) Evaluating Column Shapes: |███████████████████████████████████████████████████| 9/9 [00:00<00:00, 486.94it/s]|
Column Shapes Score: 85.43%

(2/2) Evaluating Column Pair Trends: |████████████████████████████████████████████| 36/36 [00:00<00:00, 164.43it/s]|
Column Pair Trends Score: 82.68%

Overall Score (Average): 84.06%



### 6.1. Plot the loss values for both the generator and disciminator

In [17]:
fig = synthesizer.get_loss_values_plot()
fig.show()