# Package installation

First, you need to uninstall the *torchaudio* and *torchdata* packages if you already have PyTorch >= 2.0. It is to make the enviornment compatible with the *synthcity* package.

The *synthcity* team is already working on upgrading their requirements to PyTorch 2.0. See [here](https://github.com/vanderschaarlab/synthcity/issues/234).

In [None]:
# !yes | pip -q uninstall torchaudio torchdata

Remember to restart your kernel / runtime before proceeding to the next steps.

Second, you need to install the Google gPS *auto_synthetic_data_platform* package.

In [None]:
# !pip install auto_synthetic_data_platform.tar.gz

Remember to restart your kernel / runtime before proceeding to the next steps.

# Imports

In [None]:
import pathlib
from auto_synthetic_data_platform import preprocessing
from auto_synthetic_data_platform import synthetic_data_model_tuning
import pandas as pd
from synthcity import plugins
from synthcity.plugins.core import dataloader

# Global variables

The below variables might need to be changed for your use case.

In [None]:
_REAL_DATAFRAME_PATH = "/content/real_dataset.csv"

# Dataset

For the demonstration purposes we are going to use a modified version of the Acquire Valued Shoppers Challenge that can be downloaded from [Kaggle](https://www.kaggle.com/competitions/acquire-valued-shoppers-challenge/data).

In [None]:
real_dataframe = pd.read_csv(_REAL_DATAFRAME_PATH)
real_dataframe.head()

Unnamed: 0,productsize,purchasequantity,purchaseamount,dept,category,company,brand,productmeasure
0,128.0,2,5.98,63,6315,101111010,9907,OZ
1,16.0,1,0.88,9,907,101111010,9907,OZ
2,128.0,1,2.39,63,6315,101111010,9907,OZ
3,16.0,1,0.99,9,907,101111010,9907,OZ
4,12.0,1,0.88,9,907,101111010,9907,OZ


# Loading data

## Preprocessor

The *auto_synthetic_data_platform* has the *Preprocessor* class that can help preprocess the real dataset according to the industry best practices. It will also log all the information about the dataset that can impact a synthetic data model training. The logs can be shared externally in the you cannot allow external people to have access to your data and remote / blind synthetic data model training debugging is required.



In [None]:
_EXPERIMENT_DIRECTORY = ("/content/")
_COLUMN_METADATA = {
    "numerical": [
        "productsize",
        "purchasequantity",
        "purchaseamount",
    ],
    "categorical": [
        "dept",
        "category",
        "company",
        "brand",
        "productmeasure",
    ],
}
_PREPROCESS_METADATA = {
    "remove_numerical_outliers": False,
    "remove_duplicates": True,
    "preprocess_missing_values": True,
    "missing_values_method": "drop",
}

In [None]:
preprocessor = preprocessing.Preprocessor(
    dataframe_path=pathlib.Path(_REAL_DATAFRAME_PATH),
    experiment_directory=pathlib.Path(_EXPERIMENT_DIRECTORY),
    column_metadata=_COLUMN_METADATA,
    preprocess_metadata=_PREPROCESS_METADATA,
)

In [None]:
preprocessed_real_dataset = preprocessor.output_dataframe

INFO:preprocessing:No columns with missing values detected.


## Synthcity data loading

The next step is to use a data loading module from *synthcity*.

In [None]:
data_loader = dataloader.GenericDataLoader(
    data=preprocessor.output_dataframe,
    target_column="purchaseamount",
)

# Synthetic data model training

## Synthcity model selection

At this stage you need to load a synthetic data model instance from *synthcity*. The list of available models can be found [here](https://github.com/vanderschaarlab/synthcity#-methods).

In [None]:
_TVAE_MODEL = "tvae"
tvae_synthetic_data_model = plugins.Plugins().get(_TVAE_MODEL)

[2023-09-21T11:55:07.375435+0000][3153][CRITICAL] module disabled: /usr/local/lib/python3.10/dist-packages/synthcity/plugins/generic/plugin_goggle.py


## Objective/s selection

Now you need to define a mapping of evaluation metrics compatible with *synthcity*. Each model in the hyperparameter tuning process will be evaluated against these criteria to identify the best synthetic data model. The list of available metrics can be found [here](https://github.com/vanderschaarlab/synthcity#zap-evaluation-metrics).

In [None]:
_EVALUATION_METRICS = {
    "sanity": ["close_values_probability"],
    "stats": ["inv_kl_divergence"],
    "performance": ["xgb"],
    "privacy": ["k-anonymization"],
}

## Tuner setup

Here, you will need to setup a synthetic data model tuner from the *auto_synthetic_data_platform* package.

In [None]:
_NUMBER_OF_TRIALS = 2
_OPTIMIZATION_DIRECTION = "maximize"
_TASK_TYPE = "regression"

In [None]:
tvae_synthetic_data_model_optimizer = synthetic_data_model_tuning.SyntheticDataModelTuner(
    data_loader=data_loader,
    synthetic_data_model=tvae_synthetic_data_model,
    experiment_directory=pathlib.Path(_EXPERIMENT_DIRECTORY),
    number_of_trials=_NUMBER_OF_TRIALS,
    optimization_direction=_OPTIMIZATION_DIRECTION,
    evaluation_metrics=_EVALUATION_METRICS,
    task_type=_TASK_TYPE,
)

## Hyperparameter tuning

The tuner runs a hyperparameter search in the background and then outputs the best synthetic data model. The processing time depends on the number of trials.

**WARNING:** The process is resource intensive. Sometimes a kernel/ runtime needs to be restarted.

In [None]:
best_tvae_network_synthetic_data_model = (
    tvae_synthetic_data_model_optimizer.best_synthetic_data_model
    )

INFO:hyperparameter_optimization:Specified 'evaluation_metrics': {'sanity': ['close_values_probability'], 'stats': ['inv_kl_divergence'], 'performance': ['xgb'], 'privacy': ['k-anonymization']}. Optimization direction: 'maximize'.
INFO:hyperparameter_optimization:Trial: trial_0. Model: tvae. Selected hyperparameters: {'n_iter': 400, 'lr': 0.0001, 'decoder_n_layers_hidden': 1, 'weight_decay': 0.001, 'batch_size': 512, 'n_units_embedding': 250, 'decoder_n_units_hidden': 250, 'decoder_nonlin': 'elu', 'decoder_dropout': 0.11346204368066007, 'encoder_n_layers_hidden': 5, 'encoder_n_units_hidden': 150, 'encoder_nonlin': 'tanh', 'encoder_dropout': 0.05081501879035502}
[2023-09-21T11:55:08.557163+0000][3153][CRITICAL] module disabled: /usr/local/lib/python3.10/dist-packages/synthcity/plugins/generic/plugin_goggle.py
100%|██████████| 400/400 [01:28<00:00,  4.53it/s]
INFO:hyperparameter_optimization:Trial: trial_1. Model: tvae. Selected hyperparameters: {'n_iter': 100, 'lr': 0.001, 'decoder_n_la

You have an access to the best hyperparamters.

In [None]:
best_hyperparameters = tvae_synthetic_data_model_optimizer.best_hyperparameters
print(best_hyperparameters)

{'n_iter': 100, 'lr': 0.001, 'decoder_n_layers_hidden': 3, 'weight_decay': 0.0001, 'batch_size': 256, 'n_units_embedding': 350, 'decoder_n_units_hidden': 100, 'decoder_nonlin': 'tanh', 'decoder_dropout': 0.03137473872195671, 'encoder_n_layers_hidden': 5, 'encoder_n_units_hidden': 450, 'encoder_nonlin': 'leaky_relu', 'encoder_dropout': 0.06130705537690631}


You can also plot /display parallel hyperparamter coordinates for later analysis.

In [None]:
tvae_synthetic_data_model_optimizer.display_parallel_hyperparameter_coordinates()

There is also a method to display / plot hyperparamter importances during training.

In [None]:
tvae_synthetic_data_model_optimizer.display_hyperparameter_importances()

The tuner can also provide you with an evaluation report on the idenitifed best synthetic data model. The model will be assessed using *all* the available evaluation metrics in the *synthcity* package.

In [None]:
best_tvae_model_full_evaluation_report = tvae_synthetic_data_model_optimizer.best_synthetic_data_model_full_evaluation_report
best_tvae_model_full_evaluation_report

[2023-09-21T12:01:16.459147+0000][3153][CRITICAL] module disabled: /usr/local/lib/python3.10/dist-packages/synthcity/plugins/generic/plugin_goggle.py
100%|██████████| 100/100 [01:58<00:00,  1.18s/it]


Unnamed: 0,min,max,mean,stddev,median,iqr,rounds,errors,durations,direction
sanity.data_mismatch.score,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0.0,minimize
sanity.common_rows_proportion.score,0.0,0.0,0.0,0.0,0.0,0.0,1,0,0.01,minimize
sanity.nearest_syn_neighbor_distance.mean,0.073798,0.073798,0.073798,0.0,0.073798,0.0,1,0,0.0,minimize
sanity.close_values_probability.score,0.960212,0.960212,0.960212,0.0,0.960212,0.0,1,0,0.0,maximize
sanity.distant_values_probability.score,0.001326,0.001326,0.001326,0.0,0.001326,0.0,1,0,0.0,minimize
stats.jensenshannon_dist.marginal,0.02595,0.02595,0.02595,0.0,0.02595,0.0,1,0,0.07,minimize
stats.chi_squared_test.marginal,0.432897,0.432897,0.432897,0.0,0.432897,0.0,1,0,0.02,maximize
stats.inv_kl_divergence.marginal,0.800338,0.800338,0.800338,0.0,0.800338,0.0,1,0,0.02,maximize
stats.ks_test.marginal,0.822944,0.822944,0.822944,0.0,0.822944,0.0,1,0,0.01,maximize
stats.max_mean_discrepancy.joint,0.022172,0.022172,0.022172,0.0,0.022172,0.0,1,0,0.07,minimize


Another method will evaluate the model using only the evaluation metrics specified at the class initializaiton.

In [None]:
best_tvae_model_evaluation_report = tvae_synthetic_data_model_optimizer.best_synthetic_data_model_evaluation_report
best_tvae_model_evaluation_report

Unnamed: 0,min,max,mean,stddev,median,iqr,rounds,errors,durations,direction
sanity.close_values_probability.score,0.960212,0.960212,0.960212,0.0,0.960212,0.0,1,0,0.0,maximize
stats.inv_kl_divergence.marginal,0.800338,0.800338,0.800338,0.0,0.800338,0.0,1,0,0.02,maximize
performance.xgb.gt,0.422934,0.422934,0.422934,0.0,0.422934,0.0,1,0,0.63,maximize
performance.xgb.syn_id,-0.019229,-0.019229,-0.019229,0.0,-0.019229,0.0,1,0,0.63,maximize
performance.xgb.syn_ood,-0.069771,-0.069771,-0.069771,0.0,-0.069771,0.0,1,0,0.63,maximize
performance.xgb_augmentation.gt,0.422934,0.422934,0.422934,0.0,0.422934,0.0,1,0,2.45,maximize
performance.xgb_augmentation.aug_ood,-0.069771,-0.069771,-0.069771,0.0,-0.069771,0.0,1,0,2.45,maximize
privacy.k-anonymization.gt,2.0,2.0,2.0,0.0,2.0,0.0,1,0,0.22,maximize
privacy.k-anonymization.syn,3.0,3.0,3.0,0.0,3.0,0.0,1,0,0.22,maximize


The tuner will also enable you to save the best synthetic data model.

In [None]:
tvae_synthetic_data_model_optimizer.save_best_synthetic_data_model()

# Synthetic data generation

You can create the synthetic data with the tuner or the *generate_synthetic_data_with_synthetic_data_model* function from the *synthetic_data_model_tuning* module.

In [None]:
tvae_synthetic_data_model_optimizer.generate_synthetic_data_with_the_best_synthetic_data_model(
    count=10,
)

Unnamed: 0,productsize,purchasequantity,purchaseamount,dept,category,company,brand,productmeasure
0,18.698397,3,7.220013,9,907,101111010,9907,OZ
1,1.580076,0,0.683314,97,9753,10000,0,CT
2,17.018214,1,1.658242,9,907,101111010,9907,OZ
3,14.124072,3,6.233899,9,907,101111010,9907,OZ
4,1.110688,1,5.533926,97,9753,10000,0,CT
5,128.0,3,3.939598,63,6315,101111010,9907,OZ
6,14.43887,2,1.451398,97,9753,10000,0,CT
7,14.592114,1,3.874689,97,9753,10000,0,CT
8,127.604158,1,4.477282,63,6315,101111010,9907,OZ
9,1.4539,1,6.080414,97,9753,101111010,0,CT


# Comparing multiple tuned synthetic data models

Another likely step is to compare between 2 or more tuned synthetic data models and the *auto_synthetic_data_platform* package can help with it as well.

For the demo purposes, we need to tune an alternative synthetic data model following the exact same steps as above.

In [None]:
_CTGAN_MODEL = "ctgan"
ctgan_synthetic_data_model = plugins.Plugins().get(_CTGAN_MODEL)

[2023-09-21T12:11:15.408482+0000][3153][CRITICAL] module disabled: /usr/local/lib/python3.10/dist-packages/synthcity/plugins/generic/plugin_goggle.py


In [None]:
ctgan_synthetic_data_model_optimizer = synthetic_data_model_tuning.SyntheticDataModelTuner(
    data_loader=data_loader,
    synthetic_data_model=ctgan_synthetic_data_model,
    experiment_directory=pathlib.Path(_EXPERIMENT_DIRECTORY),
    number_of_trials=_NUMBER_OF_TRIALS,
    optimization_direction=_OPTIMIZATION_DIRECTION,
    evaluation_metrics=_EVALUATION_METRICS,
    task_type=_TASK_TYPE,
)

In [None]:
best_ctgan_synthetic_data_model = (
    ctgan_synthetic_data_model_optimizer.best_synthetic_data_model
    )

INFO:hyperparameter_optimization:Specified 'evaluation_metrics': {'sanity': ['close_values_probability'], 'stats': ['inv_kl_divergence'], 'performance': ['xgb'], 'privacy': ['k-anonymization']}. Optimization direction: 'maximize'.
INFO:hyperparameter_optimization:Trial: trial_0. Model: ctgan. Selected hyperparameters: {'generator_n_layers_hidden': 2, 'generator_n_units_hidden': 100, 'generator_nonlin': 'elu', 'n_iter': 200, 'generator_dropout': 0.13868998986166445, 'discriminator_n_layers_hidden': 2, 'discriminator_n_units_hidden': 150, 'discriminator_nonlin': 'elu', 'discriminator_n_iter': 1, 'discriminator_dropout': 0.07788077013106619, 'lr': 0.0002, 'weight_decay': 0.001, 'batch_size': 100, 'encoder_max_clusters': 6}
[2023-09-21T12:11:15.486900+0000][3153][CRITICAL] module disabled: /usr/local/lib/python3.10/dist-packages/synthcity/plugins/generic/plugin_goggle.py
100%|██████████| 200/200 [02:45<00:00,  1.21it/s]
INFO:hyperparameter_optimization:Trial: trial_1. Model: ctgan. Selecte

In [None]:
best_ctgan_model_evaluation_report = ctgan_synthetic_data_model_optimizer.best_synthetic_data_model_evaluation_report

[2023-09-21T12:22:40.676206+0000][3153][CRITICAL] module disabled: /usr/local/lib/python3.10/dist-packages/synthcity/plugins/generic/plugin_goggle.py
100%|██████████| 200/200 [02:32<00:00,  1.31it/s]


Finally, the last step is to compare 2 or more created evaluation reports.

In [None]:
evaluation_reports_to_compare = {
    "tvae": best_tvae_model_evaluation_report,
    "ctgan": best_ctgan_model_evaluation_report,
}
synthetic_data_model_tuning.compare_synthetic_data_models_full_evaluation_reports(
    evaluation_results_mapping=evaluation_reports_to_compare
)

Unnamed: 0,tvae,ctgan
sanity.close_values_probability.score,0.960212,0.876658
stats.inv_kl_divergence.marginal,0.800338,0.865951
performance.xgb.gt,0.422934,0.422934
performance.xgb.syn_id,-0.019229,-3.882884
performance.xgb.syn_ood,-0.069771,-4.02509
performance.xgb_augmentation.gt,0.422934,0.422934
performance.xgb_augmentation.aug_ood,-0.069771,-4.02509
privacy.k-anonymization.gt,2.0,2.0
privacy.k-anonymization.syn,3.0,1.0


# License

The *auto_synthetic_data_platform* package is a wrapper around the *synthcity* library. Credits to: Qian, Zhaozhi and Cebere, Bogdan-Constantin and van der Schaar, Mihaela

The *auto_synthetic_data_platform* package uses the Apache License (Version 2.0, January 2004) more details in the package's LICENSE file.