This code generates synthetic data using the CTGAN and GaussianCopula models from the SDV library. The goal is to create synthetic data that closely resembles the original data, while also preserving the privacy of the data subjects. The code loads a CSV file containing the original data, and then uses the Generator class from the generator.py file to generate synthetic data. The generator object is created with specific metadata, including which columns are categorical and sensitive. The Generator class uses the SDV library to generate synthetic data using either the CTGAN or GaussianCopula model. The anonymized_data variable contains anonymized data generated using the faker_categorical() method from the Generator class. The df variable contains the final synthetic data, which is a combination of the anonymized data and the synthetic data generated using the SDV library. Finally, the SimilarityCheck class from the SimilarityCheck.py file is used to perform a similarity check between the generated data and the original data, and the quality of the synthetic data is evaluated using the TableEvaluator class from the table_evaluator.py file. 

In [None]:
import sys
sys.path.append('..')
import pandas as pd
import numpy as np
from faker import Faker
import random
from collections import OrderedDict
from sdv.tabular import CTGAN, GaussianCopula
from sdv.evaluation import evaluate
from table_evaluator import TableEvaluator
from src.utils import *
import re
from src.similarity_check.SimilarityCheck import *
from src.synthetic_data_generation.generator import *

ModuleNotFoundError: No module named 'sdv.tabular'

In [None]:
# define path to the data you want to test
path_test_data = "./Subsample_training.csv"

# take the comment out to see the first 10 rows of your data

# indicate which columns are categorical, and which are sensitive
cat_cols = ['Married/Single', 'House_Ownership', 'Car_Ownership', 'Profession', 'CITY', 'STATE', 'Risk_Flag']
sensitive_cols = ["first_name", "last_name", "email", "gender", "ip_address", "nationality", "city"]

my_metadata = {
    'fields':
        {
            'Income': {'type': 'numerical', 'subtype': 'integer'},
            'Age': {'type': 'numerical', 'subtype': 'integer'},
            'Experience': {'type': 'numerical', 'subtype': 'integer'},
            'CURRENT_JOB_YRS': {'type': 'numerical', 'subtype': 'integer'},
            'CURRENT_HOUSE_YRS': {'type': 'numerical', 'subtype': 'integer'},
            'Married/Single': {'type': 'categorical'},
            'House_Ownership': {'type': 'categorical'},
            'Car_Ownership': {'type': 'categorical'},
            'Profession': {'type': 'categorical'},
            'CITY': {'type': 'categorical'},
            'STATE': {'type': 'categorical'},
            'Risk_Flag': {'type': 'boolean'}
        },
    'constraints': [],
    'model_kwargs': {},
    'name': None,
    'primary_key': None,
    'sequence_index': None,
    'entity_columns': [],
    'context_columns': []
}

In [None]:
data = get_data(path_test_data)
# checking that it can deal with nan values
data.iloc[3, 2] = float("nan")
print(data.head())
# create object
generator = Generator(n_epochs=1, n_samples=100, architecture='CTGAN',
                        data=data,
                        categorical_columns=cat_cols,
                        sensitive_columns=sensitive_cols)
print("Generating data")
synth_data = generator.generate().iloc[:, 2:]
anonymized_data = generator.faker_categorical()
df = pd.concat([anonymized_data, synth_data], axis=1)
print(df.columns)
df.drop(['CITY', 'STATE'], inplace=True, axis=1)
print(df.head())
#df.to_csv('synth_data.csv')

In [None]:
similarity_checker = SimilarityCheck(data.iloc[:, 2:], synth_data, cat_cols, my_metadata)