In [None]:
%pip install sdv==0.13.1

# ML Model Development using Synthetic Data Clones

What happens if a machine learning model is trained using synthetic data instead of real data? We ran an experiment to answer this question.

In this notebook, we supply the code we used for running our experiment. If you're interested in a high level summary of the set up an results, please see our [blog article](https://sdv.dev/blog/synthetic-clones-for-ml).

## Datasets

We'll look at 3 different datasets. We'll build ML models for the real data vs. synthetic data and we'll compare their performance.

Let's load & inspect 3 different datasets to understand the different ML tasks that are required.

In [None]:
import pandas as pd

### Income Dataset
The Income Dataset comes from [Kaggle](https://www.kaggle.com/mastmustu/income?select=train.csv). Download `train.csv` to get the data.

The dataset contains information about employees, including personal attributes like gender and education.

In [None]:
income = pd.read_csv("income.csv")

# Convert our prediction target column into a string
income['income_>50K'] = income['income_>50K'].astype(str)

income.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income_>50K
0,67,Private,366425,Doctorate,16,Divorced,Exec-managerial,Not-in-family,White,Male,99999,0,60,United-States,1
1,17,Private,244602,12th,8,Never-married,Other-service,Own-child,White,Male,0,0,15,United-States,0
2,31,Private,174201,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,1
3,58,State-gov,110199,7th-8th,4,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,40,United-States,0
4,25,State-gov,149248,Some-college,10,Never-married,Other-service,Not-in-family,Black,Male,0,0,40,United-States,0


**ML Prediction Task:** Is the employee's salary >$50K? This is represented by the last, binary column `income_>50K`.

In [None]:
income['income_>50K'].value_counts()

0    33439
1    10518
Name: income_>50K, dtype: int64

### Bank Dataset
The [bank dataset](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)* comes from the UC Irvine Machine Learning data respository. Download the `bank-additional.zip` folder [from the site](https://archive.ics.uci.edu/ml/machine-learning-databases/00222/) and use the dataset called `bank-additional-full.csv`.

This dataset contains marketing calls from a banking institution and includes data about the call recipient, such as age and marital status.

\**\[Moro et al., 2014\] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014*

In [None]:
bank = pd.read_csv("bank-additional-full.csv", sep = ";")
bank.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


**ML Prediction Task:** Did the recipient subscribe to the new product? This is represented by the final, binary column named `y`.

In [None]:
bank['y'].value_counts()

no     36548
yes     4640
Name: y, dtype: int64

### Airline Dataset
The [Airline Dataset](https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction?select=train.csv) comes from Kaggle. Download `train.csv` to get the data.

The dataset contains the of a survey given to passengers, including attributes like flight distance and loyalty status.

In [None]:
airline = pd.read_csv("airline.csv")
airline.head()

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,1,5,3,5,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,3,1,3,1,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,satisfied
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,satisfied


**ML Prediction Task:** Was the passenger satisfied with the flight? This is represented by the final column named `satisfaction`

In [None]:
airline['satisfaction'].value_counts()

neutral or dissatisfied    58879
satisfied                  45025
Name: satisfaction, dtype: int64

## Original Data (Control)

Now that we've loaded our datasets, let's evaluate the general difficulty of performing the ML task for each of our datasets.

First, we'll need to split each of our datasets into separate `train` and `test` chunks. We'll use the train data as input for building the ML model and then evaluate it by predicting values for the test data.

In [None]:
from sklearn.model_selection import train_test_split

income_train, income_test = train_test_split(income, test_size = 0.2, shuffle = True)
bank_train, bank_test = train_test_split(bank, test_size = 0.2, shuffle = True)
airline_train, airline_test = train_test_split(airline, test_size = 0.2, shuffle = True)

We'll build & test our ML models using the [SDMetrics](https://github.com/sdv-dev/SDMetrics) library. In particular, the [ML Efficacy](https://sdv.dev/SDV/user_guides/evaluation/single_table_metrics.html#machine-learning-efficacy-metrics) metrics evaluate the data by building an ML model and evaluating its performance. Let's use the Binary Decision Tree Classifier and Binary Logistic Regression models.

In [None]:
# SDV contains a wrapper for the SDMetrics we need
from sdv.metrics.tabular import BinaryDecisionTreeClassifier, BinaryLogisticRegression

# Format our input & test data
dataset_names = ['income', 'bank', 'airline']
inputs = [income_train, bank_train, airline_train]
tests = [income_test, bank_test, airline_test]

results = {}

# Iterate through each dataset and collect results
for i in range(3):
  dataset = dataset_names[i]
  train = inputs[i]
  test = tests[i]

  # the column we're trying to predict is always the last one
  target_col = train.columns[-1]

  # Create & evaluate the ML models
  tree = BinaryDecisionTreeClassifier.compute(test, train, target=target_col)
  lr = BinaryLogisticRegression.compute(test, train, target=target_col)

  # Save results
  results[dataset] = {}
  results[dataset]['Tree'] = tree
  results[dataset]['LR'] = lr

The results are now available in our dictionary.

In [None]:
print('Income:', results['income'])
print('Bank:', results['bank'])
print('Airline:', results['airline'])

Income: {'Tree': 0.8668974957726064, 'LR': 0.8710565651797657}
Bank:  {'Tree': 0.9241970021413276, 'LR': 0.9108780165774224}
Airline:  {'Tree': 0.9564958537103977, 'LR': 0.8837069225143054}


These results form our **control**. This helps us identify the difficulty of the ML task. After all, prediction on some datasets is be harder than others just based on the nature of the data.

From looking at these results, it seems like the ML task is hardest for the Income Dataset, as we aren't able to obtain high scores compared to the rest.

## Synthetic Data (Experiment)

Let's see what happens when we use synthetic data for our ML modeling rather than real data. This is our **experiment** case. We'll compare the results from these ML models to the ones that were trained on the real data.

## Generating the synthetic data

First, we need to generate the synthetic data. In this experiment, we used the SDV's [CopulaGAN](https://sdv.dev/SDV/user_guides/single_table/copulagan.html) to create the synthetic datasets.

**Note:** Generating the synthetic data using CouplaGAN may take some time. We've saved synthetic data from a previous run.

In [None]:
income_synthetic = pd.read_csv("income_synthetic.csv")
income_synthetic['income_>50K'] = income_synthetic['income_>50K'].astype(str)

bank_synthetic = pd.read_csv("bank_synthetic.csv")
airline_synthetic = pd.read_csv("airline_synthetic.csv")

The cell below contains the code we used to generate the synthetic datasets. You can uncomment and run it to generate your own synthetic data.

**Warning: This may take a few hours to complete.**

In [None]:
# Uncomment the lines below to run CopulaGAN and regenerate the synthetic data

# from sdv.tabular import CopulaGAN

# # Income Dataset
# income_model = CopulaGAN()
# income_model.fit(income_train)
# income_synthetic = income_model.sample(income_train.shape[0])

# # Bank Dataset
# bank_model = CopulaGAN()
# bank_model.fit(bank_train)
# bank_synthetic = bank_model.sample(bank_train.shape[0])

# # Airline dataset
# airline_model = CopulaGAN()
# airline_model.fit(airline_train)
# airline_synthetic = airline_model.sample(airline_train.shape[0])

## Evaluating the synthetic data

Now that we have our synthetic data, we can evaluate how effective it is to use when developing ML models.

Similar to the original data (control), we'll use the Binary Decision Tree Classifier and Binary Logistic Regression models.

In [None]:
synth_inputs = [income_synthetic, bank_synthetic, airline_synthetic]

# Iterate through each dataset and collect results
for i in range(3):
  dataset = dataset_names[i]
  train = synth_inputs[i]
  test = tests[i]

  # the column we're trying to predict is always the last one
  target_col = train.columns[-1]

  # Create & evaluate the ML models
  tree = BinaryDecisionTreeClassifier.compute(test, train, target=target_col)
  lr = BinaryLogisticRegression.compute(test, train, target=target_col)

  # Save results
  results[dataset]['Synth_Tree'] = tree
  results[dataset]['Synth_LR'] = lr

# Final results

Now we can compare the ML efficacy when we use the original data versus synthetic data.

The raw results from our run is printed below.

In [None]:
print('Income Dataset')
print('Original Logistic:', results['income']['LR'])
print('Synthetic Logistic:', results['income']['Synth_LR'])

print('Original Tree:', results['income']['Tree'])
print('Synthetic Tree:', results['income']['Synth_Tree'], '\n')

print('Bank Dataset')
print('Original Logistic:', results['bank']['LR'])
print('Synthetic Logistic:', results['bank']['Synth_LR'])

print('Original Tree:', results['bank']['Tree'])
print('Synthetic Tree:', results['bank']['Synth_Tree'], '\n')

print('Airline Dataset')
print('Original Logistic:', results['airline']['LR'])
print('Synthetic Logistic:', results['airline']['Synth_LR'])

print('Original Tree:', results['airline']['Tree'])
print('Synthetic Tree:', results['airline']['Synth_Tree'])

Income Dataset
Original Logistic: 0.8682084587359418
Synthetic Logistic: 0.8194074710176041
Original Tree: 0.8569550571029497
Synthetic Tree: 0.8423617315032892 

Bank Dataset
Original Logistic: 0.5899490167516388
Synthetic Logistic: 0.5916632942128693
Original Tree: 0.5805637358014304
Synthetic Tree: 0.4659038901601831 

Airline Dataset
Original Logistic: 0.8848334967807956
Synthetic Logistic: 0.8709287257019438
Original Tree: 0.9580192850225273
Synthetic Tree: 0.8928825771857065


We can also compute the accuracy lost in switching from the original data to synthetic data.

In [None]:
def compute_loss(dataset, metric):
  control = results[dataset][metric]
  synth = results[dataset]['Synth_' + metric]

  return 100 - (synth/control * 100)

print('Income Accuracy Loss')
print('Logistic:', compute_loss('income', 'LR'), '%')
print('Tree:', compute_loss('income', 'Tree'), '%\n')

print('Bank Accuracy Loss')
print('Logistic:', compute_loss('bank', 'LR'), '%')
print('Tree:', compute_loss('bank', 'Tree'), '%\n')

print('Airline Accuracy Loss')
print('Logistic:', compute_loss('airline', 'LR'), '%')
print('Tree:', compute_loss('airline', 'Tree'), '%')

Income Accuracy Loss
Logistic: 5.620883697607468 %
Tree: 1.7029277648462937 %

Bank Accuracy Loss
Logistic: -0.29058061163821947 %
Tree: 19.749742977481503 %

Airline Accuracy Loss
Logistic: 1.5714562264471397 %
Tree: 6.799101944517659 %


**Important note: There were many probabilistic processes involved in our process.**

*   Splitting data into `test`/`train` sets
*   Generating the synthetic data
*   Developing the Logistic Regression & Tree Classifiers

For accurate results, it's important to re-run the process multiple times and take the average.

## Published Results

For our blog article, we re-ran the process 3 times for each dataset and averaged the results. They are encoded below.

In [None]:
raw_data = {
        'Dataset': ['Income', 'Income', 'Bank', 'Bank', 'Airline', 'Airline'],
        'Metric': ['LR', 'Tree', 'LR', 'Tree', 'LR', 'Tree'],
        'Original(%)': [87.2, 85.9, 91.3, 92.8, 87.6, 95.4],
        'Synthetic(%)': [83.7, 84.1, 89.8, 89.1, 86.2, 86.9]}

all_results = pd.DataFrame(data=raw_data)

all_results['Diff(%)'] = all_results['Original(%)'] - all_results['Synthetic(%)']
all_results['Norm_Diff(%)'] = (1 - (all_results['Synthetic(%)']/all_results['Original(%)']))*100

In [None]:
all_results

Unnamed: 0,Dataset,Metric,Original(%),Synthetic(%),Diff(%),Norm_Diff(%)
0,Income,LR,87.2,83.7,3.5,4.013761
1,Income,Tree,85.9,84.1,1.8,2.09546
2,Bank,LR,91.3,89.8,1.5,1.642935
3,Bank,Tree,92.8,89.1,3.7,3.987069
4,Airline,LR,87.6,86.2,1.4,1.598174
5,Airline,Tree,95.4,86.9,8.5,8.909853


## Conclusions

From looking at the published data, we can draw a few conclusions.

1. The original data quantifies the general difficulty of the ML task. Income Dataset is the hardest task, as neither of our methods were able to get above 90% accuracy on the original data.
2. Comparing the `Original` and `Synthetic` datasets allows us to quantify the sutiability of synthetic data for ML development. Our results show a loss of between 1 and 9% of the original values with a median loss of roughly 2.5%.
3. We expect the actual loss to be even lower. In the real-world, the synthetic data quality can be improved through using constraints and tuning hyperparameters. Additionally, improved ML techniques (beyond Binary Tree Classifier & Logistic Regression) can be developed on the synthetic data and re-deployed on the real data.


Considering our results, we assess that **it is reasonable to replace the original data with synthetic data for the purpose of ML development**. So go ahead and try giving your ML development team synthetic data. You might be pleasantly surprised with the results!