# Datacamp Webinar

In this webinar, we demonstrate the process of evaluating the quality of synthetic data based on its utility for a downstream Machine Learning (ML) task. The method is commonly referred to as the Train-Synthetic-Test-Real (TSTR) evaluation [[1](#refs)]. The TSTR evaluation serves as a robust measure of synthetic data quality because ML models rely on the accurate representation of deeper underlying patterns to perform effectively on previously unseen data. As a result, this approach offers a more reliable assessment than simply evaluating higher-level statistics.

See image below for the general setup of TSTR.

<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/train-synthetic-test-real/TSTR.png' width="600px"/>

Thus, we take actual (=real) data, and split it into a holdout and a training dataset. Next, we create a synthetic dataset only based on the training data. Then we train a Machine Learning (ML) model, and do so once using the synthetic data and once using the actual training data. And finally we evaluate the performance of each of those two models on top of the actual holdout data, that was kept aside all along. By comparing the performance of these two models, we can assess how much utility has been retained by the synthesization method with respect to a specific ML task.

Note, that one needs to use a true holdout for the evaluation to properly measure out-of-sample performance, as this is the relevant metric for real-world use cases. If one uses the same training data that has been used for the synthesis, one would "leak" information from training into evaluation. This becomes particularly an issue for synthesizers that are prone to overfitting, and simply memorize the samples that it has been exposed to. If one, on the other hand, were to use synthetic data for the evaluation, one would not get meaningful results either, as the synthetic data might not be representative of the real data. E.g., consider the degenerate case of a synthesizer that only produces the same record over and over again. Any model trained on that data, would yield perfect results when evaluated on it again, whereas it will be of no use when applied to real data.

## Synthesize Data via MOSTLY AI

For this tutorial, we will be using a modified version of the UCI Adult Income [[2](#refs)] dataset, that itself stems from the 1994 American Community Survey [[3](#refs)] by the US census bureau. The dataset consists of 48,842 records, 14 mixed-type features and has 1 target variable, that indicates whether a respondent had or had not reported a high level of annual income. This dataset is being selected, as it's one of the go-to datasets commonly used to showcase machine learning models in action.

1. Download `census-training.csv` via the DataCamp file browser. This is an 80% sample of the full dataset. The remaining 20% sample is contained in `census-holdout.csv`.

2. Synthesize `census-training.csv` via [MOSTLY AI](https://mostly.ai/). You can leave all settings at their default, and just proceed to launch job.

<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/train-synthetic-test-real/screen1.png' width="400px"/> <img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/train-synthetic-test-real/screen2.png' width="400px"/><br /><img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/train-synthetic-test-real/screen3.png' width="400px"/> <img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/train-synthetic-test-real/screen4.png' width="400px"/>

3. Once the job has finished, download the generated synthetic data as CSV file to your computer, unzip and rename to `census-synthetic.csv`.

4. Upload your `census-synthetic.csv` file to the `datacamp` folder via the DataCamp file browser.

Alternatively, you can also simply rename the existing `census-synthetic-demo.csv` to `census-synthetic.csv` and proceed with that one. This synthetic dataset has been generated with MOSTLY AI already previously.

In [None]:
import pandas as pd

# read training data
train = pd.read_csv('./census-training.csv')
print(f'read training data with {train.shape[0]:,} records and {train.shape[1]} attributes')

# read holdout data
holdout = pd.read_csv('./census-holdout.csv')
print(f'read holdout data with {holdout.shape[0]:,} records and {holdout.shape[1]} attributes')

# read synthetic data
syn = pd.read_csv('./census-synthetic.csv')
print(f"read synthetic data with {syn.shape[0]:,} records and {syn.shape[1]:,} attributes")

## Explore Synthetic Data

Show 10 randomly sampled synthetic records. Note, that you can execute the following cell multiple times, to see different samples.

In [None]:
syn.sample(n=10)

Ask AI: "Show 5 randomly sampled Female Professors of age 30 or younger from the synthetic dataset."

Ask AI: "Plot the average age by marital status and by gender from the synthetic dataset. Sort from lowest to highest. Color by gender. Add average age as label."

## Compare ML Performance

Let's now train a state-of-the-art **LightGBM** classifier on top of the synthetic data, to then check how well it can predict whether an actual person reported an annual income of more than $50K or not. We will then compare the predictive accuracy to a model, that has been trained on the actual data, and see whether we were able to achieve a similar performance purely based on the synthetic data.

In [None]:
import lightgbm as lgb
from lightgbm import early_stopping
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 72

target_col = 'income'
target_val = '>50K'

def prepare_xy(df):
    y = (df[target_col]==target_val).astype(int)
    str_cols = [col for col in df.select_dtypes(['object', 'string']).columns if col != target_col]
    for col in str_cols:
        df[col] = pd.Categorical(df[col])
    cat_cols = [col for col in df.select_dtypes('category').columns if col != target_col]
    num_cols = [col for col in df.select_dtypes('number').columns if col != target_col]
    for col in num_cols:
        df[col] = df[col].astype('float')
    X = df[cat_cols + num_cols]
    return X, y

def train_model(df):
    X, y = prepare_xy(df)
    cat_cols = list(X.select_dtypes('category').columns)
    X_trn, X_val, y_trn, y_val = train_test_split(X, y, test_size=0.2, random_state=1)
    ds_trn = lgb.Dataset(X_trn, label=y_trn, categorical_feature=cat_cols, free_raw_data=False)
    ds_val = lgb.Dataset(X_val, label=y_val, categorical_feature=cat_cols, free_raw_data=False)
    model = lgb.train(
        params={
            'verbose': -1,
            'metric': 'auc',  
            'objective': 'binary'
        }, 
        train_set=ds_trn,
        valid_sets=[ds_val],
        callbacks=[early_stopping(5)],
    )
    return model

def evaluate_model(model, hol):
    X_hol, y_hol = prepare_xy(hol)
    probs = model.predict(X_hol)
    preds = (probs >= 0.5).astype(int)
    auc = roc_auc_score(y_hol, probs)
    acc = accuracy_score(y_hol, preds)
    probs_df = pd.concat([
        pd.Series(probs, name='probability').reset_index(drop=True),
        pd.Series(y_hol, name=target_col).reset_index(drop=True)
    ], axis=1)
    sns.displot(data=probs_df, x='probability', hue=target_col, bins=20, multiple="stack")
    plt.title(f"Accuracy: {acc:.1%}, AUC: {auc:.1%}", fontsize=20)
    plt.show()
    return auc

import warnings
warnings.filterwarnings('ignore')

### Train a Model on Real Data - Test on Real Data

We train the LightGBM on the original training data, and then evaluate its performance on holdout data. We report two performance metrics: 
1. **Accuracy**: This is the probability to correctly predict the `income` class of a randomly selected record.
2. **AUC** (Area-Under-Curve): This is the probability to correctly predict the `income` class, if two records, one of high-income and one of low-income are given.

Whereas the Accuracy informs about the overall ability to get the class attribution correct, the AUC specifically informs about the ability to properly rank records, with respect to their probability of being within the target class or not. In both cases, the higher the metric, the better the predictive accuracy of the model.

The displayed chart shows the distribution of scores, that the model assigned to each of the holdout records. A score close to 0 means that model is very confident, that the record is of low income. A score close to 1 means that the model is very confident that it's a high income record. These scores are further split by their actual outcome, i.e. whether they are or are not actually high income. This allows to visually inspect the model's confidence in assigning the right scores.

In [None]:
# train ML model on original training data
# CODE HERE
model_trn = train_model(train)

# evaluate trained model on original holdout data
# CODE HERE
evaluate_model(model_trn, holdout)

### Train a Model on Synthetic Data - Test on Real Data

Let's now compare these results achieved on original data, with a model trained on synthetic data. For a very good synthesizer, we expect to see a predictive performance of the two models being close to each other.

In [None]:
# train ML model on synthetic data
# CODE HERE
model_syn = train_model(syn)

# evaluate trained model on original holdout data
# CODE HERE
evaluate_model(model_syn, holdout)

### Driver Analysis

In [None]:
import shap
explainer = shap.TreeExplainer(model_trn)
X_syn, y_syn = prepare_xy(syn)
shap_values = explainer.shap_values(X_syn)
shap.summary_plot(shap_values, X_syn, plot_size=0.2)

## Close Gaps in Your Data with Smart Imputation

Dealing with datasets that contain missing values can be of challenge. In particular if the remaining non-missing values are not representative, and thus provide a distorted, biased picture of the overall population.

In this tutorial we demonstrate how MOSTLY AI can help to close such gaps in your data via "Smart Imputation". By generating a synthetic dataset, that doest not contain any missing values, it is possible to create a complete and sound representation of the underlying population. With that it is then straightforward to accurately analyze the population, as if all values were present in the first place.

1. Synthesize `census-training.csv` once again via [MOSTLY AI](https://mostly.ai/). But this time, activate the **Smart Imputation** for column **age**. Leave all other settings at their defaults.

<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/smart-imputation/screen1.png' width="400px"/> <img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/smart-imputation/screen2.png' width="400px"/>

2. Once the job has finished, download the generated synthetic data as CSV file to your computer, and rename it to `census-synthetic-imputed.csv`.

3. Upload your `census-synthetic-imputed.csv` file to the `datacamp` folder via the DataCamp file browser.

Alternatively, you can also simply rename the existing `census-synthetic-imputed-demo.csv` to `census-synthetic-imputed.csv` and proceed with that one. This synthetic dataset has been generated with MOSTLY AI already previously.

Note, that you can already see the impact of Smart Imputation by inspecting the age distribution, once for the Model QA report, and once for the Data QA report.

<img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/smart-imputation/screen3.png' width="400px"/> <img src='https://raw.githubusercontent.com/mostly-ai/mostly-tutorials/dev/smart-imputation/screen4.png' width="400px"/>

In [None]:
# read synthetic data
syn_imp = pd.read_csv('./census-synthetic-imputed.csv')
print(f"read synthetic data with {syn_imp.shape[0]:,} records and {syn_imp.shape[1]:,} attributes")

print("Share of records with age missing")
print(f"{train['age'].isna().mean():.1f} years for original data (with missings)")
print(f"{syn['age'].isna().mean():.1f} years for synthetic data (with missings)")
print(f"{syn_imp['age'].isna().mean():.1f} years for synthetic data (imputed)")

In [None]:
# plot side-by-side
import matplotlib.pyplot as plt
orig = pd.concat([train, holdout])
orig.age.plot(kind='kde', label = 'Original Data (with missings)', color='black')
syn_imp.age.plot(kind='kde', label = 'Synthetic Data (imputed)', color='green')
_ = plt.title('Age Distribution')
_ = plt.legend(loc='upper right')
_ = plt.xlim(13, 90)
_ = plt.ylim(0, None)

As one can see, the imputed synthetic data does NOT contain any missing values anymore. But it's also apparent, that the synthetic age distribution is significantly distinct from the distribution of the non-missing values that were provided.

So, let's then check, whether that new distribution is more representative of the ground truth, i.e. the underlying original age distribution.

In [None]:
raw = pd.read_csv(f'../smart-imputation/census-ground-truth.csv')

# plot side-by-side
orig.age.plot(kind='kde', label = 'Original Data (with missings)', color='black')
raw.age.plot(kind='kde', label = 'Original Data (ground truth)', color='red')
syn_imp.age.plot(kind='kde', label = 'Synthetic Data (imputed)', color='green')
_ = plt.title('Age Distribution')
_ = plt.legend(loc='upper right')
_ = plt.xlim(13, 90)
_ = plt.ylim(0, None)

print("Average Age")
print(f"{train['age'].mean():.1f} years for original data (with missings)")
print(f"{raw['age'].mean():.1f} years for original ground truth")
print(f"{syn_imp['age'].mean():.1f} years for synthetic data (imputed)")

## References<a class="anchor" name="refs"></a>

1. https://arxiv.org/pdf/1706.02633.pdf §3.1.2
1. https://archive.ics.uci.edu/ml/datasets/adult
1. https://www.census.gov/programs-surveys/acs

## Extras: Data Preparation for this Webinar

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv('../smart-imputation/census-with-missings.csv')
df_trn, df_hol = train_test_split(df, test_size=0.2, random_state=1)
df_trn.to_csv('census-training.csv', index=False)
df_hol.to_csv('census-holdout.csv', index=False)