# Train-Synthetic-Test-Real

In this workbook, we show how to assess synthetic data quality with respect to its utility for a downstream ML task. The demonstrated approach is also known as the Train-Synthetic-Test-Real (TSTR) evaluation. See image below for the setup.

<center><img src='./TSTR.png' width="600px"/></center>

Thus, we take actual (=real) data, and split it into a holdout and a training dataset. Next, we create a synthetic dataset only based on the training data. Then we train a Machine Learning (ML) model, and do so once using the synthetic data and once using the actual training data. And finally we evaluate the performance of each of those two models on top of the actual holdout data, that was kept aside all along. By comparing the performance of these two models, we can assess how much utility has been retained by the synthesization method with respect to a specific ML task.

Note, that one needs to use a true holdout for the evaluation to properly measure out-of-sample performance, which is relevant for real-world use cases. If one uses the same training data that has been used for the synthesis, one would "leak" information from training into evaluation. This becomes particularly an issue for synthesizers that are prone to overfitting, and simply memorize the samples it has been exposed to. If one, on the other hand, would use synthetic data for the evaluation, one would not get meaningful results either, if the synthetic data itself is not representative of real data. E.g., consider the case of a synthesizer that only generates the same record over and over again. Then any model trained on that data, would yield perfect results when evaluated on it again, whereas it will be of no use when applied to real data.

## Fetch Actual Training Data and Synthesize it via MOSTLY AI

1. Download `census-training.csv` from [here](https://github.com/mostly-ai/public-demo-data/raw/dev/census/census-training.csv). This is a 80% random sample of the Adult Income dataset. The corresponding remaining 20% records can, will be fetched directly further below.
2. Synthesize `census-training.csv` via [MOSTLY AI](https://mostly.ai/) - you can leave all default settings as-is.
3. Upload the generated synthetic data to this Notebook via executing the next cell.

In [None]:
# upload synthetic dataset
import pandas as pd
try:
    # check whether we are in Google colab
    from google.colab import files
    import io
    uploaded = files.upload()
    syn = pd.read_csv(io.BytesIO(list(uploaded.values())[0]))
    print(f"uploaded synthetic data with {syn.shape[0]:,} records and {syn.shape[1]:,} attributes")
except:
    syn = pd.read_csv('census-synthetic.csv')
    print(f"use previously synthesized dataset with {syn.shape[0]:,} records and {syn.shape[1]:,} attributes")


In [None]:
# fetch training and holdout data directly from S3 bucket
import pandas as pd
train = pd.read_csv('census-training.csv')
print(f'fetched training data with {train.shape[0]:,} records and {train.shape[1]} attributes')
holdout = pd.read_csv('census-holdout.csv')
print(f'fetched holdout data with {holdout.shape[0]:,} records and {holdout.shape[1]} attributes')

## Compare ML Performance

We use a state-of-the-art LightGBM classifier as our downstream ML model, and train it for the task of predicting the `income` column, based on all other 14 features. Thus, given the `age`, `education`, `marital-status`, etc. information on a subject, we intend to predict whether that person reported an annual income of more than $50K or not.

In [None]:
# define ML model training pipeline, including data preparation, model training, and model evaluation

import lightgbm as lgb
import seaborn as sns
import matplotlib.pyplot as plt
from lightgbm import early_stopping
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score

# prepare data, and split into features `X` and target `y`
def prepare_xy(df: pd.DataFrame):
    tgt_col = 'income'
    y = (df[tgt_col]=='>50K').astype(int)
    str_cols = [col for col in df.select_dtypes(['object', 'string']).columns if col != tgt_col]
    for col in str_cols:
        df[col] = pd.Categorical(df[col])
    cat_cols = [col for col in df.select_dtypes('category').columns if col != tgt_col]
    num_cols = [col for col in df.select_dtypes('number').columns if col != tgt_col]
    for col in num_cols:
        df[col] = df[col].astype('float')
    X = df[cat_cols + num_cols]
    return X, y

# train ML model with early stopping
def train_model(X, y):
    cat_cols = list(X.select_dtypes('category').columns)
    X_trn, X_val, y_trn, y_val = train_test_split(X, y, test_size=0.2, random_state=1)
    ds_trn = lgb.Dataset(X_trn, label=y_trn, categorical_feature=cat_cols, free_raw_data=False)
    ds_val = lgb.Dataset(X_val, label=y_val, categorical_feature=cat_cols, free_raw_data=False)
    print(f"X_trn: {X_trn.shape[0]:,} rows, {X_trn.shape[1]:,} columns, target: {y_trn.mean():.2%}")
    print(f"X_val: {X_val.shape[0]:,} rows, {X_val.shape[1]:,} columns, target: {y_val.mean():.2%}")
    model = lgb.train(
        num_boost_round=200, 
        params={
            'verbose': -1,
            'metric': 'auc',  
            'objective': 'binary'
        }, 
        train_set=ds_trn,
        valid_sets=[ds_val],
        callbacks=[early_stopping(5)],
    )
    return model

# apply ML Model to some holdout data, report key metrics, and visualize scores
def evaluate_model(model, hol):
    X_hol, y_hol = prepare_xy(hol)
    probs = model.predict(X_hol)
    preds = (probs >= 0.5).astype(int)
    auc = roc_auc_score(y_hol, probs)
    acc = accuracy_score(y_hol, preds)
    print("")
    print(f"Holdout Accuracy {acc:.1%}")
    print("Confusion Matrix")
    print(pd.crosstab(pd.Series(preds, name='predicted'), y_hol))
    print("")
    probs_df = pd.concat([
        pd.Series(probs, name='probability').reset_index(drop=True),
        pd.Series(y_hol, name='target').reset_index(drop=True)
    ], axis=1)
    fig = sns.displot(data=probs_df, x='probability', hue='target', bins=20, palette=['#008CFB', '#FF004F'])
    fig = plt.title(f"Holdout AUC: {auc:.3f}", fontsize = 20)

import warnings
warnings.filterwarnings('ignore')

### Train ML Model on Real Data: `model_trn`


In [None]:
X_trn, y_trn = prepare_xy(train)
model_trn = train_model(X_trn, y_trn)

### Train ML Model on Synthetic Data: `model_syn`

In [None]:
X_syn, y_syn = prepare_xy(syn)
model_syn = train_model(X_syn, y_syn)

### Evaluate `model_trn` on actual holdout

In [None]:
evaluate_model(model_trn, holdout)

### Evaluate `model_syn` on actual holdout

In [None]:
evaluate_model(model_syn, holdout)

## Wrap-Up

**Summary**: For the provided dataset, you see a near on-par performance with respect to the downstream ML task. 

TODO: spell out implications

**Further exercise**: 
* You can now try to run Train-Synthetic-Test-Real on a different dataset, using the same or a different downstream ML model class. For that you will need to do the splitting of the actual data into `train` and `holdout` yourself, e.g. via the previously used [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method. 
* You can also try to see whether you can improve the accuracy of the downstream model by synthetic upsampling, i.e. by generating and then using significantly more synthetic samples than were contained in the actual training data.