## Installation

In [None]:
# %%capture
!pip install numpy pandas matplotlib pycaret
!pip install -U gretel-client

## Log in to gretel using our API key

In [None]:
import pandas as pd
from gretel_client import configure_session

pd.set_option("max_colwidth", None)

configure_session(api_key="prompt", validate=True, clear=True)

## Load data

We're going to explore using synthetic data as input to a downstream classification task. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv("https://gretel-blueprints-pub.s3.us-west-2.amazonaws.com/rdb/grocery_orders.csv")

In [None]:
df.head()

Since we are going to train both a synthetic data generating model and a downstream classification model, we need to hold out a small validation set that doesn't get seen by the synthetic model or the classification model to test the eventual classification performance of a classification model trained purely on synthetic data and validated on unseen real data

In [None]:
train_df, valid_df = train_test_split(df, test_size=0.05)

## Train a synthetic model and look at the generated data

In [None]:
from gretel_client.projects import create_or_get_unique_project
from gretel_client.helpers import poll
from gretel_client.projects.models import read_model_config


# Create a project and model configuration.
project = create_or_get_unique_project(name="downstream-ML")

# Choose high-dimensionality config since we have 100+ columns
config = read_model_config("synthetics/high-dimensionality")

# Get a csv to work with, just dump out the train_df.
train_df.to_csv("train.csv", index=False)

model = project.create_model_obj(model_config=config, data_source="train.csv")

# Upload the training data. Train the model.
model.submit_cloud()
poll(model)

synthetic = pd.read_csv(model.get_artifact_link("data_preview"), compression="gzip")
synthetic.head()

In [None]:
from gretel_client.evaluation import QualityReport

In [None]:
synthetic.to_csv("synthetic.csv", index=False)
report = QualityReport(data_source="synthetic.csv", ref_data="train.csv")

In [None]:
report.run()

In [None]:
print(report.peek())

## Downstream usecase

One huge benefit of synthetic data, outside of privacy preservation, is utility. The data isn't fake, it has all the same correlations as the original data - which means it can be used as input to a machine learning model. We train several classifiers and observe performance on various folds of the data

In [None]:
from pycaret.classification import setup, compare_models, evaluate_model, predict_model, create_model, plot_model

In [None]:
synthetic_df = synthetic.drop(['order_id'], axis=1)
train_df = train_df.drop(['order_id'], axis=1)
valid_df = valid_df.drop(['order_id'], axis=1)

In [None]:
synthetic_train_data, synthetic_test_data = train_test_split(synthetic_df, test_size=0.2)
original_train_data, original_test_data = train_test_split(train_df, test_size=0.2)

We want to predict whether a customer will buy frozen pizza (and how many). This turns into a multi-class classifiation problem. We use the Pycaret library to test a huge number of hypothesis classes. This will take a few minutes to fit many different models on a variety of folds

In [None]:
s = setup(synthetic_train_data, target='frozen pizza')
best = compare_models()

We then see how our "Best" classification model performs on the original data when trained on the synthetic data

In [None]:
test_predictions = predict_model(best, data=original_test_data)

In [None]:
valid_predictions = predict_model(best, data=valid_df)

In [None]:
synthetic_predictions = predict_model(best, data=synthetic_test_data)

In [None]:
s = setup(original_train_data, target='frozen pizza')
best = compare_models()

In [None]:
test_predictions = predict_model(best, data=original_test_data)

In [None]:
valid_predictions = predict_model(best, data=valid_df)

In [None]:
synthetic_predictions = predict_model(best, data=synthetic_test_data)