<a href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/synthetic_data_walkthrough.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Create synthetic data with the Python SDK

This notebook utilizes Gretel's SDK and APIs to create a synthetic version of a popular machine learning financial dataset.

To run this notebook, you will need an API key from the Gretel console, at https://console.gretel.cloud.


In [None]:
%%capture
!pip install pyyaml smart_open pandas
!pip install -U gretel-client

In [None]:
# Specify your Gretel API key

import pandas as pd
from gretel_client import configure_session

pd.set_option("max_colwidth", None)

configure_session(api_key="prompt", cache="yes", validate=True)


In [None]:
# Create a project

from gretel_client.projects import create_or_get_unique_project

project = create_or_get_unique_project(name="walkthrough-synthetic")


## Create the synthetic data configuration

Load the default configuration template. This template will work well for most datasets. View other templates at https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics


In [None]:
import json
from gretel_client.projects.models import read_model_config

config = read_model_config("synthetics/default")

# Set the model epochs to 50
config["models"][0]["synthetics"]["params"]["epochs"] = 50

print(json.dumps(config, indent=2))


## Load and preview the source dataset

Specify a data source to train the model on. This can be a local file, web location, or HDFS file.


In [None]:
# Load and preview dataset to train the synthetic model on.
import pandas as pd

model = project.create_model_obj(
    model_config=config,
    data_source="https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv",
)

pd.read_csv(model.data_source)


## Train the synthetic model

In this step, we will task the worker running in the Gretel cloud, or locally, to train a synthetic model on the source dataset.


In [None]:
from gretel_client.helpers import poll

model.submit_cloud()

poll(model)


# View the generated synthetic data


In [None]:
# View the synthetic data

synthetic_df = pd.read_csv(model.get_artifact_link("data_preview"), compression="gzip")

synthetic_df.head()


# View the synthetic data quality report


In [None]:
# Generate report that shows the statistical performance between the training and synthetic data

import IPython
from smart_open import open

IPython.display.HTML(data=open(model.get_artifact_link("report")).read())


# Generate unlimited synthetic data

You can now use the trained synthetic model to generate as much synthetic data as you like.


In [None]:
# Generate more records from the model

record_handler = model.create_record_handler_obj(
    params={"num_records": 100, "max_invalid": 500}
)

record_handler.submit_cloud()

poll(record_handler)


In [None]:
synthetic_df = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")

synthetic_df.head()
