<a href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/create_synthetic_data_from_a_dataframe_or_csv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Create synthetic data with the Python SDK

This notebook will walk you through the process of creating your own synthetic data using Gretel's Python SDK from a CSV or a DataFrame of your choosing using Gretel's `tabular-actgan` model.

This notebook will take about 5 minutes to run end to end. You will need an API key from the Gretel console, at https://console.gretel.cloud.

In [None]:
!pip install -Uqq gretel-client

To get started with your project, you'll need to set up the following parameters:

- `DATASET_PATH`: Specify the path to your dataset that you want to use for training and generation.
- `GRETEL_PROJECT`: Define the name of your Gretel project where you'll store the trained model and its results. This should be a unique and descriptive name.

In [None]:
import pandas as pd

DATASET_PATH = "https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv" # @param {type:"string"}
GRETEL_PROJECT_NAME = "synthetic-data" # @param {type:"string"}

In [None]:
# Specify your Gretel API key

from gretel_client import configure_session

pd.set_option("max_colwidth", None)

configure_session(api_key="prompt", cache="yes", validate=True)


In [None]:
# Create a project

from gretel_client.projects import create_or_get_unique_project

project = create_or_get_unique_project(name="synthetic-data")


## Create the synthetic data configuration

Load the default configuration template. This template will work well for most datasets. View other templates at https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics


In [None]:
import json

from gretel_client.projects.models import read_model_config

config = read_model_config("synthetics/tabular-actgan")

# Adjust parame model epochs
config["models"][0]["actgan"]["params"]["epochs"] = "auto"
config["models"][0]["actgan"]["generate"]["num_records"] = 5000

print(f"Model configuration:\n{json.dumps(config, indent=2)}")


## Load and preview the source dataset

Specify a data source to train the model on. This can be a local file, web location, or HDFS file.


In [None]:
# Load and preview the DataFrame to train the synthetic model on.

pd.read_csv(DATASET_PATH)

## Train the synthetic model

In this step, we will task the worker running in the Gretel cloud, or locally, to train a synthetic model on the source dataset.


In [None]:
# Train model and view synthetic data

from gretel_client.helpers import poll

model = project.create_model_obj(model_config=config, data_source=DATASET_PATH)
model.submit_cloud()

print(f"Follow along with training in the console: {project.get_console_url()}")
poll(model, verbose=False)

synthetic_df = pd.read_csv(model.get_artifact_link("data_preview"), compression="gzip")
synthetic_df

# View the synthetic data quality report


In [None]:
# Generate report that shows the statistical performance between the training and synthetic data

import IPython
from smart_open import open

IPython.display.HTML(data=open(model.get_artifact_link("report")).read(), metadata=dict(isolated=True))


# Generate unlimited synthetic data

You can now use the trained synthetic model to generate as much synthetic data as you like.


In [None]:
# Sample additional records from the trained model

record_handler = model.create_record_handler_obj(
    params={"num_records": 10000, "max_invalid": 500}
)
record_handler.submit_cloud()
poll(record_handler, verbose=False)

synthetic_df = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")
synthetic_df