<a href="https://colab.research.google.com/github/mars137/synthetic-data/blob/main/docs/notebooks/create_synthetic_data_from_a_dataframe_or_csv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create synthetic data with the Python SDK

This notebook will walk you through the process of creating your own synthetic data using Gretel's Python SDK from a CSV or a DataFrame of your choosing.

To run this notebook, you will need an API key from the Gretel console, at https://console.gretel.cloud.


In [1]:
%%capture
!pip install -U gretel-client

In [2]:
# Specify your Gretel API key

import pandas as pd
from gretel_client import configure_session

pd.set_option("max_colwidth", None)

configure_session(api_key="prompt", cache="yes", validate=True)


Gretel Api Key··········
Caching Gretel config to disk.
Using endpoint https://api.gretel.cloud
Logged in as atif.tahir13@gmail.com ✅


In [3]:
# Create a project

from gretel_client.projects import create_or_get_unique_project

project = create_or_get_unique_project(name="synthetic-data")


## Create the synthetic data configuration

Load the default configuration template. This template will work well for most datasets. View other templates at https://github.com/gretelai/gretel-blueprints/tree/main/config_templates/gretel/synthetics


In [4]:
import json

from gretel_client.projects.models import read_model_config

config = read_model_config("synthetics/default")

# Set the model epochs to 50
config["models"][0]["synthetics"]["params"]["epochs"] = 50

print(json.dumps(config, indent=2))




{
  "schema_version": "1.0",
  "name": "default-config",
  "models": [
    {
      "synthetics": {
        "data_source": "__tmp__",
        "params": {
          "epochs": 50,
          "vocab_size": 20000,
          "learning_rate": 0.01,
          "validation_split": false
        },
        "generate": {
          "num_records": 5000
        },
        "privacy_filters": {
          "outliers": "auto",
          "similarity": "auto"
        }
      }
    }
  ]
}


## Load and preview the source dataset

Specify a data source to train the model on. This can be a local file, web location, or HDFS file.


In [5]:
# Load and preview the DataFrame to train the synthetic model on.
import pandas as pd

dataset_path = "https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/USAdultIncome5k.csv"
df = pd.read_csv(dataset_path)
df.to_csv("training_data.csv", index=False)
df


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,42,Private,255847,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,4386,0,48,United-States,>50K
1,34,Private,111567,HS-grad,9,Never-married,Transport-moving,Own-child,White,Male,0,0,40,United-States,<=50K
2,34,Private,263307,Bachelors,13,Never-married,Sales,Unmarried,Black,Male,0,0,45,?,<=50K
3,69,Private,174474,10th,6,Separated,Machine-op-inspct,Not-in-family,White,Female,0,0,28,Peru,<=50K
4,26,Private,260614,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,42,Self-emp-inc,287037,12th,8,Divorced,Craft-repair,Not-in-family,White,Male,0,0,10,United-States,<=50K
4996,48,Private,236858,11th,7,Divorced,Other-service,Not-in-family,White,Female,0,0,31,United-States,<=50K
4997,53,Private,317313,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,60,United-States,>50K
4998,23,Private,113601,Some-college,10,Never-married,Handlers-cleaners,Own-child,White,Male,0,0,30,United-States,<=50K


## Train the synthetic model

In this step, we will task the worker running in the Gretel cloud, or locally, to train a synthetic model on the source dataset.


In [6]:
from gretel_client.helpers import poll

model = project.create_model_obj(model_config=config, data_source="training_data.csv")
model.submit_cloud()

poll(model)


INFO: Starting poller


{
    "uid": "641ed6f8e54532098fa02eab",
    "guid": "model_2NVIEhn3E8u2eoruahHAdpfse5A",
    "model_name": "default-config",
    "runner_mode": "cloud",
    "user_id": "61779c3ebff62105d3757a71",
    "user_guid": "user_26hlyPRrQXap2t6NhfbC1G7JA0l",
    "billing_domain": null,
    "billing_domain_guid": null,
    "project_id": "641ed6ebe5bc51838c29c8db",
    "project_guid": "proj_2NVID2K4LvnVJCeGfHpYyHTvMdw",
    "status_history": {
        "created": "2023-03-25T11:11:52.232117Z"
    },
    "last_modified": "2023-03-25T11:11:52.410909Z",
    "status": "created",
    "last_active_hb": null,
    "duration_minutes": null,
    "error_msg": null,
    "error_id": null,
    "traceback": null,
    "annotations": null,
    "container_image": "074762682575.dkr.ecr.us-west-2.amazonaws.com/gretelai/synthetics@sha256:0e0d8d352d355d498b9da449f6ffb4bb33e87f530380a98afe157718f66877d1",
    "container_image_version": "2.10.41",
    "model_type": "synthetics",
    "model_type_alias": null,
    "config"

INFO: Status is created. Model creation has been queued.
INFO: Status is pending. A Gretel Cloud worker is being allocated to begin model creation.
INFO: Status is active. A worker has started creating your model!
2023-03-25T11:12:15.828949Z  Analyzing input data and checking for auto-params...
2023-03-25T11:12:39.676011Z  Starting synthetic model training
2023-03-25T11:12:39.678078Z  Loading training data
2023-03-25T11:12:39.689866Z  Running pre-flight data checks on input data
2023-03-25T11:12:42.353367Z  Training data loaded.
{
    "record_count": 5000,
    "field_count": 15,
    "upsample_count": 5000
}
2023-03-25T11:12:42.355468Z  Training fallback model...
2023-03-25T11:12:48.587098Z  Fallback model trained successfully
2023-03-25T11:12:51.350980Z  Creating semantic validators and preparing training data
2023-03-25T11:13:00.451798Z  Beginning ML model training
2023-03-25T11:13:07.489289Z  Running training on 1 batches.
{
    "batch_sizes": "[15]"
}
2023-03-25T11:13:09.818071Z  To

In [7]:
# View the synthetic data

synthetic_df = pd.read_csv(model.get_artifact_link("data_preview"), compression="gzip")

synthetic_df


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,62,?,168496.0,10th,6,Married-civ-spouse,?,Husband,White,Male,0,0,68,United-States,<=50K
1,36,?,167990.0,12th,8,Married-civ-spouse,?,Husband,White,Male,0,0,20,United-States,>50K
2,47,?,182926.0,11th,7,Married-civ-spouse,?,Husband,White,Male,0,0,40,United-States,<=50K
3,52,Private,229553.0,9th,5,Widowed,?,Not-in-family,Black,Female,0,0,60,?,<=50K
4,71,?,187748.0,HS-grad,9,Divorced,?,Unmarried,White,Female,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,49,Private,169496.0,Masters,14,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,55,United-States,>50K
4996,41,Private,111563.0,Masters,14,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,50,United-States,>50K
4997,42,Private,200973.0,9th,5,Married-civ-spouse,Machine-op-inspct,Wife,White,Female,0,0,35,United-States,>50K
4998,21,Private,296450.0,Assoc-acdm,12,Never-married,Sales,Own-child,White,Female,0,0,40,United-States,<=50K


# View the synthetic data quality report


In [8]:
# Generate report that shows the statistical performance between the training and synthetic data

import IPython
from smart_open import open

IPython.display.HTML(data=open(model.get_artifact_link("report")).read(), metadata=dict(isolated=True))


0,1,2,3,4,5
Synthetic Data Use Cases,Excellent,Good,Moderate,Poor,Very Poor
Significant tuning required to improve model,,,,,
Improve your model using our tips and advice,,,,,
Demo environments or mock data,,,,,
Pre-production testing environments,,,,,
Balance or augment machine learning data sources,,,,,
Machine learning or statistical analysis,,,,,

0,1,2,3,4,5
Data Sharing Use Case,Excellent,Very Good,Good,Normal,Poor
"Internally, within the same team",,,,,
"Internally, across different teams",,,,,
"Externally, with trusted partners",,,,,
"Externally, public availability",,,,,

Unnamed: 0,Training Data,Synthetic Data
Row Count,5000,5000
Column Count,15,15
Training Lines Duplicated,--,0

Default Privacy Protections,Advanced Protections

Field,Unique,Missing,Ave. Length,Type,Distribution Stability
education_num,16,0,1.55,Numeric,Excellent
education,16,0,8.43,Categorical,Excellent
native_country,40,0,12.3,Categorical,Excellent
occupation,15,0,12.18,Categorical,Excellent
hours_per_week,82,0,1.98,Numeric,Excellent
relationship,6,0,9.1,Categorical,Excellent
capital_loss,53,0,1.14,Numeric,Excellent
marital_status,7,0,14.52,Categorical,Excellent
workclass,8,0,7.89,Categorical,Excellent
race,5,0,5.54,Categorical,Excellent


# Generate unlimited synthetic data

You can now use the trained synthetic model to generate as much synthetic data as you like.


In [9]:
# Generate more records from the model

record_handler = model.create_record_handler_obj(
    params={"num_records": 100, "max_invalid": 500}
)
record_handler.submit_cloud()
poll(record_handler)


INFO: Starting poller


{
    "uid": "641ed88cd92404cded7179ee",
    "guid": "model_run_2NVJ3UK7MSlpk4u8jJ80ol8BsB7",
    "model_name": null,
    "runner_mode": "cloud",
    "user_id": "61779c3ebff62105d3757a71",
    "user_guid": "user_26hlyPRrQXap2t6NhfbC1G7JA0l",
    "billing_domain": null,
    "billing_domain_guid": null,
    "project_id": "641ed6ebe5bc51838c29c8db",
    "project_guid": "proj_2NVID2K4LvnVJCeGfHpYyHTvMdw",
    "status_history": {
        "created": "2023-03-25T11:18:36.427000Z"
    },
    "last_modified": "2023-03-25T11:18:36.546000Z",
    "status": "created",
    "last_active_hb": null,
    "duration_minutes": null,
    "error_msg": null,
    "error_id": null,
    "traceback": null,
    "annotations": null,
    "container_image": "074762682575.dkr.ecr.us-west-2.amazonaws.com/gretelai/synthetics@sha256:0e0d8d352d355d498b9da449f6ffb4bb33e87f530380a98afe157718f66877d1",
    "container_image_version": "2.10.41",
    "model_id": "641ed6f8e54532098fa02eab",
    "model_guid": "model_2NVIEhn3E8u2e

INFO: Status is created. A Record generation job has been queued.
INFO: Status is pending. A Gretel Cloud worker is being allocated to begin generating synthetic records.
INFO: Status is active. A worker has started!
2023-03-25T11:18:50.917458Z  Loading model to worker
2023-03-25T11:19:08.977435Z  Checking for synthetic smart seeds
2023-03-25T11:19:08.977830Z  No smart seeds provided, will attempt generation without them
2023-03-25T11:19:08.978106Z  Loading model
2023-03-25T11:19:08.978472Z  Fallback model is available to use if needed.
2023-03-25T11:19:14.123455Z  LSTM model is available for generation.
2023-03-25T11:19:14.123908Z  Generating records
{
    "num_records": 100
}
2023-03-25T11:19:19.129747Z  Generation in progress
{
    "current_valid_count": 0,
    "current_invalid_count": 0,
    "new_valid_count": 0,
    "new_invalid_count": 0,
    "completion_percent": 0.0
}
2023-03-25T11:19:24.135675Z  Generation in progress
{
    "current_valid_count": 0,
    "current_invalid_count"

In [10]:
synthetic_df = pd.read_csv(record_handler.get_artifact_link("data"), compression="gzip")

synthetic_df


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_bracket
0,26,Private,193898,9th,5,Separated,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K
1,23,?,332379,HS-grad,9,Never-married,?,Unmarried,White,Female,0,0,20,United-States,<=50K
2,62,?,293324,HS-grad,9,Married-civ-spouse,?,Husband,White,Male,0,0,50,United-States,<=50K
3,25,Private,439263,Some-college,10,Never-married,Other-service,Other-relative,Other,Male,0,0,20,Peru,<=50K
4,64,Private,415287,HS-grad,9,Married-civ-spouse,Sales,Husband,White,Male,0,0,35,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,57,Private,210525,HS-grad,9,Divorced,Transport-moving,Not-in-family,White,Male,0,0,35,United-States,<=50K
96,43,Private,101684,HS-grad,9,Married-spouse-absent,Other-service,Other-relative,White,Female,0,0,20,United-States,<=50K
97,46,Private,272780,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
98,57,Private,204033,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,>50K
