<a href="https://colab.research.google.com/github/gretelai/gretel-blueprints/blob/main/docs/notebooks/create_synthetic_data_smart_seeding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create synthetic data from partial rows

This blueprint utilizes Gretel's premium SDKs to create a synthetic version of your own data.  This blueprint uses
a helper model known as a `SeriesModel`.  Gretel uses a feature known as "smart seeding" that will generate rows based on partial values from your training data.  This is useful when you want to utilize unique column values as input to the model and let Gretel synthesize the rest of the row for you.

Use Cases for Series Data Synthesis:

- Create synthetic data that has the same number of rows as the training data
- You want to preserve some of the original row data (primary keys, dates, important categorical data).

Essentially this model will let you just take partial rows from the training data, and synthesize the rest of
the rows for you.

In the example below, we'll use a combination of a primary key and a couple of categorical fields as seed input.

In [None]:
%%capture

!pip install pandas
!pip install -U gretel-client

In [None]:
# Specify your Gretel API key

from getpass import getpass
import pandas as pd
from gretel_client import configure_session, ClientConfig

pd.set_option('max_colwidth', None)

configure_session(ClientConfig(api_key=getpass(prompt="Enter Gretel API key"), 
                               endpoint="https://api.gretel.cloud"))

In [None]:
# Load and preview dataset

import pandas as pd

dataset_path = "https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer_finance_data.csv"

# We will pull down the training data to drop an ID column.  This will help give us a better model.
training_df = pd.read_csv(dataset_path)

try:
    training_df.drop("disp_id", axis="columns", inplace=True)
except KeyError:
    pass # incase we already dropped it

training_df.head()

In [None]:
from smart_open import open
import yaml

from gretel_client import create_project
from gretel_client.helpers import poll

# Create a project and model configuration.
project = create_project(display_name="create-synthetic-data-smart-seeding")

# Pull down the default synthetic config.  We will modify it slightly.
with open("https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/config_templates/gretel/synthetics/default.yml", 'r') as stream:
    config = yaml.safe_load(stream)

# Here we prepare an object to specify the smart seeding task.
fields=["client_id", "age", "gender"]

task = {
    'type': 'seed',
    'attrs': {
        'fields': fields
    }
}

config['models'][0]['synthetics']['task'] = task

config['models'][0]['synthetics']['generate'] = {'max_invalid': 10000}

model = project.create_model_obj(model_config=config)

# Get a csv to work with, just dump out the training_set.
training_df.to_csv('train.csv', index=False)
model.data_source = 'train.csv'

# Upload the training data.  Train the model.
model.submit(upload_data_source=True)

poll(model)

# Use the model to generate synthetic data.
record_handler = model.create_record_handler_obj()

record_handler.submit(
    action="generate",
    params={"num_records": 5000, "max_invalid": 5000}
)

poll(record_handler)

synthetic = pd.read_csv(record_handler.get_artifact_link("data"), compression='gzip')

synthetic.head()

In [None]:
# Generate report that shows the statistical performance between the training and synthetic data

import IPython
from smart_open import open

IPython.display.HTML(data=open(model.get_artifact_link("report")).read())