# Create synthetic data from partial rows

This blueprint utilizes Gretel's premium SDKs to create a synthetic version of your own data.  This blueprint uses
a helper model known as a `SeriesModel`.  Gretel uses a feature known as "smart seeding" that will generate rows based on partial values from your training data.  This is useful when you want to utilize unique column values as input to the model and let Gretel synthesize the rest of the row for you.

Use Cases for Series Data Synthesis:

- Create synthetic data that has the same number of rows as the training data
- You want to preserve some of the original row data (primary keys, dates, important categorical data).

Essentially this model will let you just take partial rows from the training data, and synthesize the rest of
the rows for you.

In the example below, we'll use a combination of a primary key and a couple of categorical fields as seed input.

In [None]:
%%capture

!pip install -U "gretel-client<0.8.0" gretel-synthetics pandas

In [None]:
# Load your Gretel API key. You can acquire this from the Gretel Console 
# @ https://console.gretel.cloud

import pandas as pd
from gretel_client import get_cloud_client

pd.set_option('max_colwidth', None)

client = get_cloud_client(prefix="api", api_key="prompt")
client.install_packages()

In [None]:
# Load and preview dataset

import pandas as pd

dataset_path = "https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer_finance_data.csv"
training_df = pd.read_csv(dataset_path)
training_df

In [None]:
from pathlib import Path

checkpoint_dir = str(Path.cwd() / "checkpoints-series")

config_template = {
    "checkpoint_dir": checkpoint_dir,
    "vocab_size": 20000,
    "overwrite": True,
    "rnn_units": 1024,
    "embedding_dim": 256,
    "batch_size": 64,
    "learning_rate": 0.001
}

In [None]:
# Capture transient import errors in Google Colab

try:
    from gretel_helpers.series_models import SeriesModel
except FileNotFoundError:
    from gretel_helpers.series_models import SeriesModel

In [None]:
# When synthesizing series data we will use partial rows from the training data. Before creating the model, we can analyze
# the desired partial rows for any suggestions on how to pre-process the training data.
#
# In this example we want to use the table's primary key, gender, and age as partial rows, and let Gretel synthesize
# the rest of the data.

seed_columns = ["client_id", "age", "gender"]

print(
    SeriesModel.get_suggestions(training_df=training_df, seed_columns=seed_columns)
)

# If we look at this output, we'll see that "disp_id" is also fully unique across the table. If this column still
# needs to be unique when synthesized, we recommend adding it to the seed column list. Otherwise, if it's not really
# important for your downstream use cases, let's say ML modeling, then we can always remove it.

In [None]:
# Create a model, train, and generate a new DataFrame

try:
    training_df.drop("disp_id", axis="columns", inplace=True)
except KeyError:
    pass # incase we already dropped it

# Params for SeriesModel:
# - training_df: Your training DataFrame
# - seed_columns: A list of columns for which you want to use the original dataset values for
# - synthetic_config: The usual synthetic data configuration
# - auto_seed_corr: If enabled, automatically update the seed columns with other correlated fields. This will
#                   potentially add new columns to the seed list.

# Create a model, train, and generate a new DataFrame

model = SeriesModel(
    training_df=training_df,
    seed_columns=seed_columns,
    synthetic_config=config_template,
    auto_seed_corr=True
)

model.train()
model.generate(max_invalid=1e5)

synthetic_df = model.df
model.generate_report()

In [None]:
# Take a peek at the synthetic data

synthetic_df

In [None]:
# Generate report that shows the statistical performance between the training and synthetic data
import IPython

report_path = './gretel_report.html'
IPython.display.HTML(filename=report_path)