# Retaining primary keys and field values with conditional data generation

Gretel supports a feature known as model conditioning (seeding) that will generate rows based on partial values from your training data.  This is useful when you want to manually specify certain field values in the synthetic data, and let Gretel synthesize the rest of the row for you.

Use Cases for conditional data generation with Gretel:

- Create synthetic data that has the same number of rows as the training data
- You want to preserve some of the original row data (primary keys, dates, important categorical data).

When using conditional generation with Gretel's "seed" task, the model will generate one sample for each row of the seed dataframe, sorted in the same order. 

In the example below, we'll use a combination of a primary key `client_id` and categorical fields `age` and `gender` as conditional inputs to the synthetic model, generating a new dataframe with the same primary key and categorical fields, but with the rest of the dataframe containing synthetically generated values.

In [None]:
%%capture

!pip install pyyaml smart_open pandas
!pip install -U gretel-client

In [None]:
# Specify your Gretel API key

from getpass import getpass
import pandas as pd
from gretel_client import configure_session, ClientConfig

pd.set_option('max_colwidth', None)

configure_session(ClientConfig(api_key=getpass(prompt="Enter Gretel API key"), 
                               endpoint="https://api.gretel.cloud"))

In [None]:
# Load and preview dataset

import pandas as pd

dataset_path = "https://gretel-public-website.s3-us-west-2.amazonaws.com/datasets/customer_finance_data.csv"

# We will pull down the training data to drop an ID column.  This will help give us a better model.
training_df = pd.read_csv(dataset_path)

try:
    training_df.drop("disp_id", axis="columns", inplace=True)
except KeyError:
    pass # incase we already dropped it

training_df

In [None]:
from smart_open import open
import yaml

from gretel_client import create_project
from gretel_client.helpers import poll

# Create a project and model configuration.
project = projects.create_or_get_unique_project(name='conditional-data-example')

# Pull down the default synthetic config.  We will modify it slightly.
with open("https://raw.githubusercontent.com/gretelai/gretel-blueprints/main/config_templates/gretel/synthetics/default.yml", 'r') as stream:
    config = yaml.safe_load(stream)

# Here we prepare an object to specify the conditional data generation task.
# In this example, we will retain the values for the seed fields below,
# use their values as inputs to the synthetic model.
fields=["client_id", "age", "gender"]
task = {
    'type': 'seed',
    'attrs': {
        'fields': fields
    }
}
config['models'][0]['synthetics']['task'] = task
config['models'][0]['synthetics']['generate'] = {'num_records': len(training_df)}


# Fit the model on the training set
model = project.create_model_obj(model_config=config)
training_df.to_csv('train.csv', index=False)
model.data_source = 'train.csv'
model.submit_cloud()

poll(model)

synthetic = pd.read_csv(model.get_artifact_link("data_preview"), compression='gzip')
synthetic.head()

In [None]:
# Generate report that shows the statistical performance between the training and synthetic data

import IPython
from smart_open import open

IPython.display.HTML(data=open(model.get_artifact_link("report")).read())

In [None]:
# Use the model to generate additional synthetic data.

seeds = training_df[fields]
seeds.to_csv('seeds.csv', index=False)

rh = model.create_record_handler_obj(data_source="seeds.csv", params={"num_records": len(seeds)})
rh.submit_cloud()

poll(rh)

synthetic_next = pd.read_csv(rh.get_artifact_link("data"), compression='gzip')
synthetic_next