# How to create customized Synthetic Training Data with Gretel


## 1. Define the use case
When creating synthetic data, it's important to know the use case and the task you want to use the data to solve. For this example, we are going to use the following use case:

> Our goal is to create a wide variety of diverse synthetic examples containing known PII values that can be used to train a language model or NER to detect and label domain-specific PII.

## 2. Specify requirements
As this model will be used to label PII in a production environment, we will need to simulate a wide variety of examples including:

* Multiple languages
* Standard PII types (e.g. valid credit card numbers)
* Customized PII types
* A wide variety of different financial document types and schemas
* Generate 10k+ synthetic examples

## Setup development environment

Our first step is to install the Gretel client. You'll need a free API key from https://console.gretel.ai. If you're using Colab, we recommend storing it using Colab Secrets, under the name `gretel_api_key`.

Also, we'll use Google Drive as a simple way to store the synthetic data as it is generated.

In [None]:
!pip install -qq "git+https://github.com/gretelai/gretel-python-client.git@main"
!pip install -qq faker tqdm jsonlines

In [None]:
# Import dependencies

import itertools
import os
import random
import textwrap

import jsonlines
import pandas as pd
from faker import Faker

# from google.colab import drive, userdata
from gretel_client import Gretel
from IPython.display import display
from tqdm.notebook import tqdm

In [None]:
# Mount Google Drive to store our synthetic data

# drive.mount('/content/drive')

# output_file_path = '/content/drive/My Drive/generated_results.jsonl'
output_file_path = "generated_results.jsonl"

In [None]:
# Instantiate the Gretel Client

gretel = Gretel(api_key="prompt")

navigator = gretel.factories.initialize_inference_api(backend_model="gretelai/tabular-v0")

# Create Contextual Tags

## 1: Create domain-specific document types and descriptions

Our goal is to create synthetic examples across a wide variety of document types. This is a case where we can leverage an LLM's inherent knowledge of different industry verticals and data types to generate these document types and schemas, without having to go to the trouble of thinking of all possibilities, and then crawling the web to find examples.


In [None]:
NUM_DOCUMENT_TYPES = 10

DOCUMENT_TYPE_PROMPT = f"""
You are a data expert across the financial services, insurance, and banking verticals. Generate a diverse dataset of domains and detailed descriptions for various document types, including specific formats and schemas, as they relate to the customer journey within a Finance, Insurance, Fintech, or Banking company.

Columns:
* document_type: Examples include Email, Customer support conversation, Financial Statement, Insurance Policy, Loan Application, Bill of Lading, Safety Data Sheet, Policyholder's Report, XBRL, EDI, SWIFT Messages, FIX Protocol, FpML, ISDA Definitions, BAI Format, MT940.
* document_description: A one-sentence detailed description of the kind of documents found in this domain, including specifics about format, common fields, and content type where applicable. Describe the schema, structure, and length of the data format that could be used as instructions to create a document from scratch.

Remember to customize fields and formats based on the specific requirements of each domain to accurately reflect the variety and complexity of documents in a SaaS company environment."
"""

if not os.path.exists("document_types.csv"):
    df = navigator.generate(prompt=DOCUMENT_TYPE_PROMPT, num_records=NUM_DOCUMENT_TYPES)
    df.to_csv("document_types.csv", index=False)
else:
    df = pd.read_csv("document_types.csv")

# Display the DataFrame
document_type_dict = dict(zip(df["document_type"], df["document_description"]))
navigator.display_dataframe_in_notebook(df)

# Build PII Generator

We can ask Gretel to synthesize PII types for us, but often different kinds of PII have very specific attributes we want to detect- for example, a valid credit card number must pass a Luhn check. In this case, rather than synthesizing data using the LLM, we'll build a wrapper for the popular Python `Faker` library, as well as allowing users to provide their own lists.

In [None]:
from faker import Faker
import itertools
import random


class PIIGenerator:
    def __init__(self, locales=["en_US"]):
        self.faker = Faker(locales)
        self.locales = locales
        self.pii_types = {}

    def add_faker_generator(self, name, method, *args, **kwargs):
        """
        Adds a Faker-based generator for a specific PII type.
        """
        self.pii_types[name] = (
            self._generate_faker_data,
            (method, args, kwargs),
            "generator",
        )

    def add_custom_list(self, name, custom_list):
        """
        Adds a custom list of values for a specific PII type.
        """
        self.pii_types[name] = (itertools.cycle, (custom_list,), "list")

    def _generate_faker_data(self, method, args, kwargs):
        """
        Internal method to generate data using Faker.
        """
        result = getattr(self.faker, method)(*args, **kwargs)
        if isinstance(result, tuple):
            # Concatenate tuple elements into a single string
            return " ".join(map(str, result))
        else:
            return str(result)

    def get_pii_generator(self, name, count=1):
        """
        Retrieves a generator for the specified PII type.
        """
        if name in self.pii_types:
            func, args, _ = self.pii_types[name]
            for _ in range(count):
                yield func(*args)
        else:
            raise ValueError(f"PII type '{name}' not defined.")

    def sample(self, name, sample_size=1):
        """
        Samples data for the specified PII type without exhausting the generator.
        """
        if name not in self.pii_types:
            raise ValueError(f"PII type '{name}' not defined.")

        _, args, type = self.pii_types[name]

        if type == "generator":
            # For generators, generate a larger pool then sample, as direct sampling is not possible
            pool_size = max(
                10, sample_size
            )  # Ensure at least 10 or the requested sample size
            pool = [
                next(self.get_pii_generator(name, 1)).replace("\n", " ")
                for _ in range(pool_size)
            ]
            return random.sample(pool, k=sample_size)
        elif type == "list":
            # Directly sample from the list
            return random.sample(args[0], k=sample_size)

    def get_all_pii_generators(self):
        """
        Returns a dictionary of all PII types with their corresponding generators.
        """
        return {name: self.get_pii_generator(name) for name in self.pii_types}

    def print_examples(self):
        """
        Prints two examples of each PII type.
        """
        print("Current Locales:", self.locales)

        for name, _ in self.pii_types.items():
            examples = list(self.sample(name, sample_size=2))
            print(f"Examples of {name}: {examples}")

## Instantiate the PII generator

Now, we will instantiate the PII generator with a list of PII that we wish to interleave into our synthetic data. As many types of PII are locale specific, we also define the desired languages/character sets/locales in the `locale_list`.

In [None]:
# Specify a list of locales to use with the Faker library
locale_list = ["en_US", "ja_JP"]

# Instantiate the PII generator, and add the data types that we wish to train on.
pii_generator = PIIGenerator(locales=locale_list)
pii_generator.add_faker_generator("Name", "name")
pii_generator.add_faker_generator("Email", "email")
pii_generator.add_faker_generator("Phone number", "phone_number")
pii_generator.add_faker_generator("Full address", "address")
pii_generator.add_faker_generator("Street address", "street_address")
pii_generator.add_faker_generator("Credit card", "credit_card_number")
pii_generator.add_faker_generator("Org or Company Name", "company")
pii_generator.add_faker_generator("Date of birth", "date_of_birth")
pii_generator.add_faker_generator("Zip code", "zipcode")
pii_generator.add_faker_generator("IBAN number", "iban")
pii_generator.add_faker_generator("IPv4 address", "ipv4")
pii_generator.add_faker_generator("IPv6 address", "ipv6")
pii_generator.add_faker_generator("US bank number", "bban")
pii_generator.add_faker_generator("US passport number", "passport_number")
pii_generator.add_faker_generator("US social security number", "ssn")
pii_generator.add_custom_list(
    "GPS latitude and longitude coordinates",
    [
        "40.56754, -89.64066",
        "25.13915, 73.06784",
        "-7.60361, 37.00438",
        "33.35283, -111.78903",
        "17.54907, 82.85749",
    ],
)
pii_generator.add_custom_list(
    "Customer ID", ["ID-001", "ID-002", "ID-003", "ID-004", "ID-005"]
)

# Build a dictionary to store all generators
pii_type_dict = pii_generator.get_all_pii_generators()

# Sample PII types
pii_generator.print_examples()

## Supporting multiple languages

Named entity recognition can be very sensitive to different contexts, schemas, and languages. We want our model to be as adaptable as possible to different languages and dialects that may exist in a production environment for financial data. In the section below, we guide the LLM to create synthetic examples matching the desired language and dialect.

In [None]:
# Create contextual tags for all of the data types we wish to generate

language_dict = {
    "english_us": "Content in English as spoken and written in the United States",
    #'spanish_spain': 'Content in Spanish as spoken and written in Spain',
    #'french_france': 'Content in French as spoken and written in France',
    #'german_germany': 'Content in German as spoken and written in Germany',
    #'italian_italy': 'Content in Italian as spoken and written in Italy',
    #'japanese_japan': 'Content in Japanese as spoken and written in Japan',
    #'dutch_netherlands': 'Content in Dutch as spoken and written in the Netherlands',
    #'swedish_sweden': 'Content in Swedish as spoken and written in Sweden',
    #'english_uk': 'Content in English as spoken and written in the United Kingdom',
    #'spanish_mexico': 'Content in Spanish as spoken and written in Mexico',
    #'portuguese_brazil': 'Content in Portuguese as spoken and written in Brazil'
}

# Generate permutations of contextual tags

This is the final stage of data preparation before using Gretel to generate data at scale. In this step, we compile all of the following tags to create a "recipe" that can guide the LLM to generate highly diverse synthetic data at scale.

For this dataset, we will guide each LLM generation with the following properties:
* document type (synthetic)
* document description (synthetic)
* language (synthetic)
* pii type (sampled from Faker)
* pii values (sampled from Faker)

In [None]:
N_ROWS = 10  # 1000  # Total number of contextual tags to generate
MAX_PII_TYPES = 3
PII_VALUES_COUNT = 3  # Number of PII values to generate for each PII type
MIN_TEXT_LENGTH = 200  # Minimum length

sampled_contextual_tag_data = []
for _ in range(N_ROWS):
    document_type = random.choice(list(document_type_dict.keys()))
    locale = random.choice(list(language_dict.keys()))
    # Select a random number of PII types between 1 and 3
    num_pii_types = random.randint(1, MAX_PII_TYPES)
    selected_pii_types = random.sample(list(pii_type_dict.keys()), num_pii_types)

    # Initialize lists to hold the selected PII types and their corresponding values
    selected_pii_types_list = []
    pii_values_list = []

    for pii_type in selected_pii_types:
        # Sample the PII values for each selected PII type
        pii_values = pii_generator.sample(pii_type, sample_size=PII_VALUES_COUNT)
        selected_pii_types_list.append(pii_type)
        pii_values_list.append(pii_values)

    # Create a single data entry with lists of PII types and their values
    data_entry = (
        document_type,
        document_type_dict[document_type],
        selected_pii_types_list,  # This now contains a list of selected PII types
        locale,
        pii_values_list,  # This now contains a list of lists of PII values
    )
    sampled_contextual_tag_data.append(data_entry)

# Convert sampled data to a DataFrame
contextual_tags_df = pd.DataFrame(
    sampled_contextual_tag_data,
    columns=[
        "document_type",
        "document_description",
        "pii_type",
        "language",
        "pii_values",
    ],
)

print(f"Created {len(contextual_tags_df)} contextual tag permutations")
navigator.display_dataframe_in_notebook(contextual_tags_df.head(10))

# Creating the Synthetic Dataset

We have completed the contextual tags to guide our LLM with synthetic data generation, and now we are ready to generate synthetic data at scale. To do this, we'll prompt Gretel to create a new dataset of synthetic records matching the desired `document_type`, `language`, and sampling `PII` attributes from our generator.

For this example, we'll use create mode, iterating over each row of the contextual tag dataframe and generating `NUM_RECORDS_PER_CONTEXT` synthetic records for each row.

With Gretel, there are two ways to do this:

1. Simply passing the contextual tag dataframe with a prompt to Gretel Navigator using "edit" mode. This is the simplest method.
2. Formatting the contextual tags using a prompt template, asking Gretel to create a synthetic dataset. This offers a bit more customization.



In [None]:
NUM_DOCUMENTS_PER_CONTEXT = 3


def add_markup_to_text(text, pii_types_dict):
    for pii_type, pii_value_list in pii_types_dict.items():
        for pii_value in pii_value_list:
            marked_up_pii = f"{{[{pii_type}]{pii_value}}}"
            text = text.replace(pii_value, marked_up_pii)
    return text


def generate_text2pii_data(row, verbose=False):
    document_type = row["document_type"]
    document_description = row["document_description"]

    pii_types_dict = {}
    pii_type = row["pii_type"]
    for k in range(len(pii_type)):
        pii_types_dict[pii_type[k]] = row["pii_values"][k]

    pii_values_markdown = ", or ".join(
        [
            f"'{item}'"
            for key, values_list in pii_types_dict.items()
            for item in values_list
        ]
    )
    language = row["language"]

    generated_records = []
    failed_count = 0

    create_prompt = f"""
Create a unique, comprehensive dataset entry as described below. Each entry should differ substantially in content, style, and perspective.

Dataset format: Two columns - 'document_type' and 'document_text'

Entry specifications:

'document_type': "{document_type}"
'document_text': A complete, coherent, and distinct synthetic {document_description} in {language}, formatted as a detailed {document_type}
  * Incorporate varied themes, styles, viewpoints, and structures
  * Use vivid descriptions, examples, and elaborations
  * Avoid repetition; ensure each entry stands out
  * Maintain coherence and logical flow
  * Seamlessly integrate the following {pii_type} values exactly as provided into the text: {pii_values_markdown}
  * Identify appropriate locations within the document to naturally incorporate these values
  * Provide context for each {pii_type}, explaining its relevance to the {document_type}
  * Ensure the {pii_type} values fit grammatically and contextually within the surrounding text
  * Maintain the overall structure and coherence of the {document_type}
Aim to create a rich, detailed, and engaging {document_type} that showcases creativity and diversity while seamlessly incorporating the provided {pii_type} values.
"""

    if verbose:
        print(create_prompt)

    while len(generated_records) < NUM_DOCUMENTS_PER_CONTEXT:
        # Generate initial documents
        results = navigator.generate(
            prompt=create_prompt, num_records=NUM_DOCUMENTS_PER_CONTEXT
        )

        # Add 'markup' column by applying the markup helper function
        results["text_markup"] = results["document_text"].apply(
            lambda text: add_markup_to_text(text, pii_types_dict)
        )

        # Filter out rows where the marked-up text is not different from the provided text
        failed_results = results[
            (results["text_markup"] == results["document_text"])
            | (results["document_text"].str.len() < MIN_TEXT_LENGTH)
        ]

        # Store the successfully generated records
        generated_records.extend(
            results[~results.index.isin(failed_results.index)][
                ["document_type", "document_text", "text_markup"]
            ].values.tolist()
        )
        failed_count += len(failed_results)

    if verbose:
        # Print status update
        print(
            f"Batch Update: Successfully generated {len(generated_records)} records so far. Failed: {failed_count}."
        )
        # Display an example of the latest successful record
        if generated_records:
            latest_record = generated_records[-1]
            print(f"Latest Example:\n{textwrap.fill(str(latest_record), width=80)}\n")

    return pd.DataFrame(
        generated_records, columns=["document_type", "document_text", "text_markup"]
    )


navigator.display_dataframe_in_notebook(
    generate_text2pii_data(contextual_tags_df.iloc[0], verbose=True)
)

## Generate synthetic data at scale

After prompt tuning, we are now ready to start generating synthetic data at scale. The code below iterates over each row in the contextual tags dataframe, creating `NUM_DOCUMENTS_PER_CONTEXT` synthetic documents for each given combination of contextual tags. To ensure that all data is saved at each generation, we will append results from each generation to the `output_file_path` file in Google Drive.



In [None]:
results = []

# Iterate over each row in the DataFrame
for index, row in tqdm(contextual_tags_df.iterrows(), total=len(contextual_tags_df)):
    result_df = generate_text2pii_data(row, verbose=False)
    results.append(result_df)

    # Display the latest result with a scrollbar, if needed
    display(
        result_df.tail(1)
        .style.set_table_attributes("style='display:inline'")
        .set_caption("Latest Record")
    )

    # Append the result to the JSONL file
    with jsonlines.open(output_file_path, mode="a") as writer:
        for _, result_row in result_df.iterrows():
            writer.write(result_row.to_dict())

# Concatenate all the DataFrames in the list into a single DataFrame
final_results = pd.concat(results, ignore_index=True)

# Display the final DataFrame with all results
display(final_results)

# Conclusion

Synthetic Data Generation provides a cost effective, and most importantly, iterative way to build and customize data for AI projects. Synthetic data can significantly enhance task performance and open new opportunities for innovation and feature development where you need data. With the increasing accessibility and cost effectiveness, there has never been a better time to start working with synthetic data.