# Generating Synthetic Text

This notebook will walk you through generating realistic but synthetic text examples using an open-source implementation of OpenAI's GPT-3 architecture. 

In this example, we will generate new annotated text utterances that can be used to augment a real world financial dataset called `banking77`. This augmented dataset will have additional annotated examples that can help downstream ML models better understand and respond to new customer queries. To run this notebook, you will need an API key from the Gretel console,  at https://console.gretel.cloud. 
<br>

** **Limitations and Biases** **
Large-scale language models such as Gretel GPT may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.

In [None]:
%%capture
!pip install -U gretel-client

## Set up your project
* `DATASET_PATH`: Specify a dataset to run on.
* `INTENT`: Select an intent from the training data to boost examples for.
* `SEPARATOR`: Specify a separator character (default=`,`) to combine intents and texts with into a single column.
* `PROJECT`: Specify a project name.

In [None]:
import json

import pandas as pd
from gretel_client import configure_session
from gretel_client.helpers import poll
from gretel_client.projects import create_or_get_unique_project, get_project
from gretel_client.projects.models import read_model_config


DATASET_PATH = 'https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/banking77.csv'
INTENT = "card arrival"
SEPARATOR = ','
PROJECT = 'banking77'

In [None]:
# Log into Gretel and configure project

configure_session(api_key="prompt", cache="yes", endpoint="https://api.gretel.cloud", validate=True, clear=True)

project = create_or_get_unique_project(name=PROJECT)

## Load and preview the training dataset
Create single-column CSV training set by combining `intent` + `SEPARATOR` + `text`.


In [None]:
def create_dataset(dataset_path: str) -> pd.DataFrame:
    """
    Combine intents and text into a single string to pass to GPT-X.
    """
    records = []
    max_tokens = 0
    
    df = pd.read_csv(dataset_path)
    df['intent_and_text'] = df['intent'] + SEPARATOR + df['text']
    return df
    

pd.set_option('max_colwidth', None)

df = create_dataset(DATASET_PATH)
df[['intent_and_text']].to_csv('finetune.csv', index=False)
print(f"Total training set length {sum([len(x) for x in df['intent_and_text'].values])} bytes.")
df

## Train the synthetic model
In this step, we will task the worker running in the Gretel cloud, or locally, to fine-tune the GPT language model on the source dataset.

In [None]:
%%time 

def calc_epochs(num_rows, minutes=30) -> float:
    """Estimate the number of rows that can be trained within a time period"""
    rows_per_minute = 102.0
    return (rows_per_minute * minutes) / num_rows                                      

config = read_model_config("synthetics/natural-language")
config['models'][0]['gpt_x']['pretrained_model'] = "gretelai/mpt-7b"
config['models'][0]['gpt_x']['epochs'] = calc_epochs(len(df))
config['models'][0]['gpt_x']['generate'] = {'num_records': 1}
config

# Create and submit model
model = project.create_model_obj(model_config=config, data_source=df)
model.name = f"{PROJECT}-mpt-7b"
model.submit_cloud()

poll(model)

## Generate synthetic text data
The next cells walk through sampling data from the fine-tuned model using a prompt (conditional data generation). 

In [None]:
# Generate new text examples for a given intent by seeding
# model generation with examples from the class. 

# NOTE: We have found prompting the model with 
# ~25 examples for the class you wish to 
# generate to work well in practice.

%%time 

def create_prompt(df: pd.DataFrame, intent: str = "", recs: int = 25) -> pd.DataFrame:
    """
    Seed Gretel GPT text generation with an intent from the training data.
    """
    # NOTE: When creating a DataFrame for prompts, it must be a 1-column DataFrame!
    sample = df.query(f'intent == "{intent}"').head(recs)
    prompt = "\n".join([x[0] for x in sample[['intent_and_text']].values])
    
    # NOTE: the column name provide here does not matter, the returned
    # synthetic DataFrame will have the original column name that
    # was used to train the model, in this case it will be "intent_and_text"
    return pd.DataFrame([prompt], columns=["prompt_text"])


prompt_df = create_prompt(df=df, intent=INTENT, recs=25)

record_handler = model.create_record_handler_obj(
    params={"maximum_text_length": 1000},
    data_source=prompt_df
)
record_handler.submit_cloud()
poll(record_handler)

# Creating synthetic intents

In the cell below, we process the raw texts generated by Gretel GPT into a structured dataframe format, by splitting each row based on the intent prefix (`card_arrival`) that was used to prompt generation.



In [None]:
def get_intents(df: pd.DataFrame) -> pd.DataFrame:
    """
    Extract new intents generated by the GPT-X model.
    """
    MIN_LENGTH = 20
    texts = []
    
    for idx, row in gptx_df.iterrows(): 
        for text in row[0].split(f"{INTENT}{SEPARATOR}"):
            text = text.strip()
            if len(text) > MIN_LENGTH:
                texts.append([INTENT, text])

    intents = pd.DataFrame(texts, columns=['intent', 'synthetic_text'])
    return intents


gptx_df = pd.read_csv(record_handler.get_artifact_link("data"), compression='gzip')
gptx_df
syn = get_intents(df=gptx_df)
syn.head(15)
