# Generating Synthetic Text

This notebook will walk you through generating realistic but synthetic text examples using an open-source implementation of OpenAI's GPT-3 architecture. 

In this example, we will generate new annotated text utterances that can be used to augment a real world financial dataset called `banking77`. This augmented dataset will have additional annotated examples that can help downstream ML models better understand and respond to new customer queries. To run this notebook, you will need an API key from the Gretel console,  at https://console.gretel.cloud. 
<br>

** **Limitations and Biases** **
Large-scale language models such as GPT may produce untrue and/or offensive content without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results. For more information and examples please see [OpenAI](https://huggingface.co/gpt2#limitations-and-bias) and [EleutherAI](https://huggingface.co/EleutherAI/gpt-neo-125M#limitations-and-biases)'s docs for more details.

In [None]:
%%capture
!pip install -U gretel-client

## Set up your project
* `DATASET_PATH`: Specify a dataset to run on.
* `INTENT`: Select an intent from the training data to boost examples for.
* `SEPARATOR`: Specify a separator character (default=`,`) to combine intents and texts with into a single column.
* `PROJECT`: Specify a project name.

In [None]:
import json

import pandas as pd
from gretel_client import configure_session
from gretel_client.helpers import poll
from gretel_client.projects import create_or_get_unique_project, get_project


DATASET_PATH = 'https://gretel-public-website.s3.us-west-2.amazonaws.com/datasets/banking77.csv'
INTENT = "card arrival"
SEPARATOR = ','
PROJECT = 'banking77'

In [None]:
# Log into Gretel and configure project

configure_session(api_key="prompt", cache="yes", endpoint="https://api.gretel.cloud", validate=True, clear=True)

project = create_or_get_unique_project(name=PROJECT)
project

Gretel Api Key··········
Caching Gretel config to disk.
Using endpoint https://api.gretel.cloud
Logged in as alex@gretel.ai ✅


<gretel_client.projects.projects.Project at 0x7f29d194ff10>

## Create the model configuration

In this notebook we will use GPT-Neo, a transformer model designed using EleutherAI's replication of OpenAI's GPT-3 Architecture. This model has been pre-trained on the Pile, a large-scale dataset using 300 billion tokens over 572,300 steps. In this introductory example, we will fine-tune GPT-Neo to generate synthetic text utterances for a given intent that could be used to train a chat-bot.

In [None]:
config = {
  "schema_version": 1,
  "models": [
    {
      "gpt_x": {
        "data_source": "__",
        "pretrained_model": "EleutherAI/gpt-neo-125M",
        "batch_size": 4,
        "epochs": 1,
        "weight_decay": 0.1,
        "warmup_steps": 100,
        "lr_scheduler": "cosine",
        "learning_rate": 1e-6
      }
    }
  ]
}

In [None]:
# Helpers and utilities

def create_dataset(dataset_path: str) -> pd.DataFrame:
    """
    Combine intents and text into a single string to pass to GPT.
    """
    records = []
    max_tokens = 0
    
    df = pd.read_csv(dataset_path)
    df['intent_and_text'] = df['intent'] + SEPARATOR + df['text']
    return df


def create_prompt(df: pd.DataFrame, intent: str = "", recs: int = 25) -> str:
    """
    Seed GPT text generation with an intent from the training data.
    """
    sample = df.query(f'intent == "{intent}"').head(recs)
    prompt = "\n".join([x[0] for x in sample[['intent_and_text']].values])
    return prompt


def get_intents(df: pd.DataFrame, offset: int = 0) -> pd.DataFrame:
    """
    Extract new intents generated by the GPT model.
    """
    MIN_LENGTH = 20
    texts = []
    
    for idx, row in gptx_df.iterrows():
        
        records = row[0][offset:]
        for text in records.split(f"{INTENT}{SEPARATOR}"):
            text = text.strip()
            if len(text) > MIN_LENGTH:
                texts.append([INTENT, text])

    intents = pd.DataFrame(texts, columns=['intent', 'synthetic_text'])
    return intents

        

## Load and preview the training dataset
Create single-column CSV training set by combining `intent` + `SEPARATOR` + `text`.


In [None]:
pd.set_option('max_colwidth', None)

df = create_dataset(DATASET_PATH)
df[['intent_and_text']].to_csv('finetune.csv', index=False)
df

Unnamed: 0,intent,text,intent_and_text
0,card arrival,I am still waiting on my card?,"card arrival,I am still waiting on my card?"
1,card arrival,What can I do if my card still hasn't arrived after 2 weeks?,"card arrival,What can I do if my card still hasn't arrived after 2 weeks?"
2,card arrival,I have been waiting over a week. Is the card still coming?,"card arrival,I have been waiting over a week. Is the card still coming?"
3,card arrival,Can I track my card while it is in the process of delivery?,"card arrival,Can I track my card while it is in the process of delivery?"
4,card arrival,"How do I know if I will get my card, or if it is lost?","card arrival,How do I know if I will get my card, or if it is lost?"
...,...,...,...
9998,country support,You provide support in what countries?,"country support,You provide support in what countries?"
9999,country support,What countries are you supporting?,"country support,What countries are you supporting?"
10000,country support,What countries are getting support?,"country support,What countries are getting support?"
10001,country support,Are cards available in the EU?,"country support,Are cards available in the EU?"


## Train the synthetic model
In this step, we will task the worker running in the Gretel cloud, or locally, to fine-tune the GPT language model on the source dataset.

In [None]:
%%time 

model = project.create_model_obj(model_config=config)
model.data_source = "finetune.csv"
model.name = f"{PROJECT}-gptx"
model.submit_cloud()

poll(model)

[32mINFO: [0mStarting poller


{
    "uid": "628bdc5fe9ad86921c401049",
    "guid": "model_29ZumXJVwnWhcB6Y5AXOrKyG5jc",
    "model_name": "banking77-gptx",
    "runner_mode": "cloud",
    "user_id": "5f3c3afbbff62139634c66ca",
    "user_guid": "user_26hlmNzAeMFXnpOU5odYeV3yEjz",
    "billing_domain": "gretel.ai",
    "billing_domain_guid": "domain_28eujAnf9EFme26oSFok8xCUT4n",
    "project_id": "6287b8d94b153698342870f4",
    "project_guid": "proj_29R30oLpIN6lJwQO4SKH160bVXd",
    "status_history": {
        "created": "2022-05-23T19:11:27.885367Z"
    },
    "last_modified": "2022-05-23T19:11:28.001131Z",
    "status": "created",
    "last_active_hb": null,
    "duration_minutes": null,
    "error_msg": null,
    "error_id": null,
    "traceback": null,
    "container_image": "074762682575.dkr.ecr.us-west-2.amazonaws.com/models/gpt_x@sha256:632589dde375afa7d495376e703e918968c82435f966ddea823dfdd4caa70314",
    "model_type": "gpt_x",
    "config": {
        "schema_version": "1.0",
        "name": "banking77-gptx",

[32mINFO: [0mStatus is created. Model creation has been queued.
[32mINFO: [0mStatus is pending. A Gretel Cloud worker is being allocated to begin model creation.
[32mINFO: [0mStatus is active. A worker has started creating your model!
2022-05-23T19:11:36.069894Z  Training gpt_x!
2022-05-23T19:13:51.222376Z  Training in progress
{
    "loss": 4.4121,
    "learning_rate": 0.0,
    "epoch": 1.0
}
2022-05-23T19:13:55.054631Z  Training in progress
{
    "eval_loss": 4.046089172363281,
    "eval_runtime": 3.8293,
    "eval_samples_per_second": 522.284,
    "eval_steps_per_second": 32.643,
    "epoch": 1.0
}
2022-05-23T19:13:57.846429Z  Training in progress
{
    "train_runtime": 119.6112,
    "train_samples_per_second": 66.908,
    "train_steps_per_second": 2.09,
    "train_loss": 4.4121201171875,
    "epoch": 1.0
}
2022-05-23T19:13:57.863644Z  Training is completed!
2022-05-23T19:13:57.864345Z  Saving model
2022-05-23T19:13:58.766505Z  Model has been created successfully
2022-05-23T19

CPU times: user 883 ms, sys: 124 ms, total: 1.01 s
Wall time: 3min 7s


## Generate synthetic text data
The next cells walk through sampling data from the fine-tuned model using a prompt (conditional data generation). 

In [None]:
# Generate new text examples for a given intent by seeding
# model generation with examples from the class. Hint: We have found
# prompting the model with ~25 examples for the class you wish to 
# generate to work well in practice.

prompt = create_prompt(df=df, intent=INTENT, recs=25)

record_handler = model.create_record_handler_obj(
    params={"num_records": 1, "maximum_text_length": 1000, "prompt": prompt}
)
record_handler.submit_cloud()
poll(record_handler)

[32mINFO: [0mStarting poller


{
    "uid": "628bde6e941bfefa095ca6b2",
    "guid": "model_run_29ZvqnCBVHgtDrCFiGOd5snflVZ",
    "model_name": null,
    "runner_mode": "cloud",
    "user_id": "5f3c3afbbff62139634c66ca",
    "user_guid": "user_26hlmNzAeMFXnpOU5odYeV3yEjz",
    "billing_domain": "gretel.ai",
    "billing_domain_guid": "domain_28eujAnf9EFme26oSFok8xCUT4n",
    "project_id": "6287b8d94b153698342870f4",
    "project_guid": "proj_29R30oLpIN6lJwQO4SKH160bVXd",
    "status_history": {
        "created": "2022-05-23T19:20:14.337000Z"
    },
    "last_modified": "2022-05-23T19:20:14.467000Z",
    "status": "created",
    "last_active_hb": null,
    "duration_minutes": null,
    "error_msg": null,
    "error_id": null,
    "traceback": null,
    "container_image": "074762682575.dkr.ecr.us-west-2.amazonaws.com/models/gpt_x@sha256:632589dde375afa7d495376e703e918968c82435f966ddea823dfdd4caa70314",
    "model_id": "628bdc5fe9ad86921c401049",
    "model_guid": "model_29ZumXJVwnWhcB6Y5AXOrKyG5jc",
    "action": "gpt

[32mINFO: [0mStatus is created. A job has been queued.
[32mINFO: [0mStatus is pending. A Gretel Cloud worker is being allocated
[32mINFO: [0mStatus is active. A worker has started!
2022-05-23T19:20:39.336036Z  Sampling 1 records conditioned on 'card arrival,I am still waiting on my card?
card arrival,What can I do if my card still hasn't arrived after 2 weeks?
card arrival,I have been waiting over a week. Is the card still coming?
card arrival,Can I track my card while it is in the process of delivery?
card arrival,How do I know if I will get my card, or if it is lost?
card arrival,When did you send me my new card?
card arrival,Do you have info about the card on delivery?
card arrival,What do I do if I still have not received my new card?
card arrival,Does the package with my card have tracking?
card arrival,I ordered my card but it still isn't here
card arrival,Why has my new card still not come?
card arrival,I still haven't received my card after two weeks, is it lost?
card arr

# Creating synthetic intents

In the cell below, we process the raw texts generated by GPT into a structured dataframe format, by splitting each row based on the intent prefix (`card_arrival`) that was used to prompt generation.



In [None]:
gptx_df = pd.read_csv(record_handler.get_artifact_link("data"), compression='gzip')

syn = get_intents(df=gptx_df, offset=len(prompt))
syn


Unnamed: 0,intent,synthetic_text
0,card arrival,I don't have it yet. Are there some shipping details?
1,card arrival,Do you know if my new card has been sent yet?
2,card arrival,How do I know if I will receive the new card?
3,card arrival,Is there a way to do the tracking?
4,card arrival,What is the status of my card's delivery?
5,card arrival,It's been a week since I got my card. Are there any ways to track my card?
6,card arrival,Is there a way to track my card?
7,card arrival,I don't have my card yet. What is going on?
8,card arrival,I don't have my card yet. Are there any shipping details?
9,card arrival,I just got my card now. Are there any other steps to track my card?
