# 📈: Text Augmentation using large-scale LMs and prompt engineering

This notebook looks into the possibility of performing data augmentation on an NLP dataset leveraging the few-shot capabilities of large-scale LMs and prompt engineering 💡

Data augmentation techniques are used to generate additional samples. Data augmentation is already standard practice in computer vision projects 👌, but can also be leveraged in many NLP problems. We'll use a limited training set to simulate a real-world use case, where we are often constrained by the size of the available data 🤦.

## 🛠️ Getting started

The cells below will setup everything that is required to get started with data augmentation and finetuning an NLP model with the HuggingFace API.

### Setup

In [None]:
!pip install -q transformers datasets tokenizers openai requests sentencepiece

### Imports

In [None]:
import re
import json
import torch
import random
import requests
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from datasets import load_dataset, concatenate_datasets, load_from_disk, load_metric, Dataset
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, TrainerCallback, AutoModelForCausalLM, AutoModelForSeq2SeqLM

### Download dataset
We'll train and evaluate our models on [Emotion](https://huggingface.co/datasets/emotion) dataset that contains English Twitter messages labeled as one of the six basic emotions: anger, fear, joy, love, sadness and surprise. To reduce the complexity of the task, we will keep only three labels, namely:
- joy 😂
- anger 😠
- surprise 😯

In [None]:
# load the dataset and filter on samples that have a token count less than 30 to use only short tweets
max_input_len = 30
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
emotion_ds = load_dataset("emotion").filter(lambda e: len(tokenizer.batch_encode_plus([e['text']]).input_ids[0]) < int(max_input_len))

The dataset is already split into 16,000 train and 2,000 test samples. To investigate the effectiveness of the augmentation method, we will use only 10 samples per class as a train set.

In [None]:
# select 10 random train samples from each of the three emotions
# sadness (0), joy (1), love (2), anger (3), fear (4), surprise (5)
joy_train_samples = emotion_ds['train'].filter(lambda e: e['label'] == 1).select(range(10))
anger_train_samples = emotion_ds['train'].filter(lambda e: e['label'] == 3).select(range(10))
surprise_train_samples = emotion_ds['train'].filter(lambda e: e['label'] == 5).select(range(10))

# map emotions to integers for labeling
# joy (0), anger (1), surprise (2)
def map_emotions(example):
  if example['label'] == 1: # joy
    example['label'] = 0
  elif example['label'] == 3: # anger
    example['label'] = 1
  else: 
    example['label'] = 2 # surprise
  return example

# create a train set that consists of 10 samples per class and filter the test 
# set to contain only the valid labels
emotion_train_ds = concatenate_datasets([joy_train_samples, anger_train_samples, surprise_train_samples]).map(lambda e: map_emotions(e)).shuffle(seed=42)
emotion_test_ds = emotion_ds["test"].filter(lambda e: e['label'] in [1, 3, 5]).map(lambda e: map_emotions(e))

# define the maping between emotions and labels
idx2label = {0: 'joy', 1: 'anger', 2: 'surprise'}
label2idx = {'joy': 0, 'anger': 1, 'surprise': 2}

Before proceeding with the data augmentation, let's have a look into the baseline dataset 😎!

In [None]:
print("Train set")
print("Total samples: {}\n".format(len(emotion_train_ds)))
print("A random sample")
print("Text: {} \nLabel: {}".format(emotion_train_ds['text'][10], idx2label[emotion_train_ds['label'][10]]))
print("\n")

print("Test set")
print("Total samples: {}\n".format(len(emotion_test_ds)))
print("A random sample")
print("Text: {} \nLabel: {}".format(emotion_test_ds['text'][10], idx2label[emotion_test_ds['label'][10]]))

Train set
Total samples: 30

A random sample
Text: i feel angered and firey 
Label: anger


Test set
Total samples: 797

A random sample
Text: i feel more virtuous than when i eat veggies dipped in hummus 
Label: joy


## Text Augmentation pipeline

We will leverage the few-shot capabilities of large LMs to generate synthetic but hyper-realistic samples from a mixture of real saples. Specifically, we select two real samples from our dataset and embed these samples in a carefully designed prompt. Then, the LM takes as input the prompt and generates an augmented mixed sample influenced by the sample sentences.


Generallly, a prompt looks like this:

    Each item in the following list contains a <text type> and the
    respective <label type>. <label type> is one of ’<label token 1>’,
    ..., or ’<label token N>’. 
    <text type>: <example text 1> (<label type>: <example label 1>)
    ...
    <text type>: <example text k> (<label type>: <example label k>)
    <text type>:

In our case the prompt looks like this:

    Each item in the following list contains a tweet and the
    respective sentiment. Sentiment is one of ’joy’, 'surprise' or 'anger'. 
    Tweet: i feel angered and firey (Sentiment: anger)
    Tweet: im feeling very peaceful about our wedding again now after having (Sentiment: joy)
    Tweet:

You can find more information on text augmentation using large LMs in [GPT3Mix](https://arxiv.org/abs/2104.08826) paper.

First, we should extract pairs of samples from the train set. There are various extraction strategies that can be used to increase the quality of the synthetic samples. We will simply extract the pairs randomly since by repeating random sampling a diverse synthetic dataset will be created.

In [None]:
# define a function that returns two random samples from the train set.
def get_random_samples():
  s1 = random.randint(0, len(emotion_train_ds)-1)
  s2 = random.randint(0, len(emotion_train_ds)-1)
  return emotion_train_ds['text'][s1], emotion_train_ds['label'][s1], emotion_train_ds['text'][s2], emotion_train_ds['label'][s2]

# define a function that takes as input two samples and generates the prompt
# that we should pass to the GPT-3 language model for completion.
def get_prompt(text1, label1, text2, label2):
  description = "Each item in the following list contains a tweet and the respective sentiment. Sentiment is one of 'joy', 'surprise' or 'anger'."
  prompt = (f"{description}\n"
            f"Tweet: {text1} (Sentiment: {idx2label[label1]})\n"
            f"Tweet: {text2} (Sentiment: {idx2label[label2]})\n"
            f"Tweet:")
  return prompt

### GPT-3

We will leverage [GPT-3](https://openai.com/blog/openai-api/) as our LM that is a powerful model developed by Open AI and an excellent few-shot learner allowing it to be controlled via natural text prompts. GPT-3 can be accessed through an API.

We will generate 10, 50, 100 and 200 synthetic samples using GPT-3 to investigate the effectiveness of text augmentation.

In [None]:
# define the number of synthetic samples to generate
n = 10
new_texts = []
new_labels = []
api_key =  # insert your api key for GPT-3
headers = {'Authorization' : 'Bearer ' + api_key ,
              'Content-type':'application/json', 
              'Accept':'application/json'}

iter = 0
while iter < n:
  # select two random samples from training set
  text1, label1, text2, label2 = get_random_samples()
  # create the prompt
  prompt = get_prompt(text1, label1, text2, label2)
  # send a post request to gpt-3 using the prompt
  response = requests.post('https://api.openai.com/v1/engines/davinci/completions', 
                           headers=headers,
                           data = json.dumps({"prompt": prompt, 
                                              "max_tokens": 30,
                                              "temperature": 0.9,
                                              "top_p": 0.95}))

  # get response and extract the generated text and label
  # the generated output will be in the form "<text> (Sentiment: <label>)"
  data = response.json()['choices'][0]['text'].split('\n')[0].split('(Sentiment:')

  if len(data) < 2:
    # the format of the response is invalid
    continue

  text = data[0]
  label = data[1].split(')')[0].strip()

  if label not in ['joy', 'anger', 'surprise']:
    # the format of the response is invalid
    continue

  new_texts.append(text)
  new_labels.append(label2idx[label])
  iter += 1

# define the synthetic dataset and save it to disk so as to prevent sending 
# many api requests
synthetic_ds = Dataset.from_dict({'text': new_texts, 'label': new_labels})
synthetic_ds.save_to_disk('./drive/MyDrive/text_augmentation/gpt-3/10')

In [None]:
# load the synthetic datasets with 10, 50, 100 and 200 samples
# run this if the dataset has already been saved and set the path in your workspace
synthetic_gpt3_10_ds = load_from_disk('./drive/MyDrive/text_augmentation/gpt-3/10')
synthetic_gpt3_50_ds = load_from_disk('./drive/MyDrive/text_augmentation/gpt-3/50')
synthetic_gpt3_100_ds = load_from_disk('./drive/MyDrive/text_augmentation/gpt-3/100')
synthetic_gpt3_200_ds = load_from_disk('./drive/MyDrive/text_augmentation/gpt-3/200')

Now let's print some synthetic samples to examine their quality!

In [None]:
print("Dataset of 10 synthetic samples:")
print("Text: {} \nLabel: {}\n".format(synthetic_gpt3_10_ds['text'][5], idx2label[synthetic_gpt3_10_ds['label'][5]]))
print("Dataset of 50 synthetic samples:")
print("Text: {} \nLabel: {}\n".format(synthetic_gpt3_50_ds['text'][5], idx2label[synthetic_gpt3_50_ds['label'][5]]))
print("Dataset of 100 synthetic samples:")
print("Text: {} \nLabel: {}\n".format(synthetic_gpt3_100_ds['text'][5], idx2label[synthetic_gpt3_100_ds['label'][5]]))
print("Dataset of 200 synthetic samples:")
print("Text: {} \nLabel: {}\n".format(synthetic_gpt3_200_ds['text'][5], idx2label[synthetic_gpt3_200_ds['label'][5]]))

Dataset of 10 synthetic samples:
Text:  even if ur not into these kind of things u have to admit it's pretty cool  
Label: joy

Dataset of 50 synthetic samples:
Text:  i want to stop running and walk...but the fact that i'm still running is the real miracle  
Label: joy

Dataset of 100 synthetic samples:
Text:  i want a beer right now  
Label: anger

Dataset of 200 synthetic samples:
Text:  lol owls  
Label: joy



We see that GPT-3 has effectively generated very realistic samples. 👏👏👏

### GPT-J

As an open-source alternative of GPT-3, we can use [GPT-J](https://6b.eleuther.ai/) that is a 6 billion parameter model released by a group called Eleuther AI. The goal of the group is to democratize huge LMs, so they released GPT-J and it is currently publicly available. GPT-3 on the other hand, which was released by openAI has 175 billion parameters and is not openly available at the time.

To use GPT-J, we can simply load it through HuggingFace.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B")

In [None]:
# define the number of synthetic samples to generate
n = 10
new_texts = []
new_labels = []

iter = 0
while iter < n:
  # select two random samples from training set
  text1, label1, text2, label2 = get_random_samples()
  # create the prompt
  prompt = get_prompt(text1, label1, text2, label2)

  # generate text using GPT-J model
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
  gen_text = tokenizer.batch_decode(gen_tokens)[0]
  # the generated output will be in the form "<text> (Sentiment: <label>)"
  data = gen_text.split('\n')[3].strip('Tweet: ').split('(Sentiment:')
  if len(data) < 2:
    # the format of the response is invalid
    continue

  text = data[0]
  label = data[1].split(')')[0].strip()
  if label not in ['joy', 'anger', 'surprise']:
    # the format of the response is invalid
    continue

  new_texts.append(text)
  new_labels.append(label2idx[label])
  iter += 1

# define the synthetic dataset and save it to disk 
synthetic_ds = Dataset.from_dict({'text': new_texts, 'label': new_labels})
synthetic_ds.save_to_disk('./drive/MyDrive/text_augmentation/gpt-j/10')

In [None]:
synthetic_gptj_10_ds = load_from_disk('./drive/MyDrive/text_augmentation/gpt-j/10')
synthetic_gptj_50_ds = load_from_disk('./drive/MyDrive/text_augmentation/gpt-j/50')
synthetic_gptj_100_ds = load_from_disk('./drive/MyDrive/text_augmentation/gpt-j/100')
synthetic_gptj_200_ds = load_from_disk('./drive/MyDrive/text_augmentation/gpt-j/200')

In [None]:
print("Dataset of 10 synthetic samples:")
print("Text: {} \nLabel: {}\n".format(synthetic_gptj_10_ds['text'][5], idx2label[synthetic_gptj_10_ds['label'][5]]))
print("Dataset of 50 synthetic samples:")
print("Text: {} \nLabel: {}\n".format(synthetic_gptj_50_ds['text'][5], idx2label[synthetic_gptj_50_ds['label'][5]]))
print("Dataset of 100 synthetic samples:")
print("Text: {} \nLabel: {}\n".format(synthetic_gptj_100_ds['text'][5], idx2label[synthetic_gptj_100_ds['label'][5]]))
print("Dataset of 200 synthetic samples:")
print("Text: {} \nLabel: {}\n".format(synthetic_gptj_200_ds['text'][5], idx2label[synthetic_gptj_200_ds['label'][5]]))

Dataset of 10 synthetic samples:
Text: he didn't even call me back  
Label: surprise

Dataset of 50 synthetic samples:
Text: i am feeling angry and sad at the same time  
Label: anger

Dataset of 100 synthetic samples:
Text: i have the feeling she was amused and delighted  
Label: joy

Dataset of 200 synthetic samples:
Text: how dare you?  
Label: anger



Even though GPT-J is much smaller than GPT-3, it manages to generate high quality samples 👏.

### GPT-Neo

A third alternative is GPT-Neo that is also released by Eleuther AI. It has 2.7 billion parameters and is also publicly available. 

Like GPT-J, we can access GPT-Neo through HuggingFace.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neo-2.7B')
model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-neo-2.7B')

In [None]:
# define the number of synthetic samples to generate
n = 10
new_texts = []
new_labels = []

iter = 0
while iter < n:
  # select two random samples from training set
  text1, label1, text2, label2 = get_random_samples()
  # create the prompt
  prompt = get_prompt(text1, label1, text2, label2)

  # generate text using GPT-J model
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  gen_tokens = model.generate(input_ids, do_sample=True, temperature=0.9, max_length=100,)
  gen_text = tokenizer.batch_decode(gen_tokens)[0]
  # the generated output will be in the form "<text> (Sentiment: <label>)"
  data = gen_text.split('\n')[3].strip('Tweet: ').split('(Sentiment:')
  if len(data) < 2:
    # the format of the response is invalid
    continue
  print(data)
  text = data[0]
  label = data[1].split(')')[0].strip()
  if label not in ['joy', 'anger', 'surprise']:
    # the format of the response is invalid
    continue

  new_texts.append(text)
  new_labels.append(label2idx[label])
  iter += 1


# define the synthetic dataset and save it to disk 
synthetic_ds = Dataset.from_dict({'text': new_texts, 'label': new_labels})
synthetic_ds.save_to_disk('./drive/MyDrive/text_augmentation/gpt-neo/10')

In [None]:
synthetic_gptneo_10_ds = load_from_disk('./drive/MyDrive/text_augmentation/gpt-neo/10')
synthetic_gptneo_50_ds = load_from_disk('./drive/MyDrive/text_augmentation/gpt-neo/50')
synthetic_gptneo_100_ds = load_from_disk('./drive/MyDrive/text_augmentation/gpt-neo/100')
synthetic_gptneo_200_ds = load_from_disk('./drive/MyDrive/text_augmentation/gpt-neo/200')

In [None]:
print("Dataset of 10 synthetic samples:")
print("Text: {} \nLabel: {}\n".format(synthetic_gptneo_10_ds['text'][5], idx2label[synthetic_gptneo_10_ds['label'][5]]))
print("Dataset of 50 synthetic samples:")
print("Text: {} \nLabel: {}\n".format(synthetic_gptneo_50_ds['text'][5], idx2label[synthetic_gptneo_50_ds['label'][5]]))
print("Dataset of 100 synthetic samples:")
print("Text: {} \nLabel: {}\n".format(synthetic_gptneo_100_ds['text'][5], idx2label[synthetic_gptneo_100_ds['label'][5]]))
print("Dataset of 200 synthetic samples:")
print("Text: {} \nLabel: {}\n".format(synthetic_gptneo_200_ds['text'][5], idx2label[synthetic_gptneo_200_ds['label'][5]]))

Dataset of 10 synthetic samples:
Text: hy do i feel bored when everyone is talking about something else  
Label: anger

Dataset of 50 synthetic samples:
Text: I am happy to be here  
Label: joy

Dataset of 100 synthetic samples:
Text: I feel disgusted!  
Label: surprise

Dataset of 200 synthetic samples:
Text: i love my job and the people I work with  
Label: joy



GPT-Neo generates realistic samples! However, we can see that there are cases that the label is wrong (example 3).

## 🚀 Model 

Here we define the model and the training pipeline. We will use [DistilBERT](https://arxiv.org/abs/1910.01108) that is a light Transformer trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

In [None]:
metric = load_metric("accuracy")

batch_size = 8
max_size = 200 # size of the largest augmented dataset
steps = 5*int(max_size/batch_size) # 5 epochs in the large dataset

run_dicts = [] # list of dicts to store both metrics and logs for all the experiment runs 

In [None]:
def compute_metrics(eval_pred):
    """
        Calculates the accuracy of the model's predictions, calculated as follows; (TP + TN) / (TP + TN + FP + FN) with TP: True positive TN: True negative FP: False positive FN: False negative
    """

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels) 


class LogAccumulatorCallback(TrainerCallback):
    """
    A class that stores both the training and the evaluation loss
    """
    
    def __init__(self):
        self.acc_logs = []

    def on_log(self, args, state, control, logs=None, **kwargs):
        _ = logs.pop("total_flos", None)
        if state.is_local_process_zero and ('loss' in logs or 'eval_loss' in logs):
            self.acc_logs.append(logs.copy())


def train_and_evaluate(train_ds, test_ds, identifier):
    def tokenize(batch):
        return tokenizer(batch['text'], padding=True, truncation=True)
    
    train_ds = train_ds.map(tokenize, batched=True, batch_size=len(train_ds), remove_columns=["text"])
    test_ds = test_ds.map(tokenize, batched=True, batch_size=len(test_ds), remove_columns=["text"])
    
    training_args = TrainingArguments(
        identifier,
        evaluation_strategy="steps",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        logging_strategy="steps",
        weight_decay=0.01,
        learning_rate=1e-4,
        max_steps=steps,
        logging_steps=20
    )
    
    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)

    # Partially freezing the weights of initial layers of the model
    # Since we're working on small datasets as it usually reduces overfitting
    # Another advantage of partial freezing is reduced memory usage and a speed improvement during training.
    for block in model.distilbert.embeddings.modules():
        for param in block.parameters():
            param.requires_grad=False

    for i in [0,1,2]:
        for block in model.distilbert.transformer.layer[i].modules():
            for param in block.parameters():
                param.requires_grad=False

            
    logger = LogAccumulatorCallback()
    trainer = Trainer(
        model=model, args=training_args, 
        train_dataset=train_ds, 
        eval_dataset=test_ds,
        compute_metrics=compute_metrics,
        callbacks=[logger],
    )
    trainer.train()
    metrics = trainer.evaluate()
    
    return metrics, logger.acc_logs

### Model baseline

In [None]:
# train our model on the baseline dataset without augmentation
metrics, logs = train_and_evaluate(emotion_train_ds, emotion_test_ds, "baseline")

run_dicts.append({
    "id": "baseline",
    "metrics": metrics,
    "logs": logs
})

### Model with augmented data of GPT-3

In [None]:
# train our model on the augmented dataset that contains 10 extra synthetic samples.
augmented_gpt3_10_ds = concatenate_datasets([emotion_train_ds, synthetic_gpt3_10_ds])
metrics, logs = train_and_evaluate(augmented_gpt3_10_ds, emotion_test_ds, "augmented_10")

run_dicts.append({
    "id": "augmented_gpt3_10",
    "metrics": metrics,
    "logs": logs
})

In [None]:
# train our model on the augmented dataset that contains 50 extra synthetic samples.
augmented_gpt3_50_ds = concatenate_datasets([emotion_train_ds, synthetic_gpt3_50_ds])
metrics, logs = train_and_evaluate(augmented_gpt3_50_ds, emotion_test_ds, "augmented_50")

run_dicts.append({
    "id": "augmented_gpt3_50",
    "metrics": metrics,
    "logs": logs
})

In [None]:
# train our model on the augmented dataset that contains 100 extra synthetic samples.
augmented_gpt3_100_ds = concatenate_datasets([emotion_train_ds, synthetic_gpt3_100_ds])
metrics, logs = train_and_evaluate(augmented_gpt3_100_ds, emotion_test_ds, "augmented_100")

run_dicts.append({
    "id": "augmented_gpt3_100",
    "metrics": metrics,
    "logs": logs
})

In [None]:
# train our model on the augmented dataset that contains 200 extra synthetic samples.
augmented_gpt3_200_ds = concatenate_datasets([emotion_train_ds, synthetic_gpt3_200_ds])
metrics, logs = train_and_evaluate(augmented_gpt3_200_ds, emotion_test_ds, "augmented_200")

run_dicts.append({
    "id": "augmented_gpt3_200",
    "metrics": metrics,
    "logs": logs
})

### Model with augmented data of GPT-J

In [None]:
# train our model on the dataset augmented by GPT-J that contains 10 extra synthetic samples.
augmented_gptj_10_ds = concatenate_datasets([emotion_train_ds, synthetic_gptj_10_ds])
metrics, logs = train_and_evaluate(augmented_gptj_10_ds, emotion_test_ds, "augmented_gptj_10")

run_dicts.append({
    "id": "augmented_gptj_10",
    "metrics": metrics,
    "logs": logs
})

In [None]:
# train our model on the dataset augmented by GPT-J that contains 50 extra synthetic samples.
augmented_gptj_50_ds = concatenate_datasets([emotion_train_ds, synthetic_gptj_50_ds])
metrics, logs = train_and_evaluate(augmented_gptj_50_ds, emotion_test_ds, "augmented_gptj_50")

run_dicts.append({
    "id": "augmented_gptj_50",
    "metrics": metrics,
    "logs": logs
})

In [None]:
# train our model on the dataset augmented by GPT-J that contains 100 extra synthetic samples.
augmented_gptj_100_ds = concatenate_datasets([emotion_train_ds, synthetic_gptj_100_ds])
metrics, logs = train_and_evaluate(augmented_gptj_100_ds, emotion_test_ds, "augmented_gptj_100")

run_dicts.append({
    "id": "augmented_gptj_100",
    "metrics": metrics,
    "logs": logs
})

In [None]:
# train our model on the dataset augmented by GPT-J that contains 200 extra synthetic samples.
augmented_gptj_200_ds = concatenate_datasets([emotion_train_ds, synthetic_gptj_200_ds])
metrics, logs = train_and_evaluate(augmented_gptj_200_ds, emotion_test_ds, "augmented_gptj_200")

run_dicts.append({
    "id": "augmented_gptj_200",
    "metrics": metrics,
    "logs": logs
})

### Model with augmented data of GPT-Neo

In [None]:
# train our model on the dataset augmented by GPT-Neo that contains 10 extra synthetic samples.
augmented_gptneo_10_ds = concatenate_datasets([emotion_train_ds, synthetic_gptneo_10_ds])
metrics, logs = train_and_evaluate(augmented_gptneo_10_ds, emotion_test_ds, "augmented_gptneo_10")

run_dicts.append({
    "id": "augmented_gptneo_10",
    "metrics": metrics,
    "logs": logs
})

In [None]:
# train our model on the dataset augmented by GPT-Neo that contains 50 extra synthetic samples.
augmented_gptneo_50_ds = concatenate_datasets([emotion_train_ds, synthetic_gptneo_50_ds])
metrics, logs = train_and_evaluate(augmented_gptneo_50_ds, emotion_test_ds, "augmented_gptneo_50")

run_dicts.append({
    "id": "augmented_gptneo_50",
    "metrics": metrics,
    "logs": logs
})

In [None]:
# train our model on the dataset augmented by GPT-Neo that contains 100 extra synthetic samples.
augmented_gptneo_100_ds = concatenate_datasets([emotion_train_ds, synthetic_gptneo_100_ds])
metrics, logs = train_and_evaluate(augmented_gptneo_100_ds, emotion_test_ds, "augmented_gptneo_100")

run_dicts.append({
    "id": "augmented_gptneo_100",
    "metrics": metrics,
    "logs": logs
})

In [None]:
# train our model on the dataset augmented by GPT-Neo that contains 200 extra synthetic samples.
augmented_gptneo_200_ds = concatenate_datasets([emotion_train_ds, synthetic_gptneo_200_ds])
metrics, logs = train_and_evaluate(augmented_gptneo_200_ds, emotion_test_ds, "augmented_gptneo_200")

run_dicts.append({
    "id": "augmented_gptneo_200",
    "metrics": metrics,
    "logs": logs
})

##  📊 Visualize

To evaluate the effectiveness of text augmentation in the performance of the model, we'll visualize the results.

### Model Performance

Now let's compare the performance of the trained models. First, we will examine the accuracy improvements of text augmentation using GPT-3, GPT-J and GPT-Neo and then we will compare the three methods.

In [None]:
df = pd.DataFrame(run_dicts)

gpt3_names = ['augmented_gpt3_10', 'augmented_gpt3_50', 'augmented_gpt3_100', 'augmented_gpt3_200']
gptj_names = ['augmented_gptj_10', 'augmented_gptj_50', 'augmented_gptj_100', 'augmented_gptj_200']
gptneo_names = ['augmented_gptneo_10', 'augmented_gptneo_50', 'augmented_gptneo_100', 'augmented_gptneo_200']

# define a dataframe for each LM
df_baseline = df.loc[df['id'] == 'baseline']
df_gpt3 = df.loc[df['id'].isin(gpt3_names)]
df_gptj = df.loc[df['id'].isin(gptj_names)]
df_gptneo = df.loc[df['id'].isin(gptneo_names)]

In [None]:
# plot accuracy curve of GPT-3

fig = go.Figure()

fig.add_trace(go.Scatter(
                    x=list(range(0, steps, 20)),
                    y=pd.DataFrame(df_baseline.loc[0]['logs']).dropna(subset=['eval_accuracy'])['eval_accuracy'],
                    name='accuracy {}'.format('baseline')))

for index, row in df_gpt3.iterrows():
    
    fig.add_trace(go.Scatter(
                    x=list(range(0, steps, 20)),
                    y=pd.DataFrame(row['logs']).dropna(subset=['eval_accuracy'])['eval_accuracy'],
                    name='accuracy {}'.format(row['id'])))

fig.update_xaxes(title_text='step')
fig.update_yaxes(title_text='accuracy')
fig.update_layout(
    title="Accuracy of the model in different versions of the dataset (augmentation by GPT-3).")

fig.show()

We observe that the accuracy of the model increases as we augment more and more data. After generating 200 extra synthetic samples, the accuracy exceeds 70% indicating that text augmentation can greatly improve the accuracy of our model.

In [None]:
# plot accuracy curve of GPT-J

fig = go.Figure()

fig.add_trace(go.Scatter(
                    x=list(range(0, steps, 20)),
                    y=pd.DataFrame(df_baseline.loc[0]['logs']).dropna(subset=['eval_accuracy'])['eval_accuracy'],
                    name='accuracy {}'.format('baseline')))

for index, row in df_gptj.iterrows():
    
    fig.add_trace(go.Scatter(
                    x=list(range(0, steps, 20)),
                    y=pd.DataFrame(row['logs']).dropna(subset=['eval_accuracy'])['eval_accuracy'],
                    name='accuracy {}'.format(row['id'])))

fig.update_xaxes(title_text='step')
fig.update_yaxes(title_text='accuracy')
fig.update_layout(
    title="Accuracy of the model in different versions of the dataset (augmentation by GPT-J).")

fig.show()

The performance curve of GPT-J is awesome 🎉. Each time we increase the size of the train dataset with augmentation, there is a consistent increase in the accuracy reaching 85%.

In [None]:
# plot accuracy curve of GPT-Neo

fig = go.Figure()

fig.add_trace(go.Scatter(
                    x=list(range(0, steps, 20)),
                    y=pd.DataFrame(df_baseline.loc[0]['logs']).dropna(subset=['eval_accuracy'])['eval_accuracy'],
                    name='accuracy {}'.format('baseline')))

for index, row in df_gptneo.iterrows():
    
    fig.add_trace(go.Scatter(
                    x=list(range(0, steps, 20)),
                    y=pd.DataFrame(row['logs']).dropna(subset=['eval_accuracy'])['eval_accuracy'],
                    name='accuracy {}'.format(row['id'])))

fig.update_xaxes(title_text='step')
fig.update_yaxes(title_text='accuracy')
fig.update_layout(
    title="Accuracy of the model in different versions of the dataset (augmentation by GPT-neo).")

fig.show()

The results of GPT-Neo are worse 😞. The performance of the model decreases when we generate synthetic data (except in the case of 50 synthetic samples).

The next step is to investigate which language model boosts the final performance more.

In [None]:
# keep the accuracy of the last training step
acc_gpt3 = [df_baseline.iloc[0]['logs'][-1]['eval_accuracy']]
for i in range(4):
  acc_gpt3.append(df_gpt3.iloc[i]['logs'][-1]['eval_accuracy'])

acc_gptj = [df_baseline.iloc[0]['logs'][-1]['eval_accuracy']]
for i in range(4):
  acc_gptj.append(df_gptj.iloc[i]['logs'][-1]['eval_accuracy'])

acc_gptneo = [df_baseline.iloc[0]['logs'][-1]['eval_accuracy']]
for i in range(4):
  acc_gptneo.append(df_gptneo.iloc[i]['logs'][-1]['eval_accuracy'])

fig = go.Figure()

fig.add_trace(go.Scatter(
                x=[0, 10, 50, 100, 200],
                y=acc_gpt3,
                name='GPT-3'))


fig.add_trace(go.Scatter(
                x=[0, 10, 50, 100, 200],
                y=acc_gptj,
                name='GPT-J'))

fig.add_trace(go.Scatter(
                x=[0, 10, 50, 100, 200],
                y=acc_gptneo,
                name='GPT-Neo'))

fig.update_xaxes(title_text='number of synthetic samples')
fig.update_yaxes(title_text='accuracy')
fig.update_layout(
    title="Comparison of GPT-3, GPT-J and GPT-Neo in text augmentation.")

fig.show()

GPT-J outperforms GPT-3 in almost all different versions of the dataset 👏. These are very exciting results that indicate that we can use the open-source GPT-J instead of GPT-3 for augmentation. As for the GPT-Neo, it performs very poorly.

### Label Distribution

It is interesting to examine how the distribution of the labels changes when we generate more and more synthetic samples. Our initial train set is balanced since it consists of 10 samples per class. Let's see how the distribution of the labels changes after text augmentation! 

In [None]:
fig = make_subplots(rows=1, cols=4,
                    subplot_titles=("Baseline", "Augmented-GPT3-200", "Augmented-GPTJ-200", "Augmented-GPTneo-200"))

trace0 = go.Histogram(x=[idx2label[i] for i in emotion_train_ds["label"]],
                   opacity=0.8)

trace1 = go.Histogram(x=[idx2label[i] for i in augmented_gpt3_200_ds["label"]],
                   opacity=0.8)

trace2 = go.Histogram(x=[idx2label[i] for i in augmented_gptj_200_ds["label"]],
                   opacity=0.8)

trace3 = go.Histogram(x=[idx2label[i] for i in augmented_gptneo_200_ds["label"]],
                   opacity=0.8)

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 1, 3)
fig.append_trace(trace2, 1, 4)
fig.update_layout(showlegend=False, title_text="Distribution of labels", 
                  bargap=0.30)

fig.show()

We observe that the distribution changes a lot and the large augmented dataset is highly imbalanced 😯! GPT-3 model generated too many samples labeled as 'anger' while GPT-J and GPT-Neo generated more samples labeled as 'joy'. The reason of this behavior is the datasets that these models are pre-trained on. GPT-J and GPT-Neo are trained on the same dataset and change the label distribution in a similar way.

That's an interesting observation that we should further examine in the future! 

## ⛏ Generate more samples

Text augmentation increased accuracy by a lot! The next step is to examine how much the accuracy improves when adding even more labelled samples. We expect that at some point, the accuracy curve stops increasing and stabilizes.


In [None]:
# load the synthetic datasets with 300, 400 and 500 samples.
# run this if the dataset has already been saved!

max_size = 500
steps = 5*int(max_size/batch_size) # 4 epochs in the large dataset

synthetic_gpt3_300_ds = load_from_disk('./drive/MyDrive/text_augmentation/gpt-3/300')
synthetic_gpt3_400_ds = load_from_disk('./drive/MyDrive/text_augmentation/gpt-3/400')
synthetic_gpt3_500_ds = load_from_disk('./drive/MyDrive/text_augmentation/gpt-3/500')

In [None]:
# train our model on the augmented dataset that contains 300 extra synthetic samples.
augmented_gpt3_300_ds = concatenate_datasets([emotion_train_ds, synthetic_gpt3_300_ds])
metrics, logs = train_and_evaluate(augmented_gpt3_300_ds, emotion_test_ds, "augmented_300")

run_dicts.append({
    "id": "augmented_gpt3_300",
    "metrics": metrics,
    "logs": logs
})

In [None]:
# train our model on the augmented dataset that contains 400 extra synthetic samples.
augmented_gpt3_400_ds= concatenate_datasets([emotion_train_ds, synthetic_gpt3_400_ds])
metrics, logs = train_and_evaluate(augmented_gpt3_400_ds, emotion_test_ds, "augmented_400")

run_dicts.append({
    "id": "augmented_gpt3_400",
    "metrics": metrics,
    "logs": logs
})

In [None]:
# train our model on the augmented dataset that contains 500 extra synthetic samples.
augmented_gpt3_500_ds = concatenate_datasets([emotion_train_ds, synthetic_gpt3_500_ds])
metrics, logs = train_and_evaluate(augmented_gpt3_500_ds, emotion_test_ds, "augmented_500")

run_dicts.append({
    "id": "augmented_gpt3_500",
    "metrics": metrics,
    "logs": logs
})

In [None]:
df = pd.DataFrame(run_dicts)
gpt3_names_more = ['augmented_gpt3_10', 'augmented_gpt3_50', 'augmented_gpt3_100', 'augmented_gpt3_200', 
              'augmented_gpt3_300', 'augmented_gpt3_400', 'augmented_gpt3_500']

df_gpt3_more = df.loc[df['id'].isin(gpt3_names_more)]

In [None]:
acc_gpt3_more = [df_baseline.iloc[0]['logs'][-1]['eval_accuracy']]
for i in range(7):
  acc_gpt3_more.append(df_gpt3_more.iloc[i]['logs'][-1]['eval_accuracy'])

fig = go.Figure()

fig.add_trace(go.Scatter(
                x=[0, 10, 50, 100, 200, 300, 400, 500],
                y=acc_gpt3_more,
                name='GPT-3'))

fig.update_xaxes(title_text='number of extra samples')
fig.update_yaxes(title_text='accuracy')
fig.update_layout(showlegend=True, 
    title="Accuracy of the model in different versions of the dataset (augmentation by GPT-3).")

fig.show()

The curve goes up really fast in the beginning but the increase gradually slows down the more samples we generate. So, we realise that there is a limit in the performance improvements that text augmentation can yield.

## 🏁 Take-aways 


You've reached the finish line! 👏  Let's sum up some of the findings.

* We generated hyper-realistic synthetic samples by leveraging the few-shot capabilities of large LMs and prompt engineering.
* As a baseline, we trained a distilbert model on the Emotion dataset using a small subset of 30 samples.
* Then, we augmented the small dataset with 10, 50, 100 and 200 extra samples generated by GPT-3, GPT-J and GPT-Neo.
* We compared the performance of the models in all these settings and showed that data augmentation boosts the performance 🥳.
* We showed that GPT-J performs better than GPT-3 in our task and can be used as an open-source alternative for text augmentation 💪. GPT-Neo performs poorly mainly because it is much smaller.
* However, the augmented datasets are not balanced anymore because large LMs are prone to generate samples with certain labels based on the data they are trained on.
* Finally, as we generate more and more synthetic samples and the size of the training set increases, the overall performance increases until some point. Then, text augmentation cannot improve the performance more and we should look into other ways of increasing performance.




