# 📈: Text Augmentation using GPT-3

This notebook looks into the possibility of performing data augmentation on an NLP dataset using the GPT-3 language model.

Data augmentation techniques are used to generate additional samples. Data augmentation is already standard practice in computer vision projects 👌, but can also be leveraged in many NLP problems. We'll use a limited training set to simulate a real-world use case, where we often are constrained by the size of the available data 🤦.

## 🛠️ Getting started

The cells below will setup everything that is required to get started with data augmentation and finetuning an NLP model with the HuggingFace API.

### Setup

In [None]:
!!pip install -qq transformers datasets tokenizers openai requests

[]

### Imports

In [None]:
import re
import json
import torch
import random
import requests
import numpy as np
import pandas as pd 
import plotly.express as px
import plotly.graph_objects as go

from plotly.subplots import make_subplots
from datasets import load_dataset, concatenate_datasets, load_from_disk, load_metric, Dataset
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, TrainerCallback

### Download dataset
We'll use [Emotion](https://huggingface.co/datasets/emotion) that is a dataset of English Twitter messages labeled as one of the six basic emotions: anger, fear, joy, love, sadness and surprise. To make our task a bit easier, we will use only three of them, namely:
- joy 😂
- anger 😠
- surprise 😯

In [None]:
# load the dataset and filter on samples that have a token count less than 30 to use only short tweets
max_input_len = 30
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
emotion_ds = load_dataset("emotion").filter(lambda e: len(tokenizer.batch_encode_plus([e['text']]).input_ids[0]) < int(max_input_len))

Using custom data configuration default
Reusing dataset emotion (/root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705)


  0%|          | 0/3 [00:00<?, ?it/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-7161aca97360dca6.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-51b5751515a95f0f.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-3ad9ee5140e9d78b.arrow


The dataset is already split into 16,000 train, 2,000 validation and 2,000 test samples. To investigate the effectiveness of the GPT3Mix augmentation method, we will use only 10 samples per class as a train set.

In [None]:
# select 10 random train samples from each of the three emotions
# sadness (0), joy (1), love (2), anger (3), fear (4), surprise (5)
joy_train_samples = emotion_ds['train'].filter(lambda e: e['label'] == 1).select(range(10))
anger_train_samples = emotion_ds['train'].filter(lambda e: e['label'] == 3).select(range(10))
surprise_train_samples = emotion_ds['train'].filter(lambda e: e['label'] == 5).select(range(10))

# map emotions to integers for labeling
# joy (0), anger (1), surprise (2)
def map_emotions(example):
  example['label'] = example['label']//2
  return example

# create a train set that consists of 10 samples per class and filter the test 
# set to contain only the valid labels
emotion_train_ds = concatenate_datasets([joy_train_samples, anger_train_samples, surprise_train_samples]).map(lambda e: map_emotions(e)).shuffle(seed=42)
emotion_test_ds = emotion_ds["test"].filter(lambda e: e['label'] in [1, 3, 5]).map(lambda e: map_emotions(e))

# define the maping between emotions and labels
idx2label = {0: 'joy', 1: 'anger', 2: 'surprise'}
label2idx = {'joy': 0, 'anger': 1, 'surprise': 2}

Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-e3bce96661aef73f.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-0ff5c2cf14824a2f.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-08f457529a6b90e7.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-17c71e67601da363.arrow
Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-9af892442dfd8f9c.arrow
Loading cached processed dataset at /root/.cache/huggingfac

Before proceeding with the data augmentation, let's have a look into the baseline dataset 😎!

In [None]:
print("Train set")
print("Total samples: {}\n".format(len(emotion_train_ds)))
print("A random sample")
print("Text: {} \nLabel: {}".format(emotion_train_ds['text'][10], idx2label[emotion_train_ds['label'][10]]))
print("\n")

print("Test set")
print("Total samples: {}\n".format(len(emotion_test_ds)))
print("A random sample")
print("Text: {} \nLabel: {}".format(emotion_test_ds['text'][10], idx2label[emotion_test_ds['label'][10]]))

Train set
Total samples: 30

A random sample
Text: i feel angered and firey 
Label: anger


Test set
Total samples: 775

A random sample
Text: im feeling very peaceful about our wedding again now after having 
Label: joy


## GPT3Mix pipeline

We will use [GPT3Mix](https://arxiv.org/abs/2104.08826) model to generate synthetic but hyper-realistic samples from a mixture of real saples utilizing the [GPT-3](https://arxiv.org/abs/2005.14165) language model. Specifically, GPT3Mix takes two real samples from our dataset, embeds these samples in a carefully designed prompt and generates an augmented mixed sample influenced by the sample sentences.


Generallly, a GPT3Mix prompt looks like this:

    Each item in the following list contains a <text type> and the
    respective <label type>. <label type> is one of ’<label token 1>’,
    ..., or ’<label token N>’. 
    <text type>: <example text 1> (<label type>: <example label 1>)
    ...
    <text type>: <example text k> (<label type>: <example label k>)
    <text type>:

In our case the prompt looks like this:

    Each item in the following list contains a tweet and the
    respective sentiment. Sentiment is one of ’joy’, 'surprise' or 'anger'. 
    Tweet: i feel angered and firey (Sentiment: anger)
    Tweet: im feeling very peaceful about our wedding again now after having (Sentiment: joy)
    Tweet:

You can find more information on how GPT3Mix augmentation method works in the [paper](https://arxiv.org/abs/2104.08826).

First, we should extract pairs of samples from the train set. There are various extraction strategies that can be used to increase the quality of the synthetic samples. We will simply extract the pairs randomly since by repeating random sampling a diverse synthetic dataset will be created.

In [None]:
# define a function that returns two random samples from the train set.
def get_random_samples():
  s1 = random.randint(0, len(emotion_train_ds)-1)
  s2 = random.randint(0, len(emotion_train_ds)-1)
  return emotion_train_ds['text'][s1], emotion_train_ds['label'][s1], emotion_train_ds['text'][s2], emotion_train_ds['label'][s2]

# define a function that takes as input two samples and generates the prompt
# that we should pass to the GPT-3 language model for completion.
def get_prompt(text1, label1, text2, label2):
  description = "Each item in the following list contains a tweet and the respective sentiment. Sentiment is one of 'joy', 'surprise' or 'anger'."
  prompt = (f"{description}\n"
            f"Tweet: {text1} (Sentiment: {idx2label[label1]})\n"
            f"Tweet: {text2} (Sentiment: {idx2label[label2]})\n"
            f"Tweet:")
  return prompt

In [None]:
# define the number of synthetic samples to generate
n = 10
new_texts = []
new_labels = []
api_key =  # insert your api key for GPT-3
headers = {'Authorization' : 'Bearer ' + api_key ,
              'Content-type':'application/json', 
              'Accept':'application/json'}

iter = 0
while iter < n:
  # select two random samples from training set
  text1, label1, text2, label2 = get_random_samples()
  # create the prompt
  prompt = get_prompt(text1, label1, text2, label2)
  # send a post request to gpt-3 using the prompt
  response = requests.post('https://api.openai.com/v1/engines/davinci/completions', 
                           headers=headers,
                           data = json.dumps({"prompt": prompt, 
                                              "max_tokens": 30,
                                              "temperature": 0.9,
                                              "top_p": 0.95}))

  # get response and extract the generated text and label
  # the generated output will be in the form "<text> (Sentiment: <label>)"
  data = response.json()['choices'][0]['text'].split('\n')[0].split('(Sentiment:')

  if len(data) < 2:
    # the format of the response is invalid
    continue

  text = data[0]
  label = data[1].split(')')[0].strip()

  if label not in ['joy', 'anger', 'surprise']:
    # the format of the response is invalid
    continue

  new_texts.append(text)
  new_labels.append(label2idx[label])
  iter += 1

We will generate 3 synthetic datasets (10, 50 and 100 extra samples) in order to examine how the size of the dataset influences the model performance.

In [None]:
# define the synthetic dataset and save it to disk so as to prevent sending 
# many api requests
synthetic_ds = Dataset.from_dict({'text': new_texts, 'label': new_labels})

synthetic_ds.save_to_disk('./synthetic_dataset_10')
# synthetic_ds.save_to_disk('./synthetic_dataset_50')
# synthetic_ds.save_to_disk('./synthetic_dataset_100')

Now let's see some synthetic data to examine their quality!

In [None]:
# load the synthetic datasets with 10, 50 and 100 samples
# run this if the dataset has already been saved set the path in your workspace
synthetic_10_ds = load_from_disk('./drive/MyDrive/gpt3mix_synthetic_data/synthetic_dataset_10')
synthetic_50_ds = load_from_disk('./drive/MyDrive/gpt3mix_synthetic_data/synthetic_dataset_50')
synthetic_100_ds = load_from_disk('./drive/MyDrive/gpt3mix_synthetic_data/synthetic_dataset_100')

In [None]:
print("Text: {} \nLabel: {}".format(synthetic_10_ds['text'][5], idx2label[synthetic_10_ds['label'][5]]))

Text:  even if ur not into these kind of things u have to admit it's pretty cool  
Label: joy


In [None]:
print("Text: {} \nLabel: {}".format(synthetic_50_ds['text'][5], idx2label[synthetic_50_ds['label'][5]]))

Text:  i want to stop running and walk...but the fact that i'm still running is the real miracle  
Label: joy


In [None]:
print("Text: {} \nLabel: {}".format(synthetic_100_ds['text'][5], idx2label[synthetic_100_ds['label'][5]]))

Text:  i want a beer right now  
Label: anger


We see that GPT-3 has effectively generated very realistic samples. 👏👏👏

## 🚀 Model 

Here we define the model and the training pipeline. We will use [DistilBERT](https://arxiv.org/abs/1910.01108) that is a light Transformer trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

In [None]:
metric = load_metric("accuracy")

batch_size = 6
epochs = 20

run_dicts = [] # list of dicts to store both metrics and logs for all the experiment runs 

In [None]:
def compute_metrics(eval_pred):
    """
        Calculates the accuracy of the model's predictions, calculated as follows; (TP + TN) / (TP + TN + FP + FN) with TP: True positive TN: True negative FP: False positive FN: False negative
    """

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels) 


class LogAccumulatorCallback(TrainerCallback):
    """
    A class that stores both the training and the evaluation loss
    """
    
    def __init__(self):
        self.acc_logs = []

    def on_log(self, args, state, control, logs=None, **kwargs):
        _ = logs.pop("total_flos", None)
        if state.is_local_process_zero and ('loss' in logs or 'eval_loss' in logs):
            self.acc_logs.append(logs.copy())


def train_and_evaluate(train_ds, test_ds, identifier):
    def tokenize(batch):
        return tokenizer(batch['text'], padding=True, truncation=True)
    
    train_ds = train_ds.map(tokenize, batched=True, batch_size=len(train_ds), remove_columns=["text"])
    test_ds = test_ds.map(tokenize, batched=True, batch_size=len(test_ds), remove_columns=["text"])
    
    training_args = TrainingArguments(
        identifier,
        num_train_epochs=epochs,
        evaluation_strategy="epoch",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        logging_strategy="epoch",
        weight_decay=0.01,
        learning_rate=2e-5,
    )
    
    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-cased", num_labels=3)

    # Partially freezing the weights of initial layers of the model
    # Since we're working on small datasets as it usually reduces overfitting
    # Another advantage of partial freezing is reduced memory usage and a speed improvement during training.
    for block in model.distilbert.embeddings.modules():
        for param in block.parameters():
            param.requires_grad=False

    for i in [0,1,2]:
        for block in model.distilbert.transformer.layer[i].modules():
            for param in block.parameters():
                param.requires_grad=False

            
    logger = LogAccumulatorCallback()
    trainer = Trainer(
        model=model, args=training_args, 
        train_dataset=train_ds, 
        eval_dataset=test_ds,
        compute_metrics=compute_metrics,
        callbacks=[logger],
    )
    trainer.train()
    metrics = trainer.evaluate()
    
    return metrics, logger.acc_logs

### Model baseline

In [None]:
# train our model on the baseline dataset without augmentation
metrics, logs = train_and_evaluate(emotion_train_ds, emotion_test_ds, "baseline")

run_dicts.append({
    "id": "baseline",
    "metrics": metrics,
    "logs": logs
})

Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-fe012414826d71c9.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-240db4f8c3d82b2c.arrow
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
Model config DistilBertConfig {
  "activation": "gelu

Epoch,Training Loss,Validation Loss,Accuracy
1,1.1031,1.18981,0.063226
2,1.0978,1.172908,0.129032
3,1.0796,1.164873,0.144516
4,1.0623,1.150708,0.150968
5,1.0503,1.131751,0.179355
6,1.0029,1.128905,0.19871
7,1.0212,1.122984,0.255484
8,0.9558,1.127501,0.269677
9,0.9092,1.109547,0.322581
10,0.8745,1.101225,0.347097


***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  N

### Model with augmented data

In [None]:
# train our model on the augmented dataset that contains 10 extra synthetic samples.
augmented_10_train_ds = concatenate_datasets([emotion_train_ds, synthetic_10_ds])
metrics, logs = train_and_evaluate(augmented_10_train_ds, emotion_test_ds, "augmented_10")

run_dicts.append({
    "id": "augmented_10",
    "metrics": metrics,
    "logs": logs
})

Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-6f7599790f6222ce.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-240db4f8c3d82b2c.arrow
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
Model config DistilBertConfig {
  "activation": "gelu

Epoch,Training Loss,Validation Loss,Accuracy
1,1.1027,1.16944,0.063226
2,1.101,1.163339,0.065806
3,1.0915,1.136249,0.183226
4,1.0558,1.120108,0.183226
5,1.0237,1.112106,0.267097
6,0.998,1.098689,0.332903
7,0.9378,1.075746,0.427097
8,0.8897,1.054886,0.454194
9,0.8293,1.042476,0.464516
10,0.7044,1.023852,0.464516


***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  N

In [None]:
# train our model on the augmented dataset that contains 50 extra synthetic samples.
augmented_50_train_ds = concatenate_datasets([emotion_train_ds, synthetic_50_ds])
metrics, logs = train_and_evaluate(augmented_50_train_ds, emotion_test_ds, "augmented_50")

run_dicts.append({
    "id": "augmented_50",
    "metrics": metrics,
    "logs": logs
})

Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-cf85caccfd981ecb.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-240db4f8c3d82b2c.arrow
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
Model config DistilBertConfig {
  "activation": "gelu

Epoch,Training Loss,Validation Loss,Accuracy
1,1.1211,1.146652,0.261935
2,1.1053,1.092461,0.261935
3,1.0562,1.042968,0.261935
4,1.0541,1.065799,0.265806
5,0.9722,1.085281,0.357419
6,0.866,1.055105,0.474839
7,0.7382,0.986825,0.565161
8,0.6243,1.053603,0.473548
9,0.4962,0.880032,0.609032
10,0.4112,0.989683,0.549677


***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  N

In [None]:
# train our model on the augmented dataset that contains 100 extra synthetic samples.
augmented_100_train_ds = concatenate_datasets([emotion_train_ds, synthetic_100_ds])
metrics, logs = train_and_evaluate(augmented_100_train_ds, emotion_test_ds, "augmented_100")

run_dicts.append({
    "id": "augmented_100",
    "metrics": metrics,
    "logs": logs
})

Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-02f3d16a92bee602.arrow
Loading cached processed dataset at /root/.cache/huggingface/datasets/emotion/default/0.0.0/348f63ca8e27b3713b6c04d723efe6d824a56fb3d1449794716c0f0296072705/cache-240db4f8c3d82b2c.arrow
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
loading configuration file https://huggingface.co/distilbert-base-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/ebe1ea24d11aa664488b8de5b21e33989008ca78f207d4e30ec6350b693f073f.302bfd1b5e031cc1b17796e0b6e5b242ba2045d31d00f97589e12b458ebff27a
Model config DistilBertConfig {
  "activation": "gelu

Epoch,Training Loss,Validation Loss,Accuracy
1,1.053,1.119105,0.261935
2,1.0228,1.233681,0.261935
3,0.9781,1.117986,0.261935
4,0.8783,0.878676,0.745806
5,0.6912,0.850739,0.690323
6,0.5116,0.746332,0.75871
7,0.3866,0.657123,0.796129
8,0.3061,0.680348,0.748387
9,0.2638,0.940879,0.656774
10,0.1829,0.825558,0.682581


***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  Num examples = 775
  Batch size = 6
***** Running Evaluation *****
  N

##  📊 Visualize

In [None]:
df = pd.DataFrame(run_dicts)

In [None]:
fig = go.Figure()

for index, row in df.iterrows():
    
    fig.add_trace(go.Scatter(
                    x=list(range(n)),
                    y=pd.DataFrame(row['logs']).dropna(subset=['eval_accuracy'])['eval_accuracy'],
                    name='accuracy {}'.format(row['id'])))

fig.update_xaxes(title_text='epoch')
fig.update_yaxes(title_text='accuracy')

fig.show()

Our initial train set is balanced since it consists of 10 samples per class. Let's see how the distribution of the labels changes after the random sapling method for extracting pairs and the GPT3Mix augmentation technique! 

In [None]:
from plotly.subplots import make_subplots

fig = make_subplots(rows=2, cols=2,
                    subplot_titles=("Baseline", "Augmented-10", "Augmented-50", "Augmented-100"))

trace0 = go.Histogram(x=[idx2label[i] for i in emotion_train_ds["label"]],
                   opacity=0.8)

trace1 = go.Histogram(x=[idx2label[i] for i in augmented_10_train_ds["label"]],
                   opacity=0.8)

trace2 = go.Histogram(x=[idx2label[i] for i in augmented_50_train_ds["label"]],
                   opacity=0.8)

trace3 = go.Histogram(x=[idx2label[i] for i in augmented_100_train_ds["label"]],
                   opacity=0.8)

fig.append_trace(trace0, 1, 1)
fig.append_trace(trace1, 1, 2)
fig.append_trace(trace2, 2, 1)
fig.append_trace(trace3, 2, 2)
fig.update_layout(showlegend=False, title_text="Distribution of labels", 
                  bargap=0.30)

fig.show()

We observe that the distribution changes a lot and the large augmented dataset is highly imbalanced! GPT-3 model generated too many samples labeled as 'anger' while ideally we want to generate a balanced train set.

That's an interesting observation that we should further examine in the future 😯.

## 🏁 Take-aways 


You've reached the finish line! 👏  Let's sum up some of the findings.

* We managed to generate hyper-realistic synthetic samples using GPT3Mix indicating that we can use it as a text augmentation technique.
* As a baseline, we trained a distilbert model on the Emotion dataset using a small subset of 30 samples.
* Then we augmented the small dataset with 10, 50 and 100 extra samples generated by GPT-3.
* We compared the performance of the models in all these settings and showed that data augmentation boosts the performance.
* As we generate more and more synthetic samples and the size of the training set increases, the overall performance increases too.
* However, the augmented datasets are not balanced anymore because GPT-3 was more prone to generate 'anger' samples.




