# 📈 Transformer-based Data Augmentation in NLP 

The training size will impact the performace of a model heavily, this notebook looks into the possibilities of performing data augmentation on a NLP dataset. Data augmentation techniques are used to generate additional samples 🥷. 

Data augmentation is already standard practice in computer vision projects 👌, but can also be leveraged in multilingual NLP problems. We'll use a limited trainingset to simulate a real-world use case, where we often are constrained by the size of the available data 🤦. 

We'll focuss on using back-translation and contextual word-embedding insertions as data augmentation techniques 🤗.

## 🛠️ Getting started

The cells below will setup everything that is required to get started with data augmentation and finetuning an NLP model with the HuggingFace API.

### Setup

In [None]:
!pip install -q transformers sentencepiece datasets tokenizers nltk nlpaug 

### Imports

In [None]:
import re
import numpy as np
import pandas as pd 

import nltk
import nlpaug.flow as naf
import nlpaug.augmenter.word as naw
import plotly.graph_objects as go
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, TrainerCallback
from datasets import load_dataset, concatenate_datasets, load_from_disk, load_metric

### Download dataset
Since we're particulary interested in multilingual NLP, we'll use a well known dutch dataset [DBRD](https://github.com/benjaminvdb/DBRD). The dataset contains over 110k book reviews along with associated binary sentiment polarity labels. The downstream task will be assigning a sentiment to a book review.  

In [None]:
max_input_len=128
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")
book_review_ds = load_dataset("dbrd").filter(lambda e: len(tokenizer.batch_encode_plus([e['text']]).input_ids[0]) < int(max_input_len))

In [None]:
# Limiting the size of the training dataset to simulate our low-data use case
book_review_train_ds = book_review_ds["train"].shuffle(seed=42).select(range(50))

book_review_test_ds = book_review_ds["test"] 

## Data augmentation pipelines

### ㊗️ Back-translation 
We'll be using the MariaMT model to perform back-translations, the translated sentences should be similar in context but not structurally identical. The back-translation process is as follows:

1.   Translate a Dutch book review into French
2.   Translate the resulting French text into English
3.   Translate the resulting English text back into Dutch[link text](https://)

In [None]:
trans_pipeline_en_nl = pipeline(
    task='translation_en_to_nl',
    model='Helsinki-NLP/opus-mt-en-nl',
    tokenizer='Helsinki-NLP/opus-mt-en-nl',
    device=0)
trans_pipeline_nl_fr = pipeline(
    task='translation_nl_to_fr',
    model='Helsinki-NLP/opus-mt-nl-fr',
    tokenizer='Helsinki-NLP/opus-mt-nl-fr',
    device=0)
trans_pipeline_fr_en = pipeline(
    task='translation_fr_to_en',
    model='Helsinki-NLP/opus-mt-fr-en',
    tokenizer='Helsinki-NLP/opus-mt-fr-en',
    device=0)
nltk.download('punkt')

In [None]:
def back_tranlation_nl_fr_en_nl(texts):
    fr_texts = trans_pipeline_nl_fr(texts)
    back_translated_texts = trans_pipeline_fr_en([el['translation_text'] for el in fr_texts])
    twohopback_translated_texts = trans_pipeline_en_nl([el['translation_text'] for el in back_translated_texts])
    return [el['translation_text'] for el in twohopback_translated_texts]
    
backtranslate_dataset = lambda dataset: dataset.map(lambda x: {'text': back_tranlation_nl_fr_en_nl(x["text"])}, batch_size=10, batched=True)

In [None]:
# Back-translate the training dataset
book_review_train_ds_back = backtranslate_dataset(book_review_train_ds)

### ✨ Contextual word embedding insertions


The [nlpaug](https://github.com/makcedward/nlpaug) library combines frequently used augmentation techniques into a python package. We'll use the `ContextualWordEmbsForSentenceAug` component that uses contextual word embeddings to find the top n similar words for augmentation. 

The contextual embeddings are retrieved from the tranformer-based pretrained RoBERTa model, which was trained on the Dutch section of the [OSCAR](https://oscar-corpus.com/) corpus. The word embeddings have a dependence on the surrounding words, this defines the **context** of the embededing.  

In [None]:
aug = naf.Sequential([
    naw.ContextualWordEmbsAug(
        model_path='pdelobelle/robbert-v2-dutch-base',
        model_type='roberta',
        aug_p=0.20,
        action="insert")
])

replace_newline = lambda dataset: dataset.map(lambda x: {'text': x["text"].replace("\n",' ')}, batched=False)
contextual_emb_aug = lambda dataset: dataset.map(lambda x: {'text': aug.augment(x["text"])},  batch_size=10, batched=True)

In [None]:
# Removing newlines in the text and performing word insertions based on contextual word embeddings
book_review_train_ds_newline = replace_newline(book_review_train_ds)
book_review_train_ds_contemb = contextual_emb_aug(book_review_train_ds_newline)

### 🥷 Combination of both techniques
Digging deeper into our bag of tricks 🔥! 

This approach will combine both back-translation and contextual word embedding insertions as follows:

1.   Inserting new words by using the contextual word-embeddings 
2.   Back-translate the augmented textual dataset


In [None]:
# Combination of both contextual word embedding insertion and back-translation
book_review_train_ds_contemb_back = backtranslate_dataset(book_review_train_ds_contemb)

## 🚀 Model 

In [None]:
metric = load_metric("accuracy")


batch_size = 8
epochs = 20
max_steps = epochs * int(((len(book_review_train_ds)*3)/batch_size)) 

run_dicts = [] # list of dicts to store both metrics and logs for all the experiment runs 

In [None]:
def compute_metrics(eval_pred):
    """
        Calculates the accuracy of the model's predictions, calculated as follows; (TP + TN) / (TP + TN + FP + FN) with TP: True positive TN: True negative FP: False positive FN: False negative
    """

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels) 


class LogAccumulatorCallback(TrainerCallback):
    '''
    A class that stores both the training and the evaluation loss
    '''
    
    def __init__(self):
        self.acc_logs = []

    def on_log(self, args, state, control, logs=None, **kwargs):
        _ = logs.pop("total_flos", None)
        if state.is_local_process_zero and ('loss' in logs or 'eval_loss' in logs):
            self.acc_logs.append(logs.copy())


def train_and_evaluate(train_ds, test_ds, identifier):
    def tokenize(batch):
        return tokenizer(batch['text'], padding=True, truncation=True)
    
    train_ds = train_ds.map(tokenize, batched=True, batch_size=len(train_ds))
    test_ds = test_ds.map(tokenize, batched=True, batch_size=len(test_ds))
    
    
    training_args = TrainingArguments(
        identifier,
        evaluation_strategy="steps",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        eval_steps=25,
        logging_steps=25,
        max_steps=max_steps,
        learning_rate=2e-5,
    )
    
    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-multilingual-cased", num_labels=2)

    # Partially freezing the weights of initial layers of the model, retraining only the higher layers.
    # The small size of the data set is compensated by the fact that the initial layers are 
    # kept pretrained/frozen (which have been trained on a large dataset)
    for block in model.distilbert.embeddings.modules():
        for param in block.parameters():
            param.requires_grad=False

    for i in [0,1,2]:
        for block in model.distilbert.transformer.layer[i].modules():
            for param in block.parameters():
                param.requires_grad=False

            
    logger = LogAccumulatorCallback()
    trainer = Trainer(
        model=model, args=training_args, 
        train_dataset=train_ds, 
        eval_dataset=test_ds,
        compute_metrics=compute_metrics,
        callbacks=[logger],
    )
    trainer.train()
    metrics = trainer.evaluate()
    
    return metrics, logger.acc_logs

### Model baseline

In [None]:
metrics, logs = train_and_evaluate(book_review_train_ds, book_review_test_ds, "baseline")

run_dicts.append({
    "id": "baseline",
    "metrics": metrics,
    "logs": logs
})

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=541808922.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['pre_classifier.bias', 'classifie

Step,Training Loss,Validation Loss,Accuracy
25,0.6687,0.696149,0.508914
50,0.5766,0.665577,0.602917
75,0.3384,0.771806,0.617504
100,0.0733,1.082392,0.643436
125,0.0126,1.401169,0.636953
150,0.0057,1.471007,0.646677
175,0.0042,1.56892,0.640194
200,0.0036,1.55487,0.656402
225,0.0028,1.622764,0.648298
250,0.0024,1.655677,0.654781



### Model back-translated

In [None]:
train_ds = concatenate_datasets([book_review_train_ds, book_review_train_ds_back])
metrics, logs = train_and_evaluate(train_ds, book_review_test_ds, "backtranslated")

run_dicts.append({
    "id": "backtranslated",
    "metrics": metrics,
    "logs": logs
})

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Loading cached processed dataset at /root/.cache/huggingface/datasets/dbrd/plain_text/3.0.0/a454f53ccf247517cbb44e57f07904d4adefc5837d766f6120ff467ea7a465f7/cache-1f91bf39ff4e7c25.arrow





Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['pre_classifier.bias', 'classifie

Step,Training Loss,Validation Loss,Accuracy
25,0.6997,0.697789,0.444084
50,0.6368,0.652168,0.619125
75,0.4572,0.635084,0.646677
100,0.2224,0.806585,0.654781
125,0.0649,1.133113,0.65316
150,0.0113,1.3075,0.666126
175,0.0053,1.473922,0.654781
200,0.0039,1.515776,0.659643
225,0.0029,1.55109,0.661264
250,0.0027,1.628136,0.649919


### Model contextual word embedding insertions



In [None]:
train_ds = concatenate_datasets([book_review_train_ds, book_review_train_ds_contemb])

metrics, logs = train_and_evaluate(train_ds, book_review_test_ds, "contextual_embedding")

run_dicts.append({
    "id": "contextual_embedding",
    "metrics": metrics,
    "logs": logs
})

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Loading cached processed dataset at /root/.cache/huggingface/datasets/dbrd/plain_text/3.0.0/a454f53ccf247517cbb44e57f07904d4adefc5837d766f6120ff467ea7a465f7/cache-1f91bf39ff4e7c25.arrow





Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['pre_classifier.bias', 'classifie

Step,Training Loss,Validation Loss,Accuracy
25,0.6902,0.686494,0.575365
50,0.6004,0.647743,0.622366
75,0.3869,0.659535,0.643436
100,0.1567,0.823537,0.67423
125,0.0283,1.100565,0.677472
150,0.0092,1.273716,0.677472
175,0.0047,1.391002,0.682334
200,0.0034,1.441377,0.683955
225,0.0028,1.485793,0.675851
250,0.0025,1.515399,0.67423


### Model back-translated & contextual word embedding insertions

In [None]:
train_ds = concatenate_datasets([book_review_train_ds,  book_review_train_ds_contemb_back])

metrics, logs = train_and_evaluate(train_ds, book_review_test_ds, "backtranslated_contextual_embedding")

run_dicts.append({
    "id": "backtranslated_contextual_embedding",
    "metrics": metrics,
    "logs": logs
})

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Loading cached processed dataset at /root/.cache/huggingface/datasets/dbrd/plain_text/3.0.0/a454f53ccf247517cbb44e57f07904d4adefc5837d766f6120ff467ea7a465f7/cache-1f91bf39ff4e7c25.arrow





Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['pre_classifier.bias', 'classifie

Step,Training Loss,Validation Loss,Accuracy
25,0.7019,0.703699,0.444084
50,0.6611,0.684089,0.560778
75,0.5537,0.623973,0.641815
100,0.3452,0.633191,0.669368
125,0.1649,0.789313,0.683955
150,0.0811,0.985083,0.682334
175,0.0166,1.121521,0.685575
200,0.0073,1.207161,0.687196
225,0.0049,1.271577,0.685575
250,0.0042,1.31436,0.685575


##  📊 Visualize

In [None]:
df = pd.DataFrame(run_dicts)
df.head()

In [None]:
fig = go.Figure()


for index, row in df.iterrows():
    
    fig.add_trace(go.Scatter(
                    x=list(range(25,max_steps,25)),
                    y=pd.DataFrame(row['logs']).dropna(subset=['eval_accuracy'])['eval_accuracy'],
                    name='accuracy {}'.format(row['id'])))

fig.update_xaxes(title_text='step')
fig.update_yaxes(title_text='accuracy')

fig.show()

## 🏁 Take-aways 


You've reached the finish line! 👏  Let's sum up some of the findings.

* Both back-translation and contextual word embedding insertions boosted the robustness and performance of the model 👌 
* Creativity also helps! 🎨 The combination of both back-translation and contextual word embedding insertions achieved the highest performance. 
* The goal is to use context-preserving augmentation techniques that generate structurally different sentences while preserving the meaning.
* The data from the DBRD dataset was well-represented by the pretrained model, such that training without data-augmentation techniques already yielded good results

We considered 3-hop backtranslation between Dutch, French and English, but you could also include other languages and more hops to generate even more samples . 

You could also try out other text augmentation techniques such as: Synonym Replacement, Random Insertion, Random Swap, Random Deletion. 🕵️‍♂️




