# Data augmentation techniques for NLP

The training size will impact the performace of a model heavily, this notebook will discuss the possibilities of performing data augmentation on a NLP dataset. Data augmentation techniques are used to generate additional samples. Data augmentation is already standard practice in computer vision projects, but can also be leveraged in multilingual NLP problems. We'll use a limited trainingset to simulate a real-world use case, where we often are constrained by the size of the available data. We'll focuss on using backtranslation and word-embedding substitution as a data augmentation technique.

## 🛠️ Getting started

The cells below will setup everything that is required to get started with data augmentation and finetuning an NLP model with the HuggingFace API.

### Setup

In [15]:
!pip install -q transformers sentencepiece datasets tokenizers nltk nlpaug 

### Imports

In [16]:
import re
import numpy as np
import pandas as pd 

import nltk
import nlpaug.flow as naf
import nlpaug.augmenter.word as naw
import plotly.graph_objects as go
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, TrainerCallback
from datasets import load_dataset, concatenate_datasets, load_from_disk, load_metric

### Download translation pipelines and tokenizer

In [17]:
trans_pipeline_en_nl = pipeline(
    task='translation_en_to_nl',
    model='Helsinki-NLP/opus-mt-en-nl',
    tokenizer='Helsinki-NLP/opus-mt-en-nl',
    device=0)
trans_pipeline_nl_fr = pipeline(
    task='translation_nl_to_fr',
    model='Helsinki-NLP/opus-mt-nl-fr',
    tokenizer='Helsinki-NLP/opus-mt-nl-fr',
    device=0)
trans_pipeline_fr_en = pipeline(
    task='translation_fr_to_en',
    model='Helsinki-NLP/opus-mt-fr-en',
    tokenizer='Helsinki-NLP/opus-mt-fr-en',
    device=0)
nltk.download('punkt')

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Download dataset
Since we're particulary interested in multilingual NLP, we'll use a well known dutch dataset [DBRD](https://github.com/benjaminvdb/DBRD). The dataset contains over 110k book reviews along with associated binary sentiment polarity labels. The downstream task will be assigning a sentiment to a book review.  

In [18]:
max_input_len=128
book_review_ds = load_dataset("dbrd").filter(lambda e: len(tokenizer.batch_encode_plus([e['text']]).input_ids[0]) < int(max_input_len))

Reusing dataset dbrd (/root/.cache/huggingface/datasets/dbrd/plain_text/3.0.0/a454f53ccf247517cbb44e57f07904d4adefc5837d766f6120ff467ea7a465f7)
Loading cached processed dataset at /root/.cache/huggingface/datasets/dbrd/plain_text/3.0.0/a454f53ccf247517cbb44e57f07904d4adefc5837d766f6120ff467ea7a465f7/cache-85d1284acc1f5dc1.arrow


HBox(children=(FloatProgress(value=0.0, max=3.0), HTML(value='')))

Token indices sequence length is longer than the specified maximum sequence length for this model (640 > 512). Running this sequence through the model will result in indexing errors
Loading cached processed dataset at /root/.cache/huggingface/datasets/dbrd/plain_text/3.0.0/a454f53ccf247517cbb44e57f07904d4adefc5837d766f6120ff467ea7a465f7/cache-4fa500eacb3cf7bd.arrow





## Data augmentation pipelines


### ㊗️ Backtranslation 
We'll be using the MariaMT model to perform backtranslations, the translated sentences should be similar in context but not structurally identical. The backtranslation process is as follows:

1.   Translate a ducht book review into french
2.   Translate the resulting frech text into english
3.   Translate the resulting english text back into dutch

### ✨ Contextual word embedding replacements


The [nlpaug](https://github.com/makcedward/nlpaug) library combines frequently used augmentation techniques into a python package. We'll use the `ContextualWordEmbsForSentenceAug` which uses contextual word embeddings to find the top n similar words for augmentation.




In [19]:
def back_tranlation_nl_fr_en_nl(texts):
    fr_texts = trans_pipeline_nl_fr(texts)
    back_translated_texts = trans_pipeline_fr_en([el['translation_text'] for el in fr_texts])
    twohopback_translated_texts = trans_pipeline_en_nl([el['translation_text'] for el in back_translated_texts])
    return [el['translation_text'] for el in twohopback_translated_texts]

aug = naf.Sequential([
    naw.ContextualWordEmbsAug(
        model_path='pdelobelle/robbert-v2-dutch-base',
        model_type='roberta',
        aug_p=0.20,
        action="insert")
])

replace_newline = lambda dataset: dataset.map(lambda x: {'text': x["text"].replace("\n",' ')}, batched=False)
contextual_emb_aug = lambda dataset: dataset.map(lambda x: {'text': aug.augment(x["text"])},  batch_size=10, batched=True)
backtranslate_dataset = lambda dataset: dataset.map(lambda x: {'text': back_tranlation_nl_fr_en_nl(x["text"])}, batch_size=10, batched=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=539.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=469740689.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=732536.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=382677.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1115741.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=239.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1109.0, style=ProgressStyle(description…




In [20]:
book_review_train_ds = book_review_ds["train"].shuffle(seed=42).select(range(50))
book_review_train_ds_newline = replace_newline(book_review_train_ds)
book_review_train_ds_contemb = contextual_emb_aug(book_review_train_ds_newline)
book_review_train_ds_back = backtranslate_dataset(book_review_train_ds)
book_review_train_ds_contemb_back = backtranslate_dataset(book_review_train_ds_contemb)
book_review_test_ds = book_review_ds["test"]

Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/dbrd/plain_text/3.0.0/a454f53ccf247517cbb44e57f07904d4adefc5837d766f6120ff467ea7a465f7/cache-3bfb9571b4d14263.arrow


HBox(children=(FloatProgress(value=0.0, max=50.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




## Model 

In [21]:
metric = load_metric("accuracy")
batch_size = 8
epochs = 20
max_steps = epochs * int(((len(book_review_train_ds)*3)/batch_size)) 

run_dicts = [] # list of dicts to store both metrics and logs for all the experiment runs 

In [22]:
def compute_metrics(eval_pred):
    """
        Calculates the accuracy of the model's predictions, calculated as follows; (TP + TN) / (TP + TN + FP + FN) with TP: True positive TN: True negative FP: False positive FN: False negative
    """

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels) 


class LogAccumulatorCallback(TrainerCallback):
    '''
    A class that stores both the training and the evaluation loss
    '''
    
    def __init__(self):
        self.acc_logs = []

    def on_log(self, args, state, control, logs=None, **kwargs):
        _ = logs.pop("total_flos", None)
        if state.is_local_process_zero and ('loss' in logs or 'eval_loss' in logs):
            self.acc_logs.append(logs.copy())


def train_and_evaluate(train_ds, test_ds, identifier):
    def tokenize(batch):
        return tokenizer(batch['text'], padding=True, truncation=True)
    
    train_ds = train_ds.map(tokenize, batched=True, batch_size=len(train_ds))
    test_ds = test_ds.map(tokenize, batched=True, batch_size=len(test_ds))
    
    
    training_args = TrainingArguments(
        "trainer_args",
        evaluation_strategy="steps",
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        eval_steps=25,
        logging_steps=25,
        max_steps=max_steps,
        learning_rate=2e-5,
    )
    
    model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-multilingual-cased", num_labels=2)

    for block in model.distilbert.embeddings.modules():
        for param in block.parameters():
            param.requires_grad=False

    for i in [0,1,2]:
        for block in model.distilbert.transformer.layer[i].modules():
            for param in block.parameters():
                param.requires_grad=False

            
    logger = LogAccumulatorCallback()
    trainer = Trainer(
        model=model, args=training_args, 
        train_dataset=train_ds, 
        eval_dataset=test_ds,
        compute_metrics=compute_metrics,
        callbacks=[logger],
    )
    trainer.train()
    metrics = trainer.evaluate()
    
    return metrics, logger.acc_logs

### Model baseline

In [23]:
metrics, logs = train_and_evaluate(book_review_train_ds, book_review_test_ds, "baseline")

run_dicts.append({
    "id": "baseline",
    "metrics": metrics,
    "logs": logs
})

Loading cached processed dataset at /root/.cache/huggingface/datasets/dbrd/plain_text/3.0.0/a454f53ccf247517cbb44e57f07904d4adefc5837d766f6120ff467ea7a465f7/cache-b565fac39eedf3a8.arrow


HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.weight', 'pre_classif

Step,Training Loss,Validation Loss,Accuracy
25,0.6561,0.684543,0.576985
50,0.5269,0.684688,0.593193
75,0.2782,0.818827,0.604538
100,0.0664,1.157773,0.6094
125,0.0133,1.459125,0.615883
150,0.0057,1.546664,0.619125
175,0.004,1.7081,0.612642
200,0.003,1.699749,0.617504
225,0.0024,1.730965,0.619125
250,0.0022,1.79971,0.617504



### Model backtranslated

In [24]:
train_ds = concatenate_datasets([book_review_train_ds, book_review_train_ds_back])
metrics, logs = train_and_evaluate(train_ds, book_review_test_ds, "backtranslated")

run_dicts.append({
    "id": "backtranslated",
    "metrics": metrics,
    "logs": logs
})

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.weight', 'pre_classif

Step,Training Loss,Validation Loss,Accuracy
25,0.6997,0.697789,0.444084
50,0.6368,0.652168,0.619125
75,0.4572,0.635084,0.646677
100,0.2224,0.806585,0.654781
125,0.0649,1.133113,0.65316
150,0.0113,1.3075,0.666126
175,0.0053,1.473922,0.654781
200,0.0039,1.515776,0.659643
225,0.0029,1.55109,0.661264
250,0.0027,1.628136,0.649919


### Model contextual word embedding insertions



In [25]:
train_ds = concatenate_datasets([book_review_train_ds, book_review_train_ds_contemb])

metrics, logs = train_and_evaluate(train_ds, book_review_test_ds, "contextual_embedding")

run_dicts.append({
    "id": "contextual_embedding",
    "metrics": metrics,
    "logs": logs
})

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Loading cached processed dataset at /root/.cache/huggingface/datasets/dbrd/plain_text/3.0.0/a454f53ccf247517cbb44e57f07904d4adefc5837d766f6120ff467ea7a465f7/cache-1e62b6b79db84a5c.arrow





Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.weight', 'pre_classif

Step,Training Loss,Validation Loss,Accuracy
25,0.6887,0.685732,0.570502
50,0.6098,0.650225,0.619125
75,0.4006,0.653514,0.638574
100,0.1445,0.828715,0.667747
125,0.0353,1.092931,0.667747
150,0.009,1.261362,0.666126
175,0.0052,1.347244,0.669368
200,0.0037,1.410583,0.672609
225,0.003,1.443906,0.67423
250,0.0027,1.495462,0.670989


### Model backtranslated & contextual word embedding insertions

In [26]:
train_ds = concatenate_datasets([book_review_train_ds,  book_review_train_ds_contemb_back])

metrics, logs = train_and_evaluate(train_ds, book_review_test_ds, "backtranslated_contextual_embedding")

run_dicts.append({
    "id": "backtranslated_contextual_embedding",
    "metrics": metrics,
    "logs": logs
})

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))

Loading cached processed dataset at /root/.cache/huggingface/datasets/dbrd/plain_text/3.0.0/a454f53ccf247517cbb44e57f07904d4adefc5837d766f6120ff467ea7a465f7/cache-1e62b6b79db84a5c.arrow





Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.weight', 'pre_classif

Step,Training Loss,Validation Loss,Accuracy
25,0.6894,0.687809,0.58671
50,0.6348,0.656128,0.60778
75,0.47,0.626431,0.638574
100,0.2382,0.71059,0.682334
125,0.0982,0.9314,0.670989
150,0.0411,1.080824,0.701783
175,0.0081,1.216855,0.683955
200,0.0052,1.299461,0.675851
225,0.0038,1.328563,0.690438
250,0.0049,1.391091,0.682334


## Visualize

In [27]:
df = pd.DataFrame(run_dicts)
df.head()

Unnamed: 0,id,metrics,logs
0,baseline,"{'eval_loss': 1.838724970817566, 'eval_accurac...","[{'loss': 0.6561, 'learning_rate': 1.861111111..."
1,backtranslated,"{'eval_loss': 1.7218648195266724, 'eval_accura...","[{'loss': 0.6997, 'learning_rate': 1.861111111..."
2,contextual_embedding,"{'eval_loss': 1.5411746501922607, 'eval_accura...","[{'loss': 0.6887, 'learning_rate': 1.861111111..."
3,backtranslated_contextual_embedding,"{'eval_loss': 1.4115041494369507, 'eval_accura...","[{'loss': 0.6894, 'learning_rate': 1.861111111..."


In [28]:
fig = go.Figure()


for index, row in df.iterrows():
    
    fig.add_trace(go.Scatter(
                    x=list(range(25,max_steps,25)),
                    y=pd.DataFrame(row['logs']).dropna(subset=['eval_accuracy'])['eval_accuracy'],
                    name='accuracy {}'.format(row['id'])))

fig.update_xaxes(title_text='step')
fig.update_yaxes(title_text='accuracy')

fig.show()

## Take-aways

We used back-translation to generate more training data to improve the model's performance. When comparing both approaches we can observe that using an augmented dataset enables the model to converge a bit faster and is able to achieve a higher accuracy. As we can see, after 350 steps the best perfoming augmented technique yields an accuracy of 69.2% compared to the benchmark of 61.6%.

In the notebook we only considered 3-hop backtranslation between dutch, french and english, but you could also include other languages and more hops to generate even more samples. Aside from backtranslation you could also try out other text augmentation techniques such as: Synonym Replacement, Random Insertion, Random Swap, Random Deletion. 

An assumption we can make is that the data from the DBRD dataset was well-represented by the pretrained model, such that training without data-augmentation techniques already yielded good results. 



