<img align="right" width="400" src="https://www.fhnw.ch/de/++theme++web16theme/assets/media/img/fachhochschule-nordwestschweiz-fhnw-logo.svg" alt="FHNW Logo">


# Data Augmentation using Back Translation

by Fabian Märki

## Summary
The aim of this notebook is to show how Huggingface's model can be used for back translation.

### Sources
- [Text Data Augmentation with Back Translation](https://amitness.com/back-translation/)
- [Faster batch translation](https://github.com/huggingface/transformers/issues/9994) with code example

### Libraries/Models
- [Hugging Face](https://huggingface.co)
- [Translation Models](https://huggingface.co/models?language=de&pipeline_tag=translation&sort=downloads&search=Helsinki-NLP) that can be used with this code

## Links
- [Enabling GPU on Google Colab](https://www.tutorialspoint.com/google_colab/google_colab_using_free_gpu.htm)

This notebook contains assigments: <font color='red'>Questions are written in red.</font>

<a href="https://colab.research.google.com/github/markif/2023_HS_DAS_NLP_Notebooks/blob/master/08_c_Transformers_Data_Augmentation_using_Back_Translation.ipynb">
  <img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [1]:
%%capture

!pip install 'fhnw-nlp-utils>=0.8.0,<0.9.0'

from fhnw.nlp.utils.processing import parallelize_dataframe
from fhnw.nlp.utils.processing import is_iterable
from fhnw.nlp.utils.storage import download
from fhnw.nlp.utils.storage import save_dataframe
from fhnw.nlp.utils.storage import load_dataframe

import numpy as np
import pandas as pd

**Make sure that a GPU is available (see [here](https://www.tutorialspoint.com/google_colab/google_colab_using_free_gpu.htm))!!!**

In [2]:
from fhnw.nlp.utils.system import set_log_level
from fhnw.nlp.utils.system import system_info

set_log_level()
print(system_info())

OS name: posix
Platform name: Linux
Platform release: 5.15.0-48-generic
Python version: 3.6.9
CPU cores: 6
RAM: 31.12GB total and 23.46GB available
Tensorflow version: 2.5.1
GPU is available
GPU is a NVIDIA GeForce RTX 2070 with Max-Q Design with 8192MiB


In [3]:
%%time
download("https://drive.switch.ch/index.php/s/0hE8wO4FbfGIJld/download", "data/german_doctor_reviews_tokenized.parq")
data = load_dataframe("data/german_doctor_reviews_tokenized.parq")
data.shape

CPU times: user 7.69 s, sys: 1.55 s, total: 9.24 s
Wall time: 5.26 s


(350087, 10)

In [4]:
data.head(3)

Unnamed: 0,text_original,rating,text,label,sentiment,token_clean,text_clean,token_lemma,token_stem,token_clean_stopwords
0,Ich bin franzose und bin seit ein paar Wochen ...,2.0,Ich bin franzose und bin seit ein paar Wochen ...,positive,1,"[ich, bin, franzose, und, bin, seit, ein, paar...",ich bin franzose und bin seit ein paar wochen ...,"[franzose, seit, paar, wochen, muenchen, zahn,...","[franzos, seit, paar, woch, muench, ., zahn, s...","[franzose, seit, paar, wochen, muenchen, ., za..."
1,Dieser Arzt ist das unmöglichste was mir in me...,6.0,Dieser Arzt ist das unmöglichste was mir in me...,negative,-1,"[dieser, arzt, ist, das, unmöglichste, was, mi...",dieser arzt ist das unmöglichste was mir in me...,"[arzt, unmöglichste, leben, je, begegnen, unfr...","[arzt, unmog, leb, je, begegnet, unfreund, ,, ...","[arzt, unmöglichste, leben, je, begegnet, unfr..."
2,Hatte akute Beschwerden am Rücken. Herr Magura...,1.0,Hatte akute Beschwerden am Rücken. Herr Magura...,positive,1,"[hatte, akute, beschwerden, am, rücken, ., her...",hatte akute beschwerden am rücken . herr magur...,"[akut, beschwerden, rücken, magura, erste, arz...","[akut, beschwerd, ruck, ., magura, erst, arzt,...","[akute, beschwerden, rücken, ., magura, erste,..."


Drop the computed columns (will need to be re-computed).

In [5]:
data = data.drop(["token_clean", "token_lemma", "token_stem", "token_clean_stopwords", "text_clean"], axis=1)

In [6]:
data.head(3)

Unnamed: 0,text_original,rating,text,label,sentiment
0,Ich bin franzose und bin seit ein paar Wochen ...,2.0,Ich bin franzose und bin seit ein paar Wochen ...,positive,1
1,Dieser Arzt ist das unmöglichste was mir in me...,6.0,Dieser Arzt ist das unmöglichste was mir in me...,negative,-1
2,Hatte akute Beschwerden am Rücken. Herr Magura...,1.0,Hatte akute Beschwerden am Rücken. Herr Magura...,positive,1


Only keep negative text (the class with fewer samples).

In [7]:
data_augm = data[data["label"] == "negative"]
data_augm.shape

(33022, 5)

In [8]:
#data_augm = data_augm.reset_index(drop=True)
data_augm.head(3)

Unnamed: 0,text_original,rating,text,label,sentiment
1,Dieser Arzt ist das unmöglichste was mir in me...,6.0,Dieser Arzt ist das unmöglichste was mir in me...,negative,-1
13,1. Termin:<br />\n1 Stunde Wartezimmer + 2 min...,6.0,. Termin Stunde Wartezimmer minütige Behandlu...,negative,-1
19,"Eine sehr unfreundliche Ärztin, so etwas habe ...",6.0,"Eine sehr unfreundliche Ärztin, so etwas habe ...",negative,-1


In [9]:
%%capture

!pip install torch transformers sentencepiece mosestokenizer sacremoses

In [10]:
def get_compute_device():
    """Provides the device for the computation
    
    Returns
    -------
    str
        The GPU device with number (cuda:0) or cpu
    """
    
    import torch

    return "cuda:0" if torch.cuda.is_available() else "cpu"


def gpu_empty_cache():
    """Cleans the GPU cache which seems to fill up after a while
    
    """
    
    import torch
    
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

In [11]:
params = {
    "verbose": True,
    "shuffle": True,
    # modify batch_size in case you experience memory issues
    "batch_size": 8,
    "X_column_name": "text",
    "y_column_name": "label",
    "y_column_name_prediction": "translation",
    "last_stored_batch": -1,
    "store_path": "data/german_doctor_reviews_augmented_translated.parq"
}

### Back Translation

You might repeate following steps for several languages (see [here](https://huggingface.co/models?language=de&pipeline_tag=translation&sort=downloads&search=Helsinki-NLP) for alternative models).

<font color='red'>**TASK: Try a different language by replacing `lang_to` with another from the [Helsinki-NLP/opus-mt-...](https://huggingface.co/models?language=de&pipeline_tag=translation&sort=downloads&search=Helsinki-NLP) list.**</font>

In [12]:
# replace values to load different tranlsation models
lang_from = "de"
lang_to = "es"
compute_device = get_compute_device()

from transformers import MarianMTModel, MarianTokenizer
orig2dest_model_name = "Helsinki-NLP/opus-mt-"+lang_from+"-"+lang_to
orig2dest_tokenizer = MarianTokenizer.from_pretrained(orig2dest_model_name)
orig2dest_model = MarianMTModel.from_pretrained(orig2dest_model_name).to(compute_device)
dest2orig_model_name = "Helsinki-NLP/opus-mt-"+lang_to+"-"+lang_from
dest2orig_tokenizer = MarianTokenizer.from_pretrained(dest2orig_model_name)
dest2orig_model = MarianMTModel.from_pretrained(dest2orig_model_name).to(compute_device)

#from transformers import FSMTForConditionalGeneration, FSMTTokenizer
#orig2dest_model_name = "facebook/wmt19-"+lang_from+"-"+lang_to
#orig2dest_tokenizer = FSMTTokenizer.from_pretrained(orig2dest_model_name)
#orig2dest_model = FSMTForConditionalGeneration.from_pretrained(orig2dest_model_name).to(device)
#dest2orig_model_name = "facebook/wmt19-"+lang_to+"-"+lang_from
#dest2orig_tokenizer = FSMTTokenizer.from_pretrained(dest2orig_model_name)
#dest2orig_model = FSMTForConditionalGeneration.from_pretrained(dest2orig_model_name).to(device)

Downloading:   0%|          | 0.00/809k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/799k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.40M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/290M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/799k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/809k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.40M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/290M [00:00<?, ?B/s]

<font color='red'>**TASK: Print the intermediate translations (i.e. decode the `tokenized_dest_texts`) in order to get an understanding of the *creative power* of the back translation (you might want to choose a language you understand).**</font>

In [13]:
def back_translate(params, texts):
    #tokenized_texts = orig2dest_tokenizer.prepare_seq2seq_batch(texts, return_tensors="pt").to(compute_device)
    tokenized_texts = orig2dest_tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(compute_device)
    back_translations = [set() for _ in range(len(texts))]

    # Translate texts to target language (e.g. Spanish) and back to source language (e.g. German)
    generate_kwargs = {"num_beams": 1, "do_sample": True, "num_return_sequences": 2}
    tokenized_dest_texts = orig2dest_model.generate(tokenized_texts["input_ids"], attention_mask=tokenized_texts["attention_mask"], top_p=0.7, **generate_kwargs)
    tokenized_source_texts = dest2orig_model.generate(tokenized_dest_texts, top_p=0.8, **generate_kwargs)
    
    # TODO: !!! place your code here !!!
    ####################################

    #for i, t in enumerate(tokenized_dest_texts):
    #    print(orig2dest_tokenizer.decode(t, skip_special_tokens=True).lower())
        
    ###################
    # TODO: !!! end !!!

    # Decode and deduplicate back-translations and assign to original text indices
    for i, t in enumerate(tokenized_source_texts):
        back_translations[i // 4].add(dest2orig_tokenizer.decode(t, skip_special_tokens=True).lower())

    # Remove back translations that are empty or equal to the original text
    return [[bt for bt in s if bt and bt != t] for s, t in zip(back_translations, map(str.lower, texts))]

Give it a try...

In [14]:
back_translate(params, ["Hallo zusammen! Wie geht es euch heute?", "NLP ist grossartig, oder?"])

[['- hey, wie geht es ihnen?',
  'hey, wie geht es euch?',
  "wie geht's heute?",
  'wie geht es ihnen heute?'],
 ['die ntp ist doch super, oder?',
  '- ist das nlp ein großartiger ort?',
  'die nhp ist wirklich cool, oder?',
  '- ist die nlp ein großartiger ort?']]

In [15]:
def compute_predictions(params, data, predict_func):
    """Computes the actual predictions. Allows for recovery in case of a crash...

    Parameters
    ----------
    params: dict
        The dictionary containing the parameters
    data: dataframe
        The data
    predict_func: callable
        The function that computes the prediction
    """
    import os
    from datetime import datetime
    
    from fhnw.nlp.utils.storage import save_dataframe
    from fhnw.nlp.utils.storage import load_dataframe
    
    verbose = params.get("verbose", False)
    batch_size = params.get("batch_size", 8)
    X_column_name = params.get("X_column_name", "text")
    y_column_name = params.get("y_column_name", "label")
    y_column_name_prediction = params.get("y_column_name_prediction", "prediction")
    store_every_n_elements = params.get("store_every_n_elements", 32768)
    store_path = params.get("store_path", "data/predictions.parq")
    last_stored_batch = params.get("last_stored_batch", -1)
    empty_gpu_cache = params.get("empty_gpu_cache", False)
    
    predictions = []
    
    # load stored data for recovery
    if last_stored_batch >= 0 or last_stored_batch == -1 and os.path.exists(store_path):
        predictions_loaded = load_dataframe(store_path)
        predictions = [row.to_dict() for index, row in predictions_loaded.iterrows()]
        
        if last_stored_batch < 0:
            last_stored_batch = len(predictions) // batch_size
            
        if verbose:
            print(datetime.now().time(), "Loaded batch:", last_stored_batch, " predictions: ", len(predictions))
         
    # do the predictions
    for g, df in data.groupby(np.arange(len(data)) // batch_size):
        if g >= last_stored_batch:
            # prevent OOM on GPU
            if empty_gpu_cache:
                gpu_empty_cache()
                
            predictions_batch = predict_func(params, df[X_column_name].to_list())
            
            # store the predictions together with the data
            i = 0
            for index, row in df.iterrows():
                # e.g. back translation might provide more than one translation per prediction
                if isinstance(predictions_batch[i], list):
                    for prediction in predictions_batch[i]:
                        row_dict = row.to_dict()
                        row_dict[y_column_name_prediction] = prediction
                        predictions.append(row_dict)
                else:
                    row_dict = row.to_dict()
                    row_dict[y_column_name_prediction] = predictions_batch[i]
                    predictions.append(row_dict)

                i += 1

                
            if (g + 1) % (store_every_n_elements // batch_size) == 0:
                if verbose:
                    print(datetime.now().time(), "Save batch:", str(g+1), ", processed elements:", str((g+1)*batch_size), ", total predictions:", len(predictions))

                save_dataframe(pd.DataFrame(predictions), store_path)

    if verbose:
        print(datetime.now().time(), "Prediction done. Batches:", str(data.shape[0] // batch_size), ", processed elements:", str(data.shape[0]), ", total predictions:", len(predictions))
    
    pred_data = pd.DataFrame(predictions)
    save_dataframe(pred_data, store_path)
    
    return pred_data

In [16]:
%%time

data_augm_translated = compute_predictions(params, data_augm, back_translate)

01:53:26.098258 Save batch: 4096 , processed elements: 32768 , total predictions: 130728
01:55:59.910435 Prediction done. Batches: 4127 , processed elements: 33022 , total predictions: 131741
CPU times: user 5h 31min 7s, sys: 38 s, total: 5h 31min 45s
Wall time: 5h 22min 41s


In [17]:
data_augm_translated.head(3)

Unnamed: 0,text_original,rating,text,label,sentiment,translation
0,Dieser Arzt ist das unmöglichste was mir in me...,6.0,Dieser Arzt ist das unmöglichste was mir in me...,negative,-1,dieser arzt ist die unmöglichste in meinem leb...
1,Dieser Arzt ist das unmöglichste was mir in me...,6.0,Dieser Arzt ist das unmöglichste was mir in me...,negative,-1,dieser arzt ist der unmöglichste in meinem leb...
2,Dieser Arzt ist das unmöglichste was mir in me...,6.0,Dieser Arzt ist das unmöglichste was mir in me...,negative,-1,"dieser arzt ist das unmöglichste, was sie je i..."


In [18]:
save_dataframe(data_augm_translated, "data/german_doctor_reviews_augmented_translated_"+lang_to+".parq")