<img align="right" width="400" src="https://www.fhnw.ch/de/++theme++web16theme/assets/media/img/fachhochschule-nordwestschweiz-fhnw-logo.svg" alt="FHNW Logo">


# Data Augmentation with Back Translation using Transformers

by Fabian Märki

## Summary
The aim of this notebook is to show how Huggingface's model can be used for back translation.

### Sources
- [Text Data Augmentation with Back TranslationPermalink](https://amitness.com/back-translation/)
- [Faster batch translation](https://github.com/huggingface/transformers/issues/9994) with code example

### Libraries/Models
- [Hugging Face](https://huggingface.co)
- [Translation Models](https://huggingface.co/models?language=de&pipeline_tag=translation&sort=downloads&search=Helsinki-NLP) that can be used with this code

This notebook contains assigments: <font color='red'>Questions are written in red.</font>

<a href="https://colab.research.google.com/github/markif/2021_HS_CAS_NLP_LAB_Notebooks/blob/master/06_b_Augmentation_with_Back_Translation_using_Transformers.ipynb">
  <img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [1]:
%%capture

!pip install 'fhnw-nlp-utils>=0.2.13,<0.3.0'

from fhnw.nlp.utils.processing import parallelize_dataframe
from fhnw.nlp.utils.processing import is_iterable
from fhnw.nlp.utils.storage import download
from fhnw.nlp.utils.storage import save_dataframe
from fhnw.nlp.utils.storage import load_dataframe

import numpy as np
import pandas as pd

**Make sure that a GPU is available (see [here](https://www.tutorialspoint.com/google_colab/google_colab_using_free_gpu.htm))!!!**

In [2]:
from fhnw.nlp.utils.system import system_info
print(system_info())

OS name: posix
Platform name: Linux
Platform release: 5.11.0-40-generic
Python version: 3.6.9
Tensorflow version: 2.5.1
GPU is available


In [3]:
%%time
download("https://drive.google.com/uc?id=19AFeVnOfX8WXU4_3rM7OFoNTWWog_sb_", "data/german_doctor_reviews_tokenized.parq")
data = load_dataframe("data/german_doctor_reviews_tokenized.parq")
data.shape

CPU times: user 7.63 s, sys: 1.41 s, total: 9.04 s
Wall time: 5.12 s


(350087, 10)

In [4]:
data.head(3)

Unnamed: 0,text_original,rating,text,label,sentiment,token_clean,text_clean,token_lemma,token_stem,token_clean_stopwords
0,Ich bin franzose und bin seit ein paar Wochen ...,2.0,Ich bin franzose und bin seit ein paar Wochen ...,positive,1,"[ich, bin, franzose, und, bin, seit, ein, paar...",ich bin franzose und bin seit ein paar wochen ...,"[franzose, seit, paar, wochen, muenchen, zahn,...","[franzos, seit, paar, woch, muench, ., zahn, s...","[franzose, seit, paar, wochen, muenchen, ., za..."
1,Dieser Arzt ist das unmöglichste was mir in me...,6.0,Dieser Arzt ist das unmöglichste was mir in me...,negative,-1,"[dieser, arzt, ist, das, unmöglichste, was, mi...",dieser arzt ist das unmöglichste was mir in me...,"[arzt, unmöglichste, leben, je, begegnen, unfr...","[arzt, unmog, leb, je, begegnet, unfreund, ,, ...","[arzt, unmöglichste, leben, je, begegnet, unfr..."
2,Hatte akute Beschwerden am Rücken. Herr Magura...,1.0,Hatte akute Beschwerden am Rücken. Herr Magura...,positive,1,"[hatte, akute, beschwerden, am, rücken, ., her...",hatte akute beschwerden am rücken . herr magur...,"[akut, beschwerden, rücken, magura, erste, arz...","[akut, beschwerd, ruck, ., magura, erst, arzt,...","[akute, beschwerden, rücken, ., magura, erste,..."


Drop the computed columns (will need to be re-computed).

In [5]:
data = data.drop(["token_clean", "token_lemma", "token_stem", "token_clean_stopwords", "text_clean"], axis=1)

In [6]:
data.head(3)

Unnamed: 0,text_original,rating,text,label,sentiment
0,Ich bin franzose und bin seit ein paar Wochen ...,2.0,Ich bin franzose und bin seit ein paar Wochen ...,positive,1
1,Dieser Arzt ist das unmöglichste was mir in me...,6.0,Dieser Arzt ist das unmöglichste was mir in me...,negative,-1
2,Hatte akute Beschwerden am Rücken. Herr Magura...,1.0,Hatte akute Beschwerden am Rücken. Herr Magura...,positive,1


Only keep negative text (the class with fewer samples).

In [8]:
data_augm = data[data["label"] == "negative"]
data_augm.shape

(33022, 5)

In [9]:
#data_augm = data_augm.reset_index(drop=True)
data_augm.head(3)

Unnamed: 0,text_original,rating,text,label,sentiment
1,Dieser Arzt ist das unmöglichste was mir in me...,6.0,Dieser Arzt ist das unmöglichste was mir in me...,negative,-1
13,1. Termin:<br />\n1 Stunde Wartezimmer + 2 min...,6.0,. Termin Stunde Wartezimmer minütige Behandlu...,negative,-1
19,"Eine sehr unfreundliche Ärztin, so etwas habe ...",6.0,"Eine sehr unfreundliche Ärztin, so etwas habe ...",negative,-1


In [10]:
%%capture

!pip install torch transformers sentencepiece mosestokenizer

In [11]:
def gpu_empty_cache():
    """Cleans the GPU cache which seems to fill up after a while
    
    """
        
    import torch
    import tensorflow as tf

    if tf.config.list_physical_devices("GPU"):
        torch.cuda.empty_cache()
    
def get_gpu_device_number():
    """Provides the number of the GPU device
    
    Returns
    -------
    int
        The GPU device number of -1 if none is installed
    """
        
    import tensorflow as tf
    
    return 0 if tf.config.list_physical_devices("GPU") else -1

def get_compute_device():
    """Provides the device for the computation
    
    Returns
    -------
    str
        The GPU device with number (cuda:0) of cpu
    """
        
    import tensorflow as tf
    
    return "cuda:0" if tf.config.list_physical_devices("GPU") else "cpu"

### Back Translation

You might repeate following steps for several languages (see [here](https://huggingface.co/models?language=de&pipeline_tag=translation&sort=downloads&search=Helsinki-NLP) for alternative models).

<font color='red'>**TASK: Try a different language by replacing `lang_to` with another from the [Helsinki-NLP/opus-mt-...](https://huggingface.co/models?language=de&pipeline_tag=translation&sort=downloads&search=Helsinki-NLP) list.**</font>

In [12]:
# replace values to load different tranlsation models
lang_from = "de"
lang_to = "es"
compute_device = get_compute_device()

from transformers import MarianMTModel, MarianTokenizer
orig2dest_model_name = "Helsinki-NLP/opus-mt-"+lang_from+"-"+lang_to
orig2dest_tokenizer = MarianTokenizer.from_pretrained(orig2dest_model_name)
orig2dest_model = MarianMTModel.from_pretrained(orig2dest_model_name).to(compute_device)
dest2orig_model_name = "Helsinki-NLP/opus-mt-"+lang_to+"-"+lang_from
dest2orig_tokenizer = MarianTokenizer.from_pretrained(dest2orig_model_name)
dest2orig_model = MarianMTModel.from_pretrained(dest2orig_model_name).to(compute_device)

#from transformers import FSMTForConditionalGeneration, FSMTTokenizer
#orig2dest_model_name = "facebook/wmt19-"+lang_from+"-"+lang_to
#orig2dest_tokenizer = FSMTTokenizer.from_pretrained(orig2dest_model_name)
#orig2dest_model = FSMTForConditionalGeneration.from_pretrained(orig2dest_model_name).to(device)
#dest2orig_model_name = "facebook/wmt19-"+lang_to+"-"+lang_from
#dest2orig_tokenizer = FSMTTokenizer.from_pretrained(dest2orig_model_name)
#dest2orig_model = FSMTForConditionalGeneration.from_pretrained(dest2orig_model_name).to(device)

Downloading:   0%|          | 0.00/809k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/799k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.40M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/290M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/799k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/809k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.40M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.11k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/290M [00:00<?, ?B/s]

<font color='red'>**TASK: Print the intermediate translations (i.e. decode the `tokenized_dest_texts`) in order to get an understanding of the *creative power* of the back translation (you might want to choose a language you understand).**</font>

In [13]:
def back_translate_transformers(texts):
    #tokenized_texts = orig2dest_tokenizer.prepare_seq2seq_batch(texts, return_tensors="pt").to(compute_device)
    tokenized_texts = orig2dest_tokenizer(texts, return_tensors="pt", padding=True, truncation=True).to(compute_device)
    back_translations = [set() for _ in range(len(texts))]

    # Translate texts to target language (e.g. Spanish) and back to source language (e.g. German)
    generate_kwargs = {"num_beams": 1, "do_sample": True, "num_return_sequences": 2}
    tokenized_dest_texts = orig2dest_model.generate(tokenized_texts["input_ids"], attention_mask=tokenized_texts["attention_mask"], top_p=0.7, **generate_kwargs)
    tokenized_source_texts = dest2orig_model.generate(tokenized_dest_texts, top_p=0.8, **generate_kwargs)
    
    # TODO: !!! place your code here !!!
    ####################################
    
        
    ###################
    # TODO: !!! end !!!

    # Decode and deduplicate back-translations and assign to original text indices
    for i, t in enumerate(tokenized_source_texts):
        back_translations[i // 4].add(dest2orig_tokenizer.decode(t, skip_special_tokens=True).lower())

    # Remove back translations that are empty or equal to the original text
    return [[bt for bt in s if bt and bt != t] for s, t in zip(back_translations, map(str.lower, texts))]

Give it a try...

In [14]:
back_translate_transformers(["Hallo zusammen! Wie geht es euch heute?", "NLP ist grossartig, oder?"])

[['hallo zusammen. wie geht es euch heute?',
  'hallo, leute. wie geht es ihnen heute?',
  "- hallo, allerseits. wie geht's ihnen heute?",
  "hallo, wie geht's euch heute?"],
 ['die nlp ist ja cool, nicht wahr?',
  'die nlp ist doch toll, oder?',
  'nlp ist cool, oder?',
  'nip ist fantastisch, oder?']]

Do the actual back translation. Following code allows for recovery in case of a crash...

In [15]:
%%time
from datetime import datetime

batch_size = 5
save_every_n_elements = 50
translations = []
last_stored = -1 #8409

# set to the last stored index for recovery
if last_stored >= 0:
    data_trans = load_dataframe("data/german_doctor_reviews_augmented_tmp.parq")
    translations = [row.to_dict() for index, row in data_trans.iterrows()]
    print("Loaded", len(translations))

for g, df in data_augm.groupby(np.arange(len(data_augm)) // batch_size):
    if g > last_stored:
        gpu_empty_cache()
        back_trans = back_translate_transformers(df["text_original"].to_list())
    
        i = 0
        for index, row in df.iterrows():
            
            for trans in back_trans[i]:
                row_dict = row.to_dict()
                row_dict["text"] = trans
                translations.append(row_dict)
           
            i += 1
    
        if (g + 1) % (save_every_n_elements // batch_size) == 0:
            print(datetime.now().time(), "save ", g, len(translations))
        
            save_dataframe(pd.DataFrame(translations), "data/german_doctor_reviews_augmented_tmp.parq")
            
    else:
        print("Skip", g)
              
    
save_dataframe(pd.DataFrame(translations), "data/german_doctor_reviews_augmented_tmp.parq")

13:48:01.349491 save  9 199
13:48:47.050349 save  19 399


KeyboardInterrupt: 

In [14]:
save_data = pd.DataFrame(translations)

In [15]:
save_data.head(3)

Unnamed: 0,text_original,rating,text,label,sentiment
0,Dieser Arzt ist das unmöglichste was mir in me...,6.0,"dieser arzt ist das unmöglichste, das ich je i...",negative,-1
1,Dieser Arzt ist das unmöglichste was mir in me...,6.0,"dieser arzt ist das unmöglichste, was ich jema...",negative,-1
2,Dieser Arzt ist das unmöglichste was mir in me...,6.0,dieser arzt ist am wenigsten unmöglich in mein...,negative,-1


In [16]:
save_dataframe(save_data, "data/german_doctor_reviews_augmented_translated_"+lang_to+".parq")

Load all the back translated text and perform normalization of the augmented data.

In [17]:
import glob
files = glob.glob("data/german_doctor_reviews*augmented_trans*_[a-z][a-z].parq")
print(files)

dataframes = []
for file in files:
    data_aug = load_dataframe(file)
    dataframes.append(data_aug)
    
data_aug = pd.concat(dataframes)

['data/german_doctor_reviews_augmented_translated_es.parq']


In [18]:
data_aug.shape

(131629, 5)

In [19]:
data_aug.head(3)

Unnamed: 0,text_original,rating,text,label,sentiment
0,Dieser Arzt ist das unmöglichste was mir in me...,6.0,"dieser arzt ist das unmöglichste, das ich je i...",negative,-1
1,Dieser Arzt ist das unmöglichste was mir in me...,6.0,"dieser arzt ist das unmöglichste, was ich jema...",negative,-1
2,Dieser Arzt ist das unmöglichste was mir in me...,6.0,dieser arzt ist am wenigsten unmöglich in mein...,negative,-1


In [16]:
from fhnw.nlp.utils.normalize import tokenize
from fhnw.nlp.utils.normalize import tokenize_stem
from fhnw.nlp.utils.normalize import tokenize_lemma
from fhnw.nlp.utils.normalize import normalize
from fhnw.nlp.utils.text import clean_text
from fhnw.nlp.utils.text import join_tokens

In [21]:
!pip install 'spacy>=3.0.5'
!pip install nltk

import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
nltk.download('punkt')
nltk.download('stopwords')

import spacy
!python3 -m spacy download de_core_news_md

nlp = spacy.load("de_core_news_md")

stemmer = SnowballStemmer("german")
empty_stopwords = set()
stopwords = set(stopwords.words("german"))
n_cores = 4

You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


2021-10-01 00:28:04.728499: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_lg')


In [22]:
%%time

# make sure we did not introduce maleformated stuff
data_aug = data_aug.rename(columns={"text": "text_tmp"})
data_aug = parallelize_dataframe(data_aug, clean_text, n_cores=n_cores, field_read="text_tmp", field_write="text", keep_punctuation=True)
data_aug = data_aug.drop(columns=["text_tmp"], errors="ignore")

CPU times: user 725 ms, sys: 416 ms, total: 1.14 s
Wall time: 2.82 s


In [23]:
%%time
save_dataframe(data_aug, "data/german_doctor_reviews_augmented_tokenized_tmp.parq")

CPU times: user 6.1 s, sys: 112 ms, total: 6.21 s
Wall time: 6.2 s


In [24]:
%%time
data_aug = parallelize_dataframe(data_aug, normalize, n_cores=n_cores, field_read="text", field_write="token_clean", stopwords=empty_stopwords, stemmer=None, lemmanizer=None, lemma_with_ner=False)

CPU times: user 4.29 s, sys: 656 ms, total: 4.95 s
Wall time: 31 s


In [25]:
%%time
save_dataframe(data_aug, "data/german_doctor_reviews_augmented_tokenized_tmp.parq")

CPU times: user 10.3 s, sys: 204 ms, total: 10.5 s
Wall time: 10.5 s


In [26]:
%%time
data_aug = parallelize_dataframe(data_aug, join_tokens, n_cores=n_cores, field_read="token_clean", field_write="text_clean", stopwords=empty_stopwords)

CPU times: user 31.2 s, sys: 1.18 s, total: 32.4 s
Wall time: 33.7 s


In [27]:
%%time
save_dataframe(data_aug, "data/german_doctor_reviews_augmented_tokenized_tmp.parq")

CPU times: user 14.7 s, sys: 312 ms, total: 15 s
Wall time: 14.9 s


In [28]:
%%time
data_aug = parallelize_dataframe(data_aug, normalize, n_cores=n_cores, field_read="token_clean", field_write="token_lemma", stopwords=stopwords, stemmer=None, lemmanizer=nlp, lemma_with_ner=False)

CPU times: user 1min, sys: 6.34 s, total: 1min 6s
Wall time: 3min 47s


In [29]:
%%time
save_dataframe(data_aug, "data/german_doctor_reviews_augmented_tokenized_tmp.parq")

CPU times: user 15.9 s, sys: 308 ms, total: 16.2 s
Wall time: 16.2 s


In [30]:
%%time
data_aug = parallelize_dataframe(data_aug, normalize, n_cores=n_cores, field_read="token_clean", field_write="token_stem", stopwords=stopwords, stemmer=stemmer, lemmanizer=None, lemma_with_ner=False)

CPU times: user 33 s, sys: 1.59 s, total: 34.6 s
Wall time: 47.5 s


In [31]:
%%time
save_dataframe(data_aug, "data/german_doctor_reviews_augmented_tokenized_tmp.parq")

CPU times: user 18.4 s, sys: 384 ms, total: 18.7 s
Wall time: 18.7 s


In [32]:
%%time
data_aug = parallelize_dataframe(data_aug, normalize, n_cores=n_cores, field_read="token_clean", field_write="token_clean_stopwords", stopwords=stopwords, stemmer=None, lemmanizer=None, lemma_with_ner=False)

CPU times: user 34.2 s, sys: 1.45 s, total: 35.6 s
Wall time: 42.1 s


In [33]:
%%time
save_dataframe(data_aug, "data/german_doctor_reviews_augmented_tokenized_tmp.parq")

CPU times: user 20.8 s, sys: 396 ms, total: 21.1 s
Wall time: 21.1 s


In [34]:
data_aug = data_aug[data_aug["token_lemma"].map(len) > 1 ]

In [35]:
data_aug.head(3)

Unnamed: 0,text_original,rating,label,sentiment,text,token_clean,text_clean,token_lemma,token_stem,token_clean_stopwords
0,Dieser Arzt ist das unmöglichste was mir in me...,6.0,negative,-1,"dieser arzt ist das unmöglichste, das ich je i...","[dieser, arzt, ist, das, unmöglichste, ,, das,...","dieser arzt ist das unmöglichste , das ich je ...","[arzt, unmöglichste, je, leben, triefen, böswi...","[arzt, unmog, ,, je, leb, getroff, ,, boswill,...","[arzt, unmöglichste, ,, je, leben, getroffen, ..."
1,Dieser Arzt ist das unmöglichste was mir in me...,6.0,negative,-1,"dieser arzt ist das unmöglichste, was ich jema...","[dieser, arzt, ist, das, unmöglichste, ,, was,...","dieser arzt ist das unmöglichste , was ich jem...","[arzt, unmöglichste, jemals, leben, kennen, ve...","[arzt, unmog, ,, jemal, leb, kannt, ,, versaut...","[arzt, unmöglichste, ,, jemals, leben, kannte,..."
2,Dieser Arzt ist das unmöglichste was mir in me...,6.0,negative,-1,dieser arzt ist am wenigsten unmöglich in mein...,"[dieser, arzt, ist, am, wenigsten, unmöglich, ...",dieser arzt ist am wenigsten unmöglich in mein...,"[arzt, wenig, unmöglich, leben, finden, unfreu...","[arzt, wenig, unmog, leb, find, ,, unfreund, ,...","[arzt, wenigsten, unmöglich, leben, finden, ,,..."


In [37]:
#%%time
#save_dataframe(data_aug, "data/german_doctor_reviews_augmented_tokenized.parq")