# Final Project

## TRAC-2 Data Augmentation- Back Translation

In this notebook we use back traslation techniques to augment the TRAC-2 dataset.

## Packages import

In [1]:
!pip install transformers==4.1.1 sentencepiece==0.1.94



In [2]:
!pip install mosestokenizer==1.1.0



In [1]:
# import the MarianMT model and tokenizer.
from transformers import MarianMTModel, MarianTokenizer

# import pandas and numpys
import pandas as pd
import numpy as np

## Models from english to Romance languages and back

In [4]:
# model that can translate from English to Romance languages
# this is a single model that can translate to any of the romance languages

target_tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-ROMANCE')
target_model = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-en-ROMANCE')

Some weights of MarianMTModel were not initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-ROMANCE and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
# initialize models that can translate Romance languages to English.

en_tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-ROMANCE-en')
en_model = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-ROMANCE-en')

Some weights of MarianMTModel were not initialized from the model checkpoint at Helsinki-NLP/opus-mt-ROMANCE-en and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Available languages

Below is a full list of available languages. The most relevant are: 

French(fr), Spanish(es), Italian(it), Portuguese(pt), Romanian(ro), Galician(gl), Sardinian(sn), Corsican(co), and many others including spanish from Spain and Latin America.

In [6]:
# available languages
target_tokenizer.supported_language_codes

['>>fr<<',
 '>>es<<',
 '>>it<<',
 '>>pt<<',
 '>>pt_br<<',
 '>>ro<<',
 '>>ca<<',
 '>>gl<<',
 '>>pt_BR<<',
 '>>la<<',
 '>>wa<<',
 '>>fur<<',
 '>>oc<<',
 '>>fr_CA<<',
 '>>sc<<',
 '>>es_ES<<',
 '>>es_MX<<',
 '>>es_AR<<',
 '>>es_PR<<',
 '>>es_UY<<',
 '>>es_CL<<',
 '>>es_CO<<',
 '>>es_CR<<',
 '>>es_GT<<',
 '>>es_HN<<',
 '>>es_NI<<',
 '>>es_PA<<',
 '>>es_PE<<',
 '>>es_VE<<',
 '>>es_DO<<',
 '>>es_EC<<',
 '>>es_SV<<',
 '>>an<<',
 '>>pt_PT<<',
 '>>frp<<',
 '>>lad<<',
 '>>vec<<',
 '>>fr_FR<<',
 '>>co<<',
 '>>it_IT<<',
 '>>lld<<',
 '>>lij<<',
 '>>lmo<<',
 '>>nap<<',
 '>>rm<<',
 '>>scn<<',
 '>>mwl<<']

## Helper functions to translate in batch

In [7]:
def translate(texts, model, tokenizer, language="fr"):
    # Prepare the text data into appropriate format for the model
    template = lambda text: f"{text}" if language == "en" else f">>{language}<< {text}"
    src_texts = [template(i) for i in texts]

    # Tokenize the texts
    encoded = tokenizer.prepare_seq2seq_batch(src_texts,return_tensors="pt")
    
    # Generate translation using model
    translated = model.generate(**encoded)

    # Convert the generated tokens indices back into text
    translated_texts = tokenizer.batch_decode(translated, skip_special_tokens=True)
    
    return translated_texts


In [8]:
def back_translate(texts, source_lang="en", target_lang="fr"):
    # Translate from source to target language
    fr_texts = translate(texts, target_model, target_tokenizer, 
                         language=target_lang)

    # Translate from target language back to source language
    back_translated_texts = translate(fr_texts, en_model, en_tokenizer, 
                                      language=source_lang)
    
    return back_translated_texts

In [9]:
# note that even a single text needs to be inside a list
single_sentence = ['I would like to have a cup of coffee']

aug_texts = back_translate(single_sentence, source_lang="en", target_lang="es")
print(aug_texts)

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  ../aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


["I'd like a cup of coffee."]


## Play a little...

In [10]:
# example using back translation to spanish

example_texts = ['I like this homework so much', 'this is not a great performance']

aug_texts = back_translate(example_texts, source_lang="en", target_lang="es")
print(aug_texts)

['I really like this assignment.', "It's not a great performance."]


In [11]:
# example using back translation to french

aug_texts = back_translate(example_texts, source_lang="en", target_lang="fr")

print(aug_texts)


['I love homework so much.', "It's not a great performance."]


In [12]:
# example using back translation to italian

aug_texts = back_translate(example_texts, source_lang="en", target_lang="it")

print(aug_texts)

['I really like homework.', "This isn't a great performance."]


In [13]:
# example using back translation to portuguese

aug_texts = back_translate(example_texts, source_lang="en", target_lang="pt")

print(aug_texts)

['I really like this housework.', "This isn't a great performance."]


## Load training data

In [14]:
# load all training data
train_data_a = pd.read_csv('../../../data/release-files/eng/trac2_eng_train_oversampled_task_A.csv')
train_data_b = pd.read_csv('../../../data/release-files/eng/trac2_eng_train_oversampled_task_B.csv')

In [15]:
train_data_a.head()

Unnamed: 0,ID,Text,Sub-task A,Sub-task B
0,C45.451,Next part,NAG,NGEN
1,C47.11,Iii8mllllllm\nMdxfvb8o90lplppi0005,NAG,NGEN
2,C33.79,🤣🤣😂😂🤣🤣🤣😂osm vedio ....keep it up...make more v...,NAG,NGEN
3,C4.1961,What the fuck was this? I respect shwetabh and...,NAG,NGEN
4,C10.153,Concerned authorities should bring arundathi R...,NAG,NGEN


In [16]:
train_data_b.head()

Unnamed: 0,ID,Text,Sub-task A,Sub-task B
0,C45.451,Next part,NAG,NGEN
1,C47.11,Iii8mllllllm\nMdxfvb8o90lplppi0005,NAG,NGEN
2,C33.79,🤣🤣😂😂🤣🤣🤣😂osm vedio ....keep it up...make more v...,NAG,NGEN
3,C4.1961,What the fuck was this? I respect shwetabh and...,NAG,NGEN
4,C10.153,Concerned authorities should bring arundathi R...,NAG,NGEN


In [40]:
# create a dataframe for each class
# translation takes a lot of time. This allows to create partial results in case the kernel dies
train_data_a_NAG = train_data_a[train_data_a['Sub-task A'] == 'NAG']
train_data_a_CAG = train_data_a[train_data_a['Sub-task A'] == 'CAG']
train_data_a_OAG = train_data_a[train_data_a['Sub-task A'] == 'OAG']

train_data_b_NGEN = train_data_b[train_data_b['Sub-task B'] == 'NGEN']
train_data_b_GEN = train_data_b[train_data_b['Sub-task B'] == 'GEN']


## Data augmentation for task-A

### Class NAG

In [None]:
# tranlation to spanish and back to english
train_data_a_NAG['es_trans'] = train_data_a_NAG['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='es')[0])
train_data_a_NAG.to_csv('../../../data/augm/taskA-NAG-es.csv', index=False)

In [None]:
# tranlation to french and back to english
train_data_a_NAG['fr_trans'] = train_data_a_NAG['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='fr')[0])
train_data_a_NAG.to_csv('../../../data/augm/taskA-NAG-fr.csv', index=False)

In [None]:
# tranlation to italian and back to english
train_data_a_NAG['it_trans'] = train_data_a_NAG['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='it')[0])
train_data_a_NAG.to_csv('../../../data/augm/taskA-NAG-it.csv', index=False)

### Class CAG

In [None]:
# tranlation to spanish and back to english
train_data_a_CAG['es_trans'] = train_data_a_CAG['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='es')[0])
train_data_a_CAG.to_csv('../../../data/augm/taskA-CAG-es.csv', index=False)

In [None]:
# tranlation to french and back to english
train_data_a_CAG['fr_trans'] = train_data_a_CAG['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='fr')[0])
train_data_a_CAG.to_csv('../../../data/augm/taskA-CAG-fr.csv', index=False)

In [None]:
# tranlation to italian and back to english
train_data_a_CAG['it_trans'] = train_data_a_CAG['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='it')[0])
train_data_a_CAG.to_csv('../../../data/augm/taskA-CAG-it.csv', index=False)

### Class OAG

In [None]:
# tranlation to spanish and back to english
train_data_a_OAG['es_trans'] = train_data_a_OAG['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='es')[0])
train_data_a_OAG.to_csv('../../../data/augm/taskA-OAG-es.csv', index=False)

In [None]:
# tranlation to french and back to english
train_data_a_OAG['fr_trans'] = train_data_a_OAG['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='fr')[0])
train_data_a_OAG.to_csv('../../../data/augm/taskA-OAG-fr.csv', index=False)

In [None]:
# tranlation to italian and back to english
train_data_a_OAG['it_trans'] = train_data_a_OAG['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='it')[0])
train_data_a_OAG.to_csv('../../../data/augm/taskA-OAG-it.csv', index=False)

## Data augmentation for task-B

### Class NGEN

In [None]:
# tranlation to spanish and back to english
train_data_b_NGEN['es_trans'] = train_data_b_NGEN['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='es')[0])
train_data_b_NGEN.to_csv('../../../data/augm/taskB-NGEN-es.csv', index=False)

In [None]:
# tranlation to french and back to english
train_data_b_NGEN['fr_trans'] = train_data_b_NGEN['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='fr')[0])
train_data_b_NGEN.to_csv('../../../data/augm/taskB-NGEN-fr.csv', index=False)

In [None]:
# tranlation to italian and back to english
train_data_b_NGEN['it_trans'] = train_data_b_NGEN['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='it')[0])
train_data_b_NGEN.to_csv('../../../data/augm/taskB-NGEN-it.csv', index=False)

### Class GEN

In [None]:
# tranlation to spanish and back to english
train_data_b_GEN['es_trans'] = train_data_b_GEN['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='es')[0])
train_data_b_GEN.to_csv('../../../data/augm/taskB-GEN-es.csv', index=False)

In [None]:
# tranlation to french and back to english
train_data_b_GEN['fr_trans'] = train_data_b_GEN['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='fr')[0])
train_data_b_GEN.to_csv('../../../data/augm/taskB-GEN-fr.csv', index=False)

In [None]:
# tranlation to italian and back to english
train_data_b_GEN['it_trans'] = train_data_b_GEN['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='it')[0])
train_data_b_GEN.to_csv('../../../data/augm/taskB-GEN-it.csv', index=False)

## Import data back to generate the augmented dataset

### Task A

In [12]:
df0 = pd.read_csv('../../../data/files_back_translation/taskA-NAG-es.csv')[['Text', 'Sub-task A']]
df1 = pd.read_csv('../../../data/files_back_translation/taskA-NAG-es.csv')[['es_trans', 'Sub-task A']]
df2 = pd.read_csv('../../../data/files_back_translation/taskA-NAG-it.csv')[['it_trans', 'Sub-task A']]
df3 = pd.read_csv('../../../data/files_back_translation/taskA-NAG-fr.csv')[['fr_trans', 'Sub-task A']]

df4 = pd.read_csv('../../../data/files_back_translation/taskA-CAG-es.csv')[['Text', 'Sub-task A']]
df5 = pd.read_csv('../../../data/files_back_translation/taskA-CAG-es.csv')[['es_trans', 'Sub-task A']]
df6 = pd.read_csv('../../../data/files_back_translation/taskA-CAG-it.csv')[['it_trans', 'Sub-task A']]
df7 = pd.read_csv('../../../data/files_back_translation/taskA-CAG-fr.csv')[['fr_trans', 'Sub-task A']]

df8 = pd.read_csv('../../../data/files_back_translation/taskA-OAG-es.csv')[['Text', 'Sub-task A']]
df9 = pd.read_csv('../../../data/files_back_translation/taskA-OAG-es.csv')[['es_trans', 'Sub-task A']]
df10 = pd.read_csv('../../../data/files_back_translation/taskA-OAG-it.csv')[['it_trans', 'Sub-task A']]
df11 = pd.read_csv('../../../data/files_back_translation/taskA-OAG-fr.csv')[['fr_trans', 'Sub-task A']]

In [13]:
# rename columns

df1 = df1.rename(columns={'es_trans':'Text'})
df2 = df2.rename(columns={'it_trans':'Text'})
df3 = df3.rename(columns={'fr_trans':'Text'})

df5 = df5.rename(columns={'es_trans':'Text'})
df6 = df6.rename(columns={'it_trans':'Text'})
df7 = df7.rename(columns={'fr_trans':'Text'})

df9 = df9.rename(columns={'es_trans':'Text'})
df10 = df10.rename(columns={'it_trans':'Text'})
df11 = df11.rename(columns={'fr_trans':'Text'})

In [14]:
# concatenate dataframes
final_a = pd.concat([df0,df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11], ignore_index=True)

# shuffle 
final_a = final_a.sample(frac=1)

# save
final_a.to_csv('../../../data/release-files/eng/trac2_eng_train_BT_task_A.csv')

In [15]:
final_a['Sub-task A'].value_counts()

OAG    13500
NAG    13500
CAG    13500
Name: Sub-task A, dtype: int64

In [16]:
final_a.shape

(40500, 2)

### Task B

In [2]:
df0 = pd.read_csv('../../../data/files_back_translation/taskB-NGEN-es.csv')[['Text', 'Sub-task B']]
df1 = pd.read_csv('../../../data/files_back_translation/taskB-NGEN-es.csv')[['es_trans', 'Sub-task B']]
df2 = pd.read_csv('../../../data/files_back_translation/taskB-NGEN-it.csv')[['it_trans', 'Sub-task B']]
df3 = pd.read_csv('../../../data/files_back_translation/taskB-NGEN-fr.csv')[['fr_trans', 'Sub-task B']]

df4 = pd.read_csv('../../../data/files_back_translation/taskB-GEN-es.csv')[['Text', 'Sub-task B']]
df5 = pd.read_csv('../../../data/files_back_translation/taskB-GEN-es.csv')[['es_trans', 'Sub-task B']]
df6 = pd.read_csv('../../../data/files_back_translation/taskB-GEN-it.csv')[['it_trans', 'Sub-task B']]
df7 = pd.read_csv('../../../data/files_back_translation/taskB-GEN-fr.csv')[['fr_trans', 'Sub-task B']]

In [3]:
# rename columns

df1 = df1.rename(columns={'es_trans':'Text'})
df2 = df2.rename(columns={'it_trans':'Text'})
df3 = df3.rename(columns={'fr_trans':'Text'})

df5 = df5.rename(columns={'es_trans':'Text'})
df6 = df6.rename(columns={'it_trans':'Text'})
df7 = df7.rename(columns={'fr_trans':'Text'})

In [4]:
# concatenate dataframes
final_b = pd.concat([df0,df1,df2,df3,df4,df5,df6,df7], ignore_index=True)

# shuffle 
final_b = final_b.sample(frac=1)

# save
final_b.to_csv('../../../data/release-files/eng/trac2_eng_train_BT_task_B.csv')

In [5]:
final_b['Sub-task B'].value_counts()

NGEN    15816
GEN     15816
Name: Sub-task B, dtype: int64

In [6]:
final_b.shape

(31632, 2)