# Final Project

## TRAC-2 Data Augmentation- Back Translation

In this notebook we use back traslation techniques to augment our data.

## Packages import

In [1]:
!pip install transformers==4.1.1 sentencepiece==0.1.94



In [2]:
!pip install mosestokenizer==1.1.0



In [3]:
# import the MarianMT model and tokenizer.
from transformers import MarianMTModel, MarianTokenizer

# import pandas and numpys
import pandas as pd
import numpy as np

## Models from english to Romance languages and back

In [4]:
# model that can translate from English to Romance languages
# this is a single model that can translate to any of the romance languages

target_tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-ROMANCE')
target_model = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-en-ROMANCE')

Some weights of MarianMTModel were not initialized from the model checkpoint at Helsinki-NLP/opus-mt-en-ROMANCE and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [5]:
# initialize models that can translate Romance languages to English.

en_tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-ROMANCE-en')
en_model = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-ROMANCE-en')

Some weights of MarianMTModel were not initialized from the model checkpoint at Helsinki-NLP/opus-mt-ROMANCE-en and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Available languages

Below is a full list of available languages. The most relevant are: 

French(fr), Spanish(es), Italian(it), Portuguese(pt), Romanian(ro), Galician(gl), Sardinian(sn), Corsican(co), and many others including spanish from Spain and Latin America.

In [6]:
# available languages
target_tokenizer.supported_language_codes

['>>fr<<',
 '>>es<<',
 '>>it<<',
 '>>pt<<',
 '>>pt_br<<',
 '>>ro<<',
 '>>ca<<',
 '>>gl<<',
 '>>pt_BR<<',
 '>>la<<',
 '>>wa<<',
 '>>fur<<',
 '>>oc<<',
 '>>fr_CA<<',
 '>>sc<<',
 '>>es_ES<<',
 '>>es_MX<<',
 '>>es_AR<<',
 '>>es_PR<<',
 '>>es_UY<<',
 '>>es_CL<<',
 '>>es_CO<<',
 '>>es_CR<<',
 '>>es_GT<<',
 '>>es_HN<<',
 '>>es_NI<<',
 '>>es_PA<<',
 '>>es_PE<<',
 '>>es_VE<<',
 '>>es_DO<<',
 '>>es_EC<<',
 '>>es_SV<<',
 '>>an<<',
 '>>pt_PT<<',
 '>>frp<<',
 '>>lad<<',
 '>>vec<<',
 '>>fr_FR<<',
 '>>co<<',
 '>>it_IT<<',
 '>>lld<<',
 '>>lij<<',
 '>>lmo<<',
 '>>nap<<',
 '>>rm<<',
 '>>scn<<',
 '>>mwl<<']

## Helper functions to translate in batch

In [7]:
def translate(texts, model, tokenizer, language="fr"):
    # Prepare the text data into appropriate format for the model
    template = lambda text: f"{text}" if language == "en" else f">>{language}<< {text}"
    src_texts = [template(i) for i in texts]

    # Tokenize the texts
    encoded = tokenizer.prepare_seq2seq_batch(src_texts,return_tensors="pt")
    
    # Generate translation using model
    translated = model.generate(**encoded)

    # Convert the generated tokens indices back into text
    translated_texts = tokenizer.batch_decode(translated, skip_special_tokens=True)
    
    return translated_texts


In [8]:
def back_translate(texts, source_lang="en", target_lang="fr"):
    # Translate from source to target language
    fr_texts = translate(texts, target_model, target_tokenizer, 
                         language=target_lang)

    # Translate from target language back to source language
    back_translated_texts = translate(fr_texts, en_model, en_tokenizer, 
                                      language=source_lang)
    
    return back_translated_texts

In [9]:
# note that even a single text needs to be inside a list
single_sentence = ['I would like to have a cup of coffee']

aug_texts = back_translate(single_sentence, source_lang="en", target_lang="es")
print(aug_texts)

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  ../aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


["I'd like a cup of coffee."]


## Play a little...

In [10]:
# example using back translation to spanish

example_texts = ['I like this homework so much', 'this is not a great performance']

aug_texts = back_translate(example_texts, source_lang="en", target_lang="es")
print(aug_texts)

['I really like this assignment.', "It's not a great performance."]


In [11]:
# example using back translation to french

aug_texts = back_translate(example_texts, source_lang="en", target_lang="fr")

print(aug_texts)


['I love homework so much.', "It's not a great performance."]


In [12]:
# example using back translation to italian

aug_texts = back_translate(example_texts, source_lang="en", target_lang="it")

print(aug_texts)

['I really like homework.', "This isn't a great performance."]


In [13]:
# example using back translation to portuguese

aug_texts = back_translate(example_texts, source_lang="en", target_lang="pt")

print(aug_texts)

['I really like this housework.', "This isn't a great performance."]


## Load training data

In [14]:
# load all training data
train_data = pd.read_csv('../../../data/release-files/eng/trac2_eng_train.csv')

In [15]:
train_data.head()

Unnamed: 0,ID,Text,Sub-task A,Sub-task B
0,C45.451,Next part,NAG,NGEN
1,C47.11,Iii8mllllllm\nMdxfvb8o90lplppi0005,NAG,NGEN
2,C33.79,🤣🤣😂😂🤣🤣🤣😂osm vedio ....keep it up...make more v...,NAG,NGEN
3,C4.1961,What the fuck was this? I respect shwetabh and...,NAG,NGEN
4,C10.153,Concerned authorities should bring arundathi R...,NAG,NGEN


In [16]:
## create a column that considers all the possible combination of classes for task A and task B
## NAG-NGEN, NAG-GEN, CAG-NGEN, CAG-GEN, OAG-NGEN, OAG-GEN

# create a list of conditions
conditions = [(train_data['Sub-task A'] == 'NAG') & (train_data['Sub-task B'] == 'NGEN'),
              (train_data['Sub-task A'] == 'NAG') & (train_data['Sub-task B'] == 'GEN'), 
              (train_data['Sub-task A'] == 'CAG') & (train_data['Sub-task B'] == 'NGEN'),
              (train_data['Sub-task A'] == 'CAG') & (train_data['Sub-task B'] == 'GEN'),
              (train_data['Sub-task A'] == 'OAG') & (train_data['Sub-task B'] == 'NGEN'),
              (train_data['Sub-task A'] == 'OAG') & (train_data['Sub-task B'] == 'GEN')
             ]
           
# values for each condition
values = [0, 1, 2, 3, 4, 5]

# create a new column 
train_data['combined'] = np.select(conditions, values)

In [17]:
train_data.head()

Unnamed: 0,ID,Text,Sub-task A,Sub-task B,combined
0,C45.451,Next part,NAG,NGEN,0
1,C47.11,Iii8mllllllm\nMdxfvb8o90lplppi0005,NAG,NGEN,0
2,C33.79,🤣🤣😂😂🤣🤣🤣😂osm vedio ....keep it up...make more v...,NAG,NGEN,0
3,C4.1961,What the fuck was this? I respect shwetabh and...,NAG,NGEN,0
4,C10.153,Concerned authorities should bring arundathi R...,NAG,NGEN,0


In [18]:
train_data['combined'].value_counts()

0    3241
2     418
4     295
5     140
1     134
3      35
Name: combined, dtype: int64

In [19]:
# create a dataframe for each class
train_0 = train_data[train_data['combined'] == 0]
train_1 = train_data[train_data['combined'] == 1]
train_2 = train_data[train_data['combined'] == 2]
train_3 = train_data[train_data['combined'] == 3]
train_4 = train_data[train_data['combined'] == 4]
train_5 = train_data[train_data['combined'] == 5]

### Data augmentation for combined class 1

In [21]:
# make translations to 5 languages and back to english
train_1['es_trans'] = train_1['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='es')[0])
train_1['fr_trans'] = train_1['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='fr')[0])
train_1['it_trans'] = train_1['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='it')[0])
train_1['pt_trans'] = train_1['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='pt')[0])
train_1['ro_trans'] = train_1['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='ro')[0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_

In [22]:
# save 
train_1.to_csv('../../../data/augm/train_1.csv', index=False)

### Data augmentation for combined class 2

In [23]:
# make translations to 5 languages and back to english
train_2['es_trans'] = train_2['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='es')[0])
train_2['fr_trans'] = train_2['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='fr')[0])
train_2['it_trans'] = train_2['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='it')[0])
train_2['pt_trans'] = train_2['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='pt')[0])
train_2['ro_trans'] = train_2['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='ro')[0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_

In [24]:
# save 
train_2.to_csv('../../../data/augm/train_2.csv', index=False)

### Data augmentation for combined class 3

In [25]:
# make translations to 5 languages and back to english
train_3['es_trans'] = train_3['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='es')[0])
train_3['fr_trans'] = train_3['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='fr')[0])
train_3['it_trans'] = train_3['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='it')[0])
train_3['pt_trans'] = train_3['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='pt')[0])
train_3['ro_trans'] = train_3['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='ro')[0])

In [26]:
# save 
train_3.to_csv('../../../data/augm/train_3.csv', index=False)

### Data augmentation for combined class 4

In [27]:
# make translations to 5 languages and back to english
train_4['es_trans'] = train_4['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='es')[0])
train_4['fr_trans'] = train_4['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='fr')[0])
train_4['it_trans'] = train_4['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='it')[0])
train_4['pt_trans'] = train_4['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='pt')[0])
train_4['ro_trans'] = train_4['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='ro')[0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_

In [28]:
# save 
train_4.to_csv('../../../data/augm/train_4.csv', index=False)

### Data augmentation for combined class 5

In [29]:
# make translations to 5 languages and back to english
train_5['es_trans'] = train_5['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='es')[0])
train_5['fr_trans'] = train_5['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='fr')[0])
train_5['it_trans'] = train_5['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='it')[0])
train_5['pt_trans'] = train_5['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='pt')[0])
train_5['ro_trans'] = train_5['Text'].apply(lambda x: back_translate([x], source_lang='en', target_lang='ro')[0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_

In [30]:
# save 
train_5.to_csv('../../../data/augm/train_5.csv', index=False)

## Load all data back and create the new training dataset

In [2]:
import pandas as pd

In [3]:
# combined class 0 is not backtranslated (majority class)
train_0 = pd.read_csv('../../../data/release-files/eng/trac2_eng_train.csv')
train_0 = train_0[(train_0['Sub-task A'] == 'NAG') & (train_0['Sub-task B'] == 'NGEN')]
# Load all the other combined classes
train_1 = pd.read_csv('../../../data/augm/train_1.csv')
train_2 = pd.read_csv('../../../data/augm/train_2.csv')
train_3 = pd.read_csv('../../../data/augm/train_3.csv')
train_4 = pd.read_csv('../../../data/augm/train_4.csv')
train_5 = pd.read_csv('../../../data/augm/train_5.csv')

In [4]:
train_0.head()

Unnamed: 0,ID,Text,Sub-task A,Sub-task B
0,C45.451,Next part,NAG,NGEN
1,C47.11,Iii8mllllllm\nMdxfvb8o90lplppi0005,NAG,NGEN
2,C33.79,🤣🤣😂😂🤣🤣🤣😂osm vedio ....keep it up...make more v...,NAG,NGEN
3,C4.1961,What the fuck was this? I respect shwetabh and...,NAG,NGEN
4,C10.153,Concerned authorities should bring arundathi R...,NAG,NGEN


In [11]:
train_5

Unnamed: 0,ID,Text,Sub-task A,Sub-task B,combined,es_trans,fr_trans,it_trans,pt_trans,ro_trans
0,C10.967.5,Ram Raja - she is a female dog - who can’t eve...,OAG,GEN,5,Ram Raja - she's a bitch - who can't even keep...,Ram Raja - she's a bitch - who can't even keep...,Ram Raja - she's a bitch - who can't even hold...,Ram Raja - she's a dog - who can't even keep a...,Ram Raja - she's a female dog - who can't even...
1,C20.110,One word for u bhaad me jaa chudail,OAG,GEN,5,A word for or bhaad me jaa chudail,A word for u bhaad me jaa chudail,A word for bhaad me jaa chudail,A word for the bhaad me jaa chudail,A word for me jaa chudail
2,C9.58,This whore is a member of psuedo intellectual ...,OAG,GEN,5,This whore is a member of a centuries-old inte...,This whore is a member of a psuedo secular int...,This whore is a member of the secular intellec...,This prostitute is a member of the pseudo secu...,This whore is a member of the centuries-old ps...
3,C4.1652,Fuck feminism !,OAG,GEN,5,Fuck feminism!,Damn feminism!,Fuck feminism!,Fuck feminism!,Fuck feminism!
4,C26.171,He should have also killed that bitch,OAG,GEN,5,"I should have killed that bitch, too.","He should have killed that bitch, too.","You should have killed that bitch, too.","He should have killed that bitch, too.","He should have killed that bitch, too."
...,...,...,...,...,...,...,...,...,...,...
135,C20.265,"You,re a bitch",OAG,GEN,5,You're a bitch.,You're a bitch.,You're a bitch.,"You, you're a bitch.",You're a bitch.
136,C43.167.1,@Baraqua Amina Levy-Khan So is prests raping c...,OAG,GEN,5,@Baraqua Amina Levy-Khan So presidents rape ch...,"@Baraqua Amina Levy-Khan So, violent loan chil...",@Baraqua Amina Levy-Khan So it's Prest raping ...,@Baraqua Amina Levy-Khan So Prestes raping chi...,@Baraqua Amina Levy-Khan Asa is ready to rape ...
137,C20.62,Fuck u and your reviews.....,OAG,GEN,5,Fuck you and your criticisms..,Fuck you and your comments..,Fuck you and your reviews..,Fuck you and your criticisms.,Fuck your reviews.
138,C38.448,Open bob and vagane,OAG,GEN,5,Open bob and vag,Open Bob and Vagina,Open the bob and the vag,Open Bob and Vagane,Open Bob and Vagina


In [47]:
def reshape_data(df):
    '''
    Returns a dataframe with all translations as rows and not columns.
    '''
    df0 = df[['ID','Text', 'Sub-task A', 'Sub-task B']]
    df1 = df[['ID','es_trans', 'Sub-task A', 'Sub-task B']].rename(columns={'es_trans':'Text'})
    df2 = df[['ID','fr_trans', 'Sub-task A', 'Sub-task B']].rename(columns={'fr_trans':'Text'})
    df3 = df[['ID','it_trans', 'Sub-task A', 'Sub-task B']].rename(columns={'it_trans':'Text'})
    df4 = df[['ID','pt_trans', 'Sub-task A', 'Sub-task B']].rename(columns={'pt_trans':'Text'})
    df5 = df[['ID','ro_trans', 'Sub-task A', 'Sub-task B']].rename(columns={'ro_trans':'Text'})
    
    # concatenate dataframes
    result = pd.concat([df0,df1,df2,df3,df4,df5])
    
    return result

In [48]:
# reshape data
train_1_reshaped = reshape_data(train_1)
train_2_reshaped = reshape_data(train_2)
train_3_reshaped = reshape_data(train_3)
train_4_reshaped = reshape_data(train_4)
train_5_reshaped = reshape_data(train_5)

In [53]:
# concatenate final dataframe with back-translations
final_df = pd.concat([train_0, train_1_reshaped, train_2_reshaped, 
                      train_3_reshaped, train_4_reshaped, train_5_reshaped])

In [55]:
final_df.shape

(9373, 4)

In [57]:
# review new distribution of classes for Task-A
final_df['Sub-task A'].value_counts(normalize=True)

NAG    0.431559
CAG    0.289982
OAG    0.278459
Name: Sub-task A, dtype: float64

In [58]:
# review new distribution of classes for Task-B
final_df['Sub-task B'].value_counts(normalize=True)

NGEN    0.802198
GEN     0.197802
Name: Sub-task B, dtype: float64

In [59]:
final_df.to_csv('../../../data/release-files/eng/trac2_eng_train_augm_backt.csv', index=False)

## References

- Text data augmentation: https://amitness.com/back-translation/