# Final Project

## TRAC-2 Data Augmentation- Word embeddings augmentation

In this notebook we use word embeddings to augment our data.

## Import packages

In [13]:
!pip install nlpaug



In [14]:
!pip install transformers



In [15]:
import nlpaug.augmenter.word as naw
import transformers 
import pandas as pd
import numpy as np

In [16]:
# to use WordEmbsAug (word2vec, glove or fasttext), 
# download pre-trained model first 

from nlpaug.util.file.download import DownloadUtil
# DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
# DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model

In [17]:
# for the synomym augmenter
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Play with some methods in the NLPAug library

In [18]:
# set a test sentence to play
test_sentence = 'this is a great tool for creating new data. We should use it day and night, and all the time.'

### Word embeddings augmenter

In [19]:
# model_type: glove
# everytime that runs generates a different sentence

aug_glove = naw.WordEmbsAug(model_type='glove', 
                          model_path='glove.6B.300d.txt',
                          action="substitute",
                          aug_min=1,
                          aug_max=5,
                          top_k=5
                          )

augmented_text = aug_glove.augment(test_sentence)

print("Original:")
print(test_sentence)

print("Augmented Text:")
print(augmented_text)

Original:
this is a great tool for creating new data. We should use it day and night, and all the time.
Augmented Text:
this is a great tool for created new database. We should use but day and night, and only this time.


In [20]:
# model_type: glove
# everytime that runs generates a different sentence

aug_glove = naw.WordEmbsAug(model_type='glove', 
                          model_path='glove.6B.300d.txt',
                          action="substitute",
                          aug_min=1,
                          aug_max=5,
                          top_k=3
                          )

augmented_text = aug_glove.augment(test_sentence)

print("Original:")
print(test_sentence)

print("Augmented Text:")
print(augmented_text)

Original:
this is a great tool for creating new data. We should use it day and night, and all the time.
Augmented Text:
this now a great method for creating same analysis. We should use it day and sunday, and all the time.


In [21]:
# model_type: glove
# set percentage of substitution instead of aug_max
# everytime that runs generates a different sentence

aug_glove = naw.WordEmbsAug(model_type='glove', 
                          model_path='glove.6B.300d.txt',
                          action="substitute",
                          aug_min=1,
                          # aug_max=5,
                          aug_p = 0.2,
                          top_k=5
                          )

augmented_text = aug_glove.augment(test_sentence)

print("Original:")
print(test_sentence)

print("Augmented Text:")
print(augmented_text)

Original:
this is a great tool for creating new data. We should use it day and night, and all the time.
Augmented Text:
what is a greatest using for creating new data. We should use it day and night, and all of time.


### BERT augmenter

In [22]:
# using bert

aug_bert = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action="substitute")

augmented_text = aug_bert.augment(test_sentence)

print("Original:")
print(test_sentence)

print("Augmented Text:")
print(augmented_text)


  cpuset_checked))


Original:
this is a great tool for creating new data. We should use it day and night, and all the time.
Augmented Text:
this is a great tool describing this global data. we also use charts day and night, and all the stars.


### Synomym Augmenter

In [23]:
aug_synonym = naw.SynonymAug(aug_src='wordnet',
                             aug_min=1,
                             aug_max=3)

augmented_text = aug_synonym.augment(test_sentence)

print("Original:")
print(test_sentence)

print("Augmented Text:")
print(augmented_text)

Original:
this is a great tool for creating new data. We should use it day and night, and all the time.
Augmented Text:
this is a great tool for creating new data. We should use information technology day and night, and all the time.


### Back Translation Augmenter

The MarianMT model seems better.

In [24]:
back_translation_aug = naw.BackTranslationAug(from_model_name='facebook/wmt19-en-de', 
                                              to_model_name='facebook/wmt19-de-en',
                                              device='cuda')
augmented_text = back_translation_aug.augment(test_sentence)

print("Original:")
print(test_sentence)

print("Augmented Text:")
print(augmented_text)

Downloading:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/829k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/829k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/308k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/67.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/829k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/829k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/308k [00:00<?, ?B/s]

Original:
this is a great tool for creating new data. We should use it day and night, and all the time.
Augmented Text:
This is a great tool for creating new data. We should use it day and night and all the time.


## Import Data

In [25]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [26]:
train_data = pd.read_csv('drive/MyDrive/w266/release-files/eng/trac2_eng_train.csv')

In [27]:
train_data.head()

Unnamed: 0,ID,Text,Sub-task A,Sub-task B
0,C45.451,Next part,NAG,NGEN
1,C47.11,Iii8mllllllm\nMdxfvb8o90lplppi0005,NAG,NGEN
2,C33.79,🤣🤣😂😂🤣🤣🤣😂osm vedio ....keep it up...make more v...,NAG,NGEN
3,C4.1961,What the fuck was this? I respect shwetabh and...,NAG,NGEN
4,C10.153,Concerned authorities should bring arundathi R...,NAG,NGEN


## Data Augmentation with word embeddings 

In [28]:
## create a column that considers all the possible combination of classes for task A and task B
## NAG-NGEN, NAG-GEN, CAG-NGEN, CAG-GEN, OAG-NGEN, OAG-GEN

# create a list of conditions
conditions = [(train_data['Sub-task A'] == 'NAG') & (train_data['Sub-task B'] == 'NGEN'),
              (train_data['Sub-task A'] == 'NAG') & (train_data['Sub-task B'] == 'GEN'), 
              (train_data['Sub-task A'] == 'CAG') & (train_data['Sub-task B'] == 'NGEN'),
              (train_data['Sub-task A'] == 'CAG') & (train_data['Sub-task B'] == 'GEN'),
              (train_data['Sub-task A'] == 'OAG') & (train_data['Sub-task B'] == 'NGEN'),
              (train_data['Sub-task A'] == 'OAG') & (train_data['Sub-task B'] == 'GEN')
             ]
           
# values for each condition
values = [0, 1, 2, 3, 4, 5]

# create a new column 
train_data['combined'] = np.select(conditions, values)

In [29]:
train_data['combined'].value_counts()

0    3241
2     418
4     295
5     140
1     134
3      35
Name: combined, dtype: int64

In [30]:
# create a dataframe for each class
train_0 = train_data[train_data['combined'] == 0]
train_1 = train_data[train_data['combined'] == 1]
train_2 = train_data[train_data['combined'] == 2]
train_3 = train_data[train_data['combined'] == 3]
train_4 = train_data[train_data['combined'] == 4]
train_5 = train_data[train_data['combined'] == 5]

In [31]:
# define a function to apply glove embeddings
# To control variability, max 5 words substitution, min 1. 
def glove_augment(x):
  aug = naw.WordEmbsAug(model_type='glove', 
                        model_path='glove.6B.300d.txt',
                        action="substitute",
                        aug_min=1,
                        aug_max=5,
                        top_k=3
                        )
  return aug.augment(x)


### Data augmentation for combined class 1

In [32]:
# run 5 times to generate 5 versions of the sentence
train_1['glove1'] = train_1['Text'].apply(glove_augment)
train_1['glove2'] = train_1['Text'].apply(glove_augment)
train_1['glove3'] = train_1['Text'].apply(glove_augment)
train_1['glove4'] = train_1['Text'].apply(glove_augment)
train_1['glove5'] = train_1['Text'].apply(glove_augment)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_

In [33]:
train_1.head()

Unnamed: 0,ID,Text,Sub-task A,Sub-task B,combined,glove1,glove2,glove3,glove4,glove5
21,C38.482,its not good i think its all dimaghi keeda bei...,NAG,GEN,1,the not good i think its all dimaghi keeda bei...,the not good i think their all dimaghi keeda b...,which n't good i certainly its all dimaghi kee...,its not good 've think its other dimaghi keeda...,its but always i think its those dimaghi keeda...
24,C10.155,Why can't the Indian government take serious a...,NAG,GEN,1,Why can ' t the Indian government take serious...,Why can ' t the Indian government take problem...,Why can ' t the Indian government take serious...,Why can ' t the Indian government take serious...,Why can ' t the Indian government take serious...
61,C10.427,This mentally ill Lady is a barking Street dog...,NAG,GEN,1,This mentally ill Lady now this barking Street...,This emotionally ill Lady this a barking Stree...,This mentally ill Lady is a barking Street dog...,This mentally ill Lady now a barked Street dog...,This mentally illness Lady is a barking Street...
82,C43.12,"Finally, i can legally call indians gay",NAG,GEN,1,"Finally, i if obliged call indians gay","Finally, i can legally call natives homosexual","Finally, i can legally call tribes gays","Finally, 'd cannot legally call indians gay","Finally, i can illegally call indians lesbian"
85,C10.1402,As per Arundhati she should give her name as K...,NAG,GEN,1,As per Arundhati she should give her name as K...,As per Arundhati she should give she name as K...,As per Arundhati she should giving his name as...,As per Arundhati she should give her name as K...,As per Arundhati she should give his name as K...


### Data augmentation for combined class 2

In [34]:
# run 5 times to generate 5 versions of the sentence
train_2['glove1'] = train_2['Text'].apply(glove_augment)
train_2['glove2'] = train_2['Text'].apply(glove_augment)
train_2['glove3'] = train_2['Text'].apply(glove_augment)
train_2['glove4'] = train_2['Text'].apply(glove_augment)
train_2['glove5'] = train_2['Text'].apply(glove_augment)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_

In [35]:
train_2.head()

Unnamed: 0,ID,Text,Sub-task A,Sub-task B,combined,glove1,glove2,glove3,glove4,glove5
7,C7.1642,Even when kabir singh was unaware that Preeti ...,CAG,NGEN,2,Even when kabir singh was unaware that Preeti ...,Even when kabir singh was unclear that Preeti ...,Even when kabir singh was unaware which Preeti...,Even when kabir singh was unaware that Preeti ...,Even when kabir singh was unaware that Preeti ...
14,C7.2097.2,@Dushant Sain EXACTLY!!!!! don't we just go to...,CAG,NGEN,2,@ Dushant Sain EXACTLY! !! !! don ' t we just ...,@ Dushant Sain EXACTLY! !! !! don ' h we just ...,@ Dushant Sain EXACTLY! !! !! antonio ' t we j...,@ Dushant Sain EXACTLY! !! !! quixote ' t we j...,@ Dushant Sain EXACTLY! !! !! don ' t going ju...
19,C59.1085,It Calls Uneducated Person means Unexpected Be...,CAG,NGEN,2,It Calls Uneducated Person meaning Unexpected ...,It Calls Uneducated Person meaning Unexpected ...,It Calls Uneducated Person thus Unexpected Beh...,It Calls Uneducated Person mean Unexpected Beh...,It Calls Uneducated Person meaning Unexpected ...
42,C7.1035,By the way i myself being a 21st century moder...,CAG,NGEN,2,By the way i myself being a 21st century moder...,By the way i myself being a 21st century moder...,By the way i myself being a 21st century moder...,By the way i yourself being a 21st century mod...,By the way i myself being a 21st century moder...
44,C58.442,"Looks like a ghost, real conjuring",CAG,NGEN,2,"Looks even another ghost, real conjuring","Looks like a ghost, just conjure","Looks like a ghost, what conjure","Looks such a ghostly, real conjuring","Looks you a ghost, true conjuring"


### Data augmentation for combined class 3

In [36]:
# run 5 times to generate 5 versions of the sentence
train_3['glove1'] = train_3['Text'].apply(glove_augment)
train_3['glove2'] = train_3['Text'].apply(glove_augment)
train_3['glove3'] = train_3['Text'].apply(glove_augment)
train_3['glove4'] = train_3['Text'].apply(glove_augment)
train_3['glove5'] = train_3['Text'].apply(glove_augment)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_

In [37]:
train_3.head()

Unnamed: 0,ID,Text,Sub-task A,Sub-task B,combined,glove1,glove2,glove3,glove4,glove5
140,C7.95,bro bro....😂😂😂just try and replace the girl wi...,CAG,GEN,3,sis sis. .. . 😂 😂 😂 just trying and replace th...,bro morgannwg. .. . 😂 😂 😂 just try and replace...,sis morgannwg. .. . 😂 😂 😂 just try and replace...,'cause bro. .. . 😂 😂 😂 just to well replace th...,bro bro. .. . 😂 😂 😂 just try and replace the w...
197,C7.1558,"Fuckboy isn't good , but a bully isn't either....",CAG,GEN,3,"Fuckboy isn ' t good, but a bully isn ' t eith...","Fuckboy isn ' t good, but a bully isn ' t eith...","Fuckboy isn ' t good, but a bullied isn ' t ei...","Fuckboy isn ' t good, but a bully isn ' t eith...","Fuckboy isn ' t good, but a bully isn ' t eith..."
236,C10.883,Let AR use her name as range & billa and fill ...,CAG,GEN,3,Let AR use her name as ranges & billa and fill...,Let AR use her name as ranges & billa and fill...,Let AR use her name also ranging & billa and f...,Let AR use her name as range & billa and fill ...,Let AR use her name as variety & billa and fil...
324,C36.1136.2,Nikita Tiwari hey miss nikita are you one of t...,CAG,GEN,3,Nikita Tiwari hey miss nikita are you one of t...,Nikita Tiwari hey miss nikita are you three of...,Nikita Tiwari hey she nikita are you one of th...,Nikita Tiwari! miss nikita these you three of ...,Nikita Tiwari hey miss nikita are you only of ...
340,C10.419,"What a vacuum minded witch, product of May be ...",CAG,GEN,3,"What a vacuum savvy witch, product of May not ...","What a cleaners pragmatic witch, product of Ma...","What a vacuum pragmatic witches, product of Ma...","What a vacuum pragmatic witch, brand the May n...","What is vacuum minded witch, brand of May be b..."


### Data augmentation for combined class 4

In [38]:
# run 5 times to generate 5 versions of the sentence
train_4['glove1'] = train_4['Text'].apply(glove_augment)
train_4['glove2'] = train_4['Text'].apply(glove_augment)
train_4['glove3'] = train_4['Text'].apply(glove_augment)
train_4['glove4'] = train_4['Text'].apply(glove_augment)
train_4['glove5'] = train_4['Text'].apply(glove_augment)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_

In [39]:
train_4.head()

Unnamed: 0,ID,Text,Sub-task A,Sub-task B,combined,glove1,glove2,glove3,glove4,glove5
28,C7.240,feminism means equality not discrimination fir...,OAG,NGEN,4,feminism thus equality but discrimination firs...,feminism means equality not discrimination thi...,feminism means equality did discrimination fir...,feminism means equal but racism first you shou...,feminism thus equality n't discrimination firs...
54,C4.1318,I had head ache during this movie and also pos...,OAG,NGEN,4,I had head ache during this film and also post...,"I have head ache during this movie, also post ...",I had head ache during this movie and also pos...,I have head ache during this movie and also po...,I had assistant ache days this film and also p...
64,C4.1321.3,Totally agree with ur points bro nd kabir sing...,OAG,NGEN,4,Totally agreed with ur points bro nd kabir sha...,Totally agree with ur points bro nd kabir sing...,Totally agree with ur points morgannwg nd kabi...,Totally agree with ur points bro nd alamgir si...,Totally agree with shoy rebounds bro nd kabir ...
96,C59.1034,to moto chele a rakom ma der chere dei tar par...,OAG,NGEN,4,able motorcycling jiwamol a rakom ma der chere...,to moto mongkolporn a rakom jeou van chere dei...,help moto mongkolporn a rakom ma der chere deg...,able moto chele is rakom jeou der chere dei ta...,to motogp chele this rakom ma und chere dei he...
113,C68.812,I hat ranu Mondal,OAG,NGEN,4,I trick ranu Mondal,I hat mongkolporn Mondal,I wearing ranu Mondal,I hat mongkolporn Mondal,I hat nordeman Mondal


### Data augmentation for combined class 5

In [40]:
# run 5 times to generate 5 versions of the sentence
train_5['glove1'] = train_5['Text'].apply(glove_augment)
train_5['glove2'] = train_5['Text'].apply(glove_augment)
train_5['glove3'] = train_5['Text'].apply(glove_augment)
train_5['glove4'] = train_5['Text'].apply(glove_augment)
train_5['glove5'] = train_5['Text'].apply(glove_augment)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_

In [41]:
train_5.head()

Unnamed: 0,ID,Text,Sub-task A,Sub-task B,combined,glove1,glove2,glove3,glove4,glove5
62,C10.967.5,Ram Raja - she is a female dog - who can’t eve...,OAG,GEN,5,Ram Raja - she is a women dog - have can ’ t e...,Ram Raja - she is this female dog - who can ’ ...,Ram Raja - her is a male dog - who can ' t eve...,Ram Raja - she is a female dog - who can ‘ h e...,Ram Raja - she is another female dog - whom ca...
71,C20.110,One word for u bhaad me jaa chudail,OAG,GEN,5,One word for lwin bhaad me jhalak chudail,One word for lwin bhaad me dikhhla chudail,One words. u bhaad me jaa chudail,One word for n bhaad you jaa chudail,One phrase for u bhaad know jaa chudail
189,C9.58,This whore is a member of psuedo intellectual ...,OAG,GEN,5,This whore now a members of psuedo scholarly m...,This whore is a committee some psuedo intellec...,This whore is a committee of psuedo intellectu...,This whore is a committee of psuedo scholarly ...,This whore it a elected of psuedo intellectual...
307,C4.1652,Fuck feminism !,OAG,GEN,5,Fuck feminist!,Fuck feminist!,Fuck feminist!,Fuck environmentalism!,Fuck environmentalism!
339,C26.171,He should have also killed that bitch,OAG,GEN,5,He should have also killing that fucking,He should have both killed which bitch,He should had although killed that bitch,He should they also killed what bitch,He not already also killed that bitch


### Reshape the dataframes

In [42]:
def reshape_data(df):
    '''
    Returns a dataframe with all augmentations as rows and not columns.
    '''
    df0 = df[['ID','Text', 'Sub-task A', 'Sub-task B']]
    df1 = df[['ID','glove1', 'Sub-task A', 'Sub-task B']].rename(columns={'glove1':'Text'})
    df2 = df[['ID','glove2', 'Sub-task A', 'Sub-task B']].rename(columns={'glove2':'Text'})
    df3 = df[['ID','glove3', 'Sub-task A', 'Sub-task B']].rename(columns={'glove3':'Text'})
    df4 = df[['ID','glove4', 'Sub-task A', 'Sub-task B']].rename(columns={'glove4':'Text'})
    df5 = df[['ID','glove5', 'Sub-task A', 'Sub-task B']].rename(columns={'glove5':'Text'})
    
    # concatenate dataframes
    result = pd.concat([df0,df1,df2,df3,df4,df5])
    
    return result

In [43]:
# reshape data
train_1_reshaped = reshape_data(train_1)
train_2_reshaped = reshape_data(train_2)
train_3_reshaped = reshape_data(train_3)
train_4_reshaped = reshape_data(train_4)
train_5_reshaped = reshape_data(train_5)

In [44]:
# concatenate final dataframe with back-translations
final_df = pd.concat([train_0, train_1_reshaped, train_2_reshaped, 
                      train_3_reshaped, train_4_reshaped, train_5_reshaped])

In [45]:
final_df.shape

(9373, 5)

In [46]:
# review new distribution of classes for Task-A
final_df['Sub-task A'].value_counts(normalize=True)

NAG    0.431559
CAG    0.289982
OAG    0.278459
Name: Sub-task A, dtype: float64

In [47]:
# review new distribution of classes for Task-B
final_df['Sub-task B'].value_counts(normalize=True)

NGEN    0.802198
GEN     0.197802
Name: Sub-task B, dtype: float64

In [48]:
final_df.to_csv('drive/MyDrive/w266/release-files/eng/trac2_eng_train_augm_glove_emb.csv', index=False)