# Final Project

## TRAC 2 - EDA (Easy Data Augmentation Techniques)

In this notebook we augment our TRAC-2 training data using the data augmentation process described in the paper: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks.

Paper: https://arxiv.org/abs/1901.11196

GitHub: https://github.com/jasonwei20/eda_nlp


## Package imports

In [1]:
import pandas as pd
import numpy as np

In [2]:
pip install -U nltk

Note: you may need to restart the kernel to use updated packages.


In [3]:
import nltk; nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/isabel/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Load and prepare training data

In [4]:
# Load aggressiveness dataset
train_data = pd.read_csv('../../../data/release-files/eng/trac2_eng_train.csv')

In [5]:
## create a column that considers all the possible combination of classes for task A and task B
## NAG-NGEN, NAG-GEN, CAG-NGEN, CAG-GEN, OAG-NGEN, OAG-GEN

# create a list of conditions
conditions = [(train_data['Sub-task A'] == 'NAG') & (train_data['Sub-task B'] == 'NGEN'),
              (train_data['Sub-task A'] == 'NAG') & (train_data['Sub-task B'] == 'GEN'), 
              (train_data['Sub-task A'] == 'CAG') & (train_data['Sub-task B'] == 'NGEN'),
              (train_data['Sub-task A'] == 'CAG') & (train_data['Sub-task B'] == 'GEN'),
              (train_data['Sub-task A'] == 'OAG') & (train_data['Sub-task B'] == 'NGEN'),
              (train_data['Sub-task A'] == 'OAG') & (train_data['Sub-task B'] == 'GEN')
             ]
           
# values for each condition
values = [0, 1, 2, 3, 4, 5]

# create a new column 
train_data['combined'] = np.select(conditions, values)

In [6]:
train_data['combined'].value_counts()

0    3241
2     418
4     295
5     140
1     134
3      35
Name: combined, dtype: int64

In [7]:
# create a dataframe for each class
train_0 = train_data[train_data['combined'] == 0]
train_1 = train_data[train_data['combined'] == 1]
train_2 = train_data[train_data['combined'] == 2]
train_3 = train_data[train_data['combined'] == 3]
train_4 = train_data[train_data['combined'] == 4]
train_5 = train_data[train_data['combined'] == 5]

In [8]:
# only want to augment the minority classes
# create a dataframe with only the minority classes

df = pd.concat([train_1, train_2, train_3, train_4, train_5], axis=0)
df

Unnamed: 0,ID,Text,Sub-task A,Sub-task B,combined
21,C38.482,its not good i think its all dimaghi keeda bei...,NAG,GEN,1
24,C10.155,Why can't the Indian government take serious a...,NAG,GEN,1
61,C10.427,This mentally ill Lady is a barking Street dog...,NAG,GEN,1
82,C43.12,"Finally, i can legally call indians gay",NAG,GEN,1
85,C10.1402,As per Arundhati she should give her name as K...,NAG,GEN,1
...,...,...,...,...,...
4182,C20.265,"You,re a bitch",OAG,GEN,5
4208,C43.167.1,@Baraqua Amina Levy-Khan So is prests raping c...,OAG,GEN,5
4226,C20.62,Fuck u and your reviews.....,OAG,GEN,5
4254,C38.448,Open bob and vagane,OAG,GEN,5


In [9]:
# eliminate newline characters
df = df.replace('\n',' ', regex=True)

In [10]:
# select columns in the order expected in the algorithm
df = df[['combined', 'Text']]

In [11]:
df.to_csv('../../../eda_nlp/data/trac-2.txt', sep ='\t', index=False, header=False)

## Augment Data

In [12]:
# go to the directory where the EDA code is
%cd ../../../eda_nlp

/Users/isabel/SynologyDrive/Data_Science/09-W266_NPL_Deep_Learning/05-Final_project/eda_nlp


### Synomym augmentation with random deletion

Replace 10% of words by synonyms and delete 5% of words.


In [13]:
!python code/augment.py \
--input=data/trac-2.txt \
--output=data/trac-2_augmented.txt \
--num_aug=5 \
--alpha_sr=0.1 \
--alpha_rd=0.05 \
--alpha_ri=0.0 \
--alpha_rs=0.0

generated augmented sentences with eda for data/trac-2.txt to data/trac-2_augmented.txt with num_aug=5


In [14]:
# read augmented data back
df_augm= pd.read_csv('data/trac-2_augmented.txt', sep='\t', names=['combined', 'Text'])

In [15]:
df_augm

Unnamed: 0,combined,Text
0,1,not good i its all dimaghi keeda being or lesbian
1,1,its not good i think its all dimaghi keeda bei...
2,1,its not good think its all dimaghi keeda being...
3,1,its not good i call up its all dimaghi keeda b...
4,1,its not good i think its all dimaghi keeda bei...
...,...,...
5105,5,abey loudey arnab did u ever see the vedios of...
5106,5,abey loudey arnab did u always see the vedios ...
5107,5,abey loudey arnab did u ever see the vedios of...
5108,5,abey loudey arnab did u ever see the vedios hi...


In [21]:
# now need to create a dataframe with all the data (original + augmented)

# list of conditions to populate labels task A and task B
conditions = [(df_augm['combined'] == 1), 
              (df_augm['combined'] == 2 ),
              (df_augm['combined'] == 3 ),
              (df_augm['combined'] == 4 ),
              (df_augm['combined'] == 5 )
             ]
# values for task A and B
values_a = ['NAG','CAG','CAG','OAG', 'OAG']
values_b = ['GEN','NGEN','GEN','NGEN', 'GEN']

# create columsn with labels 
df_augm['Sub-task A'] = np.select(conditions, values_a)
df_augm['Sub-task B'] = np.select(conditions, values_b)

df_augm = df_augm[['Text', 'Sub-task A', 'Sub-task B']]


In [24]:
# select same columns in majority class dataset and original data
train_0 = train_0[['Text', 'Sub-task A', 'Sub-task B']]
df = df[['Text', 'Sub-task A', 'Sub-task B']] 

KeyError: "['Sub-task A', 'Sub-task B'] not in index"

In [23]:
df_augm

Unnamed: 0,Text,Sub-task A,Sub-task B
0,not good i its all dimaghi keeda being or lesbian,NAG,GEN
1,its not good i think its all dimaghi keeda bei...,NAG,GEN
2,its not good think its all dimaghi keeda being...,NAG,GEN
3,its not good i call up its all dimaghi keeda b...,NAG,GEN
4,its not good i think its all dimaghi keeda bei...,NAG,GEN
...,...,...,...
5105,abey loudey arnab did u ever see the vedios of...,OAG,GEN
5106,abey loudey arnab did u always see the vedios ...,OAG,GEN
5107,abey loudey arnab did u ever see the vedios of...,OAG,GEN
5108,abey loudey arnab did u ever see the vedios hi...,OAG,GEN


In [25]:
# select same columns in the original data
train_0 = train_0[['Text', 'Sub-task A', 'Sub-task B']]
train_1 = train_1[['Text', 'Sub-task A', 'Sub-task B']]
train_2 = train_2[['Text', 'Sub-task A', 'Sub-task B']]
train_3 = train_3[['Text', 'Sub-task A', 'Sub-task B']]
train_4 = train_4[['Text', 'Sub-task A', 'Sub-task B']]
train_5 = train_5[['Text', 'Sub-task A', 'Sub-task B']]

In [29]:
# concatenate dataframes to produce a final dataframe
final = pd.concat([train_0,train_1,train_2,train_3,train_4,train_5,df_augm], ignore_index=True)

In [30]:
final

Unnamed: 0,Text,Sub-task A,Sub-task B
0,Next part,NAG,NGEN
1,Iii8mllllllm\nMdxfvb8o90lplppi0005,NAG,NGEN
2,🤣🤣😂😂🤣🤣🤣😂osm vedio ....keep it up...make more v...,NAG,NGEN
3,What the fuck was this? I respect shwetabh and...,NAG,NGEN
4,Concerned authorities should bring arundathi R...,NAG,NGEN
...,...,...,...
9368,abey loudey arnab did u ever see the vedios of...,OAG,GEN
9369,abey loudey arnab did u always see the vedios ...,OAG,GEN
9370,abey loudey arnab did u ever see the vedios of...,OAG,GEN
9371,abey loudey arnab did u ever see the vedios hi...,OAG,GEN


In [31]:
# shuffle dataframe
final = final.sample(frac=1)
final

Unnamed: 0,Text,Sub-task A,Sub-task B
5168,true brother bollywood definitely chutiyapa,CAG,NGEN
8340,this man is sick,OAG,NGEN
7743,who is arundhati roy is she above constitution...,OAG,NGEN
4510,she really have randy take care,NAG,GEN
8650,as a citizen i feel nrc is a slipper slap on m...,OAG,NGEN
...,...,...,...
7329,jhand movie is the perfect,OAG,NGEN
2518,Hllo,NAG,NGEN
4580,brother one more question why do u think being...,NAG,GEN
1742,I agree bro you are right,NAG,NGEN


In [33]:
final['Sub-task A'].value_counts(normalize=True)

NAG    0.431559
CAG    0.289982
OAG    0.278459
Name: Sub-task A, dtype: float64

In [34]:
final['Sub-task B'].value_counts(normalize=True)

NGEN    0.802198
GEN     0.197802
Name: Sub-task B, dtype: float64

In [36]:
!pwd

/Users/isabel/SynologyDrive/Data_Science/09-W266_NPL_Deep_Learning/05-Final_project/eda_nlp


In [37]:
# save file
final.to_csv('../data/release-files/eng/trac2_eng_train_EDA.csv', index=False)