# Final Project

## TRAC 2 - EDA (Easy Data Augmentation Techniques)

In this notebook we augment our TRAC-2 training data using the data augmentation process described in the paper: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks.

Paper: https://arxiv.org/abs/1901.11196

GitHub: https://github.com/jasonwei20/eda_nlp


## Package imports

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
pip install -U nltk

Note: you may need to restart the kernel to use updated packages.


In [3]:
import nltk; nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /Users/isabel/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Load and prepare training data

In [4]:
# Load aggressiveness dataset
train_data_a = pd.read_csv('../../../data/release-files/eng/trac2_eng_train_oversampled_task_A.csv')
train_data_b = pd.read_csv('../../../data/release-files/eng/trac2_eng_train_oversampled_task_B.csv')

In [5]:
# eliminate newline characters
train_data_a = train_data_a.replace('\n',' ', regex=True)
train_data_b = train_data_b.replace('\n',' ', regex=True)

In [6]:
# Find instances with no letter in them. They cause problems in the algorithm
train_data_a['filter_out'] = train_data_a['Text'].map(lambda x: 0 if bool(re.search('[a-zA-Z]', x)) else 1)
train_data_a = train_data_a[train_data_a['filter_out']==0]

train_data_b['filter_out'] = train_data_b['Text'].map(lambda x: 0 if bool(re.search('[a-zA-Z]', x)) else 1)
train_data_b = train_data_b[train_data_b['filter_out']==0]

In [7]:
# drop a few extra examples that cause problems to the algorithm
train_data_a = train_data_a.drop([517, 1310, 2406])
train_data_b = train_data_b.drop([591, 1550, 1942])

In [8]:
train_data_a.shape

(10104, 5)

In [9]:
train_data_b.shape

(7887, 5)

In [10]:
# select columns in the order expected in the algorithm
train_data_a = train_data_a[['Sub-task A', 'Text']]
train_data_b = train_data_b[['Sub-task B', 'Text']]

In [11]:
train_data_a.to_csv('../../../eda_nlp/data/trac-2-forEDA-taskA.txt', sep ='\t', index=False, header=False)
train_data_b.to_csv('../../../eda_nlp/data/trac-2-forEDA-taskB.txt', sep ='\t', index=False, header=False)

## Augment Data

In [12]:
# go to the directory where the EDA code is
%cd ../../../eda_nlp

/Users/isabel/SynologyDrive/Data_Science/09-W266_NPL_Deep_Learning/05-Final_project/eda_nlp


### Synomym augmentation with random deletion

Replace 10% of words by synonyms and delete 5% of words.


In [13]:
# augment for Task A
!python code/augment.py \
--input=data/trac-2-forEDA-taskA.txt \
--output=data/trac-2_augmented_EDA_taskA.txt \
--num_aug=3 \
--alpha_sr=0.1 \
--alpha_rd=0.05 \
--alpha_ri=0.0 \
--alpha_rs=0.0

generated augmented sentences with eda for data/trac-2-forEDA-taskA.txt to data/trac-2_augmented_EDA_taskA.txt with num_aug=3


In [14]:
# augment for Task B
!python code/augment.py \
--input=data/trac-2-forEDA-taskB.txt \
--output=data/trac-2_augmented_EDA_taskB.txt \
--num_aug=3 \
--alpha_sr=0.1 \
--alpha_rd=0.05 \
--alpha_ri=0.0 \
--alpha_rs=0.0

generated augmented sentences with eda for data/trac-2-forEDA-taskB.txt to data/trac-2_augmented_EDA_taskB.txt with num_aug=3


In [33]:
# read back augmented data
df_augm_a= pd.read_csv('data/trac-2_augmented_EDA_taskA.txt', sep='\t', names=['Sub-task A', 'Text'])
df_augm_b= pd.read_csv('data/trac-2_augmented_EDA_taskB.txt', sep='\t', names=['Sub-task B', 'Text'])

In [34]:
df_augm_a

Unnamed: 0,Sub-task A,Text
0,NAG,following part
1,NAG,next part
2,NAG,next part
3,NAG,iii mllllllm o lplppi
4,NAG,iii mllllllm mdxfvb type o lplppi
...,...,...
30307,OAG,who is arundhati roy is she above constitution...
30308,OAG,who is arundhati roy is she above constitution...
30309,OAG,bakwaas baate hai ye talking about liberals an...
30310,OAG,bakwaas baate hai ye talking about liberals an...


In [35]:
df_augm_b

Unnamed: 0,Sub-task B,Text
0,NGEN,next division
1,NGEN,next part
2,NGEN,next part
3,NGEN,trine mllllllm mdxfvb o lplppi
4,NGEN,mllllllm mdxfvb o
...,...,...
23656,GEN,vidya harish bhandary she is a failed soft por...
23657,GEN,vidya harish bhandary she is a failed soft por...
23658,GEN,i dont know much about homosexuals but i defin...
23659,GEN,i dont know much about homosexuals but i defin...


In [36]:
!pwd

/Users/isabel/SynologyDrive/Data_Science/09-W266_NPL_Deep_Learning/05-Final_project/eda_nlp


In [37]:
# create a dataframe with all the data (original + augmented)

train_data_a = pd.read_csv('../data/release-files/eng/trac2_eng_train_oversampled_task_A.csv')
train_data_b = pd.read_csv('../data/release-files/eng/trac2_eng_train_oversampled_task_B.csv')

train_data_a = train_data_a[['Text', 'Sub-task A']]
train_data_b = train_data_b[['Text', 'Sub-task B']]

final_a = pd.concat([train_data_a,df_augm_a], ignore_index=True)
final_b = pd.concat([train_data_b,df_augm_b], ignore_index=True)

In [38]:
final_a.shape

(40437, 2)

In [39]:
final_b.shape

(31569, 2)

In [40]:
# shuffle dataframes
final_a = final_a.sample(frac=1)
final_b = final_b.sample(frac=1)

In [41]:
final_a['Sub-task A'].value_counts(normalize=True)

OAG    0.333853
CAG    0.333853
NAG    0.332295
Name: Sub-task A, dtype: float64

In [42]:
final_b['Sub-task B'].value_counts(normalize=True)

GEN     0.500998
NGEN    0.499002
Name: Sub-task B, dtype: float64

In [43]:
# save file
final_a.to_csv('../data/release-files/eng/trac2_eng_train_EDA_task_A.csv', index=False)
final_b.to_csv('../data/release-files/eng/trac2_eng_train_EDA_task_B.csv', index=False)