# Dataset Creation using Data Augmentation in NLP for Tweets Sentiment Analysis







### Loading and Cleaning the tweets 
<br>
Download from https://canvas.instructure.com/courses/2734517/files/138795503 and upload it when you run the code below

In [None]:
#Upload tweets csv file
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [None]:
import pandas as pd
df_tweets = pd.read_csv('tweets.csv')
df_tweets.head()

Unnamed: 0,tweets,labels
0,Obama has called the GOP budget social Darwini...,1
1,"In his teen years, Obama has been known to use...",0
2,IPA Congratulates President Barack Obama for L...,0
3,RT @Professor_Why: #WhatsRomneyHiding - his co...,0
4,RT @wardollarshome: Obama has approved more ta...,1


In [None]:
print("Number of records in the dataset:", df_tweets.shape[0])
print("Labels and count per label:\n",df_tweets.labels.value_counts())

Number of records in the dataset: 1364
Labels and count per label:
 0    931
1    352
2     81
Name: labels, dtype: int64


#### Cleaning the tweets

Doing EDA (or rather mid-way during data augmentation:)), I found that the tweets need to be cleaned and the following need to be removed or substituted

1. HTTP links
2. @ references to other twitter accounts. Though this coule be helpful as proper nouns in the dataset, after removing '@', this does create noise during back translation
3. Other non-alphanumeric characters like # etc. 

In [None]:
#Regex
re_twitter_handle = "\s@[\w]+|^@[\w]+"
re_http_links = "https?:\/\/.[\w.?&]+"
re_special_chars_removal = "[^0-9a-zA-Z\.]" #to remove non-alphanumeric like hashtags
combined_pattern = r'|'.join([re_twitter_handle,re_http_links,re_special_chars_removal])

def tweet_cleaner(tweet):
  tweet = re.sub(r"\s[R][T]\b|^([R][T])\b"," ",tweet)  #removing the token RT
  tweet = re.sub(combined_pattern," ",tweet)
  tweet = " ".join([t for t in tweet.split(" ") if t is not ""])
  return tweet

###Example of tweet cleaning
tweet = tweet_cleaner("RT @deklekl_wdq #ejeida http://t_co.co ")
print(tweet)


ejeida


In [None]:
df_tweets.loc[7,"tweets"]

'RT @ohgirlphrase: American kid "You\'re from the UK? Ohhh cool, So do you have tea with the Queen?". British kid: "Do you like, go to Mcdonalds with Obama?'

In [None]:
print("Pre-cleaning :",df_tweets.loc[7,"tweets"])
df_tweets["tweets"] = df_tweets["tweets"].apply(tweet_cleaner)
print("Post-cleaning : ",df_tweets.loc[7,"tweets"])

Pre-cleaning : RT @ohgirlphrase: American kid "You're from the UK? Ohhh cool, So do you have tea with the Queen?". British kid: "Do you like, go to Mcdonalds with Obama?
Post-cleaning :  American kid You re from the UK Ohhh cool So do you have tea with the Queen . British kid Do you like go to Mcdonalds with Obama


### Data Augmentation:

We will be following the data augmentation provided in this paper: https://arxiv.org/pdf/1901.11196.pdf and Back Translation methods

As per the paper, *"EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks"*

1. **Synonym Replacement (SR)**: Randomly
choose n words from the sentence that are not
stop words. Replace each of these words with
one of its synonyms chosen at random.
2. **Random Insertion (RI)**: Find a random synonym of a random word in the sentence that is
not a stop word. Insert that synonym into a random position in the sentence. Do this n times.
3. **Random Swap (RS)**: Randomly choose two
words in the sentence and swap their positions.
Do this n times.
4. **Random Deletion (RD)**: Randomly remove
each word in the sentence with probability p

In this exercise, we will try:

1. **Random Swap**
2. **Random Deletion**
3. **Back Translate**

for augmentation



In [None]:
#We need to augment only the training dataset and not the validation or test dataset. 
#Hence let us do the train, text split to separate the train and validation dataset
#Let us create two separate csv files - one for Train, one for Validation

from sklearn.model_selection import train_test_split

train, valid = train_test_split(df_tweets, random_state=43, test_size=0.15, stratify=df["labels"])
train.to_csv("train_tweets.csv")
valid.to_csv("valid_tweets.csv")

In [None]:
#Split the train dataset for different data augmentation functions
#This can be done in different ways and this is just one way to show how we can data augment
#Across the 5 functions, let us set aside 25% of the train set for back translation

df_rd_rs = train.sample(frac=0.75)
df_back_trans = train.drop(df_rd_rs.index)

In [None]:

import random
random.seed(50)

def random_deletion(words, p=0.5): #sentence: list of words/tokens from a sentence
    if len(words) == 1: # return if single word
        return words
    remaining = list(filter(lambda x: random.uniform(0,1) > p,words)) 
    if len(remaining) == 0: # if not left, sample a random word
        return [random.choice(words)] 
    else:
        return remaining

def random_swap(sentence, n=5): #sentence: list of words/tokens from a sentence
    sen_len = len(sentence)
    length = range(sen_len)
    if (sen_len>1): 
      for _ in range(n):
          idx1, idx2 = random.sample(length, 2)
          sentence[idx1], sentence[idx2] = sentence[idx2], sentence[idx1] 
      return sentence
    else:
      return sentence

In [None]:
import re

df_aug_rd_rs =  pd.DataFrame(columns=train.columns)
def update_aug_ds(tweet, tweet_rd, tweet_rs, label):
  index = len(df_aug_rd_rs.index)
  df_aug_rd_rs.loc[index] = [tweet,label]
  df_aug_rd_rs.loc[index + 1] = [tweet_rd,label]
  df_aug_rd_rs.loc[index + 2] = [tweet_rs, label]
  #print(df_aug_rd_rs.head(n=20))
  return

for i, row in df_rd_rs.iterrows():
  try:
    tweet = row["tweets"]
    label = row["labels"]
    tweet_tokens = tweet.split(" ")

    #We will call Random deletion and Random Swap once for each tweet from df_rd_rs
    tweet_rd = " ".join(random_deletion(tweet_tokens))
    tweet_rs = " ".join(random_swap(tweet_tokens))
    #print(f"%s\n%s\n%s\n"%(tweet, tweet_rd,tweet_rs))

    #Appending to a new DataFrame which will have both the train set as well as augmented tweets
    #We can use shuffle in Train iterator for better training
    #This is to show how the augmentation has been done
    update_aug_ds(tweet,tweet_rd, tweet_rs, label)
  except Exception as e:
    print(e.message)
    pass
  



In [None]:
for i,row in df_aug_rd_rs.head(n=15).iterrows():
  print(row["tweets"])
  if ((i+1)%3==0):
    print("\n")

#Tweet
#Tweet with Random Deletion
#Tweet with Random Swap


WhatsRomneyHiding The person who refuses to let Obama be clear
person Obama
Obama The WhatsRomneyHiding who clear to let person be refuses


American kid You re from the UK Ohhh cool So do you have tea with the Queen . British kid Do you like go to Mcdonalds with Obama
You from the UK Ohhh cool you have with Queen kid to with Obama
American kid You re from you UK Ohhh cool So do you Mcdonalds tea with . kid the British Do Queen the like go to have with Obama


Sharing ztufxsn1 Czech Press More is know about the Birth of Jesus a millenia ago than about Obama
Sharing ztufxsn1 about the of millenia Obama
More ztufxsn1 Czech is Jesus Press know about the Birth of about Obama millenia ago than Sharing a


American kid You re from the UK Ohhh cool So do you have tea with the Queen . British kid Do you like go to Mcdonalds with Obama
American the So do have tea with Queen kid you like go to Mcdonalds
American cool You So from the UK Ohhh kid re do you . you with the Queen have British kid Do 

In [None]:
print("Number of training + augmented samples: ",len(df_aug_rd_rs))

Number of training + augmented samples:  2607




---

### Back Translation

In [None]:
!pip install google_trans_new



In [None]:
from google_trans_new import google_translator  
available_langs = list(googletrans.LANGUAGES.keys()) 

In [None]:
trans_lang = random.choice(available_langs)  
translator = google_translator()  
translate_text = translator.translate(
      translator.translate('We are okay',lang_tgt=trans_lang),
      lang_src = trans_lang,
      lang_tgt='en')

In [None]:
translate_text


'we are fine '

In [None]:
trans_lang

'fr'

In [None]:
df_aug_back_trans =  pd.DataFrame(columns=train.columns)
def update_aug(tweet, tweet_trans, label):
  index = len(df_aug_back_trans.index)
  df_aug_back_trans.loc[index] = [tweet,label]
  df_aug_back_trans.loc[index + 1] = [tweet_trans,label]
  return

def back_translate(sentences):
  trans_lang = random.choice(available_langs)  
  translator = google_translator()  
  translate_text = translator.translate(
      translator.translate(sentences,lang_tgt=trans_lang),
      lang_src = trans_lang,
      lang_tgt='en')
  return translate_text

In [None]:
for i,row in df_back_trans.iterrows():
  tweet, label = row["tweets"],row["labels"]

  translated_tweet = back_translate(tweet)

  tweet_trans = re.sub(str(translated_tweet),re_special_chars_removal,"")

  update_aug(tweet, translated_tweet, label)

  

In [None]:
len(df_aug_back_trans)

590

In [None]:
for i,row in df_aug_back_trans.tail(n=10).iterrows():
  print(row["tweets"])
  if ((i+1)%2==0):
    print("\n")

This just pissed me the hell off. If Obama were white he d be Mitt Romney.
That's just angry. If he was Obama White is D Mitt Romney. 


Obama being all direct BarackObama So what s Romney hiding Tweet to demand he release his tax returns. WhatsRomneyHiding
Obama is all immediate Barackobama, so that the Tweet hides to require the release of his tax returns. Whatsromneyhiding 


Rep. Drop charges against anti Obama Marine Military News News From Afghanistan Iraq And Military Times 4ZV5rueN
Rep. 


WhatsRomneyHiding a bloody plan to put guns in cartel hands to force stricter gun laws on Americans Nope. That s Obama. FastAndFurious
WhatsromneyneyHiding is a bloody plan to keep a blood gun gun to strengthen the strict gun law in a bloody plan of Americans. That Obama. Fast and mad 


Obama has called the GOP budget social Darwinism. Nice try but they believe in social creationism.
Obama called Social Darvinism of GOP. Nice try but believe in social creatism. 




In [None]:
df_aug = pd.concat([df_aug_rd_rs,df_aug_back_trans])
len(df_aug)

3197

Conclusion of Data Augmentation 
TO DO

In [None]:
df_aug.to_csv("augmented_tweets_train.csv")
