## Preprocessing of the tweets
In this notebook, we are going to execute the different steps in order to clean the Sentiment140 dataset.
here are the pre-processing steps we are going to perform :
1. __HTML decoding__: In our text field the HTML encoding has not yet been converted to text, and to avoid having “&amp” ,”&quot”, ect in the text field. Decoding HTML is processed with the python BeautifulSoup library.
<br>
2. __Usernames__: Users often include Twitter usernames in their tweets in order to direct their messages using the @user mention. We believe that this does not give insights in the sentiment classification of the tweet and therefore we removed users mentions.
<br>
3. __Usage of links__: Users very often include links in their tweets, just like including usernames, it doesn’t help for sentiment classification and therefore we removed it using regex patterns.
<br>
4. __Hashtag__: Sometimes the text used with hashtag can provide useful information about the tweet. It might be a bit risky to get rid of all the text together with the hashtag. So we decided to leave the text intact and just remove the ‘#’. 
<br>
5. __Handling negation__: we built a dictionary where we mapped every negation word with it’s splitted form like “musn’t” : “must not” , “haven’t” : “have not”.


In [1]:
#importing the necessary libraries 
import numpy as np
import pandas as pd
import seaborn as sns
import re
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer

In [2]:
#loading the dataset
cols = ["sentiment","id","date","query","user","text"]
data = pd.read_csv("../data/tweets.csv", names = cols,header = None, encoding = "ISO-8859-1")

In [3]:
#dropping useless columns 
columns_to_drop = ["id","date","query","user"]
data = data.drop(columns_to_drop,axis =1)

In [4]:
#since the column "sentiment" is the class, it would be better to change it's position to the last one.
data = data[['text','sentiment']]
data.head(5)

Unnamed: 0,text,sentiment
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0
1,is upset that he can't update his Facebook by ...,0
2,@Kenichan I dived many times for the ball. Man...,0
3,my whole body feels itchy and like its on fire,0
4,"@nationwideclass no, it's not behaving at all....",0


### Definition of regex and core functions
In the following we define regex patterns to remove @username mention, links starting with http and www and numbers. then to avoid applying those regex one by one, we combine them using the join method and store this combination in combined_regex.<br>

Since those are very repeating tasks, we define a function called remove_pattern that will take in input the targeted text and the regex.

In [6]:
#@user regex
mention_regex = "@[^ ]*"
#http regex
http_regex = r'https?://[^ ]+'
#www regex
www_regex = r'www.[^ ]+'
#letters only
letters_only_regex = "[^a-zA-Z]"
#combining all the regex
combined_regex = r'|'.join((mention_regex, http_regex,www_regex,letters_only_regex))
#removing weird symbols
bom_removing = "ï¿½"
#negation words 
negations_dic = {"i'm":"i am ","it's":"it is","isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not","shoulda":"should have","ive":"i have ","i've":"i have","that's":"this is"}

neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')

#removing patterns 
def remove_pattern(input_text,regex):
    clean = re.sub(regex," ",input_text)
    return clean

#to lowercase
def to_lowercase(text):
    return text.lower()

The function __clean_tweet__ is the main function of the process, it takes in input a tweet and apply all the preprocessing describe in the begining for cleaning the tweet. <br>

first we tokenize the tweet using WordPunctTokenizer function available on the nltk library, this function  tokenize a text into a sequence of alphabetic and non-alphabetic characters.<br>

In the next step we lowercase the text using to_lowercase() function and we convert the text into a soup object using BeautifulSoup, this will allow us to get rid of html encoding.<br>

Finally we apply the negation pattern and the combined_regex.



In [7]:
def clean_tweet(text):
    tok = WordPunctTokenizer()
    text_lowercase = to_lowercase(text)
    soup = BeautifulSoup(text_lowercase,"lxml")
    souped = soup.get_text()
    
    try:
        bom_removed = souped.replace(bom_removing, "?")
    except:
        bom_removed = souped
    
    text_neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()],bom_removed)
    text_clean = remove_pattern(text_neg_handled,combined_regex)
    text_tokenized = [x for x  in tok.tokenize(text_clean) if len(x) > 1]
    
    return (" ".join(text_tokenized)).strip()

The __tweet_process__ function just apply the clean_tweet on all the tweets of dataset.

In [8]:
def tweet_preprocess():
    clean=[]
    nums = [0,400000,800000,1200000,1600000]
    print ("Cleaning and parsing the tweets...\n")
    #i cleaned the wole dataset
    for i in range(nums[0],nums[4]):
        if( (i+1)%100000 == 0 ):
            print ("Tweets %d of %d has been processed" % ( i+1, nums[4] ) )                                                                   
        clean.append(clean_tweet(data["text"][i]))
    return clean

In [9]:
clean_text = tweet_preprocess()

Cleaning and parsing the tweets...

Tweets 100000 of 1600000 has been processed
Tweets 200000 of 1600000 has been processed
Tweets 300000 of 1600000 has been processed
Tweets 400000 of 1600000 has been processed
Tweets 500000 of 1600000 has been processed
Tweets 600000 of 1600000 has been processed
Tweets 700000 of 1600000 has been processed
Tweets 800000 of 1600000 has been processed
Tweets 900000 of 1600000 has been processed
Tweets 1000000 of 1600000 has been processed
Tweets 1100000 of 1600000 has been processed
Tweets 1200000 of 1600000 has been processed
Tweets 1300000 of 1600000 has been processed
Tweets 1400000 of 1600000 has been processed
Tweets 1500000 of 1600000 has been processed
Tweets 1600000 of 1600000 has been processed


In [10]:
data["clean_text"] = clean_text

In [11]:
data.head(5)

Unnamed: 0,text,sentiment,clean_text
0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",0,awww this is bummer you should have got david ...
1,is upset that he can't update his Facebook by ...,0,is upset that he can not update his facebook b...
2,@Kenichan I dived many times for the ball. Man...,0,dived many times for the ball managed to save ...
3,my whole body feels itchy and like its on fire,0,my whole body feels itchy and like its on fire
4,"@nationwideclass no, it's not behaving at all....",0,no it is not behaving at all am mad why am her...


In [40]:
for i in range(0,400):
    print("Entry %s :" % (i))
    print("original : %s" % (data["text"][i]))
    print("cleaned : %s" % (data["clean_text"][i]))

Entry 0 :
original : @switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D
cleaned : awww this is bummer you should have got david carr of third day to do it
Entry 1 :
original : is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!
cleaned : is upset that he can not update his facebook by texting it and might cry as result school today also blah
Entry 2 :
original : @Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds
cleaned : dived many times for the ball managed to save the rest go out of bounds
Entry 3 :
original : my whole body feels itchy and like its on fire 
cleaned : my whole body feels itchy and like its on fire
Entry 4 :
original : @nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there. 
cleaned : no it is not behaving at all am mad why am here because can not see y

In the printing above we can notice that some tweets became empty after the processing like the tweet in the entry 249, this is because the tweet content had only a @username mention and after the pre procesing it became empty, let's remove those empty tweets from the dataset.

In [12]:
#dropping empty tweets
data['clean_text'].replace('', np.nan, inplace=True)
data = data.dropna()
#dropping text column 
data.drop("text",axis=1)
data=data[["clean_text","sentiment"]]

In [13]:
data["sentiment"].value_counts()

0    798193
4    797841
Name: sentiment, dtype: int64

Before the preprocessing we had 800000 posivite and negative tweets for each. After the preprocessing we have 798193 negative tweets and 797841 positive tweets. <br>

Since we can't use all the dataset for our expriment due to computational reasons, we decided to take 50% of the dataset, and to keep a balance between the classes we used trai_test_split function of the sklearn library to do that.

In [25]:
from sklearn.model_selection import train_test_split
SEED = 2000
df1 , df2 =train_test_split(data, test_size=.5, random_state=SEED)

In [21]:
df1["sentiment"].value_counts()

4    399469
0    398548
Name: sentiment, dtype: int64

In [22]:
df1.to_csv("tweets_subset.csv",encoding="utf-8",header= None)

In [26]:
#replacing the 0->0 and 4->1 where 0-->negative and 4-->positive
df1.sentiment.replace([0, 4], [0, 1], inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [27]:
df1["sentiment"].value_counts()


1    399469
0    398548
Name: sentiment, dtype: int64

In [28]:
#storing the subset in a csv file
df1.to_csv("tweets_subset.csv",encoding="utf-8",header= None)