## Data Analysis and preprocessing

In this notebook we will try to understand the dataset and preprocess it to be used for further use, means training the ML and DL models on top of this.

In [1]:
# !pip install numpy pandas matplotlib seaborn sklearn nltk 

#### Imports

In [22]:
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from time import time

import warnings
warnings.filterwarnings("ignore")

#### Loading train and test dataset

In [23]:
dataset_dir = Path("../dataset")
train_dataset_path = dataset_dir/"train.csv"
test_dataset_path = dataset_dir/"test.csv"

In [24]:
# load train and test dataset
train = pd.read_csv(open(train_dataset_path,"r"), header=None)
test = pd.read_csv(open(test_dataset_path,"r"), header=None)

#### Preprocess TRAIN dataset

In [25]:
# head of the dataset
train.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [26]:
train.tail()

Unnamed: 0,0,1,2,3,4,5
1599995,4,2193601966,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,AmandaMarie1028,Just woke up. Having no school is the best fee...
1599996,4,2193601969,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,TheWDBoards,TheWDB.com - Very cool to hear old Walt interv...
1599997,4,2193601991,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,bpbabe,Are you ready for your MoJo Makeover? Ask me f...
1599998,4,2193602064,Tue Jun 16 08:40:49 PDT 2009,NO_QUERY,tinydiamondz,Happy 38th Birthday to my boo of alll time!!! ...
1599999,4,2193602129,Tue Jun 16 08:40:50 PDT 2009,NO_QUERY,RyanTrevMorris,happy #charitytuesday @theNSPCC @SparksCharity...


<b>For the Sentiment Classifier we just need two columns `0` (sentiment_score) and `5` (actual tweet)

In [27]:
train = train[[0,5]]
train.head()

Unnamed: 0,0,5
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [28]:
# do the same for test dataset
test = test[[0,5]]
test.head()

Unnamed: 0,0,5
0,4,@stellargirl I loooooooovvvvvveee my Kindle2. ...
1,4,Reading my kindle2... Love it... Lee childs i...
2,4,"Ok, first assesment of the #kindle2 ...it fuck..."
3,4,@kenburbary You'll love your Kindle2. I've had...
4,4,@mikefish Fair enough. But i have the Kindle2...


#### Rename the column name as tweet and sentiment_score

In [29]:
# renaming the column
test.columns = ["sentiment_score","tweet"]
train.columns = ["sentiment_score","tweet"]

In [30]:
test.head()

Unnamed: 0,sentiment_score,tweet
0,4,@stellargirl I loooooooovvvvvveee my Kindle2. ...
1,4,Reading my kindle2... Love it... Lee childs i...
2,4,"Ok, first assesment of the #kindle2 ...it fuck..."
3,4,@kenburbary You'll love your Kindle2. I've had...
4,4,@mikefish Fair enough. But i have the Kindle2...


In [31]:
train.head()

Unnamed: 0,sentiment_score,tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


#### Replace sentiment_score `4` by `1`

In [32]:
# replace values of columns by using DataFrame.loc[] property.
train.loc[train["sentiment_score"]==4, "sentiment_score"] = 1
test.loc[test["sentiment_score"]==4, "sentiment_score"] = 1

#### Distribution of sentiment_score in both train and test dataset

In [33]:
train.sentiment_score.value_counts()

0    800000
1    800000
Name: sentiment_score, dtype: int64

_We are having 0.8M tweets for both labels `0` and `1`_

In [34]:
test.sentiment_score.value_counts()

1    182
0    177
2    139
Name: sentiment_score, dtype: int64

_In test dataset we are having additional class `2` which are not available in the train dataset. So, we will remove this class from the test dataset._

In [35]:
test = test[test.sentiment_score.isin([0,1])]
test.sentiment_score.value_counts()

1    182
0    177
Name: sentiment_score, dtype: int64

#### Preprocess the tweet

- removing https links
- removing words preceding by @(mentions)
- remove #
- remove RT (retweet) 
- removing word of length greate than 17 (because, generally we don't use word greate than of this length)
- correct the words like
            - 'aaahhhh' ---> ah
            - 'wwwwaaahhhhh' ---> 'wah'
    
- split the words which were grouped together unintentionally using `wordninja` library
        - 'ilikethis' --> 'i like this'
        etc.

In [36]:
# function to clean text
# !pip install wordninja
import wordninja
import re
import string
from bs4 import BeautifulSoup
def cleanText(text):
    
    # remove html tags (if any)
    text = (BeautifulSoup(text)).get_text()
    
    # first of all replace abbreviations
    text = text.replace("′", " ").replace("’", " ").replace(".","").replace("!"," ")\
                           .replace(",", " ").replace("?"," ")\
                           .replace("won't", "will not").replace("cannot", "can not").replace("can't", "can not")\
                           .replace("n't", " not").replace("what's", "what is").replace("it's", "it is")\
                           .replace("'ve", " have").replace("i'm", "i am").replace("'re", " are")\
                           .replace("he's", "he is").replace("that's","that is").replace("she's", "she is").replace("'s", " own")\
                           .replace("'ll", " will").replace("couldn't","could not")
    
    text = re.sub(r"@[A-Za-z0-9]+", "", text) # removing @mentions
    text = re.sub(r"#", "", text) # removing the "#" symbol
    text = re.sub(r"RT[\s]+", "", text) # removing RT
    text = re.sub(r"https?:\/\/\S+", "", text) # removing hyper links
    text = re.sub(r"\s+", " ", text) # substituting multiple spaces into one
    
    # remove punctuations
    text = text.translate(str.maketrans("", "", string.punctuation))
    text = text.strip()
    
    # do a few other processing on words
    # by the observation we realized that there are a few words which
    # consists of very long length and the words like this
    # aaaaaaaaaaaahhhhhhhhhhhhhhh
    # wwwwaaaaaaaaaaaaaahhhhhhhhhhhhh
    # So, just process it
    sentence = []
    # Wordninja will split the text in a way like "iamagoodboy"-->["I","am","a","good","boy"]
    for word in text.split():
        # remove word of length greater than 17. Link :::: https://arxiv.org/ftp/arxiv/papers/1207/1207.2334.pdf
        if len(word)<17: # remove the words greater than len 17
            if len(word)>6 and len(set(word))<=3:
                # then, do the processing like
                # wwwwaaaahhhh ---> wah
                temp_word = []
                prev_char = word[0]
                temp_word.append(prev_char)
                for character in list(word):
                    if character!=prev_char:
                        temp_word.append(character)
                        prev_char = character
                
                sentence.append(''.join(temp_word))
            elif len(word)>2 and len(set(word))<=2:
                continue
            else:
                sentence.append(word)
    sentence = ' '.join(sentence)
    sentence = ' '.join(wordninja.split(sentence))
    return sentence

In [37]:
cleanText("@switchfoot aaaaahhhhhhh wwaahhwaahhh wwwwwwwwwwwahhhhhhhhhhhhhhhhhhhhhhh http://twitpic.com/2y1zl - <b>Awww, that's a bummer</b>. iamagoodboy You shoulda got David Carr of Third Day to do it. ;D")

'ah wah wah that is a bummer iam a good boy You should a got David Carr of Third Day to do it D'

In [50]:
t0 = time()
train["tweet"] = train["tweet"].apply(cleanText)
print(f"time taken to preprocess train({train.shape[0]} datapoints) dataset: {time()-t0} s")

time taken to preprocess train(1600000 datapoints) dataset: 1962.190800666809 s


In [51]:
t0 = time()
test["tweet"] = test["tweet"].apply(cleanText)
print(f"time taken to preprocess test({test.shape[0]} datapoints) dataset: {time()-t0} s")

time taken to preprocess test(359 datapoints) dataset: 0.16105151176452637 s


#### Removing rows where tweet is NaN after preprocessing

In [52]:
train.isna().sum(), test.isna().sum()

(sentiment_score    0
 tweet              0
 dtype: int64,
 sentiment_score    0
 tweet              0
 dtype: int64)

In [53]:
# removing rows where there NaN in tweet column for both (train and test)
train.dropna(inplace=True)
test.dropna(inplace=True)

In [108]:
# remove the rows where the value of tweet is empty string
train.drop(train[train.tweet==''].index, inplace=True)
test.drop(test[test.tweet==''].index, inplace=True)

In [109]:
train.shape, test.shape

((1595743, 2), (359, 2))

#### Save the model

In [110]:
train.to_csv(open(dataset_dir/"train_new.csv", "wb"), index=None)
test.to_csv(open(dataset_dir/"test_new.csv","wb"), index=None)

_We can use this dataset for further training ML and DL Models._

__Wordninja__ is a library which split the words which were grouped 

    - We used it in our preprocessing steps
    - "iamagoodboy" --> ['i','am','a','good','boy']

In [58]:
import wordninja

In [59]:
wordninja.split("iamadatascientistwhetheryoubelieveitornotIdon'tcare.")

['iam',
 'a',
 'data',
 'scientist',
 'whether',
 'you',
 'believe',
 'it',
 'or',
 'not',
 'I',
 "don't",
 'care']