## Sentiment Analysis

### Steps:

1) Raw data cleaning and preprocessing

2) Cleaned data exploration and analysis

3) Model training and best model selection

4) Prediction experiment with some random tweets

### Data cleaning and Preprocssing

In [1]:
# importing modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import string
import nltk
import re

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
# customizing plot

plt.rcParams['figure.figsize'] = 8, 6
plt.style.use('ggplot')

In [3]:
#ignore warnings

import warnings                                                     
warnings.filterwarnings("ignore", category=DeprecationWarning)


In [4]:
#loading dataset
train=pd.read_csv("./Dataset/train.csv",encoding="latin-1")      #encoding="latin-1" to avoid unicode decode error
test=pd.read_csv("./Dataset/test.csv",encoding="latin-1")


In [5]:
train.shape     #visualizing the shape of train data

(99989, 3)

In [6]:
test.shape      #visualizing the shape of test data

(299989, 2)

In [7]:
train.head()     #visualizing the real data

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...


In [8]:
train[["SentimentText"]].loc[99959]

SentimentText    @CT415 @UCLA_Bruin  it made me sad too!  that ...
Name: 99959, dtype: object

#### 1.1  Removing Twitter Handles (e.g.@CT415,@UCLA)

Remove all these twitter handles from the data as they don’t convey much information

In [9]:
#combining train and test data to perform data cleaning at once for both

combined= train.append(test, ignore_index=True,sort=False) 

In [10]:
# defining user defined function for removing patterns starting with @

def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)
    return input_txt   

In [12]:
 #create a new column tidy_tweet, it will contain the cleaned and processed texts without word starting with @

combined['clean_text'] = np.vectorize(remove_pattern)(combined['SentimentText'], "@[\w]*")

#### 1.2 Removing Punctuations, Numbers, and Special Characters

In [42]:
# replace everything except characters and hashtags with spaces.

combined['clean_text'] = combined['clean_text'].str.replace("[^a-zA-Z]", " ")

#### 1.3 Convert all tweets to lowercase

In [17]:
combined["clean_text"]=combined['clean_text'].apply(lambda x: x.lower())


#### 1.4 Removing white spaces

In [18]:
combined["clean_text"]=combined['clean_text'].apply(lambda x:x.strip())
combined.head()

Unnamed: 0,ItemID,Sentiment,SentimentText,clean_text
0,1,0.0,is so sad for my APL frie...,is so sad for my apl friend
1,2,0.0,I missed the New Moon trail...,i missed the new moon trailer
2,3,1.0,omg its already 7:30 :O,omg its already o
3,4,0.0,.. Omgaga. Im sooo im gunna CRy. I'...,omgaga im sooo im gunna cry i ve been at th...
4,5,0.0,i think mi bf is cheating on me!!! ...,i think mi bf is cheating on me t t


#### 1.5 Tokenization

process of splitting given text into smaller pieces called tokens(words,numbers,punctuation marks and others)

In [33]:
tokens=combined["clean_text"].apply(lambda x:x.split())
tokens.head()

0                  [is, so, sad, for, my, apl, friend]
1                 [i, missed, the, new, moon, trailer]
2                               [omg, its, already, o]
3    [omgaga, im, sooo, im, gunna, cry, i, ve, been...
4       [i, think, mi, bf, is, cheating, on, me, t, t]
Name: clean_text, dtype: object

#### 1.6 Removing Stop words

Stop words are most common words which do not add much meaning and are usually removed

In [41]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
tokens=tokens.apply(lambda x: [item for item in x if item not in stop_words])
tokens.head()

0                                   [sad, apl, friend]
1                         [missed, new, moon, trailer]
2                                       [omg, already]
3    [omgaga, im, sooo, im, gunna, cry, dentist, si...
4                            [think, mi, bf, cheating]
Name: clean_text, dtype: object

#### 1.7 Stemming

process of reducing words to their stem,base or root form

e.g.books-->book,looking-->look



In [45]:
from nltk.stem import PorterStemmer
stemmer= PorterStemmer()

In [46]:
tokens_stemmed=tokens.apply(lambda x:[stemmer.stem(a) for a in x])

In [47]:
tokens_stemmed.head()

0                                   [sad, apl, friend]
1                           [miss, new, moon, trailer]
2                                       [omg, alreadi]
3    [omgaga, im, sooo, im, gunna, cri, dentist, si...
4                               [think, mi, bf, cheat]
Name: clean_text, dtype: object

#### 1.8 Lemmatization

In [51]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
tokens_normalized=tokens.apply(lambda x:[wordnet_lemmatizer.lemmatize(word) for word in x])
tokens_normalized.head()


0                                   [sad, apl, friend]
1                         [missed, new, moon, trailer]
2                                       [omg, already]
3    [omgaga, im, sooo, im, gunna, cry, dentist, si...
4                            [think, mi, bf, cheating]
Name: clean_text, dtype: object