# Sentiment Analysis

This is to train a classifier on a robust and tagged twitter corpus( obtained from a kaggle challenge ) and allow the model to be used for tagging the scraped tweets  as either 0(negative) or 1(positive).

## Loading and Splitting Data

In [28]:
'''
Get the dataset.
'''
import os
import numpy as np
import pandas as pd
import pprint
from sklearn.model_selection import train_test_split

data = open("kaggleTweets.csv", "rb")
df = pd.read_csv(data, error_bad_lines=False, usecols=['Sentiment', 'SentimentText'], encoding="utf-8")
df.head()

Unnamed: 0,Sentiment,SentimentText
0,0,is so sad for my APL frie...
1,0,I missed the New Moon trail...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,0,i think mi bf is cheating on me!!! ...


In [29]:
'''
splitting dataset to obtain training and test set.
we will use 80:20 ratio.
'''

train, test = train_test_split(df, test_size=0.2, random_state=42)
print len(train)
print len(test)

1262891
315723


In [30]:
'''
drop the sentiment from test set
'''
test_set = test.drop("Sentiment", axis=1)
test_set.head()

Unnamed: 0,SentimentText
1432703,http://www.popsugar.com/2999655 keep voting fo...
166675,@GamrothTaylor I am starting to worry about yo...
1224710,sunburned...no sunbaked! ow. it hurts to sit.
606184,Celebrating my 50th birthday by doing exactly ...
1487194,Leah and Aiden Gosselin are the cutest kids on...


## Data Pre-processing

In [31]:
import re

def preprocessor(text):
    
    text = re.sub('<[^>]*>', ' ', text)    # removes HTML from tweets
    text = re.sub('(http|https)://[^ ]+ ', '', text)    # removes all the hyperlinks
    text = re.sub('\s\s+', '', text)    # removes all the extra whitespaces
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P|[^T_T])', text)    #find all emoticons
    text = re.sub('[\W]+', ' ', text.lower()) + ''.join(emoticons).replace('-', '')  # appends emmoticons at the end.
    return text

In [32]:
preprocessor(df.loc[1432703, "SentimentText"])    #example

u'keep voting for robert pattinson in the popsugar100 as well '

In [33]:
df['SentimentText'] = df['SentimentText'].apply(preprocessor)

In [34]:
df.head()

Unnamed: 0,Sentiment,SentimentText
0,0,is so sad for my apl friend
1,0,i missed the new moon trailer
2,1,omg its already 7 30 o:3:O
3,0,omgaga im soooim gunna cry i ve been at this ...
4,0,i think mi bf is cheating on me t_t


In [35]:
# Processing into tokens
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [36]:
# exemplary run
tokenizer_porter(df.loc[1432703, "SentimentText"])

[u'keep',
 u'vote',
 u'for',
 u'robert',
 u'pattinson',
 u'in',
 u'the',
 u'popsugar100',
 u'as',
 u'well']

In [37]:
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /home/pc/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [38]:
# storing all the stopwords in an array.
from nltk.corpus import stopwords
stop = stopwords.words('english')

## Training a supervised learning classifier

In [44]:
# Prepare dataset to be operated upon by GridSearchCV
X_train = df.loc[:50000, "SentimentText"].values
y_train = df.loc[:50000, "Sentiment"].values
X_test = df.loc[50000:100000, "SentimentText"].values
y_test = df.loc[50000:100000, "Sentiment"].values

In [45]:
# Using GridSearchCV to find best parameters to use for classifier(SGDClassifier)
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [None]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


In [42]:
print('Best parameter set: %s '% gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f'%gs_lr_tfidf.best_score_)

Best parameter set: {'vect__ngram_range': (1, 1), 'vect__tokenizer': <function tokenizer_porter at 0x7efe8d2f8d70>, 'clf__penalty': 'l2', 'clf__C': 1.0, 'vect__stop_words': None} 
CV Accuracy: 0.761


In [43]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.715
