# Sentiment Analysis

In this Notebook, we dig a little "deeper" into sentiment analysis.

Word2Vec attempts to understand meaning and semantic relationships among words. It works in a way that is similar to deep approaches, such as recurrent neural nets or deep neural nets, but is computationally more efficients

Sentiment analysis is a challenging subject in machine learning. People express their emotions in language that is often obscured by sarcasm, ambiguity, and plays on words, all of which could be very misleading for both humans and computers


## About The data 

LabeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.  

TestData - The test set. The tab-delimited file has a header row followed by 25,000 rows containing an id and text for each review. Our task is to predict the sentiment for each one. 

DATA FIELDS:

id - Unique ID of each review

sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews

review - Text of the review

The data set is association with:

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). "Learning Word Vectors for Sentiment Analysis." The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

The data was taken from https://www.kaggle.com/c/word2vec-nlp-tutorial/data

## Preprocessing

Before starting on a machine learning task, we have to preprocess the data.

• Word Stemming: Words are reduced to their stemmed form. For example, “discount”, “discounts”, “discounted” and “discounting” are all replaced with “discount”. Sometimes, the Stemmer actually strips additional characters from the end, so “include”, “includes”, “included”, and “including” are all replaced with “includ”.

• Removal of non-words: Non-words and punctuation have been re-
moved. All white spaces (tabs, newlines, spaces) have all been trimmed
to a single space character.

## Vocabulary List
After preprocessing the reviews, we have a list of words for
each review. The next step is to choose which words we would like to use in
our classifier and which we would want to leave out.
For this exercise, I have chosen only the most frequently occuring words
as our set of words considered (the vocabulary list).

## Extracting Features
We will now implement the feature extraction that converts each review into
a vector . For this exercise, you will be using n = # words in vocabulary
list. Specifically, the feature x(i) belongs {0, 1} for an review corresponds to whether
the i-th word in the dictionary occurs in the reviews. That is, x(i) = 1 if the i-th
word is in the review and x(i) = 0 if the i-th word is not present in the review. Here i have used n=5000

## Importing the Libraries 

In [3]:
import numpy as np
import pandas as pd
import re 

As the file name starts with tsv. So the delimeter is a tab and quoting=3 implies it avoids doubles quotes

In [4]:
#importing the dataset
train=pd.read_csv("labeledTrainData.tsv",delimiter='\t',quoting=3)  

In [5]:
train.shape

(25000, 3)

In [6]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


## Cleaning the Text 

We will start cleaning the text. Before that we will import some libraries and do some preprocessing and for 
removing stopwords i have used nltk(Natural Language Toolkit) library.


In [7]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords               # for removing the words like 'the' ,'a','an' from the set
from nltk.stem.porter import PorterStemmer      # for making words like loved as love and loving as love       

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/suchith/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
ps=PorterStemmer()
corpus=[]                         # this contains our reviews after cleaning 
for i in range(0,25000):          # we will iterate through each review and clean it
    
    # its for removing exclamtion marks, question marks and like this stuff 
    Review=re.sub('[^a-zA-Z]',' ',train['review'][i])       
    
    # making every letter to lower case
    Review=Review.lower()
    
    # splitting the string in words( why we do this.. the reasons are below)
    Review=Review.split()
    
    # we are removing the words like 'the','an'  and we are also stemming it 
    Review=[ps.stem(word) for word in Review if not word in set(stopwords.words('english'))]
    
    # after cleaning we are just joning to make it as string
    Review=' '.join(Review)
    
    # we will do this for every review and the review we got after cleaning we are appending to corpus
    corpus.append(Review)

We have cleaned our data and our data is ready to build a bag of words model

## Bag Of Words

We are using python inbuilt library for building the Bag of words model and i have taken 5000 most frequent words from the corpus.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=5000)
X=cv.fit_transform(corpus).toarray()
print(X.shape)

(25000, 5000)


In [14]:
y=train.iloc[:,1].values
print(y.shape)

(25000,)


## Building A Classifier

In [15]:
from sklearn.model_selection import train_test_split
X_train,X_cross,y_train,y_cross=train_test_split(X,y,test_size=0.2,random_state=42)
print(X_train.shape,X_cross.shape,y_train.shape,y_cross.shape)

(20000, 5000) (5000, 5000) (20000,) (5000,)


Here I will a Random Forest Classifier and Naive Bayes Classifier as our models

## 1) Random Forest Classifier

In [28]:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=100,random_state=0)
rf.fit(X,y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

Here I have fitted the model by X,y and not by X_train and y_train. It's because we know that our model overfits 

the data. So to increase the accuracy on test data we just want more data. So i fitted the model with X,y. And i 

have observed that accuracy increased from 84.24% to 84.26% That's so small I know that :) 

In [29]:
# making predictions on Cross data set
y_pred_rf=rf.predict(X_cross)

In [30]:
# We will print the Confusion Matrix
from sklearn.metrics import confusion_matrix,accuracy_score
cm=confusion_matrix(y_pred_rf,y_cross)
print(cm)

[[2481    0]
 [   0 2519]]


In [31]:
acc_rf=accuracy_score(y_pred_rf,y_cross)
print(acc_rf)
print(accuracy_score(rf.predict(X_train),y_train))

1.0
1.0


## 2) Naive-Bayes  

In [32]:
from sklearn.naive_bayes import GaussianNB
nb=GaussianNB()
nb.fit(X_train,y_train)

GaussianNB(priors=None)

In [33]:
y_pred_nb=nb.predict(X_cross)
cm1=confusion_matrix(y_pred_nb,y_cross)
print(cm1)
acc_nb=accuracy_score(y_pred_nb,y_cross)
print(acc_nb)

[[2133 1129]
 [ 348 1390]]
0.7046


We will choose Random Forest Classifier as it's performance is good

## Test Data

In [34]:
test=pd.read_csv("testData.tsv",delimiter='\t',quoting=3)
print(test.shape)
print(test.head(10))

(25000, 2)
           id                                             review
0  "12311_10"  "Naturally in a film who's main themes are of ...
1    "8348_2"  "This movie is a disaster within a disaster fi...
2    "5828_4"  "All in all, this is a movie for kids. We saw ...
3    "7186_2"  "Afraid of the Dark left me with the impressio...
4   "12128_7"  "A very accurate depiction of small time mob l...
5    "2913_8"  "...as valuable as King Tut's tomb! (OK, maybe...
6    "4396_1"  "This has to be one of the biggest misfires ev...
7     "395_2"  "This is one of those movies I watched, and wo...
8   "10616_1"  "The worst movie i've seen in years (and i've ...
9    "9074_9"  "Five medical students (Kevin Bacon, David Lab...


## Cleaning The Test data 

In [23]:
ps1=PorterStemmer()
corpus1=[]                         # this contains our reviews after cleaning 
for i in range(0,25000):          # we will iterate through each review and clean it
    
    # its for removing exclamtion marks, question marks and like this stuff 
    Review=re.sub('[^a-zA-Z]',' ',test['review'][i])       
    
    # making every letter to lower case
    Review=Review.lower()
    
    # splitting the string in words( why we do this.. the reasons are below)
    Review=Review.split()
    
    # we are removing the words like 'the','an'  and we are also stemming it 
    Review=[ps1.stem(word) for word in Review if not word in set(stopwords.words('english'))]
    
    # after cleaning we are just joning to make it as string
    Review=' '.join(Review)
    
    # we will do this for every review and the review we got after cleaning we are appending to corpus
    corpus1.append(Review)

In [51]:
X_test=cv.transform(corpus1).toarray()
X_test.shape

(25000, 5000)

In [52]:
y_test=rf.predict(X_test)
print(y_test.shape)
output = pd.DataFrame( data={"id":test["id"], "sentiment":y_test} )
print(output.head(10))

(25000,)
           id  sentiment
0  "12311_10"          1
1    "8348_2"          0
2    "5828_4"          1
3    "7186_2"          1
4   "12128_7"          1
5    "2913_8"          0
6    "4396_1"          0
7     "395_2"          0
8   "10616_1"          0
9    "9074_9"          0


In [53]:
output.to_csv( "Bag_of_Words_model.csv", index=False, quoting=3 )  # getting a accuracy of 84.26% with max_fea=5000
                                                                        

## Applying Dimensonality Reduction

I have tried various small dimension and we will take the dimensions when the variance becomes greater than 90%
500 components : 70.7%

1000 components : 82.6 %

2000 components : 92.85 %

2500 componenets : 95.33%

3000 componenents : 97.03 %

In [46]:
from sklearn.decomposition import PCA
pca=PCA(n_components=2500)
pca.fit(X)

PCA(copy=True, iterated_power='auto', n_components=2500, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [47]:
var=pca.explained_variance_ratio_.cumsum()
print(var.shape,var[499],var[999],var[1999],var[2499])

(2500,) 0.711072874909 0.82870601549 0.928427705475 0.95232176373


In [48]:
X_red=pca.transform(X)
print(X_red.shape)

(25000, 2500)


In [55]:
X_test_red=pca.transform(X_test)
print(X_test_red.shape)

(25000, 2500)


In [56]:
randomforest=RandomForestClassifier(n_estimators=100,random_state=0)
randomforest.fit(X_red,y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)

In [57]:
y_test=rf.predict(X_test)
print(y_test.shape)
output1 = pd.DataFrame( data={"id":test["id"], "sentiment":y_test} )
print(output1.head(10))

(25000,)
           id  sentiment
0  "12311_10"          1
1    "8348_2"          0
2    "5828_4"          1
3    "7186_2"          1
4   "12128_7"          1
5    "2913_8"          0
6    "4396_1"          0
7     "395_2"          0
8   "10616_1"          0
9    "9074_9"          0


In [59]:
output1.to_csv( "Bag_of_Words_model1.csv", index=False, quoting=3 )       # got same  accuracy

Even after applying Dimensonality Reduction the accuracy has not changed that much.

## Trying Some New methods

In [60]:
from sklearn.feature_selection import VarianceThreshold

In [61]:
vr=VarianceThreshold()
X_new=vr.fit_transform(X)
print(X_new.shape)

(25000, 5000)


I have tried to remove the columns whose variance is zero but you can see that there no colummns present.