# Sentiment Analysis ML way

#Given to you short review of some movies. The reviews could talk bad or good about the movie. We can identify the sentiment of the text by looking/reading the words in the sentence. How can we make a machine/system understand the sentiment in the text.

#One way is the ML way. There is a ground truth that is created for some corpus i.e  we have both postive and negative reviews that are tagged with their respective class. This forms the base and the algorithm is trained on this data (after converting this to structured form) and depending on the words used the classification is done (Machine/system tries to obtain a pattern from data).

#Another way is dictionary approach, where we create a dictionary of positive and negative words and explicitly state that these words are positive or negative. We can then count the number of positive and negative words in the sentence and give a score. If the score is positive then its positive else its negative.

#In either cases, there is manual work involved (creating ground truth in case 1 or creating the dictionary in case 2)

In [28]:
import re
import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
f1 = open("short_reviews/positive.txt","r")   # "r" is for reading
short_pos = f1.readlines() 

In [4]:
short_pos[1]

'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth . \n'

In [3]:
type(short_pos)

list

In [5]:
short_pos=[re.sub("\n","",i)for i in short_pos]
x_short_pos=short_pos[:1000]
f2 = open("short_reviews/negative.txt","r")
short_neg = f2.readlines()
short_neg=[re.sub("\n","",i)for i in short_neg]
x_short_neg=short_neg[:1000]

In [6]:
x_short_pos[1]

'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth . '

In [8]:
type(short_neg)

list

In [8]:
cv=CountVectorizer(stop_words='english',lowercase=True,
                   strip_accents='unicode',decode_error='ignore')
data=x_short_pos+x_short_neg
tdm = cv.fit_transform(data)
Mat = tdm.todense()

In [9]:
Mat.shape

(2000L, 7402L)

In [13]:
import pandas as pd
Mat = pd.DataFrame(Mat)
Mat['type'] = ['pos']*1000+['neg']*1000
Mat = pd.DataFrame(Mat)
Mat = Mat.sample(frac = 1,random_state=1234)

In [14]:
Mat.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,7393,7394,7395,7396,7397,7398,7399,7400,7401,type
1748,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,neg
934,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,pos


In [15]:

train = Mat.iloc[:1800]
test = Mat.iloc[1800:]

In [16]:
from sklearn.linear_model import LogisticRegression


In [21]:
logreg = LogisticRegression(C=1e5)
X=train.ix[:,:-1].as_matrix()
Y=train.ix[:,-1].as_matrix()


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  


In [18]:
?LogisticRegression

In [22]:
logreg.fit(X,Y)

LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [24]:
test1=test.ix[:,:-1].as_matrix()
true=test.ix[:,-1].as_matrix()
pred=logreg.predict(test1)

In [25]:
confusion_matrix(test.ix[:,-1],pred)

array([[59, 38],
       [33, 70]])

In [30]:
print(classification_report(true,pred))

             precision    recall  f1-score   support

        neg       0.64      0.61      0.62        97
        pos       0.65      0.68      0.66       103

avg / total       0.64      0.65      0.64       200



In [31]:
logreg.predict_proba

<bound method LogisticRegression.predict_proba of LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)>

In [27]:
true
pred

array(['pos', 'pos', 'pos', 'pos', 'neg', 'neg', 'pos', 'pos', 'pos',
       'neg', 'neg', 'neg', 'neg', 'pos', 'neg', 'neg', 'pos', 'pos',
       'pos', 'neg', 'pos', 'pos', 'pos', 'pos', 'neg', 'pos', 'neg',
       'pos', 'pos', 'neg', 'pos', 'neg', 'pos', 'pos', 'pos', 'neg',
       'pos', 'pos', 'pos', 'pos', 'neg', 'pos', 'neg', 'pos', 'neg',
       'pos', 'neg', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos',
       'pos', 'pos', 'pos', 'neg', 'neg', 'pos', 'pos', 'neg', 'pos',
       'neg', 'neg', 'pos', 'pos', 'pos', 'pos', 'neg', 'neg', 'neg',
       'neg', 'pos', 'pos', 'neg', 'pos', 'pos', 'neg', 'neg', 'pos',
       'neg', 'pos', 'neg', 'neg', 'pos', 'pos', 'neg', 'pos', 'neg',
       'pos', 'pos', 'neg', 'neg', 'neg', 'pos', 'neg', 'pos', 'pos',
       'pos', 'neg', 'pos', 'pos', 'neg', 'pos', 'neg', 'pos', 'neg',
       'neg', 'neg', 'neg', 'neg', 'pos', 'pos', 'pos', 'pos', 'neg',
       'pos', 'neg', 'neg', 'neg', 'neg', 'pos', 'pos', 'neg', 'neg',
       'neg', 'neg',

In [65]:
pred

array(['pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'neg', 'neg', 'neg',
       'pos', 'pos', 'pos', 'neg', 'neg', 'neg', 'neg', 'neg', 'neg',
       'neg', 'pos', 'pos', 'pos', 'pos', 'pos', 'pos', 'neg', 'neg',
       'neg', 'neg', 'pos', 'neg', 'pos', 'neg', 'neg', 'neg', 'pos',
       'pos', 'pos', 'pos', 'neg', 'neg', 'pos', 'neg', 'neg', 'pos',
       'pos', 'pos', 'neg', 'neg', 'neg', 'pos', 'neg', 'pos', 'neg',
       'pos', 'neg', 'pos', 'pos', 'neg', 'pos', 'pos', 'pos', 'pos',
       'neg', 'pos', 'pos', 'pos', 'pos', 'pos', 'neg', 'pos', 'neg',
       'neg', 'neg', 'neg', 'pos', 'neg', 'pos', 'neg', 'pos', 'pos',
       'pos', 'neg', 'pos', 'pos', 'neg', 'pos', 'neg', 'pos', 'neg',
       'neg', 'pos', 'neg', 'pos', 'neg', 'neg', 'neg', 'pos', 'pos',
       'neg', 'neg', 'pos', 'neg', 'pos', 'neg', 'pos', 'pos', 'neg',
       'neg', 'neg', 'neg', 'pos', 'neg', 'pos', 'neg', 'pos', 'neg',
       'neg', 'pos', 'neg', 'pos', 'neg', 'pos', 'pos', 'neg', 'neg',
       'neg', 'pos',

# Work with any other classification model and check if you can improve the accuracies

In [34]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(train.ix[:,:-1],train.ix[:,-1])
A=classifier.predict(test.ix[:,:-1])
confusion_matrix(test.ix[:,-1],A)

print(classification_report(test.ix[:,-1], A))

             precision    recall  f1-score   support

        neg       0.67      0.65      0.66        97
        pos       0.68      0.70      0.69       103

avg / total       0.67      0.68      0.67       200



In [35]:
?MultinomialNB

In [57]:
A

array(['neg', 'pos', 'neg', 'pos', 'neg', 'pos', 'neg', 'neg', 'neg',
       'pos', 'pos', 'pos', 'neg', 'neg', 'neg', 'pos', 'neg', 'neg',
       'pos', 'pos', 'neg', 'neg', 'pos', 'pos', 'neg', 'pos', 'neg',
       'neg', 'neg', 'neg', 'neg', 'pos', 'neg', 'pos', 'pos', 'pos',
       'neg', 'pos', 'neg', 'pos', 'neg', 'pos', 'neg', 'neg', 'pos',
       'pos', 'pos', 'neg', 'neg', 'neg', 'pos', 'neg', 'pos', 'pos',
       'neg', 'neg', 'pos', 'neg', 'neg', 'neg', 'pos', 'pos', 'pos',
       'pos', 'pos', 'neg', 'pos', 'pos', 'pos', 'pos', 'pos', 'neg',
       'neg', 'pos', 'pos', 'pos', 'pos', 'pos', 'neg', 'neg', 'pos',
       'pos', 'neg', 'neg', 'neg', 'neg', 'neg', 'neg', 'pos', 'neg',
       'neg', 'pos', 'neg', 'pos', 'neg', 'neg', 'pos', 'neg', 'pos',
       'neg', 'pos', 'neg', 'neg', 'neg', 'neg', 'pos', 'pos', 'pos',
       'neg', 'pos', 'neg', 'pos', 'neg', 'neg', 'pos', 'pos', 'neg',
       'neg', 'pos', 'neg', 'pos', 'neg', 'pos', 'neg', 'neg', 'pos',
       'neg', 'pos',

#What else could be done to improve the accuracies

#Solution: There could be some common words in both positive and negative reviews.
    To avoid such words we can consider only adjectives to solve the problem and check if the accuracies improve

In [9]:
A=[1,2,3]
B=[4,5]
A+B

[1, 2, 3, 4, 5]