# Sentiment Analysis

# 1. Naive Baye's

### Finding word counts


We have to calculate the probabilities of each classification, and the probabilities of each feature falling into each classification.

We were working with several discrete features in the last example. Here, all we have is one long string. The easiest way to generate features from text is to split the text up into words. Each word in a review will then be a feature that we can then work with. In order to do this, we’ll split the reviews based on whitespace.

We’ll then count up how many times each word occurs in the negative reviews, and how many times each word occurs in the positive reviews. This will allow us to eventually compute the probabilities of a new review belonging to each class.




In [3]:
import os
from collections import Counter
import re
i=0
#print os.getcwd()
posreviews=[]
positive_text = " "
negative_text = " "
i=0
for fileName in os.listdir("mr/train/pos"):
    fo=open("mr/train/pos/%s" % fileName,"r")
    #print fo.name
    #os.rename("mr/train/neg/%s" % fileName,"mr/train/neg/%d.txt" %(i))
    str= fo.read()
    #print "%d. %s" % (i,str)
    posreviews.append(str)
    #print "%d . %s" %(i,positive_text)
    #if i>1000: 
     #   break
    fo.close()
    i=i+1
positive_text= " ".join(posreviews)
#print positive_text
negreviews=[]
i=0
for fileName in os.listdir("mr/train/neg"):
    fo=open("mr/train/neg/%s" % fileName,"r")
    #os.rename("mr/train/neg/%s" % fileName,"mr/train/neg/%d.txt" %(i))
    negreviews.append(fo.read())
    #if i>1000: 
     #   break
    i=i+1
negative_text=" ".join(negreviews)
#print negative_text

def count_text(text):
  # Split text into words based on whitespace.  Simple but effective.
  words = re.split("\s+", text)
  # Count up the occurence of each word.
  return Counter(words)

# Generate word counts for negative tone.
negative_counts = count_text(negative_text)
# Generate word counts for positive tone.
positive_counts = count_text(positive_text)

print("Negative text sample: {0}".format(negative_text[:500]))
print("Positive text sample: {0}".format(positive_text[:500]))


Negative text sample: Story of a man who has unnatural feelings for a pig. Starts out with a opening scene that is a terrific example of absurd comedy. A formal orchestra audience is turned into an insane, violent mob by the crazy chantings of it's singers. Unfortunately it stays absurd the WHOLE time with no general narrative eventually making it just too off putting. Even those from the era should be turned off. The cryptic dialogue would make Shakespeare seem easy to a third grader. On a technical level it's bette
Positive text sample: Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their 

### Making Predictions

Now that we have the word counts, we just have to convert them to probabilities and multiply them out to get the predicted classification. Let’s say we wanted to find the probability that the review didn't like it expresses a negative sentiment. We would find the total number of times the word didn't occured in the negative reviews, and divide it by the total number of words in the negative reviews to get the probability of x given y. We would then do the same for like and it. We would multiply all three probabilities, and then multiply by the probability of any document expressing a negative sentiment to get our final probability that the sentence expresses negative sentiment.

We would do the same for positive sentiment, and then whichever probability is greater would be the class that the review is assigned to.

To do all this, we’ll need to compute the probabilities of each class occuring in the data, and then make a function to compute the classification.


In [4]:

# We need these counts to use for smoothing when computing the prediction.
positive_review_count = len(posreviews)
negative_review_count = len(negreviews)
# These are the class probabilities (we saw them in the formula as P(y)).
prob_positive = positive_review_count / float(len(posreviews) + len(negreviews))
prob_negative = negative_review_count / float(len(posreviews) + len(negreviews))

def make_class_prediction(text, counts, class_prob, class_count):
  prediction = 1.0
  text_counts = Counter(re.split("\s+", text))
  for word in text_counts:
      # For every word in the text, we get the number of times that word occured in the reviews for a given class, add 1 to smooth the value, and divide by the total number of words in the class (plus the class_count to also smooth the denominator).
      # Smoothing ensures that we don't multiply the prediction by 0 if the word didn't exist in the training data.
      # We also smooth the denominator counts to keep things even.
      # print "%d, %d,%d" %(text_counts.get(word), counts.get(word,0), sum(counts.values()))
      prediction *=  text_counts.get(word) * ((counts.get(word,0) + 1) / float(sum(counts.values()) + class_count))
      #print prediction
  # Now we multiply by the probability of the class existing in the documents.
  return prediction * class_prob

# As you can see, we can now generate probabilities for which class a given review is part of.
# The probabilities themselves aren't very useful -- we make our classification decision based on which value is greater.
#print("Review: {0}".format(reviews[0][0]))
#text="Movie was junk, useless, good for nothing, sheer waste of time and money."
text="Good nice Well Done"
#print("Negative prediction: {0}".format(make_class_prediction(text, negative_counts, prob_negative, negative_review_count))
neg=make_class_prediction(text, negative_counts, prob_negative, negative_review_count)
pos=make_class_prediction(text, positive_counts, prob_positive, positive_review_count)
if pos>neg:
   print "%s - Positive (pos score={0}, neg score={1})".format(pos,neg) %(text)
else:
   print "%s - Negative (pos score={0}, neg score={1})".format(pos,neg) %(text)
text="Movie was junk, useless, good for nothing, sheer waste of time and money."
neg=make_class_prediction(text, negative_counts, prob_negative, negative_review_count)
pos=make_class_prediction(text, positive_counts, prob_positive, positive_review_count)
if pos>neg:
   print "%s - Positive (pos score={0}, neg score={1})".format(pos,neg) %(text)
else:
   print "%s - Negative (pos score={0}, neg score={1})".format(pos,neg) %(text)

Good nice Well Done - Positive (pos score=8.06432308243e-18, neg score=1.53669815111e-18)
Movie was junk, useless, good for nothing, sheer waste of time and money. - Negative (pos score=6.35349543481e-48, neg score=2.56654331959e-45)


### Naive Bayes - Faster way to predict using Sklearn

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
test=[]
actual=[]
for fileName in os.listdir("mr/test/pos"):
    fo=open("mr/test/pos/%s" % fileName,"r")
    str= fo.read()
    test.append(str)
    actual.append(1)
for fileName in os.listdir("mr/test/neg"):
    fo=open("mr/test/neg/%s" % fileName,"r")
    str= fo.read()
    test.append(str)
    actual.append(-1)
reviews = []
for r in posreviews:
    reviews.append(r)
for r in negreviews:
    reviews.append(r)
# Generate counts from text using a vectorizer.  There are other vectorizers available, and lots of options you can set.
# This performs our step of computing word counts.
vectorizer = CountVectorizer(stop_words='english')
train_features = vectorizer.fit_transform([r for r in reviews])
test_features = vectorizer.transform([r for r in test])

# Fit a naive bayes model to the training data.
# This will train the model using the word counts we computer, and the existing classifications in the training set.
nb = MultinomialNB()
trainRes=[]
for r in posreviews:
    trainRes.append(1)
for r in negreviews:
    trainRes.append(-1)
nb.fit(train_features,trainRes)

# Now we can use the model to predict classifications for our test features.
predictions = nb.predict(test_features)
#print predictions
# Compute the error.  It is slightly different from our model because the internals of this process work differently from our implementation.
fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label=1)
print("Multinomial naive bayes AUC: {0}".format(metrics.auc(fpr, tpr)))

Multinomial naive bayes AUC: 0.785085965614


## Using SVM
First Sample Kernal is rbf(radial basis fuction) i.e. exp(-gamma |x-x'|^2) where gamma is a specified variable

In [6]:
from sklearn.svm import SVC
clf = SVC()
clf.fit(train_features, trainRes) 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
predictSvc=clf.predict(test_features)
print predictSvc
# Compute the error.  
fpr, tpr, thresholds = metrics.roc_curve(actual, predictSvc, pos_label=1)
print("SVC Analysis AUC: {0}".format(metrics.auc(fpr, tpr)))

[-1 -1 -1 ..., -1 -1 -1]
SVC Analysis AUC: 0.5


### Kernal as Linear Function 

In [7]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
predictSvc=clf.predict(test_features)
print predictSvc
# Compute the error.  
fpr, tpr, thresholds = metrics.roc_curve(actual, predictSvc, pos_label=1)
print("SVC Analysis AUC: {0}".format(metrics.auc(fpr, tpr)))

[-1 -1 -1 ..., -1 -1 -1]
SVC Analysis AUC: 0.5


### Kernal as Sigmoid function i.e. tanh(gamma* < x, x'> + r) where r is specified by variable coef0

In [21]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=200.0,
    decision_function_shape=None, degree=3, gamma='auto', kernel='sigmoid',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
predictSvc=clf.predict(test_features)
print predictSvc
# Compute the error.  
fpr, tpr, thresholds = metrics.roc_curve(actual, predictSvc, pos_label=1)
print("SVC Analysis AUC: {0}".format(metrics.auc(fpr, tpr)))

[-1 -1 -1 ..., -1 -1 -1]
SVC Analysis AUC: 0.5


### Using Random Forests

#### Maximum Trees 10

In [14]:
from sklearn.ensemble import RandomForestClassifier
RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=None, min_samples_split=2, 
                                              min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', 
                                              max_leaf_nodes=None,bootstrap=True, oob_score=False, 
                                              n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)
predictRF=clf.predict(test_features)
print predictSvc
# Compute the error.  
fpr, tpr, thresholds = metrics.roc_curve(actual, predictSvc, pos_label=1)
print("SVC Analysis AUC: {0}".format(metrics.auc(fpr, tpr)))

[-1 -1 -1 ..., -1 -1 -1]
SVC Analysis AUC: 0.5


#### Maximum Trees 20

In [19]:
RandomForestClassifier(n_estimators=50, criterion='gini', max_depth=None, min_samples_split=2, 
                                              min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', 
                                              max_leaf_nodes=None, bootstrap=True, oob_score=False, 
                                              n_jobs=1, random_state=None, verbose=0, warm_start=False, class_weight=None)
predictRF=clf.predict(test_features)
print predictSvc
# Compute the error.  
fpr, tpr, thresholds = metrics.roc_curve(actual, predictSvc, pos_label=1)
print("SVC Analysis AUC: {0}".format(metrics.auc(fpr, tpr)))

[-1 -1 -1 ..., -1 -1 -1]
SVC Analysis AUC: 0.5


### References
Dataset - http://ai.stanford.edu/~amaas/data/sentiment/ [25k(12.5k pos and 12.5k negative reviews) Training Data and 25k Test Movie Reviews] <br/>We took 2.5k pos samples and neg samples from test and tag for the results which are displayed.<br/>Kindly download the dataset from the link provided and make sure it is in the same folder as this ipython notebook to run.<br/>
Naive Bayes Senti Analysis on Movie Reviews Blog https://www.dataquest.io/blog/naive-bayes-tutorial/ <br/>
Developed by Mayank Bhasin, for any queries contact mayankbhasin@gmail.com