# Sentiment Analysis using a Naive Bayes Classifer 
by Catherine Pan and Lydia Ding <br>
for CS391: Topics in NLP

## 1. Introduction


The proliferation of social media and abundance of textual data has led to a growing interest in extracting insights from large quantities of text automatically and efficiently. One field that has wide-ranging applications in marketing, politics, and clinical medicine is sentiment analysis (SA), which leverages natural language processing, biometrics, and machine learning techniques to categorize the attitude of an individual (or group) with respect to a given subject. Categorization can determine the sentimental polarity (i.e., positive or negative), or predict more nuanced emotions such as happiness, sadness, or anger. SA has been instrumental in researching brand perception (Ghiassi et al 2013), detect mental health issues (Wang et al 2013), and even predict public voting patterns using data from Twitter (Tumasjan et al 2010). A number of methods exist for sentiment analysis, including statistical models such as point-wise mutual information and latent semantic indexing, as well as machine learning methods such as Naive Bayes, Bayesian networks, and neural networks (Medhat et al 2014). <br>

The current study applies the Naive Bayes model to a set of Rotten Tomatoes film reviews and attempts to categorize reviews as either positive or negative. The dataset, which includes sentential textual tokens from 8,544 film reviews, is taken from a sentiment analysis corpus created by Pang and Lee (2002). 

### 1.1 Reading the data
Here, we read the .tsv file of Rotten Tomatoes reviews into a dataframe in Python.

In [2]:
import pandas as pd
from collections import Counter
import re

# Read the tsv file into a pandas dataframe
reviews = pd.read_csv('new_train.tsv', sep='\t')
print (reviews.count())
reviews.describe()

PhraseId      8529
SentenceId    8529
Phrase        8529
Sentiment     8529
dtype: int64


Unnamed: 0,PhraseId,SentenceId,Sentiment
count,8529.0,8529.0,8529.0
mean,81492.254543,4269.683433,2.063196
std,44268.957774,2466.705592,1.276636
min,1.0,1.0,0.0
25%,43992.0,2133.0,1.0
50%,82655.0,4268.0,2.0
75%,119774.0,6406.0,3.0
max,156040.0,8544.0,4.0


In [3]:
# randomize the dataset
reviews = reviews.sample(frac = 1)
reviews.head(10)

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
6213,116660,6221,Caruso sometimes descends into sub-Tarantino c...,3
2622,52908,2623,Brosnan is more feral in this film than I 've ...,3
8081,148792,8095,"It is , however , a completely honest , open-h...",4
7275,135009,7287,Has its charming quirks and its dull spots .,2
7313,135609,7326,"But believe it or not , it 's one of the most ...",4
4363,84433,4368,It is different from others in its genre in th...,4
630,14627,631,The film 's darker moments become smoothed ove...,2
2995,59426,2998,What makes the film special is the refreshingl...,3
5670,107424,5676,A meatier deeper beginning and\/or ending woul...,3
2609,52704,2610,"For VeggieTales fans , this is more appetizing...",4


### 1.2 Pre-processing and partitioning
Orginally, reviews in this dataset were labeled with scores that indicate various degrees of positive and negative sentiment: 0 (negative), 1 (somewhat negative), 2 (neutral), 3 (somewhat positive), and 4 (positive). Since we are only interested in identifying whether a review is positive or negative, we convert this tiered system into one that simply distinguishes between positive and negative reviews. We treat any review assigned a sentiment label of 0-2 as a negative review. Any review assigned a sentiment label of 3-4 we treat as a positive review.

In [4]:
# convert the 5 sentiment labels into -1 (negative) and 1 (positive) labels
reviews['Sentiment'].loc[reviews['Sentiment'] < 3] = -1
reviews['Sentiment'].loc[reviews['Sentiment'] > 2] = 1
reviews['Sentiment'].head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


6213    1
2622    1
8081    1
7275   -1
7313    1
4363    1
630    -1
2995    1
5670    1
2609    1
Name: Sentiment, dtype: int64

Next, we split our dataset into two sets: 75% for training and 25% for testing.

In [5]:
reviews_train = reviews.loc[reviews["SentenceId"] < reviews['SentenceId'].max()*.75]
reviews_test = reviews.loc[reviews["SentenceId"] >= reviews['SentenceId'].max()*.75]
reviews_train.count() + reviews_test.count()
#reviews_test = reviews[int(len(reviews)*.75):]


PhraseId      8529
SentenceId    8529
Phrase        8529
Sentiment     8529
dtype: int64

## 2. Naive Bayes Classifier
The problem we are trying to solve can be stated as follows: we wish to find the class $c$, positive or negative, that yields the maximum posterior probability given a document $d$. The probability of a class given a data point can be stated as in 1.1. The class that maximizes the probability of (1.1) is calculated by (1.2). <br>

$$P(c|d)$$	(1.1) <br>
$$argmax_{c} = P(c|d)$$ (1.2) <br>

We apply Bayes' rule to 1.2 and compute the most likely class $\hat{c}$ given some text $\textit{d}$ by choosing the class that yields the highest product of the two probabilities: the prior probility of the class $\textit{P(c)}$ and the likelihood of a text $\textit{w}$ given class $\textit{c}$. Thus, our problem can be broken down into two tractable problems: solving for the likelihood $\textit{P(d|c)}$ and the prior $\textit{P(c)}$. Since the probability of the document $\textit{P(d)}$ is given and constant, it can be omitted from our calculations.<br>

$$argmax P(c|d) = \frac{argmax_{c}P(d|c)P(c)}{P(d)}$$
$$argmax P(c|d) ≅ argmaxP(d|c)P(c)$$	(1.3)

For the likelihood $\textit{P(d|c)}$, we represent document $\textit{d}$ as a set of words $\textit{f = f1, f2 ...fn}$:<br>

$$P(w_{1}, w_{2}, w_{3}...w_{n}|c)$$
(1.4)
    
However, 1.4 is not computable without making some assumptions about the data. First, we treat the document as a bag of words for which the position of a word is not relevant and each word is equally important. Second, we make the naive Bayes assumption (hence the name of the method) that the word probabilities $P(f_{i}|c)$ are statistically independent from one another. Assuming these to be true, we can calculate the likelihood by computing for 1.5: <br>

$$P(w_{1}, w_{2}, w_{3}...w_{n}|c) = P(w_{1}|c) \times P(w_{2}|c) \times P(w_{3})|c...P(w_{n}|c)$$
(1.5)

Next, we describe the process of training a Naives Bayes classifier to learn values for the prior and for word likelihoods $P(w_{i}|c)$ (the probability of a word given its sentiment category).

### 2.1 Training a Naive Bayes Classifier
For the document prior $\textit{P(c)}$, we can simply find what percentage of the documents in our training set belong to each category $c$ (either positive or negative). Let $N_{c}$ be the number of documents in our training set that belong to sentiment $c$, and $N_{doc}$ be the total number of documents in our training set. <br>

$$P(c) = \frac{N_{c}}{N_{doc}}$$(1.6)

Recall that the likelihood $\textit{P(d|c)}$ is the product of the individual likelihoods $P(w_{i}|c)$ of each word in the document. To compute $P(w_{i}|c)$, we find the fraction of times word $w_{i}$ appears among all words in documents of category $c$. We divide the number of times words $w_{i}$ appears in $c$ by the total number of words in documents of category $c$, as in 1.7. <br>

$$P(w_{i}|c) = \frac{count(w_{i},c)}{\Sigma_{w\in V}count(w_{i},c)}$$(1.7)

However, there is a problem with this; in our test data, we may encounter a word such as “superb” that does not occur in our training data. Using the current formula, we would have a numerator of zero. Since Bayes multiplies each $P(w_{i}|c)$ to find the total likelihood $\textit{P(d|c)}$, this would result in a likelihood value of $0$. To circumvent this issue, we use Laplace smoothing and add a constant of one to both the numerator and denominator for each $P(w_{i}|c)$:<br>

$$P(w_{i}|c) = \frac{count(w_{i},c)+1}{\Sigma_{w\in V}(count(w_{i},c)+1)}$$(1.8)

In [17]:
def get_text(score):
    # join together the phrases in reviews for a particular sentiment score
    # lower case all words
    return " ".join(row['Phrase'].lower() for index, row in reviews_train.iterrows() if row["Sentiment"] == score)    

In [7]:
def count_text(text):
    # split text into words beased on whitespeace. 
    words = re.split('\s+', text) 
    # words = text.split(" ")
    # count the occurence of each word
    return Counter(words)

negative_text = get_text(-1)
positive_text = get_text(1)

# Generate words counts for negative reviews
negative_counts = count_text(negative_text)
# Generate words counts for positive reviews
positive_counts = count_text(positive_text)


In [8]:
#len(reviews_train.loc[reviews_train['Sentiment'] == -1])

In [10]:
# calculate P(c), the prior probability of the class c
def get_c_count(score):
    # count the total occurence of each classification in the data 
    return len(reviews_train.loc[reviews_train["Sentiment"] == score])

# counts of the total occurence of each classification in the data
positive_review_count = get_c_count(1)
negative_review_count = get_c_count(-1)

# prior probabilities of each classification, P(c)
prob_positive = positive_review_count / float(len(reviews_train))
prob_negative = negative_review_count / float(len(reviews_train))

print (positive_review_count)
print (negative_review_count)

2696
3702


### 2.2. Making Predictions
Now given a document $\textit{d}$, we can use the classifer to predict the most probable class of d: <br> c* = $argmax_{c} P(c|d)$

In [106]:
#text_counts = Counter(re.split("\s+", negative_text))
#for word in text_counts:
    #print text_counts.get(word)
   # print negative_counts.get(word)


In [11]:
def make_class_prediction(text, counts, class_prob, class_count):
    prediction = 1
    text_counts = Counter(re.split("\s+", text))
 
    for word in text_counts:
    # For every word in the text, we get the number of times that word occured in the reviews for a given class, 
    # add 1 to smooth the value, and divide by the total number of words in that class plus the class_count to also smooth the denominator(Laplace transformation) 
    # then multiply by the times of occurence of each word in the text to weight in repeated words in the text
    # Smoothing ensures that we don't multiply the prediction by 0 if the word didn't exist in the training data.
    # We smooth the denominator counts to keep things even.
        prediction *=  float(text_counts.get(word)) * ((counts.get(word, 0) + 1.0) / (sum(counts.values()) + class_count))

 # Now we multiply by the probability of the class existing in the documents.
    return prediction * class_prob

In [12]:
def predict(text, make_class_prediction):
    # compute the probabilities of a given text being positive and negative
    negative = make_class_prediction(text, negative_counts, prob_negative, negative_review_count)
    positive = make_class_prediction(text, positive_counts, prob_positive, positive_review_count)

    if negative > positive:
        return -1
    return 1

predictions = [predict(review, make_class_prediction) for review in reviews_test['Phrase']] 




### 2.3 Computing Accuracy

In [13]:
sum(row["Sentiment"] == predict(row["Phrase"], make_class_prediction) for index, row in reviews_test.iterrows())

1518

In [112]:
len(predictions)

2131

In [15]:
print ("accuracy:",1518/2131.0)

accuracy: 0.7123416236508682


baseline metrics (assuming we predict negative for all cases): 

In [121]:
print "all negative prediction baseline:"
print negative_review_count / ((negative_review_count + positive_review_count)*1.0) 

all negative prediction baseline:
0.578618318224


## 3. Results and Future Steps

The performance of our model performs well above baseline chance. Above, we calculate the overall accuracy of our Naive Bayes model by dividing the number of correct predictions -- when we predict 'negative' for a negative review -- by the total number of predictions made. Predicting 'negative' for every review yields an accuracy rate of 57.86%, while our model predicts polarity with 71.23% accuracy: well above chance.

While our model performs better than chance, its performance could certainly be improved. We briefly offer suggestions for future steps below. First, it would be useful to train on a larger data set. Our Bayes model was trained on 6,408 tokens that were 1-2 sentences long. This means that it learned far fewer words than are commonly used in the English language, and it only had a handful of examples for each word. Second, unlike other models of this type, ours did not implement negation tagging. This means that the phrase "very good" in "not very good" would have been treated positively rather than negatively. In contrast, Pang and Lee (2002) marked every word in a sentence that followed negation ("not" "never" "no") with a _NOT_ tag, treating negated constituents differently from their positive counterparts. With the addition of this step, "good" would still be "good", but "not good" would then be counted as a different word. There has been evidence that leaving out negation tagging can lead to at least mildly detrimental effects in SA performance (Pang and Lee 2002).

Our model makes a number of simplifying assumptions that may also affect performance. Naive Bayes, by definition, assumes no dependence between features in a document; however, this is highly unlikely given the complex relationships that hold between words in language. A Multinomial Bayes model does not depend upon the assumption of independence, and may provide a slightly truer representation of sentential structure. Finally, this paper takes document features to be individual words -- unigrams with no context. We imagine that breaking up a document into bi-grams or tri-grams might capture more nuanced phrasal relationships and provide a more accurate estimation of polarity.

-----
## Sources
Ghiassi, M., Skinner, J., & Zimbra, D. (2013). Twitter brand sentiment analysis: A hybrid system using n-gram analysis and dynamic artificial neural network. Expert Systems with applications, 40(16), 6266-6282. <br> <br>
Jurafsky, D., & James, H. (2000). Speech and language processing an introduction to natural language processing, computational linguistics, and speech. <br> <br>
Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5(4), 1093-1113. <br> <br>
Pang, B., Lee, L., & Vaithyanathan, S. (2002, July). Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10 (pp. 79-86). Association for Computational Linguistics. <br> <br>
Pang, B., & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1–2), 1-135. <br> <br>
Paruchuri, V. (2015, March 17). Naive Bayes: predicting movie sentiment. Retrieved from https://www.dataquest.io/blog/naive-bayes-tutorial/.<br> <br>
Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2010). Predicting elections with twitter: What 140 characters reveal about political sentiment. ICWSM, 10(1), 178-185.<br> <br>
Wang, X., Zhang, C., Ji, Y., Sun, L., Wu, L., & Bao, Z. (2013, April). A depression d
etection model based on sentiment analysis in micro-blog social network. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 201-213). Springer Berlin Heidelberg. <br> <br>
Author unknown. (2014). Sentiment Analysis on Movie Reviews. Retrieved from https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews