# Naive Bayes Sentiment Analysis

## Setup

In this lab we will take advantage of some parts of the scikit-learn (sklearn) machine learning library to carry out a Naive Bayes
sentiment analysis of some amazon product reviews.

In addition to our usual libraries (numpy for linear algebra and bokeh for plotting) we load two functions
from sklearn.

- ```CountVectorizer``` extracts wordcounts from documents
- ```train_test_split``` splits our data up into a "training set" from which we will derive our probabilities, and 
    a "test" set that we will use to evaluate our classifier.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split
from bokeh.plotting import figure
from bokeh.io import show, output_notebook
import numpy as np
output_notebook()

First we read our data from the file.  Each line of the file consists of a review, followed by a tab character,
followed by a "0" or "1".  We build a list of reviews and a corresponding list of labels from the file.

In [None]:
reviews = []
labels = []
with open("amazon.txt") as f:
    for line in f:
        review, label = line.strip().split('\t')
        reviews.append(review)
        labels.append(int(label))

The train_test_split breaks up the reviews and labels array randomly into two parts; by default, 75% of the data
goes into the train set and 25% into the test set, though this is adjustable. Now we set aside the test data
and work only with the training data until the end.

In [None]:
train_reviews, test_reviews, train_labels, test_labels = train_test_split(reviews, labels,random_state=11)
print('Length of train_reviews is ', len(train_reviews))
print('Length of test_reviews is ',len(test_reviews))

## A simple example

Now we use the CountVectorizer function to analyze our reviews.  The syntax here is that we create a CountVectorizer object and apply it to our review data.  We set options to say that  we only want to keep track of the 100 most common words in the reviews and (using ```binary=True```)that we only want to mark words that occur with a zero or 1 -- otherwise the routine will count the number of occurrences of each word.

In [None]:
F = CountVectorizer(max_features=100,binary=True)

To see how this works, let's apply it to a couple of simple sentences. The vectorizer expects a list of sentences, so we'll give a list of three sentences.  It returns a matrix whose rows correspond to the sentences and whose columns are features corresponding to the words it discovered in the data. 

In [None]:
simple = F.fit_transform(['This is a simple sentence that contains the word sentence twice.' ,
                          'This is another sentence.',
                          'A simple sentence has twice.'])
simple

The vectorizer returns a "sparse" array, which is an efficient way to store large matrices which are mostly zero.  We'll
see how to view it in a moment.  

First, we can ask the vectorizer for the vocabulary that it uncovered.  It returns a python dictionary that associates
words to columns.  So for example in this case the word ```word``` corresponds to the 9th column of the data matrix.

In [None]:
F.vocabulary_

Now the array ```simple```.

In [None]:
simple.toarray()

Each row is the feature vector to a sentence, and each column corresponds to a  word.  So the first sentence does *not* contain the first key word (```another```) but the second one does.

The column sums tell us how often each word occurs (total) in the documents.

In [None]:
simple.sum(axis=0)

Suppose the first sentence is "positive" (labelled with 1) and the other two are negative (labelled with zero). The corresponding target array is Y.

In [None]:
Y = np.array([[1],[0],[0]])
Y

We can compute the frequencies in the positive documents.

In [None]:
Y.transpose() @ simple


and in the negative ones.

In [None]:
(1-Y).transpose() @ simple

The fourth word is 'sentence' which does indeed occur once in the type 1 sentences and 2 times in the type 0 ones. 

The numbers of sentences of each type are $Y^TY$ and $(1-Y)^T(1-Y)$ although numpy thinks these are two dimensional 1x1 arrays.

In [None]:
Y.transpose()@Y

In [None]:
(1-Y).transpose()@(1-Y)

We can compute the conditional probabilities with which each word occurs in the two types.  

In [None]:
Pplus = Y.transpose()@simple/(Y.transpose()@Y)
Pplus

Just as a check, the first word, ```another```, occurs only in the second sentence, so if has a 50% chance of
occurring in a sentence labelled zero.

In [None]:
Pminus = (1-Y).transpose()@simple/((1-Y).transpose()@(1-Y))
Pminus

Now let's look at our training data.  First we use the CountVectorizer to compute the feature matrix.  We're going to add an option 
to tell the vectorizer to ignore elements of a list of "stop words" like he, his, at, him, ... to simplify things. 

## The product review data

In [None]:
vectorizer = CountVectorizer(max_features=100,binary=True,stop_words='english')

In [None]:
train_matrix = vectorizer.fit_transform(train_reviews).toarray()
keywords = vectorizer.get_feature_names()

In [None]:
train_matrix.shape

Let's compute the frequencies of the 100 words in each of the two classes.  We convert our labels into a numpy array.

In [None]:
train_y = np.array(train_labels)

In [None]:
train_y.shape

We use our formulae to compute the frequencies and the conditional probabilitiy vectors for the two classes.

In [None]:
freq_plus = (train_y.transpose()@train_matrix)
Nplus = train_y.transpose()@train_y
Pplus = freq_plus/Nplus
freq_minus = ((1-train_y).transpose()@train_matrix)
Nminus = ((1-train_y).transpose()@(1-train_y))
Pminus = freq_minus/Nminus
N = Nminus+Nplus

In [None]:
print(Nplus, Nminus)

Here we use a trick called "indirect sort" to find the words with largest P(w|+) and P(w|-).  Argsort returns this *locations* of
the elements in order.  So indices[0] is the location in the Pplus array with the smallest value, indices[1] the next smallest value, and so on.
We use these indices to extract the corresponding keywords.

In [None]:
indices = np.argsort(Pplus)
[keywords[i] for i in indices[::-1]][:20]

In [None]:
indices = np.argsort(Pminus)
[keywords[i] for i in indices[::-1]][:20]

In [None]:
# This is a fancy use of bokeh to add hover labels to the dots
source = ColumnDataSource({'+':Pplus,'-':Pminus,'word':keywords})
f=figure()
f.scatter(x='+',y='-',source=source)

f.xaxis.axis_label='P(w|+)'
f.yaxis.axis_label = 'P(w|-)'
f.line(x=[0,.2],y=[0,.2])
f.add_tools(HoverTool(tooltips=[("word","@word")]))
show(f)

To avoid taking the logarithm of zero, we increase all of the frequency counts by 1, as well as the Nplus and Nminus by 1. This is often called
"smoothing."


In [None]:
freq_plus = freq_plus+1
freq_minus = freq_minus+1
Nplus = Nplus+2
Nminus = Nminus+2

In [None]:
LPplus = np.log(freq_plus/Nplus)
LPNplus = np.log(1-freq_plus/Nplus)
LPminus = np.log(freq_minus/Nminus)
LPNminus = np.log(1-freq_minus/Nminus)

Using the equation from the notes (which is essentially Bayes rule), we find:

In [None]:
posL = train_matrix @ LPplus + (1-train_matrix) @(1-LPplus) - np.log(Nplus/(N+2))
negL  = train_matrix @ LPminus + (1-train_matrix)@(1-LPminus) - np.log(Nminus/(N+2))

Recall that posL and negL are the likelihoods that a particular review is positive or negative, and our decision criterion is:
- label 1 if posL-negL>0
- label 0 otherwise

Our decision array has a 1 if posL>negL and a zero otherwise.

In [None]:
decision = (posL > negL).astype(int)

Our check array as a 1 if decision and the original label agree, and zero otherwise.

In [None]:
check = (decision==train_y).astype(int)

In [None]:
np.sum(check)

They agree 554/750 times, or about 75% of the time.  That's much better than guessing, which would only be right 50% of the time.

Finally, we use the test data to see if we can predict labels on "new" data.
We re-use the LPplus and LPminus parameters, as well as the Nplus/N and Nminus/N from the training data.
But we need to compute the data matrix for the test data *based on the features derived from the training data.*

In [None]:
vectorizer.fit(train_reviews)
test_matrix = vectorizer.transform(test_reviews).toarray()
test_y = np.array(test_labels)

In [None]:
test_matrix.shape

In [None]:
posL = test_matrix @ LPplus + (1-test_matrix)@(LPNplus) -np.log(Nplus/(Nplus+Nminus))
negL = test_matrix @ LPminus + (1-test_matrix)@(LPNminus) - np.log(Nminus/(Nplus+Nminus))

In [None]:
test_decision = (posL > negL).astype(int)
check = (test_decision == test_y).astype(int)
np.sum(check)

So we have correctly classified 182/250 reviews from the test set for an accuracy of 171/250 = 73%. Much better than guessing!

## Using the sklearn facilities

The sklearn library can do all of this using built in routines.  We add to the work above one more import.

In [None]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB

The BernoulliNB function takes the binary feature matrix and does all the computations associated with fitting the naive bayes
model.  Let's walk through it. We start with the train_reviews and train_labels lists.  This cell builds the Bernoulli classifier B
using the vectorized train_reviews and the train_labels data.

In [None]:
vectorizer = CountVectorizer(max_features=100,binary=True,stop_words='english')
V = vectorizer.fit(train_reviews)
X = V.transform(train_reviews)
train_y = np.array(train_labels)
B = BernoulliNB().fit(X,train_y)

This cell uses the fitted vectorizer to compute the data matrix for the test reviews.

In [None]:
T = V.transform(test_reviews)
test_y = np.array(test_labels)

Now we find the predictions using the matrix T.

In [None]:
predictions = B.predict(T)

In [None]:
predictions

The score method allows us to tell how well we did.

In [None]:
B.score(T,test_y)

The logs of the probabilities P(w|+/-) are stored inside B, and they agree with our computations.

In [None]:
B.feature_log_prob_

In [None]:
LPminus

In [None]:
LPplus

## Your turn

Carry out the analysis above using the yelp and imdb data.  You can use the sklearn facilities to make your life easier if you want.
You can also try the multinomial classifier to see if it works better.  In that case, you need to remove the binary=True flag
from the vectorizer so that it counts frequencies.  Here is an example of how the countvectorizer can compute term frequencies. You can also experiment with the "stop_words" flag.

In [None]:
vectorizer = CountVectorizer(max_features=10)

In [None]:
V = vectorizer.fit(["Here is a sentence", "Here is another sentence"])

In [None]:
V.vocabulary_

In [None]:
X = V.transform(["What are the frequencies in this here sentence", "I wrote a sentence about this sentences"])

In [None]:
X.toarray()

In [None]:
V.get_feature_names()