<a href="https://colab.research.google.com/github/mlfa19/assignments/blob/master/Module%202/05/Analyzing_Word_Saliency_for_a_Naive_Bayes_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Classification Using Na&iuml;ve Bayes

***Abstract***

In this notebook you'll be interpreting the results of fitting a Na&iuml;ve Bayes model for classifying the sentiment of a movie review.

## Sentiment Analysis

The [Wikipedia Article on Sentiment Analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) provides the following definition for sentiment analysis.

> Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.

In this notebook we'll be focusing on predicting the sentiment of a movie review from IMDB based on the text of the movie review.  This dataset is one that was originally used in a Kaggle competition called [Bag of Words meets Bag of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial) (you'll understand that joke by the end of this notebook!)

The [data](https://www.kaggle.com/c/word2vec-nlp-tutorial/data) consists of the following.

> The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

Our goal will be to see if we can learn a model, using Na&iuml;ve Bayes on a training set to accurately estimate sentiment of new reviews.

Without further ado, let's download and parse the data into a data frame.

In [36]:
import gdown
import pandas as pd

gdown.download('https://drive.google.com/uc?authuser=0&id=1Z8bwIBa_0gFe9-C2W0goZ72lQfFMbxjS&export=download',
               'labeledTrainData.tsv',
               quiet=False)
df = pd.read_csv('labeledTrainData.tsv', header=0, delimiter='\t')
df

Downloading...
From: https://drive.google.com/uc?authuser=0&id=1Z8bwIBa_0gFe9-C2W0goZ72lQfFMbxjS&export=download
To: /content/labeledTrainData.tsv
33.6MB [00:00, 161MB/s] 


Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
...,...,...,...
24995,3453_3,0,It seems like more consideration has gone into...
24996,5064_1,0,I don't believe they made this film. Completel...
24997,10905_3,0,"Guy is a loser. Can't get girls, needs to buil..."
24998,10194_3,0,This 30 minute documentary Buñuel made in the ...


Let's look at the average sentiment to see what we are dealing with (1 is positive sentiment and 0 is negative)

In [37]:
df['sentiment'].mean()

0.5

Looks like we're dealing with a balanced set of positives and negatives.

Next, let's look at a particular review.  To make the output look nicer, we'll create a [new Pandas series with line wrapping](https://www.geeksforgeeks.org/python-pandas-series-str-wrap/).

In [0]:
# this takes a little while to run
reviews_wrapped = df['review'].str.wrap(80)

In [39]:
print(reviews_wrapped.iloc[20])

\Soylent Green\" is one of the best and most disturbing science fiction movies
of the 70's and still very persuasive even by today's standards. Although flawed
and a little dated, the apocalyptic touch and the environmental premise (typical
for that time) still feel very unsettling and thought-provoking. This film's
quality-level surpasses the majority of contemporary SF flicks because of its
strong cast and some intense sequences that I personally consider classic. The
New York of 2022 is a depressing place to be alive, with over-population,
unemployment, an unhealthy climate and the total scarcity of every vital food
product. The only form of food available is synthetic and distributed by the
Soylent company. Charlton Heston (in a great shape) plays a cop investigating
the murder of one of Soylent's most eminent executives and he stumbles upon
scandals and dark secrets... The script is a little over-sentimental at times
and the climax doesn't really come as a big surprise, still the 

## The Bag of Words Model

We know that in order to apply Na&iuml;ve Bayes we need to convert each of our reviews into a vector of features.  There are lots of different methods to convert text into vectors.  In this notebook we'll be using a pretty basic (but suprisingly powerful) form of vectorization where we construct a feature vector with $k$ entries (where $k$ is the total number of unique words in the dataset) and for any particular review we set the corresponding entry to $1$ if that word appears in the review and $0$ otherwise.  This representation is called  ***bag of words*** since the encoding of the review into a vector is independent of where the words occur in the review (you could shuffle the words in the review and still have the same feature vector).  The [Wikipedia article on Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) has more information.

Instead of writing our own code to convert from text to a bag of words representation we're going to use scikit learn's built-in [count vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).  Before we apply it to the data, let's apply it to toy dataset to help you better understand the bag of words model.


## Vectorizing the Whole Dataset

Now that you have a general idea what bag of words is all about, let's apply it to our movie reviews.  To make our lives easier we're only going to include words in our feature vector if they occur in at least 100 reviews.  Doing this will help with overfitting (although next assignment we will be learning another technique to deal with this).  While we're at it we'll also convert the sentiment labels to a numpy array.

In [40]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

vectorizer = CountVectorizer(binary=True, min_df=100)
vectorizer.fit(df['review'])
X = vectorizer.transform(df['review']).todense()
y = np.array(df['sentiment'])
print("X.shape", X.shape)
print("y.shape", y.shape)

X.shape (25000, 3833)
y.shape (25000,)


As a quick intuition builder, let's look at a word we think would probably differ across sentiment values.

In [41]:
terrible_index = vectorizer.get_feature_names().index('terrible')
print("terrible occurs in", X[y==1, terrible_index].mean(), "for Y=1")
print("terrible occurs in", X[y==0, terrible_index].mean(), "for Y=0")

terrible occurs in 0.01736 for Y=1
terrible occurs in 0.08944 for Y=0


## Fitting a Model with sklearn

Instead of coding it ourselves, let's using sklearn's built-in algorithm for Na&iuml;ve Bayes.

In [42]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
np.mean(y_pred == y_test)

0.85168

## Understanding the Model

Now that we've fit the model, let's look at a few different ways to understand what the model is doing.

### Which Words Are Most Important

One way to investigate the model is to examine which words contribute the most to reviews being judged as positive versus negative.  We define contribution in this case as a combination of a word being strongly indicative of a particular sentiment as well as being relatively common.

Here is some code for computing this.


In [0]:
# model.feature_log_prob_ gives us the log probability of each of the features
# conditioned on a particular class.
log_probs = model.feature_log_prob_
# model.classes_ tells us which class corresponds to a particular row in
# model.feature_log_prob_.
class_mapping = model.classes_

In [44]:
pos_index = np.where(class_mapping == 1)[0][0]
neg_index = np.where(class_mapping == 0)[0][0]

difference_in_log_probs_by_word = log_probs[pos_index,:] - log_probs[neg_index,:]
print("the word that bumps up the log probability of a positive review as much as possible is",
      vectorizer.get_feature_names()[np.argmax(difference_in_log_probs_by_word)])

the word that bumps up the log probability of a positive review as much as possible is flawless


We can do the same thing for negative sentiment.

In [45]:
print("the word that bumps up the log probability of a positive review as much as possible is",
      vectorizer.get_feature_names()[np.argmin(difference_in_log_probs_by_word)])

the word that bumps up the log probability of a positive review as much as possible is incoherent


While flawless and incoherent might be the words that bumps up or down the prediction as much as possible, they may not be all that likely to occur.  Next, we'll reweight these differences by how commonly they occur in the test data.

In [46]:
weighted_differences = np.multiply(X_test.mean(axis=0), difference_in_log_probs_by_word)
print("the word that bumps up the log probability of a positive review as much as possible weighted by prevalence is",
      vectorizer.get_feature_names()[np.argmax(weighted_differences)])

the word that bumps up the log probability of a positive review as much as possible weighted by prevalence is great


We can do the same thing for negative sentiment.

In [47]:
print("the word that bumps up the log probability of a negative review as much as possible weighted by prevalence is",
      vectorizer.get_feature_names()[np.argmin(weighted_differences)])

the word that bumps up the log probability of a negative review as much as possible weighted by prevalence is bad


We'll leave it to you to modify the code to print out words other than the topmost (e.g., using sort).

## Analyzing the Model on a particular review

Another interesting strategy is to look at a particular review and see how each word contributes to the overall judgment of the model.

We'll do this by showing a running total of theve log likelihood ratio of positive versus negative sentiment as the review unfolds.

In [82]:
review_to_analyze_index = 20

# we'll work with the first review in the dataframe
review_to_analyze = vectorizer.transform([df['review'].iloc[review_to_analyze_index]]).todense()

# first compute the log likelihood ratio of probability of positive review versus
# negative review (should be about 0 if these are balanced)
running_llr = model.class_log_prior_[pos_index] - model.class_log_prior_[neg_index]
running_llr

0.0034133366473572124

In the notebook where we implemented Na&iuml;ve Bayes, we took into account how the absence of words contributes to the judgment of positive versus negative sentiment.  In most cases, this barely changes the probability of positive sentiment and comes at a substantial computational cost. As a result, some implementations of Na&iuml;ve Bayes (such as sklearn's) just ignore this component.  If you're interested in how this code would work, you can set `use_word_absences` to `True` in the cell below.

In [0]:
use_word_absences = False
if use_word_absences:
    # next, we calculate the contribution to the log likelihood ratio from the words
    # that *do not* occur in the review.  These contributions in the case of bag of
    # words will matter much less than the words that do occur in the review

    probs_positive = np.exp(log_probs[pos_index,:])
    probs_negative = np.exp(log_probs[neg_index,:])
    llr_for_word_not_occurring = np.log((1-probs_positive)/(1-probs_negative))

    llr_contribution_for_words_not_occurring = llr_for_word_not_occurring[np.where(review_to_analyze[0,:] == 0)[1]].sum()
    print("llr_contribution_for_words_not_occurring", llr_contribution_for_words_not_occurring)
    # running_llr += llr_contribution_for_words_not_occurring
    print("running_llr", running_llr)

Next, we'll examine the actual review text and see how it affects the model's output.  We'll print out the review word by word along with the running_llr and and the contribution to the llr from the last word.

In [84]:
import re

# lookup index by word (there may be a better way to do this)
reverse_word_lookup = dict(zip(vectorizer.get_feature_names(), range(len(vectorizer.get_feature_names()))))
pattern = re.compile(r'(?u)\b\w\w+\b')
processed_words = set()
df_llr = pd.DataFrame(columns=['word', 'contribution to llr', 'cumulative llr'])

for word in re.findall(pattern, df['review'].iloc[review_to_analyze_index]):
    if word.lower() in reverse_word_lookup and word.lower() not in processed_words:
        # if already counted, don't count it again
        contribution_to_llr = difference_in_log_probs_by_word[reverse_word_lookup[word.lower()]]
        processed_words.add(word.lower())
    else:
        contribution_to_llr = 0

    running_llr += contribution_to_llr
    df_llr = df_llr.append({'word': word, 'contribution to llr': contribution_to_llr, 'cumulative llr': running_llr}, ignore_index=True)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    display(df_llr)

Unnamed: 0,word,contribution to llr,cumulative llr
0,Soylent,0.0,0.003413
1,Green,-0.0958766,-0.092463
2,is,0.0368863,-0.055577
3,one,0.0241627,-0.031414
4,of,0.0072271,-0.024187
5,the,0.00677751,-0.01741
6,best,0.617731,0.600321
7,and,0.0201221,0.620443
8,most,0.15582,0.776263
9,disturbing,0.0842356,0.860498


In [85]:
# as a sanity check, let's see what sklearn gives us
probs = model.predict_log_proba(vectorizer.transform([df['review'].iloc[review_to_analyze_index]]).todense())
print(probs[0][pos_index] - probs[0][neg_index])

15.351287983285033
