<a href="https://colab.research.google.com/github/mlfa19/assignments/blob/master/Module%202/03/Assignment_3_Companion_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import gdown

gdown.download('https://drive.google.com/uc?authuser=0&id=1Z8bwIBa_0gFe9-C2W0goZ72lQfFMbxjS&export=download',
               'labeledTrainData.tsv',
               quiet=False)

Downloading...
From: https://drive.google.com/uc?authuser=0&id=1Z8bwIBa_0gFe9-C2W0goZ72lQfFMbxjS&export=download
To: /content/labeledTrainData.tsv
33.6MB [00:00, 89.3MB/s]


'labeledTrainData.tsv'

In [0]:
import pandas as pd

df = pd.read_csv('labeledTrainData.tsv', header=0, delimiter='\t')

Let's look at the average sentiment to see what we are dealing with (1 is positive sentiment and 0 is negative)

In [3]:
df['sentiment'].mean()

0.5

Looks like we're dealing with a balanced set of positives and negatives.

Next, let's look at a particular review.  To make the output look nicer, we'll create a [new Pandas series with line wrapping](https://www.geeksforgeeks.org/python-pandas-series-str-wrap/).

In [0]:
# this takes a little while to run
reviews_wrapped = df['review'].str.wrap(80)

In [5]:
print(reviews_wrapped.iloc[20])

\Soylent Green\" is one of the best and most disturbing science fiction movies
of the 70's and still very persuasive even by today's standards. Although flawed
and a little dated, the apocalyptic touch and the environmental premise (typical
for that time) still feel very unsettling and thought-provoking. This film's
quality-level surpasses the majority of contemporary SF flicks because of its
strong cast and some intense sequences that I personally consider classic. The
New York of 2022 is a depressing place to be alive, with over-population,
unemployment, an unhealthy climate and the total scarcity of every vital food
product. The only form of food available is synthetic and distributed by the
Soylent company. Charlton Heston (in a great shape) plays a cop investigating
the murder of one of Soylent's most eminent executives and he stumbles upon
scandals and dark secrets... The script is a little over-sentimental at times
and the climax doesn't really come as a big surprise, still the 

## Vectorizing the Data

We know that in order to apply Na&iuml;ve Bayes we need to convert each of our reviews into a vector of features.  There are lots of different methods to convert text into vectors.  In this notebook we'll be using a super basic form of this where we construct a feature vector with $k$ entries (where $k$ is the total number of unique words in the dataset) and for any particular review we set the corresponding entry to $1$ if that word appears in the dataset and $0$ otherwise.  This representation is called the bag of words since the encoding of the text into features is independent of where the words occur in the text (you could shuffle the words in the review and still have the same feature vector).

A more complete description of bag of words is given in TODO.

Here we're going to use scikit learn's built-in [count vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).


In [0]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

vectorizer = CountVectorizer(binary=True)
vectorizer.fit(df['review'])
X = vectorizer.transform(df['review'])
y = np.array(df['sentiment'])

## Fitting the Parameters of the Na&iuml;ve Bayes Model

We know from our work in the main document that the maximum likelihood values of the parameters are given by TODO:

To understand what these counts mean, we can look at one of the entries.

In [11]:
import numpy as np

def get_xcounts_for_sentiment(X, y):
    # the y[y==1] is a trick to get around the
    return X[y == 1, :].sum(axis=0), X[y == 0, :].sum(axis=0)

xcount_for_sentiment_1, xcount_for_sentiment_0 = get_xcounts_for_sentiment(X, y)
print("66150th word count for positive sentiment", xcount_for_sentiment_1[0, 66150])
print("66150th word count for negative sentiment", xcount_for_sentiment_0[0, 66150])
print("66150th word is", vectorizer.get_feature_names()[66150])

66150th word count for positive sentiment 217
66150th word count for negative sentiment 1118
66150th word is terrible


Check your understanding by interpreting this output.  What might be going on with the other 217 reviews?  We leave it to you to write the code to examine them if you want to dig into this further (we're happy to help if you have questions about how to do this).

## Implementing Na&iuml;ve Bayes

First, we'll divide our data into a train and test set.  Then we'll walk you through the implementation.

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

Next we're going to fit the parameters of our model.  Recall that we need to compute the probability of each possible sentiment value.

In [13]:
def get_y_counts_for_sentiment(y):
    return y.sum(), (y == 0).sum()

y_count_for_sentiment_1, y_count_for_sentiment_0 = get_y_counts_for_sentiment(y_train)

def get_p_y(y_count_for_sentiment_1, y_count_for_sentiment_0):
    z = y_count_for_sentiment_1 + y_count_for_sentiment_0
    return y_count_for_sentiment_1 / z, y_count_for_sentiment_0 / z

p_sentiment_1, p_sentiment_0 = get_p_y(y_count_for_sentiment_1, y_count_for_sentiment_0)
print(p_sentiment_1, p_sentiment_0)

0.5007466666666667 0.4992533333333333


We also need to compute the probability of each of the features conditioned on a particular sentiment

In [0]:
def get_p_x_given_y(xcount_for_sentiment_1,
                    xcount_for_sentiment_0,
                    ycount_for_sentiment_1,
                    ycount_for_sentiment_0):
    p_of_word_sentiment_1 = xcount_for_sentiment_1 / ycount_for_sentiment_1
    p_of_word_sentiment_0 = xcount_for_sentiment_0 / ycount_for_sentiment_0
    return p_of_word_sentiment_1, p_of_word_sentiment_0

x_count_for_sentiment_1, x_count_for_sentiment_0 = get_xcounts_for_sentiment(X_train, y_train)

# add smoothing if desired
smoothing = 1

x_count_for_sentiment_1 += smoothing
x_count_for_sentiment_0 += smoothing

p_of_word_sentiment_1, p_of_word_sentiment_0 = get_p_x_given_y(x_count_for_sentiment_1,
                                                               x_count_for_sentiment_0,
                                                               y_count_for_sentiment_1,
                                                               y_count_for_sentiment_0)

In [23]:
# remember that word 66150 is "terrible"
print(p_of_word_sentiment_1[0,66150])
print(p_of_word_sentiment_0[0,66150])

0.01736074129300245
0.0894135241961329


## Classifying New Data

In [0]:
def get_log_posterior(test_point, p_of_sentiment, p_of_word):
    log_likelihood = np.log(p_of_sentiment)
    for index in test_point.indices:
        log_likelihood += np.log(p_of_word[0, index])
    return log_likelihood

In [0]:
num_correct = 0
for i, point in enumerate(X_test):
    log_likelihood_sentiment_1 = get_log_posterior(point, p_sentiment_1, p_of_word_sentiment_1)
    log_likelihood_sentiment_0 = get_log_posterior(point, p_sentiment_0, p_of_word_sentiment_0)
    y_pred = float(log_likelihood_sentiment_1 > log_likelihood_sentiment_0)
    y_actual = y_test[i]
    num_correct += float(y_pred == y_actual)

In [26]:
print("accuracy is", num_correct / len(y_test))

accuracy is 0.85056


## Sanity Check

In [27]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
np.mean(y_pred == y_test)

0.8464