# Testing Implementations
Although I plan to use real-world data to test the naive bayes classifier algorithm I'm building, 
it will likely be far quicker to use toy datasets to experiment with the Multinomial and Gaussian flavors of 
naive bayes, which is the purpose of this notebook

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

### Multinomial Naive Bayes
For the multinomial version of naive bayes, I need a dataset that contains discrete features, such as counts. I'll be using a dataset of IMDB reviews labelled either positive or negative. The labelled data is a text file with a 1 or 0 at the end of the line denoting a positive or negative review, respectively.

In [2]:
# I have removed punctuation and excess whitespace to
# prevent certain words from being differentiated, such
# as "very," and "very"

imdb_cols = ["review", "sentiment"]
imdb = pd.read_csv("imdb_labelled.txt", sep="\t", names=imdb_cols)
print(imdb["review"][344])

imdb["review"] = imdb["review"].str.strip()
imdb["review"] = imdb["review"].str.replace(r"[^\w\s-]", "")
imdb["review"] = imdb["review"].str.replace(r"\-", " ")
imdb["review"] = imdb["review"].str.replace(r"\s{2,}", " ")
imdb["review"] = imdb["review"].str.lower()

imdb.head()

While you don't yet hear Mickey speak, there are tons of sound effects and music throughout the film--something we take for granted now but which was a huge crowd pleaser in 1928.  


Unnamed: 0,review,sentiment
0,a very very very slow moving aimless movie abo...,0
1,not sure who was more lost the flat characters...,0
2,attempting artiness with black white and cleve...,0
3,very little music or anything to speak of,0
4,the best scene in the movie was when gerardo i...,1


In [3]:
print(imdb["review"][344])

while you dont yet hear mickey speak there are tons of sound effects and music throughout the film something we take for granted now but which was a huge crowd pleaser in 1928


In [4]:
# use sklearn's count vectorizer to create vectors for each review
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(imdb["review"])
y = imdb["sentiment"]

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(598, 3114)
(150, 3114)
(598,)
(150,)


The probability of being a good review is $\frac{312}{598}$, while the probability of being a bad review is $\frac{286}{598}$

In other words:

$P(Good) \approx 0.522$

$P(Bad) \approx 0.478$

In [5]:
# determining the prior probabilities for good and bad reviews
print(y_train.value_counts())
y_train.value_counts(normalize=True)

1    312
0    286
Name: sentiment, dtype: int64


1    0.521739
0    0.478261
Name: sentiment, dtype: float64

Now I want to examine the probability of a particular word being in a bad review. In this case, I'll be looking at the word "bad"

In [6]:
# create a matrix with "feature names"
words = vectorizer.get_feature_names()
term_matrix = pd.DataFrame(X_train.toarray(), columns=words)
term_matrix.head()

Unnamed: 0,010,10,1010,110,12,13,15,17,18th,1928,...,your,youre,yourself,youthful,youtube,youve,yun,zillion,zombie,zombiez
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
# add the movie review label for reference
term_matrix["review_of_movie"] = y_train.values
term_matrix.head()

Unnamed: 0,010,10,1010,110,12,13,15,17,18th,1928,...,youre,yourself,youthful,youtube,youve,yun,zillion,zombie,zombiez,review_of_movie
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# get all bad and good reviews
bad = term_matrix[term_matrix["review_of_movie"] == 0]
good = term_matrix[term_matrix["review_of_movie"] == 1]

# total number of words in each
print(f"Total words in bad: {bad.sum().sum()}")
print(f"Total words in good: {good.sum().sum()}")

Total words in bad: 6003
Total words in good: 5954


How many times does the word bad occur in good and bad reviews?

In [10]:
print(f"Total times 'bad' appears in bad reviews: {bad.bad.sum()}")
print(f"Total times 'bad' appears in good reviews: {good.bad.sum()}")

Total times 'bad' appears in bad reviews: 50
Total times 'bad' appears in good reviews: 7


The probability that we will observe the word "bad" given that it was seen in a bad review is $\frac{50}{6003}$, while the probability that you might observe it in a good review is $\frac{7}{5954}$. That is:

$P(bad|Bad) \approx 0.008$

$P(bad|Good) \approx 0.001$

These numbers will need to be acquired for each word and label using the numpy array

In [11]:
X_train[(y_train.values == 1)].toarray().sum() + X_train[(y_train.values == 0)].toarray().sum()

11645

In [12]:
term_matrix.sum().sum() - 312

11645

In [13]:
# testing numpy's boolean indexing to see if
# it works the way I think it does
test_x = np.array([[0, 1, 1],
                   [1, 2, 1],
                   [6, 3, 1],
                   [1, 4, 1]])
test_y = np.array([0, 1, 1, 0])

# testing numpy's sum functions
test_x[(test_y == 1)].sum(axis=0)[1]

5

Now I need to know how calculating probabilities is going to work without using pandas explicitly. _Note_: The `y_train` values are a pandas series

In [14]:
# number of times "bad" appears in bad reviews
bad_in_bad = X_train[y_train.values == 0].toarray().sum(axis=0)[238]

# number of times "bad" appears in good reviews
bad_in_good = X_train[y_train.values == 1].toarray().sum(axis=0)[238]


total_words_in_good = X_train[y_train.values == 1].sum()
total_words_in_bad = X_train[y_train.values == 0].sum()

print("P(bad|Bad): %.3f" % (bad_in_bad / total_words_in_bad))
print("P(good|Bad): %.3f" % (bad_in_good / total_words_in_good))

P(bad|Bad): 0.008
P(good|Bad): 0.001


The calculation above is the same as what I arrived at previously, using pandas. Now I need to figure out how to get and store these probabilities for each word, for each class

In [None]:
# first I need to be able to count the classes in the dependent variable
classes, counts = np.unique(y_train.values, return_counts=True)
dict(zip(classes, counts))

# I need to be able to store the individual probabilites for 
# each word, given a class
class_probabilities = {c:{} for c in classes}
class_probabilities

for c in class_probabilities:
    total_words = X_train[y_train.values == c].sum()
    for w in range(X_train.shape[1]):
        word_occurrences = X_train[y_train.values == c].toarray().sum(axis=0)[w]
    
        class_probabilities[c][w] = (word_occurrences / total_words)
    
class_probabilities        

The word "bad" was the 238th word of the transposed term matrix. This should align with the new `class_probabilites` dictionary, which contains the probabilites for each word, given each class:

In [16]:
print("P(bad|Bad): %.3f" % class_probabilities[0][238])
print("P(bad|Bad): %.3f" % class_probabilities[1][238])

P(bad|Bad): 0.008
P(bad|Bad): 0.001


Now that the probabilities are reliably stored and indexed we should be able to use use Bayes' Theorem with naive assumptions (that is, assuming that each word is independent of all others. We're assuming no word affects the amount or appearance of any other word, so they affect probabilities independently) to classify a fake review.

In [17]:
fake_review = ["dont regret seeing this movie it was actually pretty good"]
fake_review_transformed = vectorizer.transform(fake_review)

In [18]:
for i in fake_review_transformed:
    print(i.data)
    print(i.indices)

[1 1 1 1 1 1 1 1 1 1]
[  71  788 1196 1467 1791 2082 2189 2351 2713 2976]


In [19]:
# in order to classify the fake review, we need P(Bad) and P(Good)
p_bad = counts[0] / (counts[0] + counts[1])
p_good = counts[1] / (counts[0] + counts[1])

Unfortunately, because there are zeros in the fake review, as well as in the document term matrix, the probability will likely come out to zero:

In [20]:
p = p_bad
for i in fake_review_transformed.toarray()[0]:
    p *= class_probabilities[0][i]
    
print(p)

0.0


With that in mind, smoothing will need to be incorporated by default. This will be accomplished by adding 1 to every word

In [21]:
# the probabilities need to be recalculated after adding 1 to every word
# and increasing the total words by the number of total additions
class_probabilities_smooth = {c:{} for c in classes}
class_probabilities_smooth

for c in class_probabilities_smooth:
    total_words = X_train[y_train.values == c].sum() + X_train.shape[1]
    for w in range(X_train.shape[1]):
        word_occurrences = X_train[y_train.values == c].toarray().sum(axis=0)[w] + 1
    
        class_probabilities_smooth[c][w] = (word_occurrences / total_words)

In [22]:
p = p_bad
fake_array = zip(fake_review_transformed.data, fake_review_transformed.indices)
for i, w in fake_array:
    if i > 1:
        p *= (class_probabilities_smooth[0][w]**i)
        print(i)
    else:
        p *= class_probabilities_smooth[0][w]
prob_bad = p

In [23]:
p = p_good
fake_array = zip(fake_review_transformed.data, fake_review_transformed.indices)
for i, w in fake_array:
    if i > 1:
        p *= (class_probabilities_smooth[1][w]**i)
        print(i)
    else:
        p *= class_probabilities_smooth[1][w]
prob_good = p

In [24]:
if prob_good > prob_bad:
    print("Good Review")
else:
    print("Bad Review")

Good Review


For all intents and purposes, the algorithm works well enough, but the "fitting" of the data takes entirely too long. How to solve this issue? Taking a peek at scikit-learn's [implementation of naive](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py) bayes exposes a clever mathematical solution.

In [25]:
from sklearn import preprocessing

# First, the labels are binarized to provide
# a "one-vs-all" method. This will help to 
# gather conditional probabilities in one location
lb = preprocessing.LabelBinarizer()
Y = lb.fit_transform(y_train)
Y = np.concatenate((1-Y, Y), axis=1)
print(f"Classes in the example: {lb.classes_}")
print(f"Binarized Classes: \n{Y[:5]}")

Classes in the example: [0 1]
Binarized Classes: 
[[0 1]
 [1 0]
 [1 0]
 [1 0]
 [1 0]]


That was sklearn's version, the following is my implementation, which is probably simpler:

In [26]:
labs = np.array([1, 2, 4, 3, 5, 5, 4, 3, 2, 1 ,3, 4, 1])
cls_ = np.unique(labs)
lbins = np.zeros((labs.shape[0], np.unique(labs).shape[0]))
for i in range(len(labs)):
    x, = np.where(cls_ == labs[i])
    lbins[i][x] = 1
    
print(f"Unique classes: {cls_}")
lbins[:5]

Unique classes: [1 2 3 4 5]


array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 1.]])

Using the previously binarized `Y` for convenience, the features and classes can be counted by getting the dot product of the transposed `Y` and the feature matrix

In [27]:
feature_count = np.dot(Y.T, X_train.toarray())
class_count = Y.sum(axis=0)

In [28]:
# is index 238 of the "bad" reviews (index 0 of Y)
# equal to 50 occurences of the word bad? What
# about the "good" reviews? Should be 7 occurences

print(f"Number at index 238 for 'bad' (Bad Reviews): {feature_count[0][238]}")
print(f"Number at index 238 for 'bad' (Good Reviews): {feature_count[1][238]}")

Number at index 238 for 'bad' (Bad Reviews): 50
Number at index 238 for 'bad' (Good Reviews): 7


At this point, the system seems trustworthy and getting the data into the respective data structures happens far quicker this way than the $O(n^2)$ solution
from before.

Now the conditional probabilities need to be extracted from the new arrays. This necessitates smoothing (adding 1 or a smaller value to all the feature counts) to avoid getting probabilities of zero.

In [29]:
smoothed_feature_count = feature_count + 1

# the smoothed class count sums word occurences
# in each class array of Y and the total vocabulary
# or occurrences, in this case 3114
smoothed_instance_count = smoothed_feature_count.sum(axis=1)

Since underflow is a real possibility when multiplying these probabilities, I'll once again take a page out of the sklearn book and perform arithmetic with the log function. This means instead of dividing the word counts by the total words in the class, I'll need to subtract the log class count from the smoothed feature count.

In [30]:
log_probs = (np.log(smoothed_feature_count) - 
             np.log(smoothed_instance_count.reshape(-1,1)))

print(log_probs[0][238])  
print(np.log(class_probabilities_smooth[0][238]))

-5.186070448860576
-5.186070448860577


In [31]:
# setting the log priors involves a similar process
log_priors = np.log(class_count) - np.log(class_count.sum(axis=0).reshape(-1,1))
print(np.log(class_count / class_count.sum(axis=0).reshape(-1,1)))
print(log_priors)

[[-0.73759894 -0.65058757]]
[[-0.73759894 -0.65058757]]


The argmax function maximizes

In [32]:
np.argmax(np.dot(fake_review_transformed.toarray(), log_probs.T) + log_priors)

1

What if there are multiple reviews?

In [100]:
multiple_reviews = ["the best movie i have seen this year loved it",
                    "awful just so slow and painful to watch, dont see it",
                    "great movie enjoyed every minute almost cried"]

mr_transformed = vectorizer.transform(multiple_reviews)

# "add" the log probabilities for every word in the review
# by using the dot product approach, and then add the
# log priors for each column respectively
scores = np.dot(mr_transformed.toarray(), log_probs.T) + log_priors
print(scores)

[[-52.57164459 -51.68864253]
 [-66.78447684 -71.93680231]
 [-41.92879856 -41.64429618]]


In [101]:
# create an array of predictions based on largest
# score for each vector
np.argmax(scores, axis=1)

array([1, 0, 1])

This process results in practically the same outcome, but with a quicker "fitting" thanks to numpy math operations. 

### Gaussian Naive Bayes
For the Gaussian Naive Bayes, i'll be using the famed Iris dataset to provide continuous data for classification. 

In [None]:
iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

In [None]:
iris.head()