# Testing Implementations
Although I plan to use real-world data to test the naive bayes classifier algorithm I'm building, 
it will likely be far quicker to use toy datasets to experiment with the Multinomial and Gaussian flavors of 
naive bayes, which is the purpose of this notebook

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

### Multinomial Naive Bayes
For the multinomial version of naive bayes, I need a dataset that contains discrete features, such as counts. I'll be using a dataset of IMDB reviews labelled either positive or negative. The labelled data is a text file with a 1 or 0 at the end of the line denoting a positive or negative review, respectively.

In [422]:
# I have removed punctuation and excess whitespace to
# prevent certain words from being differentiated, such
# as "very," and "very"

imdb_cols = ["review", "sentiment"]
imdb = pd.read_csv("imdb_labelled.txt", sep="\t", names=imdb_cols)
print(imdb["review"][344])

imdb["review"] = imdb["review"].str.strip()
imdb["review"] = imdb["review"].str.replace(r"[^\w\s-]", "")
imdb["review"] = imdb["review"].str.replace(r"\-", " ")
imdb["review"] = imdb["review"].str.replace(r"\s{2,}", " ")
imdb["review"] = imdb["review"].str.lower()

imdb.head()

While you don't yet hear Mickey speak, there are tons of sound effects and music throughout the film--something we take for granted now but which was a huge crowd pleaser in 1928.  


Unnamed: 0,review,sentiment
0,a very very very slow moving aimless movie abo...,0
1,not sure who was more lost the flat characters...,0
2,attempting artiness with black white and cleve...,0
3,very little music or anything to speak of,0
4,the best scene in the movie was when gerardo i...,1


In [None]:
print(imdb["review"][344])

In [None]:
# use sklearn's count vectorizer to create vectors for each review
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(imdb["review"])
y = imdb["sentiment"]

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

The probability of being a good review is $\frac{312}{598}$, while the probability of being a bad review is $\frac{286}{598}$

In other words:

$P(Good) \approx 0.522$

$P(Bad) \approx 0.478$

In [None]:
# determining the prior probabilities for good and bad reviews
print(y_train.value_counts())
y_train.value_counts(normalize=True)

Now I want to examine the probability of a particular word being in a bad review. In this case, I'll be looking at the word "bad"

In [None]:
# create a matrix with "feature names"
words = vectorizer.get_feature_names()
term_matrix = pd.DataFrame(X_train.toarray(), columns=words)
term_matrix.head()

In [None]:
# add the movie review label for reference
term_matrix["review_of_movie"] = y_train.values
term_matrix.head()

In [None]:
# get all bad and good reviews
bad = term_matrix[term_matrix["review_of_movie"] == 0]
good = term_matrix[term_matrix["review_of_movie"] == 1]

# total number of words in each
print(f"Total words in bad: {bad.sum().sum()}")
print(f"Total words in good: {good.sum().sum()}")

How many times does the word bad occur in good and bad reviews?

In [None]:
print(f"Total times 'bad' appears in bad reviews: {bad.bad.sum()}")
print(f"Total times 'bad' appears in good reviews: {good.bad.sum()}")

The probability that we will observe the word "bad" given that it was seen in a bad review is $\frac{50}{6003}$, while the probability that you might observe it in a good review is $\frac{7}{5954}$. That is:

$P(bad|Bad) \approx 0.008$

$P(bad|Good) \approx 0.001$

These numbers will need to be acquired for each word and label using the numpy array

In [None]:
X_train[(y_train.values == 1)].toarray().sum() + X_train[(y_train.values == 0)].toarray().sum()

In [None]:
term_matrix.sum().sum() - 312

In [None]:
# testing numpy's boolean indexing to see if
# it works the way I think it does
test_x = np.array([[0, 1, 1],
                   [1, 2, 1],
                   [6, 3, 1],
                   [1, 4, 1]])
test_y = np.array([0, 1, 1, 0])

# testing numpy's sum functions
test_x[(test_y == 1)].sum(axis=0)[1]

Now I need to know how calculating probabilities is going to work without using pandas explicitly. _Note_: The `y_train` values are a pandas series

In [None]:
# number of times "bad" appears in bad reviews
bad_in_bad = X_train[y_train.values == 0].toarray().sum(axis=0)[238]

# number of times "bad" appears in good reviews
bad_in_good = X_train[y_train.values == 1].toarray().sum(axis=0)[238]


total_words_in_good = X_train[y_train.values == 1].sum()
total_words_in_bad = X_train[y_train.values == 0].sum()

print("P(bad|Bad): %.3f" % (bad_in_bad / total_words_in_bad))
print("P(good|Bad): %.3f" % (bad_in_good / total_words_in_good))

The calculation above is the same as what I arrived at previously, using pandas. Now I need to figure out how to get and store these probabilities for each word, for each class

In [None]:
# first I need to be able to count the classes in the dependent variable
classes, counts = np.unique(y_train.values, return_counts=True)
dict(zip(classes, counts))

# I need to be able to store the individual probabilites for 
# each word, given a class
class_probabilities = {c:{} for c in classes}
class_probabilities

for c in class_probabilities:
    total_words = X_train[y_train.values == c].sum()
    for w in range(X_train.shape[1]):
        word_occurrences = X_train[y_train.values == c].toarray().sum(axis=0)[w]
    
        class_probabilities[c][w] = (word_occurrences / total_words)
    
class_probabilities        

The word "bad" was the 238th word of the transposed term matrix. This should align with the new `class_probabilites` dictionary, which contains the probabilites for each word, given each class:

In [None]:
print("P(bad|Bad): %.3f" % class_probabilities[0][238])
print("P(bad|Bad): %.3f" % class_probabilities[1][238])

Now that the probabilities are reliably stored and indexed we should be able to use use Bayes' Theorem with naive assumptions (that is, assuming that each word is independent of all others. We're assuming no word affects the amount or appearance of any other word, so they affect probabilities independently) to classify a fake review.

In [544]:
fake_review = ["dont regret seeing this movie it was actually pretty good"]
fake_review_transformed = vectorizer.transform(fake_review)

In [511]:
for i in fake_review_transformed:
    print(i.data)
    print(i.indices)

[2 1 1 1 1 1]
[ 226  293  493 1015 1791 1839]


In [None]:
# in order to classify the fake review, we need P(Bad) and P(Good)
p_bad = counts[0] / (counts[0] + counts[1])
p_good = counts[1] / (counts[0] + counts[1])

Unfortunately, because there are zeros in the fake review, as well as in the document term matrix, the probability will likely come out to zero:

In [None]:
p = p_bad
for i in fake_review_transformed.toarray()[0]:
    p *= class_probabilities[0][i]
    
print(p)

With that in mind, smoothing will need to be incorporated by default. This will be accomplished by adding 1 to every word

In [None]:
# the probabilities need to be recalculated after adding 1 to every word
# and increasing the total words by the number of total additions
class_probabilities_smooth = {c:{} for c in classes}
class_probabilities_smooth

for c in class_probabilities_smooth:
    total_words = X_train[y_train.values == c].sum() + X_train.shape[1]
    for w in range(X_train.shape[1]):
        word_occurrences = X_train[y_train.values == c].toarray().sum(axis=0)[w] + 1
    
        class_probabilities_smooth[c][w] = (word_occurrences / total_words)
    
class_probabilities_smooth   

In [None]:
np.log((2 / (X_train[y_train.values==0].sum() + X_train.shape[1]))**3)

In [545]:
p = p_bad
fake_array = zip(fake_review_transformed.data, fake_review_transformed.indices)
for i, w in fake_array:
    if i > 1:
        p *= (class_probabilities_smooth[0][w]**i)
        print(i)
    else:
        p *= class_probabilities_smooth[0][w]
prob_bad = p

In [546]:
p = p_good
fake_array = zip(fake_review_transformed.data, fake_review_transformed.indices)
for i, w in fake_array:
    if i > 1:
        p *= (class_probabilities_smooth[1][w]**i)
        print(i)
    else:
        p *= class_probabilities_smooth[1][w]
prob_good = p

In [547]:
if prob_good > prob_bad:
    print("Good Review")
else:
    print("Bad Review")

Good Review


In [447]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB(alpha=0)
clf.fit(X_train, y_train)

  'setting alpha = %.1e' % _ALPHA_MIN)


MultinomialNB(alpha=0, class_prior=None, fit_prior=True)

In [450]:
clf.predict(fake_review_transformed)

array([1])

### Gaussian Naive Bayes
For the Gaussian Naive Bayes, i'll be using the famed Iris dataset to provide continuous data for classification. 

In [None]:
iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

In [None]:
iris.head()