# Testing Implementations
Although I plan to use real-world data to test the naive bayes classifier algorithm I'm building, 
it will likely be far quicker to use toy datasets to experiment with the Multinomial and Gaussian flavors of 
naive bayes, which is the purpose of this notebook

In [85]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

### Multinomial Naive Bayes
For the multinomial version of naive bayes, I need a dataset that contains discrete features, such as counts. I'll be using a dataset of IMDB reviews labelled either positive or negative. The labelled data is a text file with a 1 or 0 at the end of the line denoting a positive or negative review, respectively.

In [79]:
# I have removed punctuation and excess whitespace to
# prevent certain words from being differentiated, such
# as "very," and "very"

imdb_cols = ["review", "sentiment"]
imdb = pd.read_csv("imdb_labelled.txt", sep="\t", names=imdb_cols)
print(imdb["review"][344])

imdb["review"] = imdb["review"].str.strip()
imdb["review"] = imdb["review"].str.replace(r"[^\w\s-]", "")
imdb["review"] = imdb["review"].str.replace(r"\-", " ")
imdb["review"] = imdb["review"].str.replace(r"\s{2,}", " ")
imdb["review"] = imdb["review"].str.lower()

imdb.head()

While you don't yet hear Mickey speak, there are tons of sound effects and music throughout the film--something we take for granted now but which was a huge crowd pleaser in 1928.  


Unnamed: 0,review,sentiment
0,a very very very slow moving aimless movie abo...,0
1,not sure who was more lost the flat characters...,0
2,attempting artiness with black white and cleve...,0
3,very little music or anything to speak of,0
4,the best scene in the movie was when gerardo i...,1


In [80]:
print(imdb["review"][344])

while you dont yet hear mickey speak there are tons of sound effects and music throughout the film something we take for granted now but which was a huge crowd pleaser in 1928


In [88]:
# use sklearn's count vectorizer to create vectors for each review
vectorizer = CountVectorizer()

X = vectorizer.fit_transform(imdb["review"])
y = imdb["sentiment"]

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.20, random_state=42)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(598, 3114)
(150, 3114)
(598,)
(150,)


The probability of being a good review is $\frac{312}{598}$, while the probability of being a bad review is $\frac{286}{598}$

In other words:

$P(good) \approx 0.522$

$P(bad) \approx 0.478$

In [90]:
# determining the prior probabilities for good and bad reviews
print(y_train.value_counts())
y_train.value_counts(normalize=True)

1    312
0    286
Name: sentiment, dtype: int64


1    0.521739
0    0.478261
Name: sentiment, dtype: float64

Now I want to examine the probability of a particular word being in a bad review. In this case, I'll be looking at the word "bad"

In [97]:
# create a matrix with "feature names"
words = vectorizer.get_feature_names()
term_matrix = pd.DataFrame(X_train.toarray(), columns=words)
term_matrix.head()

Unnamed: 0,010,10,1010,110,12,13,15,17,18th,1928,...,your,youre,yourself,youthful,youtube,youve,yun,zillion,zombie,zombiez
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [108]:
# add the movie review label for reference
term_matrix["review_of_movie"] = y_train.values
term_matrix.head()

Unnamed: 0,010,10,1010,110,12,13,15,17,18th,1928,...,youre,yourself,youthful,youtube,youve,yun,zillion,zombie,zombiez,review_of_movie
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [119]:
term_matrix[(term_matrix['review_of_movie'] == 0) & (term_matrix["bad"] >= 1)].shape

(38, 3115)

The probability that we will observe the word "bad" in a bad review is $\frac{38}{286}$

In [51]:
a = np.ones(shape=X.shape)
haha = pd.DataFrame(data=(X + a))
haha.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3104,3105,3106,3107,3108,3109,3110,3111,3112,3113
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### Gaussian Naive Bayes
For the Gaussian Naive Bayes, i'll be using the famed Iris dataset to provide continuous data for classification. 

In [5]:
iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

In [6]:
iris.head()

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa
