# Implementing your own Naive Bayes step by step

In [1]:
# import necessary modules
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from evaluation import test
from utils import load_data

https://github.com/randerson112358/Python/blob/master/Email_Spam_Detection/Email_Spam_Detection.ipynb

Data Source: https://www.kaggle.com/balakishan77/spam-or-ham-email-classification/data

Read data as DataFrame

In [2]:
emails = load_data('emails.csv')
emails.head(5)
emails.shape

Unnamed: 0,text,spam
0,Subject: naturally irresistible your corporate...,1
1,Subject: the stock trading gunslinger fanny i...,1
2,Subject: unbelievable new homes made easy im ...,1
3,Subject: 4 color printing special request add...,1
4,"Subject: do not have money , get software cds ...",1


(5728, 2)

In [3]:
# remove duplicates
emails.drop_duplicates(inplace = True)
emails.shape

(5695, 2)

Encode text using `CountVectorizer`

In [4]:
message0 = 'hello world hello hello world play'
message1 = 'test test test test one hello'

#Convert a collection of text documents to a matrix of token counts
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform([message0, message1])
print(bow)
print(type(bow))

  (0, 0)	3
  (0, 4)	2
  (0, 2)	1
  (1, 0)	1
  (1, 3)	4
  (1, 1)	1
<class 'scipy.sparse.csr.csr_matrix'>


As you can see, `CountVectorizer` returns a sparse matrix encoding our texts with the number of times a particular word occurs.

In [5]:
vocabulary = {v: k for k, v in vectorizer.vocabulary_.items()}
[vocabulary[i] for i in sorted([v for k,v in vectorizer.vocabulary_.items()])]
bow.toarray()

['hello', 'one', 'play', 'test', 'world']

array([[3, 0, 1, 0, 2],
       [1, 1, 0, 4, 0]])

Note that you can see how the encoding information is saved in a sparse matrix. For `message0`, on indices [0,4,2] you have values [3,2,1].

If we set `binary=True` when encoding messages, our encoder only records the whether the word is present or not, ignoring the numbber of occurance. For our first implementation of `NaiveBayes`, we will simply encode the presence of each word.

In [6]:
vectorizer_b = CountVectorizer(binary=True)
bow_b = vectorizer_b.fit_transform([message0, message1])
print(bow_b)
bow_b.toarray()

  (0, 0)	1
  (0, 4)	1
  (0, 2)	1
  (1, 0)	1
  (1, 3)	1
  (1, 1)	1


array([[1, 0, 1, 0, 1],
       [1, 1, 0, 1, 0]])

## Experiment with MultinomialNB from sklearn

Before implementing our own learner, let's check the performance of `MultinomialNB` from sklearn. Here we remove the stop_words when vectorizing our text.

In [7]:
messages_bow = CountVectorizer(stop_words='english').fit_transform(emails['text'])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(messages_bow, emails['spam'], test_size = 0.20,
                                                    random_state = 0,
                                                    stratify = emails['spam'])

messages_bow.shape
from sklearn.naive_bayes import MultinomialNB
test(MultinomialNB(), X_train, X_test, y_train, y_test)

(5695, 36996)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3462
           1       0.99      1.00      0.99      1094

    accuracy                           1.00      4556
   macro avg       0.99      1.00      1.00      4556
weighted avg       1.00      1.00      1.00      4556

Confusion Matrix: 
 [[3450   12]
 [   3 1091]]

Accuracy:  0.9967076382791923
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       865
           1       0.97      1.00      0.98       274

    accuracy                           0.99      1139
   macro avg       0.98      0.99      0.99      1139
weighted avg       0.99      0.99      0.99      1139

Confusion Matrix: 
 [[856   9]
 [  1 273]]

Accuracy:  0.9912203687445127


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

It acheived a pretty descent accuracy using the default parameter. Now let's check how it performs if we only encoding the presence information of the words in our text corpus.

In [8]:
messages_bow_b = CountVectorizer(stop_words='english', binary=True).fit_transform(emails['text'])

X_train, X_test, y_train, y_test = train_test_split(messages_bow_b, emails['spam'],
                                                    test_size = 0.20, random_state = 0,
                                                    stratify = emails['spam'])
test(MultinomialNB(), X_train, X_test, y_train, y_test)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3462
           1       1.00      0.99      0.99      1094

    accuracy                           1.00      4556
   macro avg       1.00      1.00      1.00      4556
weighted avg       1.00      1.00      1.00      4556

Confusion Matrix: 
 [[3458    4]
 [   8 1086]]

Accuracy:  0.9973661106233538
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       865
           1       0.97      0.99      0.98       274

    accuracy                           0.99      1139
   macro avg       0.98      0.99      0.99      1139
weighted avg       0.99      0.99      0.99      1139

Confusion Matrix: 
 [[856   9]
 [  3 271]]

Accuracy:  0.9894644424934153


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

We didn't observe a large performance drop between these encoding methods. Now let's check how our own implementation performs compared to sklearn

In [9]:
from naive_bayes import NaiveBayes_v0
test(NaiveBayes_v0(), X_train, X_test, y_train, y_test)

              precision    recall  f1-score   support

           0       1.00      0.99      1.00      3462
           1       0.98      1.00      0.99      1094

    accuracy                           0.99      4556
   macro avg       0.99      1.00      0.99      4556
weighted avg       0.99      0.99      0.99      4556

Confusion Matrix: 
 [[3441   21]
 [   4 1090]]

Accuracy:  0.9945127304653204
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       865
           1       0.94      1.00      0.97       274

    accuracy                           0.98      1139
   macro avg       0.97      0.99      0.98      1139
weighted avg       0.98      0.98      0.98      1139

Confusion Matrix: 
 [[847  18]
 [  1 273]]

Accuracy:  0.9833187006145742


<naive_bayes.NaiveBayes_v0 at 0x109a79750>

Wow, we acheived simliar performance with our naive implementation of Naive Bayes. Let's see whether our algorithm can generalize to other dataset rather than spam detection.

https://github.com/aishajv/Unfolding-Naive-Bayes-from-Scratch/blob/master/%23%20Unfolding%20Na%C3%AFve%20Bayes%20from%20Scratch!%20Take-2%20%F0%9F%8E%AC.ipynb

In [10]:
training_set = load_data('labeledTrainData.tsv', sep='\t')
testing_set = load_data('testData.tsv', sep='\t')
training_set.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [11]:
#getting training set examples labels
print ("Unique Classes: ",np.unique(training_set['sentiment']))
print ("Total Number of Training Examples: ",training_set['review'].shape)
print ("Total Number of Testing Examples: ",testing_set['review'].shape)

Unique Classes:  [0 1]
Total Number of Training Examples:  (25000,)
Total Number of Testing Examples:  (25000,)


In [12]:
vectorizer = CountVectorizer(stop_words='english', binary=True)
train_bow_b = vectorizer.fit_transform(training_set['review'])
train_bow_b.shape
# Loading the kaggle test dataset
test_set = pd.read_csv('./datasets/testData.tsv',sep='\t')
test_bow_b = vectorizer.transform(testing_set['review'])
test_bow_b.shape

(25000, 74538)

(25000, 74538)

In [13]:
X_train, X_test, y_train, y_test = train_test_split(train_bow_b, training_set['sentiment'], 
                                                    test_size = 0.20, random_state = 0,
                                                    stratify = training_set['sentiment'])

In [14]:
clf = test(NaiveBayes_v0(), X_train, X_test, y_train, y_test)

              precision    recall  f1-score   support

           0       0.93      0.91      0.92     10000
           1       0.91      0.93      0.92     10000

    accuracy                           0.92     20000
   macro avg       0.92      0.92      0.92     20000
weighted avg       0.92      0.92      0.92     20000

Confusion Matrix: 
 [[9072  928]
 [ 658 9342]]

Accuracy:  0.9207
              precision    recall  f1-score   support

           0       0.89      0.83      0.86      2500
           1       0.84      0.89      0.87      2500

    accuracy                           0.86      5000
   macro avg       0.86      0.86      0.86      5000
weighted avg       0.86      0.86      0.86      5000

Confusion Matrix: 
 [[2080  420]
 [ 268 2232]]

Accuracy:  0.8624


Let's see how MultinomialNB performs on movie review data. 

In [15]:
test(MultinomialNB(), X_train, X_test, y_train, y_test)

              precision    recall  f1-score   support

           0       0.90      0.94      0.92     10000
           1       0.94      0.90      0.92     10000

    accuracy                           0.92     20000
   macro avg       0.92      0.92      0.92     20000
weighted avg       0.92      0.92      0.92     20000

Confusion Matrix: 
 [[9420  580]
 [1014 8986]]

Accuracy:  0.9203
              precision    recall  f1-score   support

           0       0.85      0.88      0.86      2500
           1       0.88      0.84      0.86      2500

    accuracy                           0.86      5000
   macro avg       0.86      0.86      0.86      5000
weighted avg       0.86      0.86      0.86      5000

Confusion Matrix: 
 [[2208  292]
 [ 402 2098]]

Accuracy:  0.8612


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Great! It's similar to our own implementation.

In [16]:
test_pred = clf.predict(test_bow_b.toarray())

#writing results to csv to uplaoding on kaggle!
kaggle_df = pd.DataFrame(data=np.column_stack([testing_set["id"].values,test_pred.astype(int)])
                         ,columns=["id","sentiment"])
#kaggle_df.to_csv("./naive_bayes_model_take1.csv",index=False)
#print ('Predcitions Generated and saved to naive_bayes_model_take1.csv')

Wow, we can submission our result to kaggle. Not bad!

However, our first attempt in implementing Naive Bayes is pretty naive in a sense that it only read binary feature and can only predict binary output. A more generalized Naive Bayes can read discrete feature and predict k categories. Let's see an example from sklearn.

In [17]:
categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']

from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train',
     categories=categories, shuffle=True, random_state=42)

twenty_test = fetch_20newsgroups(subset='test',
     categories=categories, shuffle=True, random_state=42)

vectorizer = CountVectorizer(stop_words='english')

twenty_train.target_names
len(twenty_train.data)
len(twenty_train.filenames)
X_train = vectorizer.fit_transform(twenty_train.data)
X_test = vectorizer.transform(twenty_test.data)
y_train = twenty_train.target
y_test = twenty_test.target

test(MultinomialNB(), X_train, X_test, y_train, y_test)

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

2257

2257

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       480
           1       0.99      1.00      0.99       584
           2       1.00      0.99      1.00       594
           3       1.00      0.99      1.00       599

    accuracy                           1.00      2257
   macro avg       1.00      1.00      1.00      2257
weighted avg       1.00      1.00      1.00      2257

Confusion Matrix: 
 [[479   0   0   1]
 [  0 583   1   0]
 [  0   2 591   1]
 [  0   3   0 596]]

Accuracy:  0.9964554718653079
              precision    recall  f1-score   support

           0       0.93      0.91      0.92       319
           1       0.95      0.97      0.96       389
           2       0.96      0.92      0.94       396
           3       0.93      0.96      0.95       398

    accuracy                           0.94      1502
   macro avg       0.94      0.94      0.94      1502
weighted avg       0.94      0.94      0.94      1502

Co

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [18]:
from naive_bayes import NaiveBayes_v1
test(NaiveBayes_v1(), X_train, X_test, y_train, y_test)

              precision    recall  f1-score   support

           0       1.00      0.98      0.99       480
           1       0.99      1.00      1.00       584
           2       0.99      1.00      1.00       594
           3       0.99      0.99      0.99       599

    accuracy                           0.99      2257
   macro avg       0.99      0.99      0.99      2257
weighted avg       0.99      0.99      0.99      2257

Confusion Matrix: 
 [[472   0   0   8]
 [  0 582   2   0]
 [  0   1 593   0]
 [  0   2   1 596]]

Accuracy:  0.9937970757642889
              precision    recall  f1-score   support

           0       0.98      0.77      0.86       319
           1       0.97      0.92      0.94       389
           2       0.91      0.95      0.93       396
           3       0.83      0.98      0.90       398

    accuracy                           0.91      1502
   macro avg       0.92      0.90      0.91      1502
weighted avg       0.92      0.91      0.91      1502

Co

<naive_bayes.NaiveBayes_v1 at 0x11e937f10>

Notice that although we got similar training results, but our testing is not good as sklearn.