This notebook tests the implementation of Bernoulli naive Bayes given in ``sparse_matrix_nb_implementation.py``.  This implementation does a single read through the training data to both build the dictionary that associates the vocabulary/classes with indices as well as a sparse matrix that contains the estimates at the parameters.  The entries of this matrix must then be modified (to adjust for the total number of words in each class, the smoothing parameter $\alpha$, and to take logarithms).  Care must be taken with the choice of sparse matrix representation to ensure (efficient) support for the needed operations.  In its logarithmic form, naive Bayes is a linear classifier and therefore prediction is done by first computing the feature vector from the input and then doing a matrix-vector multiplication.  In order to keep the matrix of parameters sparse, the $\alpha$-smoothing for the $0$ entries is done by modifying the resulting vector accordingly.  

More details can be found in ``sparse_matrix_nb_implementation.py``. 

Below we test this implementation first on a synthetic dataset and then on the Huffington Post dataset.  We see that the results agree exactly with the sklearn implementation of naive Bayes.  

In [8]:
from sparse_matrix_nb_implemenation import BNB


In [9]:
# Test case

X_train = ["this is an entry\n", 
          "this is too\n", 
          "so is this\n"]

y_train = ["red",
          "blue",
          "red"]

X_test = ["this is an entry\n",
          "so this is\n",
          "hello\n",
          "this is too\n"]

clf = BNB()
clf.fit(X_train, y_train)

print(f'{clf.vocabulary=}')
print(f'{clf.Theta.toarray()=}')
print(f'{clf.log_priors=}')
print(f'{clf.alpha=}')
print(f'{clf.index_to_class=}')
print(f'{clf.num_classes=}')
print(f'{clf.num_words=}')
print(f'{clf.class_index_to_word_count=}')

# entry for 'this' in 'red'
# math.log((2.0 + 1.0)/(7.0 + 6.0*1.0))
# entry for 'too' in 'blue'
# math.log((1.0 + 1.0)/(3.0 + 6.0*1.0))
clf.predict(X_test)


clf.vocabulary={'this': 0, 'is': 1, 'an': 2, 'entry': 3, 'too': 4, 'so': 5}
clf.Theta.toarray()=array([[-1.46633707, -1.46633707, -1.87180218, -1.87180218,  0.        ,
        -1.87180218],
       [-1.5040774 , -1.5040774 ,  0.        ,  0.        , -1.5040774 ,
         0.        ]])
clf.log_priors=array([-0.40546511, -1.09861229])
clf.alpha=1.0
clf.index_to_class={0: 'red', 1: 'blue'}
clf.num_classes=2
clf.num_words=6
clf.class_index_to_word_count=Counter({0: 7, 1: 3})


['red', 'red', 'red', 'blue']

In [10]:
from itertools import compress

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

RANDOM = 42
PATH = "data/News_Category_Dataset_v3.json"
cats = ["POLITICS", "WELLNESS", "ENTERTAINMENT", "TRAVEL", "STYLE & BEAUTY"]

df = pd.read_json(PATH, lines=True)
df = df[df.category.isin(cats)]
df.reset_index(drop=True, inplace=True)
df.drop(labels=["link", "authors", "date"], axis=1, inplace=True)
df["combined"] = pd.Series([h + ' ' + d for h,d in zip(df["headline"], df["short_description"])], 
                                index=df["headline"].index.copy())

X_train, X_test = train_test_split(df, train_size=0.8, random_state=RANDOM, stratify=df["category"])
y_train, y_test = X_train["category"], X_test["category"]
X_train.drop(labels=["category"], axis=1, inplace=True)
X_test.drop(labels=["category"], axis=1, inplace=True)


In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import Pipeline

clf = Pipeline([('vect', CountVectorizer()),
                ('nb', BernoulliNB())])

clf.fit(X_train["combined"], y_train)
clf.predict(X_test["combined"])
np.mean(predicted == y_test)


0.9019586206896552

In [5]:
my_clf = BNB()
my_clf.fit(X_train["combined"], y_train)


In [6]:
predicted = my_clf.predict(X_test["combined"])
np.mean(predicted == y_test)


0.9019586206896552