# Implementing Naive Bayes, Step by Step

You've seen the [theory part](https://nickyfoto.github.io/blog/entries/naive-bayes) of Naive Bayes, now let's see how to implement it in code.

In [1]:
# import necessary modules
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython.display import HTML

import numpy as np

from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelBinarizer
from learn.naive_bayes import NaiveBayes

In [2]:
np.random.seed(0)
X = np.random.randint(2, size=(4, 4))
y = np.array([1, 0, 1, 0])
clf = MultinomialNB()
clf.fit(X, y)
X
clf.predict(X)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

array([[0, 1, 1, 0],
       [1, 1, 1, 1],
       [1, 1, 1, 0],
       [0, 1, 0, 0]])

array([1, 0, 1, 0])

Given $X$, we want to calculate the probability that $x$ belong to each class and classify $x$ with the highest value. We need feature probability and prior class probability to carry out the calculation?

The class probability is equal to the number of training examples in each category divided by the number of training examples. We can use the [`LabelBinarizer`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html) to binarizer our label to facilitate 
further computation.

```python
labelbin = LabelBinarizer()
Y = labelbin.fit_transform(y)
```

If y is a binary classification problem as our above example, we need to add one extra column to make Y a two column matrix.

In [3]:
labelbin = LabelBinarizer()
Y = labelbin.fit_transform(y)
if Y.shape[1] == 1:
    Y = np.concatenate((1 - Y, Y), axis=1)
Y

array([[0, 1],
       [1, 0],
       [0, 1],
       [1, 0]])

After the transformation, the class count and feature count can simply be

In [4]:
class_count_ = Y.sum(axis=0)
feature_count_ = np.dot(Y.T, X)
class_count_
feature_count_

array([2, 2])

array([[1, 2, 1, 1],
       [1, 2, 2, 0]])

The probability distribution can be simply calculated as

In [5]:
class_count_/class_count_.sum()
feature_count_ / feature_count_.sum(axis=1).reshape(-1, 1)

array([0.5, 0.5])

array([[0.2, 0.4, 0.2, 0.2],
       [0.2, 0.4, 0.4, 0. ]])

In practice, as with many machine learning algorihtm, we use log probability to calculate the distribution in order to avoid the [underflow error](https://towardsdatascience.com/unfolding-na%C3%AFve-bayes-from-scratch-2e86dcae4b01) caused by multiplying many probability. So the above can be written as 

In [6]:
class_log_prior_ = np.log (class_count_) - np.log(class_count_.sum())
feature_log_prob_ = np.log(feature_count_) - np.log(feature_count_.sum(axis=1)).reshape(-1, 1)
class_log_prior_
feature_log_prob_

  


array([-0.69314718, -0.69314718])

array([[-1.60943791, -0.91629073, -1.60943791, -1.60943791],
       [-1.60943791, -0.91629073, -0.91629073,        -inf]])

Note that we have a divide by zero error because one class of training example have zero sum on that feature. Here comes the Laplace smoothing technique. Before we divide the sum of features, we add 1 to each feature count.

In [7]:
feature_log_prob_ = np.log(feature_count_ +1) - np.log(feature_count_ +1).sum(axis=1).reshape(-1, 1)
feature_log_prob_

array([[-2.48490665, -2.07944154, -2.48490665, -2.48490665],
       [-2.19722458, -1.79175947, -1.79175947, -2.89037176]])

These are the training steps are our model is trying to learn. Let's see how we predict new data based on the what we learned so far. Given a dataset X, we calculate the joint log likelihood of each training example with our `feature_log_prob` and `class_log_prior`. Dot product help use evaluate the joint likelihood of a single training example on different classes. We also add the `class_log_prior_` as information of the distribution of classes in the training dataset. And finally we use `argmax` from numpy to output the class label we saved on `labelbin`.

In [8]:
X[2:3]
jll = np.dot(X[2:3], feature_log_prob_.T) + class_log_prior_
jll
labelbin.classes_[np.argmax(jll, axis=1)]

array([[1, 1, 1, 0]])

array([[-7.74240202, -6.4738907 ]])

array([1])

Use own implementation

In [9]:
NaiveBayes().fit(X, y).predict(X)

array([1, 0, 1, 0])

https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification

In [10]:
X = np.array([[6, 180, 12],
              [5.92, 190, 11],
              [5.58, 170, 12],
              [5.92, 165, 10],
              [5, 100, 6],
              [5.5, 150, 8],
              [5.42, 130, 7],
              [5.75, 150, 9]
             ])
X

array([[  6.  , 180.  ,  12.  ],
       [  5.92, 190.  ,  11.  ],
       [  5.58, 170.  ,  12.  ],
       [  5.92, 165.  ,  10.  ],
       [  5.  , 100.  ,   6.  ],
       [  5.5 , 150.  ,   8.  ],
       [  5.42, 130.  ,   7.  ],
       [  5.75, 150.  ,   9.  ]])

In [11]:
# male 1, female 0
y = np.array([1,1,1,1,0,0,0,0])

In [12]:
male = X[y == 1]
male
male.mean(axis=0)
male.var(axis=0, ddof=1)

array([[  6.  , 180.  ,  12.  ],
       [  5.92, 190.  ,  11.  ],
       [  5.58, 170.  ,  12.  ],
       [  5.92, 165.  ,  10.  ]])

array([  5.855, 176.25 ,  11.25 ])

array([3.50333333e-02, 1.22916667e+02, 9.16666667e-01])

In [13]:
female = X[y == 0]
female
female.mean(axis=0)
female.var(axis=0, ddof=1)

array([[  5.  , 100.  ,   6.  ],
       [  5.5 , 150.  ,   8.  ],
       [  5.42, 130.  ,   7.  ],
       [  5.75, 150.  ,   9.  ]])

array([  5.4175, 132.5   ,   7.5   ])

array([9.72250000e-02, 5.58333333e+02, 1.66666667e+00])

In [14]:
x = np.array([[6, 130, 8.]])
x

array([[  6., 130.,   8.]])

In [15]:
clf = NaiveBayes().fit(X, y)
clf.predict(x)

array([0])

In [16]:
clf.predict_proba(x)

array([[0.55248494, 0.44751506]])

In [20]:
from scipy.stats import multivariate_normal as mvn
mvn(mean=male.mean(axis=0), cov=male.var(axis=0, ddof=1)).pdf(x)*0.5

6.197071843878072e-09

In [21]:
mvn(mean=female.mean(axis=0), cov=female.var(axis=0, ddof=1)).pdf(x)*0.5

0.0005377909183630023