# Explore Naive Bayes classifier
Naive Bayes is a straightforward example to help learn the Bayesian probabilistic perspective.

I will first run a simple toy example and examine the implementation with the scikit learn.

[TODO: use scikit learn for more complicated examples]

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Toy example: vegetarian buying habit
Imagining you are running an online grocery shopping site and you would like to send an email about amazing vegetarian ready-meals. A simple way is to predict using their items bought. You've got some training data about the vegetarian (y = 1) and non-vegterian (y=0) customers. Let us start simple, by including only two features: (1) whether they bought Quorn before $x_1$ and (2) whether they bought steak before $x_2$.

In [2]:
# Create training data
# 200 Users, first 100 are non-vegetarian
X = np.zeros((200,2))
Y = np.zeros((200))
Y[100:] = 1
# Randomly assign features
rand_idx = np.random.permutation(100)
X[rand_idx[:20],0] = 1
rand_idx = np.random.permutation(100)
X[rand_idx[:70],1] = 1
rand_idx = np.random.permutation(100)+100
X[rand_idx[:80],0] = 1
rand_idx = np.random.permutation(100)+100
X[rand_idx[:10],1] = 1
print(sum(X[:100,:]))
print(sum(X[100:,:]))

[ 20.  70.]
[ 80.  10.]


### Bayesian classifier
According to how the data is generated, the Bernoulli probability distribution is 
$$ p(x_1=1|y=0) = 0.2,\ p(x_1=1|y=1) = 0.8, \ p(x_2 = 1|y = 0) = 0.7, \ p(x_2 = 1|y = 1) = 0.1 $$
Thus given a new example $x = [1,1]$ The prediction is 
$$p(y = 0|x = [1,1]) = \frac{p(x|y)p(y)}{p(x)} = \frac{p(y=0)p(x1=1|y = 0)p(x2=1|y=0)}{p(y=0)p(x1=1|y = 0)p(x2=1|y=0)+p(y=1)p(x1=1|y = 1)p(x2=1|y=1)}$$
The result is
$$p(y = 0|x = [1,1]) = \frac{0.5\times 0.2 \times 0.7}{0.5\times 0.2 \times 0.7 + 0.5\times 0.8 \times 0.1} = 7/11 = 0.636 $$
Thus the probability of a user getting both quorn and steak is a non-vegetarian is 0.636.

### Test with scikit learn

In [3]:
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB(alpha=0, binarize=0.0, class_prior=None, fit_prior=True)
clf.fit(X, Y)
# Check some properties
print('Counting the classes: ', clf.class_count_)
print('Counting the features: \n', clf.feature_count_)
# Try predicting x= [1,1] (Note x should be a column vector!)
new_x = np.ones((1,2))
print(clf.predict_proba(new_x))

Counting the classes:  [ 100.  100.]
Counting the features: 
 [[ 20.  70.]
 [ 80.  10.]]
[[ 0.63636364  0.36363636]]


  'setting alpha = %.1e' % _ALPHA_MIN)


This probability (0.636) is what we expect! 

Let us see how the Laplace smoothing is applied.
According to Andrew Ng's notes, the Laplace smoothing is given by 
$$ p(x_j=1|y=0) = \frac{\sum_{i=1}^m 1\{x_j^{(i)} \bigwedge y^{(i)}=0 \} + 1 }{\sum_{i=1}^m 1\{y^{(i)}=0 \} + 2}$$
- Symbol $1\{y^{(i)}=0 \}$ gives 1 for all $y^{(i)}=0$ and 0 otherwise
- the $+2$ at the bottom represents then number of all classes

Therefore in this case the adjusted values are
$$ p(x_1=1|y=0) = 21/102,\ p(x_1=1|y=1) = 81/102, \ p(x_2 = 1|y = 0) = 71/102, \ p(x_2 = 1|y = 1) = 11/102 $$
The new example thus has probability
$$p(y = 0|x = [1,1]) = \frac{0.5\times \frac{21}{102} \times \frac{71}{102}}{0.5\times \frac{21}{102} \times \frac{71}{102} + 0.5\times \frac{81}{102} \times \frac{11}{102}} = 7/11 = 0.626 $$


In [4]:
clf2 = BernoulliNB(alpha=1, binarize=0.0, class_prior=None, fit_prior=True)
clf2.fit(X, Y)
new_x = np.ones((1,2))
print(clf2.predict_proba(new_x))

[[ 0.62594458  0.37405542]]


Now we understand what alpha is! What if alpha = $0.5$, we can guess that instead of adding $1$, we add $0.5$ at the top  of the Laplace smoothing (of course the bottom will be adjusted too, in this case be $+1$ instead of $+2$. 

We can also include some prior information using the parameters fit_prior and class_prior. Let's assume that by default the user is more likely to be non vegetarian with $p(y=1) = 0.4$. The results for the previous scenario becomes

In [5]:
clf2 = BernoulliNB(alpha=1, binarize=0.0, class_prior=[0.6,0.4], fit_prior=True)
clf2.fit(X, Y)
new_x = np.ones((1,2))
print(clf2.predict_proba(new_x))

[[ 0.71510791  0.28489209]]


With intuition, the above results is the same as 
$$p(y = 0|x = [1,1]) = \frac{0.6 \times \frac{21}{102} \times \frac{71}{102}}{0.6\times \frac{21}{102} \times \frac{71}{102} + 0.4\times \frac{81}{102} \times \frac{11}{102}} = 0.715 $$
Therefore, it uses the class_prior we provided. If class_prior is provded the fit_prior is not useful. However, if fit_prior is set to false, the class_prior is assumed [0.5,0.5] instead of following the distribution provided by the data.

In principle, we can also use Multinomial Naive Bayes. Let us try this now

In [6]:
from sklearn.naive_bayes import MultinomialNB
clf_MN = MultinomialNB()
clf_MN.fit(X,Y)
new_x = np.ones((1,2))
print(clf_MN.predict_proba(new_x))

[[ 0.62594458  0.37405542]]


Some other examples to explore are
- [Fit MNIST](https://github.com/bikz05/ipython-notebooks/blob/master/machine-learning/naive-bayes-mnist-sklearn.ipynb)
- [Spam classifier in CS229's problem sheet 3](http://cs229.stanford.edu/syllabus.html)
