# Naive Bayes Classifiers


Naive Bayes classifiers are a family of classifiers that are quite similar to the linear
models discussed in the previous section. However, they tend to be even **faster** in
training. 

The price paid for this efficiency is that naive Bayes models often provide
generalization performance that is slightly **worse than that of linear classifiers** like
LogisticRegression and LinearSVC.

The reason that naive Bayes models are so efficient is that they learn parameters by
looking at each feature individually and collect simple per-class statistics from each
feature

There are three kinds of naive Bayes classifiers implemented in scikit68 
1. GaussianNB, 
2. BernoulliNB, and
3. MultinomialNB.

 GaussianNB can be applied to any continuous data,
 while BernoulliNB assumes binary data 
 MultinomialNB assumes count data (that is, that each feature represents an integer count of some‐
thing, like how often a word appears in a sentence).

BernoulliNB and MultinomialNB are mostly used in text data classification.


The BernoulliNB classifier counts how often every feature of each class is not zero.
This is most easily understood with an example:

In [1]:
import numpy as np
X = np.array([[0, 1, 0, 1],
 [1, 0, 1, 1],
 [0, 0, 0, 1],
 [1, 0, 1, 0]])
y = np.array([0, 1, 0, 1])


Here, X is a 4x4 NumPy array representing a dataset with four data points (rows) and four features (columns). y is an array representing the class labels (0 or 1) for each data point.

For class 0 output in y (the first and third data points), the first feature (first column) is zero two times (0th and 2nd index)
and nonzero zero times (as only for class 0 from y set), the second feature is zero one time and nonzero one time,
and so on. These same counts are then calculated for the data points in the second
class. 

Counting the nonzero entries per class in essence looks like this:

In [2]:
counts = {}
for label in np.unique(y):
 # iterate over each class
 # count (sum) entries of 1 per feature
 counts[label] = X[y == label].sum(axis=0)
print("Feature counts:\n{}".format(counts))

Feature counts:
{0: array([0, 1, 0, 2]), 1: array([2, 0, 2, 1])}



The output Feature counts: {0: array([0, 1, 0, 2]), 1: array([2, 0, 2, 1])} indicates the counts of nonzero entries for each feature, separated by class. Let's break it down:

For Class 0:

Feature 1 has 0 occurrences of 1 and 2 occurrences of 0.

Feature 2 has 1 occurrence of 1 and 0 occurrences of 0.

Feature 3 has 0 occurrences of 1 and 0 occurrences of 0.

Feature 4 has 2 occurrences of 1 and 0 occurrences of 0.

For Class 1:

Feature 1 has 2 occurrences of 1 and 0 occurrences of 0.

Feature 2 has 0 occurrences of 1 and 2 occurrences of 0.

Feature 3 has 2 occurrences of 1 and 0 occurrences of 0.

Feature 4 has 1 occurrence of 1 and 1 occurrence of 0.

=============================================================================================

The other two naive Bayes models, MultinomialNB and GaussianNB, are slightly dif‐
ferent in what kinds of statistics they compute.

***MultinomialNB takes into account the average value of each feature for each class, while GaussianNB stores the average value as well as the standard deviation of each feature for each class.***

To make a prediction, a data point is compared to the statistics for each of the classes,
and the best matching class is predicted. Interestingly, for both MultinomialNB and
BernoulliNB, this leads to a prediction formula that is of the same form as in the lin‐
ear models (see “Linear models for classification” on page 56). Unfortunately, coef_
for the naive Bayes models has a somewhat different meaning than in the linear mod‐
els, in that coef_ is not the same as w.


MultinomialNB and BernoulliNB have a single parameter, **alpha**, which controls
model complexity. The way alpha works is that the algorithm adds to the data alpha
many virtual data points that have positive values for all the features. This results in a
“smoothing” of the statistics. 

***A large alpha means more smoothing, resulting in less complex models.***
 The algorithm’s performance is relatively robust to the setting of
alpha, meaning that setting alpha is not critical for good performance. However,
tuning it usually improves accuracy somewhat.

**GaussianNB is mostly used on very high-dimensional data, while the other two variants of naive Bayes are widely used for sparse count data such as text.**
 MultinomialNBusually performs better than BinaryNB, particularly on datasets with a relatively large
number of nonzero features (i.e., large documents).
