# Naive Bayes Classifiers

Naive Bayes models are fast and simple classification algorithms, often suitable for very high-dimensional datasets. Because they are so fast and have few tunable parameters, they are a very useful baseline for a classification problem.

## Bayes Theorem

To understand how bayes classifiers work, first we need to explain the Bayes' theorem. Bayes' theorem is a mathematical expression describing the relationship of conditional probabilities of statistical variables. In Bayesian classification, we are interested in finding the probability of a label given some observed features, which we can write as  $P(L | features)$. Bayes' theorem helps us to derive this from other variables that we can compute more directly:

$$ P(L~|~{\rm features}) = \frac{P({\rm features}~|~L)P(L)}{P({\rm features})}$$

If we are dealing with a binary classification problem, that is we want to classify each example with one of two lables —we can call them L1 and L2— then one way to make this decision is to compute the ratio of the posterior probabilities for each label:

$$\frac{P(L_1~|~{\rm features})}{P(L_2~|~{\rm features})} = \frac{P({\rm features}~|~L_1)}{P({\rm features}~|~L_2)}\frac{P(L_1)}{P(L_2)}$$

All we need now is some model by which we can compute $P(features | L_i)$ for each label. Such a model is called a *generative model* because it specifies the hypothetical random process that generates the data. Specifying this generative model for each label is the main piece of the training of such a Bayesian classifier. The general version of such a training step is a very difficult task, but we can make it simpler through the use of some simplifying assumptions about the form of this model.

This is why the method is named "naive Bayes". If we make "naive" assumptions about the generative models that follows our data, we can find a rough approximation of the generative model for each class, and then proceed with the Bayesian classification. Different types of naive Bayes classifiers rest on different naive assumptions about the data.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

## Gaussian Naive Bayes
In this classifier, the assumption is that data associated to each label is drawn from a *Gaussian distribution*. Let us start by generating some points with the function `make_blobs`. Generate 100 points with 2 centers, use `random_state=2` and `cluster_std=1.5`. Make a scatter plot of the generated points.

If we assume that points belonging to each class are generated from a gaussian distribution with no covariance between dimensions, (that is, the different features are independent between them) then we can easily approximate the generative distribution by computing the mean and the standard deviation of the points belonging to the two classes. The following function will plot the gaussian distributions fitted to the given points.

In [None]:
def plot_gaussian_distributions(X, y):
    fig, ax = plt.subplots()

    ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')
    ax.set_title('Naive Bayes Model', size=14)

    xlim = 1.1*np.array([X[:, 0].min(), X[:, 0].max()]) 
    ylim = 1.1*np.array([X[:, 1].min(), X[:, 1].max()])

    xg = np.linspace(xlim[0], xlim[1], 50)
    yg = np.linspace(ylim[0], ylim[1], 50)
    xx, yy = np.meshgrid(xg, yg)
    Xgrid = np.vstack([xx.ravel(), yy.ravel()]).T

    for label, color in enumerate(['red', 'blue']):
        mask = (y == label)
        mu, std = X[mask].mean(0), X[mask].std(0)
        P = np.exp(-0.5 * (Xgrid - mu) ** 2 / std ** 2).prod(1)
        ax.pcolorfast(xg, yg, P.reshape(xx.shape), alpha=0.5,
                      cmap=color.title() + 's')
        ax.contour(xx, yy, P.reshape(xx.shape),
                   levels=[0.01, 0.1, 0.5, 0.9],
                   colors=color, alpha=0.2)

    ax.set(xlim=xlim, ylim=ylim);

Plot the gaussian distributions

The ellipses here represent the Gaussian generative model for each label, with larger probability toward the center of the ellipses. With this generative model in place for each class, we have a simple recipe to compute the likelihood $P(features | L_i)$ for any data point, and thus we can quickly compute the posterior ratio and determine which label is the most probable for a given point.

This procedure is implemented in Scikit-Learn with `sklearn.naive_bayes.GaussianNB` estimator, import this model and fit it to the dataset we have generated.

Now generate some new data and use the model to predict the label. Generate points uniformly between (-6, 8) for the x axis and (-14, 4) for the y axis. Set `np.random.seed(0)`

Now predict the label with the gaussian naive bayes's classifier.

To get an idea of the decision boundary, plot the predictions with a scatter plot. Plot first the points of the training set and then plot the new generated points with `alpha=0.2`. What shape does the decision function follow?

Now generate a new model and change the priors, fit the model to the same training set and plot the prediction over the uniformly distributed data to see the decision function. How does the prior affect the decision?

An interesting property of this Bayesian formalism is that it naturally allows for probabilistic classification, which we can compute using the `predict_proba` method. Use it to obtain the posterior probabilities of the uniformly distributed data. Use the model without predefined priors.

Now represent the original training set and a heatmap of the posterior probabilities. You have to define a new function `plot_posterior_proba` that receives `X, y, model` and plots the points in `X` and a heatmap of the posterior probability. To do so you may reuse part of the code provided above to plot the gaussian distributions. Use the `cmap='jet'` with the `pcolorfast` method. And plot the posterior probability of class 0.

Plot the posterior probability. Is this shape expected? What observations do you have about the "border"?

Naive Bayes's classifier are very fast algorithms, and as we have seen, very robust to variations in the prior distributions. But what happens if the data does not follow exactly a gaussian distribution? Now generate a new dataset with `make_circles`. Generate 100 points with `factor=.1, noise=.2, random_state=0`

Fit a GaussianNB and plot the posterior probability. What do you observe? Was this expected?

Now plot the gaussian distributions fitted to this data. What shape do they have? Is the "naive" assumption correct?

Now generate a dataset which does not have "circular" classes. Use the `make_moons` method from sklearn with `noise=0.2`. Now fit another model an plot the posterior probability and the gaussian distributions.

## Multinomial  Bayes' classifier
The Gaussian assumption just described is not the only assumption that could be used about the generative distribution for each label. Another useful example is *multinomial naive Bayes*, where the features are assumed to be generated from a *multinomial distribution*. The multinomial distribution describes the probability of observing counts among a number of categories, and thus multinomial naive Bayes is most appropriate for features that represent counts or count rates.

The idea is precisely the same as before, except that instead of modeling the data distribution with the best-fit Gaussian, we model the data distribuiton with a best-fit multinomial distribution.

### Example: Classifying text
One place where multinomial naive Bayes might be useful is in text classification, where the features are related to word counts or frequencies within the documents to be classified. Here we will use the sparse word count features from the 20 Newsgroups corpus to show how we might classify these short documents into categories. Let's download the data and take a look at the target names:

In [None]:
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()

Print the dataset description

Print the target names

For simplicity here, we will select just a few of these categories, and download the training and testing set.

Print the data from example number 6 in the training set (remember python starts counting from 0).

In order to use this data for machine learning, we need to be able to convert the content of each string into a vector of numbers. For this we will use the TF-IDF vectorizer. The Term frequency-inverse document frequency (TF–IDF) which weights the word counts by a measure of how often they appear in the documents. We will compute the tf-idf and create a pipeline that attaches it to a multinomial naive Bayes classifier. 

Make a pipeline with the `TfidfVectorizer`and `MultinomialNB`

With this pipeline, now apply it to the training data, and predict labels for the test data

Now that we have predicted the labels for the test data, we can evaluate them to learn about the performance of the estimator. Plot the confusion matrix of the predictions. Use the `confusion_matrix` from sklearn and the `sns.heatmap`. What can you say about this confusion matrix? Are all the classes equally well separated?

An interesting experiment that we can do now is to try to determine the category for any string, using this pipeline. This is a quick function that will return the prediction for a single string:

In [None]:
def predict_category(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]

Now try different sentences and see how are they classified.