# Naive Bayes

## Introduction:

In this notebook, we will learn about the naive Bayes classifier, which relates the probability an instance belongs to a given target class to the prior probability of each feature and the joint probability of the given set of features given the target class. By making an implicit assumption of conditional independence, this joint probability can be simplified into a product of individual conditional probabilities for each feature. Using this with Bayes theorem, we get the final probabilistic classification, given a set of features. This algorithm can be quick to apply, once the model is trained, and can often provide satisfactory results, which can be used on its own, or as a comparative benchmark for more advanced techniques.

We will first explore the probability concepts behind this classifier. Next, the Naive Bayes classification algorithm will be introduced by using the Iris data set.

## Classification
* **Input** : Measurements $x^{(1)} ,\ldots, x^{(n)}$ in an input space $\mathcal{X} \in \mathbb{R}^d $. Each measurement $x^{(i)} $ consists of $d$ features.

* **Output** : The discrete output space $\mathcal{Y}$ is composed of $K$ possible classes : 
  * $ \mathcal{Y}=  \{ 0 , 1 \}$ is called binary classification. 
  * $ \mathcal{Y} = \{ 1,\ldots, K \}$ is called multiclass classification.

## The Bayes Classifier:

In Bayesian classification, we're interested in finding the probability of a label $Y=y$ given some observed features X, which we can write as $P(Y=y~|~X=x)$.
Bayes's theorem tells us how to express this in terms of quantities we can compute more directly:

$$
P(Y=y~|~X=x) = \frac{P(X=x~|~Y=y)P(Y=y)}{P({\rm X=x})}
$$

Under the assumption $( X , Y ) \overset{\text{iid}}{\sim} \mathcal{P}$ , the optimal classifier is
$$f^* ( x ) := \arg \max_{ y \in \mathcal{Y}  }P(Y = y | X = x ) $$

From Bayes rule we equivalently have
$$ f^* ( x ) = \arg \max_{ y \in \mathcal{Y}  } P(X = x  | Y = y )P(Y = y )$$

where:
* $P ( X = x | Y = y )$ is called the class conditional distribution of $X$ .
* $P ( Y = y )$ is called the class prior .
* In practice we don’t know either of these, so we approximate them.

**Goal:** Find the label y from Y with the greatest probability given the observed data 𝐷. This is the maximum a posteriori (MAP) hypothesis

All we need now is some model by which we can compute $P(X=x~|~Y=y_i)$ for each label in $Y$.
Such a model is called a *generative model* because it specifies the hypothetical random process that generates the data.

Specifying this generative model for each label is the main piece of the training of such a Bayesian classifier.
The general version of such a training step is a very difficult task, but we can make it simpler through the use of some simplifying assumptions about the form of this model.

**Question:** Under the assumption, MAP can be simplified to the maximum likelihood (ML) hypothesis






## NAIVE BAYES

We have to define $P( X = x | Y = y )$ . The "naive" in "naive Bayes" comes in: if we make very naive assumptions about the generative model for each label, we can find a rough approximation of the generative model for each class, and then proceed with the Bayesian classification. 

**Simplifying assumption:** Naive Bayes is a Bayes classifier that makes the assumption
$P ( X = x | Y = y ) = \prod_{j=1}^d P( x_j | Y = y )$ .

i.e., it treats the dimensions of $X$ as conditionally independent given $ y$.

**Question:** For which of the following data sets can we use the Naive Bayes assumption?

<img src="https://matplotlib.org/3.1.1/_images/sphx_glr_confidence_ellipse_001.png" style="width:600px;height:200px;">

One approach to computing the conditional probabilities was demonstrated in the Lecture. A problem with using the counts from a training data set is that the entire process becomes completely deterministic, which means that the same result occurs every time we run the algorithm.  
In some cases, this would not be a problem, but in general, we want to account for the fact that any data we have collected (or sampled) is merely a subset of the entire parent population that we wish to quantify. If we were to collect a second data set (or sample), we likely would produce different results.


## Naive Bayes: Classification

When applying the naive Bayes algorithm, we must choose a specific classifier that assumes the features are sampled from a relevant distribution. 

In the `scikit learn` library, we can employ the naive Bayes algorithm by creating one of three estimators, which are all in the `naive_bayes` module:

* [Gaussian Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html): the features are assumed to follow a normal distribution.
* [Multinomial Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html): the features are assumed to follow a multinomial distribution.
* [Bernoulli Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html): the features are assumed to follow a binomial distribution.


### Import Libraries:
first we will import all the packages that are required for this exercise. 
- [numpy](www.numpy.org) is the main package for scientific computing with Python.
- [matplotlib](http://matplotlib.org) and [seaborn](https://seaborn.pydata.org/introduction.html) are libraries to plot graphs in Python.
- np.random.seed(1) is used to keep all the random function calls consistent

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams["figure.figsize"] = [9,6]
%matplotlib inline

np.random.seed(1)

## Gaussian Naive Bayes
In this classifier, the assumption is that data from each label is drawn from a simple Gaussian distribution. 

### Example 1:

In [None]:
from sklearn.datasets import make_blobs
X, y = make_blobs(100, 2, centers=2, random_state=2, cluster_std=1.5)

plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')
plt.title('Training Data ', size=14)
plt.show()

In [None]:
# Data generated from two Gaussian distribution.
fig, ax = plt.subplots()

ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')
ax.set_title('Naive Bayes Model', size=14)

xlim = (-8, 8)
ylim = (-15, 5)

xg = np.linspace(xlim[0], xlim[1], 60)
yg = np.linspace(ylim[0], ylim[1], 40)
xx, yy = np.meshgrid(xg, yg)
Xgrid = np.vstack([xx.ravel(), yy.ravel()]).T

for label, color in enumerate(['red', 'blue']):
    mask = (y == label)
    mu, std = X[mask].mean(0), X[mask].std(0)
    P = np.exp(-0.5 * (Xgrid - mu) ** 2 / std ** 2).prod(1)
    Pm = np.ma.masked_array(P, P < 0.03)
    ax.pcolorfast(xg, yg, Pm.reshape(xx.shape), alpha=0.5,
                  cmap=color.title() + 's')
    ax.contour(xx, yy, P.reshape(xx.shape),
               levels=[0.01, 0.1, 0.5, 0.9],
               colors=color, alpha=0.2)
    
ax.set(xlim=xlim, ylim=ylim)
plt.show()

The Gaussian naive Bayes estimator takes one hyperparameter: `priors`, which are the prior probabilities of the different classes.

### Task 1:
Deine a [Gaussian Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) and  fit it according to X, y

In [None]:
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X, y);

#### Naive Bayes: Decision Surface

We now compute and display the decision surface for the naive Bayes classifier. 
* First, we create a mesh (or grid of points in two-dimensions) that spans the features. 
* Next, we fit a naive Bayes classifier to the training data in these two dimensions, before applying this model to the test data and the two-dimensional mesh. 

The resulting figure shows the non-linear nature of this classifier, especially when using the GaussianNB estimator.

In [None]:
# Construct mesh grid data
x_1lim = (-8, 8)
x_2lim = (-15, 5)
x_1g = np.linspace(x_1lim[0], x_1lim[1], 60)
x_2g = np.linspace(x_2lim[0], x_2lim[1], 40)
xx, yy = np.meshgrid(x_1g, x_2g)
Xgrid = np.vstack([xx.ravel(), yy.ravel()]).T

# Predict for mesh grid
z = model.predict(Xgrid)

# Plot
fig, ax = plt.subplots()
plt.scatter(Xgrid[:, 0], Xgrid[:, 1], c=z, s=40, cmap='RdBu',alpha=0.2)
ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu')
ax.set_title('Naive Bayes Model', size=14)
ax.set(xlim=x_1lim, ylim=x_2lim)
plt.show()

We see a slightly curved boundary in the classifications—in general, the boundary in Gaussian naive Bayes is quadratic.


A nice piece of this Bayesian formalism is that it naturally allows for probabilistic classification, which we can compute using the predict_proba method. Now let's generate some new data and predict the label:

In [None]:
rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(80, 2)

ynew = model.predict(Xnew)
print("ynew",ynew[-10:].round(2))

yprob = model.predict_proba(Xnew)
print("yprob",yprob[-10:].round(2))

The columns give the posterior probabilities of the first and second label, respectively.

Of course, the final classification will only be as good as the model assumptions that lead to it, which is why Gaussian naive Bayes often does not produce very good results. 

Still, in many cases—especially as the number of features becomes large—this assumption is not detrimental enough to prevent Gaussian naive Bayes from being a useful method.

## Multinomial Naive Bayes:

Another useful example is multinomial naive Bayes, where the features are assumed to be generated from a simple multinomial distribution. 

The multinomial distribution describes the probability of observing counts among a number of categories, and thus multinomial naive Bayes is most appropriate for features that represent counts or count rates.

The idea is precisely the same as before, except that instead of modeling the data distribution with the best-fit Gaussian, we model the data distribuiton with a best-fit multinomial distribution.



### Example 2: Classifying Text
 One place where multinomial naive Bayes is often used is in text classification, where the features are related to word counts or frequencies within the documents to be classified

In [None]:
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups()
data.target_names

We take only a subset of the data

In [None]:
categories = ['talk.religion.misc', 'soc.religion.christian',
              'sci.space', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

In [None]:
print(train.data[5])


### Text Features

Another common need in feature engineering is to convert text to a set of representative numerical values. One of the simplest methods of encoding data is by word counts: you take each snippet of text, count the occurrences of each word within it, and put the results in a table.

For example, consider the following set of three phrases:


In [None]:
sample = ['problem of evil',
          'evil queen',
          'horizon problem',
         'of and a',
         'of the a',]

While doing this by hand would be possible, the tedium can be avoided by using Scikit-Learn's CountVectorizer:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
X = vec.fit_transform(sample)
X

It is easier to inspect if we convert this to a DataFrame with labeled columns:

In [None]:
import pandas as pd
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

The raw word counts lead to features which put too much weight on words that appear very frequently, and this can be sub-optimal in some classification algorithms. One approach to fix this is known as term frequency-inverse document frequency (TF–IDF).which weights the word counts by a measure of how often they appear in the documents but offset them by the number of times the word appears in the entire dataset. This offset helps remove the importance from really common words  like "the" or "a". The syntax for computing these features is similar to the previous example:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(sample)
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

### Example 2 (Continue)
Create a pipeline that attaches it to a multinomial naive Bayes classifier:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

### Task 2:
With this pipeline, you can apply the model to the training data. 

In [None]:
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
model.fit(train.data, train.target)

Next use the model to predict labels for the test data:

In [None]:
labels = model.predict(test.data)

Now that we have predicted the labels for the test data, we can evaluate them to learn about the performance of the estimator. For example, here is the confusion matrix between the true and predicted labels for the test data:

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(test.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');

Evidently, even this very simple classifier can successfully separate space talk from computer talk, but it gets confused between talk about religion and talk about Christianity. This is perhaps an expected area of confusion!

The very cool thing here is that we now have the tools to determine the category for any string, using the `predict()` method of this pipeline. Here's a quick utility function that will return the prediction for a single string:


In [None]:
def predict_category(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]

In [None]:
predict_category('sending a payload to the ISS')

In [None]:
predict_category('determining the screen resolution')

## Conclusion:
This algorithm can be quick to apply, once the model is trained, and can often provide reasonable results, which can be used on its own, or as a comparative benchmark for more advanced techniques.  It  has very few (if any) tunable parameters.

As the dimension of a dataset grows, it is much less likely for any two points to be found close together (after all, they must be close in every single dimension to be close overall). This means that clusters in high dimensions tend to be more separated, on average, than clusters in low dimensions, assuming the new dimensions actually add information. For this reason, simplistic classifiers like naive Bayes tend to work as well or better than more complicated classifiers as the dimensionality grows: once you have enough data, even a simple model can be very powerful.

## Baysian Network 

A Bayesian network is a probabilistic model represented by a direct acyclic graph $G = {V, E}$, where the vertices are random variables $X_i$, and the edges determine a conditional dependence among them.
<img src="https://imgur.com/eL7d94r.png" style="width:300px;height:300px;">

In [None]:
import numpy as np

def X1_sample(p=0.35):
    return np.random.binomial(1, p)

def X2_sample(p=0.65):
    return np.random.binomial(1, p)

def X3_sample(x1, x2, p1=0.75, p2=0.4):
    if x1 == 1 and x2 == 1:
        return np.random.binomial(1, p1)
    else:
        return np.random.binomial(1, p2)

def X4_sample(x3, p1=0.65, p2=0.5):
    if x3 == 1:
        return np.random.binomial(1, p1)
    else:
        return np.random.binomial(1, p2)

In [None]:
N = 4
Nsamples = 5000

S = np.zeros((N, Nsamples))
Fsamples = {}

for t in range(Nsamples):
    x1 = X1_sample()
    x2 = X2_sample()
    x3 = X3_sample(x1, x2)
    x4 = X4_sample(x3)

    sample = (x1, x2, x3, x4)

    if sample in Fsamples:
        Fsamples[sample] += 1
    else:
        Fsamples[sample] = 1

When the sampling is complete, it's possible to extract the full joint probability:

In [None]:
samples = np.array(list(Fsamples.keys()), dtype=np.bool_)
probabilities = np.array(list(Fsamples.values()), dtype=np.float64) / Nsamples

for i in range(len(samples)):
    print('P{} = {}'.format(samples[i], probabilities[i]))

We can also query the model. For example, we could be interested in $P(X_4=True)$. We can do this by looking for all the elements where $X_4=True$, and summing up the relative probabilities:

In [None]:
p4t = np.argwhere(samples[:, 3]==True)
print(np.sum(probabilities[p4t]))