# Naive Bayesian Classifier

The naive bayesian classifier is a simple probabilistic classifier based on applying Bayes Theorem with a naive assumption of independence between the feautures. 

## Bayesian Inference 

**Bayes Theorem**

Sitting at the core of this classifier is Bayes theorem which operates as follows for two independent events $A$ and $B$ (given that $P(B) != 0 $).

\begin{align}
P(A|B) = \frac{P(B|A) P(A)}{P(B)}
\end{align}

- $P(A|B)$ is the probability of event A given event B is true.
- $P(B|A)$ is the probability of event B given event A is true.
- $P(A)$ and $P(A)$ are the probabilies of events A and B independent of one another. 

**Bayesian Interpretation**

For our naive bayes algorithm we'll interpret the theorem in a particular way. We'll identify probability as "degree of belief". We will then take a proposition $A$ and evidence $B$:

- $P(A)$, the *prior*, is the inital degree of belief in $A$.
- $P(A|B)$, the "posterior", is the degree of belief in $A$ having accounted for the degree of belieft in $B$.
- $\fract{P(B|A)}{P(B)}$ represents the support $B$ provides for $A$.

**Alternative Form**

\begin{align}
P(A|B) = \frac{P(B|A) P(A)}{P(B|A) P(A) + P(B|-A) P(-A)}
\end{align}

- $P(A)$ is the prior, the initial degree of belief in proposition $A$.
- $P(-A)$ is the degree of belief *against* proposition $A$.
- $P(B|A)$ is the degree of belief in evidence $B$ given $A$ is true.
- $P(B|-A)$ is the degree of belief in evidence $B$ given $A$ is false.
- $P(A|B)$ is the degree of belief in proposition $A$ given the evidence $B$.

**Bayesian Inference**

We can use Bayes Theorem to update the probability, the degree of belief of our proposition, as more evidence becomes available. 

We could have *n* input vectors $x_1,...,x_n$ and a desired output $y$. We want to identify:

\begin{align}
P(y|x_1,...,x_n) = \frac{P(y) P(x_1,...,x_n|y)}{P(x_1,...,x_n)}
\end{align}

Let's recall that for two independent events $A$ and $B$, $P(A)P(B|A) = P(A)P(B)$. We can also note that since the joint probability mass function is the product of the marginals: $P(A,B) = P(A)P(B)$.

Thus we can transform our numerator:

\begin{align}
P(y|x_1,...,x_n) & = \frac{P(y) P(x_1,...,x_n|y)}{P(x_1,...,x_n)} \\ 
& = \frac{P(x_1,...,x_n, y)}{P(x_1,...,x_n)} \\
& = \frac{P(x_1|x_2,...,x_n, y)P(x_2,...,x_n, y)}{P(x_1,...,x_n)} \space [note 1] \\
& = \frac{P(x_1|x_2,...,x_n, y)P(x_2|x_3...,x_n, y)P(x_3...,x_n, y)}{P(x_1,...,x_N)} \\
& = \frac{P(x_1|x_2,...,x_n, y)P(x_2|x_3...,x_n, y)...P(x_{n-1}|x_n, y)P(x_n|y)}{P(x_1,...,x_N)} \\
& = \frac{P(x_n|y) \prod_{i=1}^n P(x_i | y)}{P(x_1,...,x_n)} 
\end{align}

- [note 1] Chain rule. 

If we take the numerator alone then we can understand the following relationship:

\begin{align}
P(y|x_1,...,x_n) \propto P(y) \prod_{i=1}^n P(x_i | y)
\end{align}

We can then finalize the formal classifier as follows:

\begin{align}
\hat{y} = arg \space max \space P(y) \prod_{i=1}^n P(x_i | y)
\end{align}

We use the [maximum a posteriori (MAP)](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation) in order to get the final result. 

## Naive Bayes

We can identify three types of Naive Bayes models we'll be building in this notebook:

1. **Gaussian**. Used in classification and assumes features follow a normal distribution.
2. **Multinomial**. Used for discrete counts, i.e. text classification.
3. **Bernoulli**. Also known as binomial model. Useful if features are binary.


### Gaussian

For this variation we assume the likelihood to be Gaussian:

\begin{align}
P(x_i | y) = \frac{1}{\sqrt{2 \pi \sigma_y^2}} exp(-\frac{(x_i - \mu_y)^2}{2 \sigma_y^2})
\end{align}

In [3]:
from sklearn import datasets
iris = datasets.load_iris()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
y_pred = gnb.fit(iris.data, iris.target).predict(iris.data)
print("Number of mislabeled points out of a total %d points : %d" 
      % (iris.data.shape[0],(iris.target != y_pred).sum()))

Number of mislabeled points out of a total 150 points : 6


### Multinomial

### Bernoulli