# Naive Bayes Multiple Features

### Introduction

In the last lesson, we saw how we can use bayes theorem to predict if observations fall into one class or another.  We did so using our formula of:

$P(H|E): \frac{P(H)*P(E|H)}{P(EH) + P(EH^c)} = \frac{P(H)*P(E|H)}{P(H)*P(E|H) + P(H^c)*P(E|H^c)} $

Here, our key components in making our predictions were the prior, $P(H)$ and the likelihood $P(E|H)$.  We saw how to view this as a machine learning problem.  Given a certain feature, what is the probability of a target.  

In this lesson, we'll explore this further, and consider what it means to have multiple features in predicting a target.

### Review

Before moving onto multiple features, let's review again what our formula is saying in it's simplest form:

$P(H|E): \frac{P(HE)}{P(E)}$

The above formula states that the probability of the hypothesis given the evidence, equals the probability of the hypothesis and evidence being true divided by the probability that just the evidence is true.  And then we expand our numerator using the chain rule:

$P(H|E): \frac{P(HE)}{P(E)} = \frac{P(H)P(E|H)}{P(E)}$

So focusing on the numerator, we found that the probability of the hypothesis and the evidence being true equals the probability of the hypothesis, times the probability of the evidence given the hypothesis.

### Adding another feature

Now let's say that we use two features to find the probability of an outcome.  For example, let's return to the example of our iris flowers.  We'll use both sepal length **and sepal width** to predict if a flower is a Setosa or not.

<img src="./iris.png" width="40%">

We'll find the probability a flower is a Setosa by finding $P(H|E,F)$, where $E$ represents a sepal length value, and $F$ represents a sepal width.  So, to calculate the probability $P(H|E, F)$, the probability of a setosa given both features, we use the following formula:  

$P(H|E, F): \frac{P(HEF)}{P(EF)} = \frac{P(H)P(E|H)P(F|H,F)}{P(EF)}$

So focusing on the numerator, we see that the probability of setosa occurring, and the features $E$ and $F$ occuring is the $P(H)$ times $P(E|H)$ times $P(F|H and F)$.

For example, if $E$ represents sepal length of 5, and $F$ is sepal width of 3, then we would be calculating: 

$P(H|E, F) =\frac{P(HEF)}{P(EF)} =  \frac{P(H)P(E|H)P(F|H,F)}{P(EF)} = \frac{P(setosa)P(l = 5|setosa)P(w = 3|setosa, l = 5)}{P(EF)}$

> Now focusing on the last term in the numerator, notice that our probability recognizes that the probability of our $width = 3$, changes if we know that we have a setosa, and the length is 5.  This is just the chain rule all over again. 

**But**, in practice, with *naive* bayes we assume that all of the feature values are independent of one another, changing our formula to: 

$P(H|E, F): \frac{P(HEF)}{P(EF)} = \frac{P(H)P(E|H)P(F|H)}{P(EF)}$

As we add more and more features, our assumption of the features being independent of one another makes things easier.  And in terms of prediction, it works well in practice.  

> This assumption of independence is what makes naive bayes, naive.

### Seeing an Example

Now let's see an example of Naive Bayes with multiple features working with our iris dataset.

In [21]:
from sklearn import datasets
import pandas as pd
iris = datasets.load_iris()

X = pd.DataFrame(iris.data[:, :2], columns = iris['feature_names'][:2])  # we only take the first two features.
y = pd.Series(iris.target)

setosa_df = X.loc[y[y == 0].index]
not_setosa_df = X.loc[y[y != 0].index]

Above we loaded our data separated it by our setosas, and non setosas.  We selected the features of sepal length and sepal width for both groups.

In [6]:
setosa_df[:2], not_setosa_df[:2]

(   sepal length (cm)  sepal width (cm)
 0                5.1               3.5
 1                4.9               3.0,
     sepal length (cm)  sepal width (cm)
 50                7.0               3.2
 51                6.4               3.2)

Now given an observation, we want to calculate something like the following:

$P(S|L, W) = \frac{P(setosa)P(l = 5.1|setosa)P(w = 3.5|setosa, l = 5.1)}{P(LW)}$

So we know that $P(setosa) = .33$, and next filling out the numerator, we should find the probability of a specific length and width assuming we have a setosa.  To do that, we find the parameters of distribution of lengths and widths for a setosa.

In [8]:
import scipy
setosa_length_mean, setosa_width_mean = setosa_df.mean()
setosa_length_std, setosa_width_std = setosa_df.std()

Then we can initialize the distributions.

In [9]:
rvs_setosa_length = scipy.stats.norm(setosa_length_mean, setosa_length_std)
rvs_setosa_width = scipy.stats.norm(setosa_width_mean, setosa_width_std)

So now, we could solve the numerator in our equation:

$P(S|L, W) = \frac{P(setosa)P(l = 5.1|setosa)P(w = 3.5|setosa, l = 5.1)}{P(LW)}$

with the following:

In [11]:
p_setosa = .33
p_set_len_five = rvs_setosa_length.pdf(5.1)
p_set_width_three = rvs_setosa_width.pdf(3.5)

In [12]:
p_setosa*p_set_len_five*p_set_width_three

0.37256154850415774

And then we find the denominator, by expanding it to:

$P(S|L, W) = \frac{P(setosa)P(l = 5.1|setosa)P(w = 3.5|setosa, l = 5.1)}{P(lw|setosa) + P(lw|setosa^c)}$

The first term in the denominator, we already found as .372.  So now we just need to find the second term in the denominator.

> This second term expands to: $P(setosa^c)P(l = 5.1|setosa^c)P(w = 3.5|setosa^c, l = 5.1)$

So we can essentially, repeat the steps to now find the probabilities assuming we do not have a setosa. 

In [14]:
not_setosa_length_mean, not_setosa_width_mean = not_setosa_df.mean()
not_setosa_length_std, not_setosa_width_std = not_setosa_df.std()

rvs_not_setosa_length = scipy.stats.norm(not_setosa_length_mean, not_setosa_length_std)
rvs_not_setosa_width = scipy.stats.norm(not_setosa_width_mean, not_setosa_width_std)

In [15]:
p_not_setosa = .67
p_not_set_len_five = rvs_not_setosa_length.pdf(5.1)
p_not_set_width_three = rvs_not_setosa_width.pdf(3.5)

p_not_setosa*p_not_set_len_five*p_not_set_width_three

0.017521108897192

$P(H|E, F): \frac{P(HEF)}{P(E)} = \frac{P(H)P(E|H)P(F|H)}{P(E)}$

In [18]:
p_set_l_w = p_setosa*p_set_len_five*p_set_width_three
p_l_w = (p_setosa*p_set_len_five*p_set_width_three + p_not_setosa*p_not_set_len_five*p_not_set_width_three)

p_set_l_w/p_l_w

0.9550835994250193

And we can see that we get similar results if we fit our Gaussian naive bayes classifier with the same two features.

In [28]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(X, y)
gnb.predict_proba(X[:1])[:, 0]

array([0.9753393])

### Summary

In this lesson, we learned about what makes the naive bayes formula naive.  The central idea is:

1. Statistically (and logically), we should not consider our features as being independent of one another.  After all, knowing the value of one feature like sepal length changes the probability of observing another feature like sepal width.

2. But, in practice we can more easily train our classifier if we assume independence.  And assuming independence still leads to fairly accurate classifier.

So this assumption of independence leads to the following formula:

$P(H|E, F): \frac{P(HEF)}{P(EF)} = \frac{P(H)P(E|H)P(F|H)}{P(EF)}$

And applied to our example of the setosa is the following:
    
$P(S|L, W) = \frac{P(setosa)P(l = 5.1|setosa)P(w = 3.5|setosa, l = 5.1)}{P(lw|setosa) + P(lw|setosa^c)}$