# Bayes Classifier

### Introduction

We've now learned about bayes theorem, and how it allows us to calculate the probability of an event occurring.  The key point is to incorporate our "prior", or the base probability of an event occurring, as we incorporate new evidence.

$P(H|E)= \frac{P(H)*P(E|H)}{P(EH) + P(EH^c)} = \frac{P(H)*P(E|H)}{P(H)*P(E|H) + P(H^c)*P(E|H^c)} $

In this lesson, we'll see how we can use bayes theorem to classify different observations.

### Considering the Evidence

For this lesson, we'll work with the iris dataset.  The iris dataset consists of observations regarding different types of iris plants.  An iris, apparently, is a kind of flower.

<img src="./iris.png" width="40%">

Our task will be to use bayes theorem to classify a flower as Setosa, or not Setosa.  

In [5]:
from sklearn import datasets
import pandas as pd
iris = datasets.load_iris()

X = pd.DataFrame(iris.data[:, :1], columns = iris['feature_names'][:1])  # we only take the first two features.
y = pd.Series(iris.target)

The target is represented by numbers ranging from 0 to 2, and we'll start by only using the feature of sepal length to characterize our flowers.  

> Refer to the image above to see what a sepal is.  It's like a flower petal, perhaps.

In [6]:
X[:4]

Unnamed: 0,sepal length (cm)
0,5.1
1,4.9
2,4.7
3,4.6


In [7]:
iris['target_names']

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

So every time we see a target of 0, this is a setosa.

### Applying naive bayes

Now let's think about how we can apply Bayes theorem to classify our first observation as Setosa or not.

In [8]:
X[:1]

Unnamed: 0,sepal length (cm)
0,5.1


Remember that our formula is the following:

$P(H|E): \frac{P(H)*P(E|H)}{P(EH) + P(EH^c)} = \frac{P(H)*P(E|H)}{P(H)*P(E|H) + P(H^c)*P(E|H^c)} $

And remember that this allows to calculate the probability of a hypothesis given some evidence, or here the probability of a setosa given a sepal length value.  

The numerator of our formula is to calculate the probability that the hypothesis occurs in general, multiplied by the probability that we see the evidence when the hypothesis is true.

So here, we want to calculate the probability that a flower is of type Setosa (represented by 0), assuming a value of a sepal length.  Let's start by filling in values for the numerator, which are:

* The prior: $P(H)$
* The likelihood: $P(E|H)$

1. The **prior** is just the probability of Setosas in general

In [9]:
(y == 0).mean()

0.3333333333333333

> Let's ignore that our data is not a random sample.

2. The **likelihood**, $P(E|H)$.  That is, assuming the hypothesis, what is the probability of seeing the evidence.

So to calculate the likelihood, we calculate the probability of seeing this evidence given we are looking at Setosas. So to do so...

> First we reduce our observations to only be setosas.

In [10]:
setosa_df = X.loc[y[y == 0].index]

Then we caculate the probabilities of a setosa having a sepal length of around $5.1$.  
> For now, we'll find the number of setosas with a sepal length around 5.1, and divide this by the number of setosas in general.

In [17]:
((setosa_df.iloc[:, 0] < 5.2) & (setosa_df.iloc[:, 0] > 4.9)).sum()
# 16 
setosa_df.count() # 50

16/50

0.32

Now going back to our formula, we see that we now have enough information to calculate the numerator.

$P(H|E): \frac{P(H)*P(E|H)}{P(EH) + P(EH^c)} = \frac{P(H)*P(E|H)}{P(H)*P(E|H) + P(H^c)*P(E|H^c)} $

In [27]:
P_h = .33 
P_e_h = .32

P_h*P_e_h

0.10560000000000001

Finally, let's calculate the denominator.  Notice that we just calculated the first term in the denominator as $0.1056$. Our missing piece is to calculate $P(H^c)*P(E|H^c)$.

> So breaking this down, we see that:

1. $P(H^c) = 1 - P(H) = 1 - .33 = .67$

2. And we can  use the same technique we saw above to find $P(E|H^c)$:

In [28]:
not_setosa_df = X.loc[y[y != 0].index]
((not_setosa_df.iloc[:, 0] < 5.2) & (not_setosa_df.iloc[:, 0] > 4.9)).mean()

0.03

So now we have:

$P(H|E): \frac{P(H)*P(E|H)}{P(EH) + P(EH^c)} = \frac{P(H)*P(E|H)}{P(H)*P(E|H) + P(H^c)*P(E|H^c)} $

In [29]:
P_hc = .67
P_e_hc = .03

In [30]:
P_h*P_e_h/(P_h*P_e_h + P_hc*P_e_hc)

0.8400954653937948

So we find the probability that it is Setosa is .84.

Let's see how well this lines up to a naive bayes classifier.

In [18]:
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
gnb = GaussianNB()
y_pred = gnb.fit(X[:, :1], y).predict_proba(X[:1, :1])

In [19]:
y_pred

array([[0.8190698 , 0.15212273, 0.02880746]])

So we see that we get pretty close to the prediction from sklearn.

### Summary

In this lesson we saw how we can use Bayes theorem to calculate the probability of an event occurring given evidence.  To do so, we used our Bayes Theorem of:

$P(H|E)= \frac{P(H)*P(E|H)}{P(EH) + P(EH^c)} = \frac{P(H)*P(E|H)}{P(H)*P(E|H) + P(H^c)*P(E|H^c)} $

And the key components were to calculate our prior of $P(H)$ by finding the percentage of positive events in our training data.  And then we calculated our likelihood $P(E|H)$ by finding the percentage of Setosa observations that had the same feature value as our observation.

In the next lesson, we'll review how to use bayes theorem to classify our data as well as update our procedure for finding the likelihood of our evidence.

### Resources

[Machine learning mastery](https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/)