# Classification using Bayes' Theorem

---

In this workbook we will use the Iris dataset to investigate how Bayesian methods can be used to predict the species of a flower.

First let's import the libraries we will need.

In [1]:
import numpy as np
import pandas as pd

Then we create a DataFrame and use the ceiling function to round the feature values up to the nearest integer.

In [2]:
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv('data/iris.csv', header = None, names = column_names)
iris.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa


In [3]:
iris.loc[:, 'sepal_length':'petal_width'] = iris.loc[:, 'sepal_length':'petal_width'].apply(np.ceil)
iris.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,6.0,4.0,2.0,1.0,Iris-setosa
1,5.0,3.0,2.0,1.0,Iris-setosa
2,5.0,4.0,2.0,1.0,Iris-setosa


Let's now say I have a flower with the following dimensions.

* Sepal length = 7
* Sepal width = 3
* Petal length = 5
* Petal width = 2

How can I predict the species? Well, I can apply the value_counts() method to a DataFrame slice to find out the number of flowers from each category that have the parameters above.

In [4]:
iris[(iris.sepal_length == 7)
   & (iris.sepal_width == 3)
   & (iris.petal_length == 5)
   & (iris.petal_width == 2)].species.value_counts()

Iris-versicolor    13
Iris-virginica      4
Name: species, dtype: int64

We can get the total number from each species.

In [5]:
iris.species.value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: species, dtype: int64

Given that we are using Bayes' Theorem, let's define this as a conditional probability problem. That is, what is the probability of a particular species given that a sample has measurements of 7, 3, 5 and 2?

$$P(Species|7352)$$

Just to remind ourselves, Bayes' Theorem states that

$$P(Prediction|Observation) = \frac{P(Observation|Prediction) \cdot P(Prediction)}{P(Observation)}$$

where
* $P(Prediction|Observation)$ is the probability of the Prediction after we see the Observation - this is the posterior, or conditional, probability.
* $P(Observation|Prediction)$ is the probability of the Observation given the Prediction - this is the likelihood.
* $P(Prediction)$ is the probability of the Prediction before we make the Observation - this is the prior.
* $P(Observation)$ is the probability of the Observation under any Prediction - this is the marginal, or non-conditional, probability. Otherwise known as the Normalising Constant.

We'll start with the Versicolor species.

$$P(Versicolor|7352) = \frac{P(7352|Versicolor) \cdot P(Versicolor)}{P(7352)}$$

$$P(7352|Versicolor) = \frac{13}{50} = 0.26$$

$$P(Versicolor) = \frac{50}{150} = 0.33$$

$$P(7352) = \frac{13+4}{150} = 0.11$$

Therefore the probability of a Versicolor flower given measurements of 7, 3, 5 and 2 is

$$P(Versicolor|7352) = \frac{0.26 \times 0.33}{0.11}$$

In [6]:
print 'P(Versicolor|7352) =', 0.26*0.33/0.11

P(Versicolor|7352) = 0.78


Let's now repeat for Setosa and Virginica.

$$P(Setosa|7352) = \frac{0 \times 0.33}{0}$$

$$P(Virginica|7352) = \frac{0.08 \times 0.33}{0.11}$$

In [7]:
print 'P(Setosa|7352) =', 0
print 'P(Virginica|7352) =', 0.08*0.33/0.11 

P(Setosa|7352) = 0
P(Virginica|7352) = 0.24


---

We conclude that the Iris is a Versicolor, given that this species has the highest conditional probability.