# Applying Bayes' theorem to iris classification

Let's see if **Bayes' theorem** might be able to help us solve a **classification task**, namely predicting the species of an iris!

## Preparing the data

We'll load the iris data into a DataFrame, and **round up** all of the measurements to the next integer:

In [4]:
from sklearn.datasets import load_iris
import numpy as np
import pandas as pd

In [5]:
# load the iris data
iris = load_iris()

# round up the measurements
X = np.ceil(iris.data)

# clean up column names
col_names = [name[:-5].replace(' ', '_') for name in iris.feature_names]

# read into pandas
df = pd.DataFrame(X, columns=col_names)

# create a list of species using iris.target and iris.target_names
species = [iris.target_names[num] for num in iris.target]

# add the species list as a new DataFrame column
df['species'] = species

In [6]:
# print the head
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,6.0,4.0,2.0,1.0,setosa
1,5.0,3.0,2.0,1.0,setosa
2,5.0,4.0,2.0,1.0,setosa
3,5.0,4.0,2.0,1.0,setosa
4,5.0,4.0,2.0,1.0,setosa


## Deciding how to make a prediction

Let's say that I had an **out-of-sample observation** with the following measurements: 
+ **sepal_length=7**
+ **sepal_width=3**
+ **petal_length=5**
+ **petal_width=2**

I want to predict the species of this iris. How might I do that?

We'll first examine all observations in the **training data** with those measurements

In [None]:
# For a first step, how would you display all the rows in the data frame
# where the sepal_length is 7?

In [None]:
# How would you display all the rows in the data frame
# where the sepal_length is 7 and the sepal_width is 3?

In [7]:
# OK, now show all the observations with features: 7, 3, 5, 2
df[(df.sepal_length==7) & (df.sepal_width==3) & (df.petal_length==5) & (df.petal_width==2)]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
54,7.0,3.0,5.0,2.0,versicolor
58,7.0,3.0,5.0,2.0,versicolor
63,7.0,3.0,5.0,2.0,versicolor
68,7.0,3.0,5.0,2.0,versicolor
72,7.0,3.0,5.0,2.0,versicolor
73,7.0,3.0,5.0,2.0,versicolor
74,7.0,3.0,5.0,2.0,versicolor
75,7.0,3.0,5.0,2.0,versicolor
76,7.0,3.0,5.0,2.0,versicolor
77,7.0,3.0,5.0,2.0,versicolor


In [8]:
# count the species for these observations
df[(df.sepal_length==7) & (df.sepal_width==3) & (df.petal_length==5) & (df.petal_width==2)].species.value_counts()

versicolor    13
virginica      4
Name: species, dtype: int64

In [None]:
# count the species for all observations


Okay, so how might **Bayes' theorem** help us here?

Let's frame this as a **conditional probability**: What is the probability of some particular class, given the measurements 7352?

$$P(class | 7352)$$

We could calculate this conditional probability for **each of the three classes**, and then predict the class with the **highest probability**:

$$P(setosa | 7352)$$
$$P(versicolor | 7352)$$
$$P(virginica | 7352)$$

## Calculating the probability of each class

Let's start with **versicolor**:

$$P(versicolor | 7352) = \frac {P(7352 | versicolor) \times P(versicolor)} {P(7352)}$$

In [None]:
# Create a variable P_7352_versicolor, equal to the number of times we saw versicolor in our 
# 7-3-5-2 observations, divided by the number of time we saw versicolor in the whole dataset
# Remember that division in python defaults to integers, so use floats.

In [None]:
# Create another variable P_versicolor, equal tot he number of times we saw versicolour in the
# whole dataset, divided by the size of the whole data set

In [None]:
# And a third variable P_7352, equal to the number of times we saw 7-3-5-2 
# divided by the size of the whole dataset

In [None]:
# What is the probability that a 7-3-5-2 iris is a versicolor?

In [None]:
# Do the same with virginical and setosa

What conclusion do you reach?

In summary, we framed a **classification problem** as three conditional probability equations, we used **Bayes' theorem** to solve those equations, and then we made a **prediction** by choosing the class with the highest conditional probability.

## Using sklearn

The sklearn implementation assumes that all features are independent, so it
assumes that the width and height are completely unrelated. This makes it 
overly confident of its predictions.

In [9]:
# import the sklearn.naive_bayes library

In [10]:
# Create a sklearn.naive_bayes.GaussianNB object

In [19]:
# Use this object to .fit() the sepal length & width, and petal length & width to the species


array(['setosa', 'versicolor', 'virginica'], 
      dtype='|S10')

In [None]:
# Confirm that the .classes_ and .class_count_ attributes say something sensible

In [None]:
# Use the .predict() method to predict the species of [7,3,5,2]

In [None]:
# Use the .predict_proba() method: does it match what you calculated above? (It probably won't)