# Introduction

In machine learning, the Iris dataset is the most well known dataset used to measure the efficiency of algorithms.  The dataset was introduced in a 1936 paper by Ronald Fisher.

We will perform a comparison of three algorithms:

* Gaussian Naive Bayes
* Logisitc Regression
* Decision Tree
* Support Vector Machines

## Reading in the data

We will read in the data from the .data file.

In [1]:
import pandas as pd
import numpy as np

f = open("iris.data","r")

# Here, we want to omit the last two elements since they're 
# just empty space
tempData = f.read().split('\n')[:-2]
tempData = [i.split(',') for i in tempData]

d = pd.DataFrame(tempData, columns=['sepal length','sepal width',
                                    'petal length','petal width','class'])

d.head()

X = d[['sepal length','sepal width','petal length','petal width']]
Y = d['class']

## Gaussian Naive Bayes

We'll first use Gaussian Naive Bayes to determine how well the algorithm can classify the dataset.

In [2]:
from sklearn.naive_bayes import GaussianNB

nbModel = GaussianNB()
nbModel.fit(X,Y)

(nbModel.predict(X) == Y).sum()/len(Y)

0.95999999999999996

Not bad.

## Logistic Regression

Now, we'll use Logistic Regression with the following function:

$$ g(x) = \frac{1}{1+e^{-h(x)}}$$

where

$$ h(x) = b_0 + b_1x_1 + ... + b_nx_n$$

In [3]:
from sklearn.linear_model import LogisticRegression

lrModel = LogisticRegression()
lrModel.fit(X,Y)

(lrModel.predict(X) == Y).sum()/len(Y)

0.95999999999999996

## Decision Tree

Next up, we'll use a decision tree to classify the data.

In [4]:
from sklearn import tree

dtModel = tree.DecisionTreeClassifier()
dtModel.fit(X,Y)

(dtModel.predict(X) == Y).sum()/len(Y)

1.0

## Support Vector Machine (SVM)

Finally, we'll use an SVM to determine how accurate the model can classify the plants.  We'll be using support vector clustering as the kernel type.

In [5]:
from sklearn.svm import SVC

svmModel = SVC()
svmModel.fit(X,Y)

(svmModel.predict(X) == Y).sum()/len(Y)

0.98666666666666669

## Ranking

1. Decision Trees
2. Support Vector Machine
3. Tie between Gaussian Naive Bayes and Logistic Regression