# The Iris Dataset
## About
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other. -[Wikipedia](https://en.wikipedia.org/wiki/Iris_flower_data_set)


# Exploratory Data Analysis
## Previewing the dataset

In [1]:
from sklearn.datasets import load_iris
data = load_iris()

In [2]:
# display what each species is encoded as
print(data.target)
shape = data.data.shape
print("{} datapoints for the {} available features".format(shape[0], shape[1]))

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
150 datapoints for the 4 available features


In [3]:
features = data.feature_names
species = data.target_names
print("Features available:")
for name in features:
    print("\t-", name)
print("Species to be identified:")
for name in species:
    print("\t-",name)

Features available:
	- sepal length (cm)
	- sepal width (cm)
	- petal length (cm)
	- petal width (cm)
Species to be identified:
	- setosa
	- versicolor
	- virginica


According to above code, the Iris dataset contains 150 datapoints with 4 available features, as well as the three iris species: _setosa, versicolor_ and _virginica_. These species have datapoints that include their sepal length and width, as well as their petal length and width. 

- _Setosa_ is denoted by a __0__<br>
- _Versicolor_ is denoted by a __1__<br>
- _Virginica_ is denoted by a __2__

# Making Predictions
## K-Nearest neighbors

In [13]:
from sklearn.cross_validation import train_test_split
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

In [5]:
from sklearn.neighbors import KNeighborsClassifier

In [18]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

0.95

## Logistic Regression

In [10]:
from sklearn.linear_model import LogisticRegression

In [19]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.9

## Decision Trees

In [20]:
from sklearn.tree import DecisionTreeClassifier

In [22]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
dt.score(X_test, y_test)

0.9166666666666666

# Conclusion about classifiers

It's clear that, according to the above classifiers, KNN is the best option. Logistic Regression and Decision Trees were only used here for comparison, __not__ for any other reason based on the data.