# Getting started with the Iris Dataset

* What is the famous iris dataset, and how does it relate to machine learning?
* How do we load the iris dataset into scikit-learn?
* How do we describe a dataset using machine learning terminology?
* What are scikit-learn's four key requirements for working with data?

## Introducing the Iris dataset

![Iris](images/iris.png)

* 50 samples of 3 different species of iris (150 samples total)
* Measurements: sepal length, sepal width, petal length, petal width

## Objective
**Predict the species of an iris using the measurements**

In [None]:
# First, we'll import pandas, a data processing and CSV file I/O library
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score, confusion_matrix

# We'll also import seaborn, a Python graphing library
import warnings # current version of seaborn generates a bunch of warnings that we'll ignore
warnings.filterwarnings("ignore")
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="white", color_codes=True)

# Next, we'll load the Iris flower dataset, which is in the "../input/" directory
iris = pd.read_csv("iris.csv", header=None) # the iris dataset is now a Pandas DataFrame

# Let's see what's in the iris data - Jupyter notebooks print the result of the last thing you do
iris.head()

# Press shift+enter to execute this cell
%matplotlib inline

To simplify work, we'll have to label the columns of the dataset.

## Attribute Information:
   1. sepal length in cm
   2. sepal width in cm
   3. petal length in cm
   4. petal width in cm

In [None]:
iris.columns = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'Species']

iris = shuffle(iris)

iris.head()

In [None]:
# Let's see how many examples we have of each species
iris["Species"].value_counts()

In [None]:
# We'll use seaborn's FacetGrid to color the scatterplot by species
sns.FacetGrid(iris, hue="Species", size=5) \
   .map(plt.scatter, "SepalLengthCm", "SepalWidthCm") \
   .add_legend()

## Label Encoding

Since the Species feature is not numerical i.e. the values are text based eg "iris-setosa", we need to convert them into integers so that the model can understand what it is working with. This type of data is called **categorical data**

There are several methods of encoding the categorical data into a format that the model can easily process:
* Scikit-learn LabelEncoder
* Scikit-learn OneHotEncoder
* Pandas get_dummies

##Preparing X and y using pandas
* scikit-learn expects X (feature matrix) and y (response vector) to be NumPy arrays.
* However, pandas is built on top of NumPy.
* Thus, X can be a pandas DataFrame and y can be a pandas Series!

In [None]:
le = LabelEncoder()
le.fit(iris['Species'])
y = le.transform(iris['Species'])
y[:10]

In [None]:
# list the classes
le.classes_

In [None]:
# get the features
X = iris.drop('Species', axis=1)
X.head()

## Splitting X and y into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

## Training the model


In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train, y_train)

## Predicting the species of the test split

In [None]:
y_predict = model.predict(X_test)

## Accuracy Metrics

In [None]:
print accuracy_score(y_test, y_predict)