# Getting started in scikit-learn with iris dataset

## Introducing the iris dataset

![Iris](images/02_iris.png)

About this Dataset:
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:

* Sepal Length

* Sepal Width

* Petal Length

* Petal Width

* Species
    * Iris Setosa
    * Iris Versicolour
    * Iris Virginica


## Machine learning on the iris dataset

- Framed as a **supervised learning** problem: Predict the species of an iris using the measurements
- Famous dataset for machine learning because prediction is **easy**
- Learn more about the iris dataset: [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)

## Loading the iris dataset into scikit-learn

In [9]:
# We will use the pandas module because it allows us to work with R-like dataframes
import pandas as pd

# We will need some functions from numpy as well
import numpy as np

# The next two lines will force jypyter to output all results from a cell (by default only the last one is shown)
# Using semicolon (;) in the end of a line will force that line not to output the result
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Read iris csv file
iris = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")

# we can assign (new) column names
iris.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]

# See the data
iris.head()

# Data size
iris.shape

# We can ask for specific statistic of a column
iris.sepal_length.mean()

# Or of all columns
iris.mean()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


(150, 5)

5.843333333333334

sepal_length    5.843333
sepal_width     3.054000
petal_length    3.758667
petal_width     1.198667
dtype: float64

## Machine learning terminology

- Each row is an **observation** (also known as: sample, example, instance, record)
- Each column is a **feature** (also known as: predictor, attribute, independent variable, input, regressor, covariate)

- Each value we are predicting is the **response** (also known as: target, outcome, label, dependent variable)
- A **classification** problem is a supervised learning in which the target is categorical
- A **Regression** problem is supervised learning in which the response is ordered and continuous

![Classification / Regression](images/02_classification_regression.png)