# Machine learning on the iris dataset
 - Framed as a supervised learning problem: Predict the species of an iris using the measurements
 - Famous dataset for machine learning because prediction is easy
 - Learn more about the iris dataset: [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)

In [1]:
from IPython.display import IFrame
IFrame('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', width=300, height=150)

The datset contains data from different species of iris. Framed as a __supervised learning__ as we are trying to learn the relationship between the __data__(iris measurements) and the __outcome__ which is the species of iris. 

If this was unlabelled that is we only had the data but not the species then it would have been unsupervised learning by attempting to cluster data into meaningful clusters.

## Loading the iris dataset into scikit-learn

In scikit-learn we import the individual modules,classes and functions rather than importing the class as a whole.

In [2]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

In [3]:
# save "bunch" object containing iris dataset and its attributes
# 'bunch' is sklearn's special object type to store datasets and their attributes
iris = load_iris()
type(iris)

sklearn.utils.Bunch

In [4]:
# print the iris data
# 'data' is attribute of dataset
print(iris.data)

[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.4  3.7  1.5  0.2]
 [ 4.8  3.4  1.6  0.2]
 [ 4.8  3.   1.4  0.1]
 [ 4.3  3.   1.1  0.1]
 [ 5.8  4.   1.2  0.2]
 [ 5.7  4.4  1.5  0.4]
 [ 5.4  3.9  1.3  0.4]
 [ 5.1  3.5  1.4  0.3]
 [ 5.7  3.8  1.7  0.3]
 [ 5.1  3.8  1.5  0.3]
 [ 5.4  3.4  1.7  0.2]
 [ 5.1  3.7  1.5  0.4]
 [ 4.6  3.6  1.   0.2]
 [ 5.1  3.3  1.7  0.5]
 [ 4.8  3.4  1.9  0.2]
 [ 5.   3.   1.6  0.2]
 [ 5.   3.4  1.6  0.4]
 [ 5.2  3.5  1.5  0.2]
 [ 5.2  3.4  1.4  0.2]
 [ 4.7  3.2  1.6  0.2]
 [ 4.8  3.1  1.6  0.2]
 [ 5.4  3.4  1.5  0.4]
 [ 5.2  4.1  1.5  0.1]
 [ 5.5  4.2  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.   3.2  1.2  0.2]
 [ 5.5  3.5  1.3  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 4.4  3.   1.3  0.2]
 [ 5.1  3.4  1.5  0.2]
 [ 5.   3.5  1.3  0.3]
 [ 4.5  2.3  1.3  0.3]
 [ 4.4  3.2  1.3  0.2]
 [ 5.   3.5


# Machine learning terminology
- Each row is an __observation__ (also known as: sample, example, instance, record)
- Each column is a __feature__ (also known as: predictor, attribute, independent variable, input, regressor, covariate)

In [5]:
# print the names of the four features
# These can be thought of as column headers(names) for the data.
# Using the attribute feature_names
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [6]:
# Using attribute 'target'
# print integers representing the species of each observation
# This is what we are going to predict
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [7]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


Each value we are predicting is the response (also known as: target, outcome, label, dependent variable).

### Types of Supervised learning
- __Classification__ is supervised learning in which the response is categorical. That is, values are in finite unordered set. Eg : Predicting species of iris, e-mail is spam or ham.
- __Regression__ is supervised learning in which the response is ordered and continuous. Eg : Price of a house, Height of a person.

As a ML practitioner we have to understand how the given data is encoded and decide whether the response variable is suited for regression or classification.

Here we know 0,1,2 represent unordered category so we know to use classification technique and not regression.


## Requirements for working with data in scikit-learn

1. Features and response are __separate objects.__
2. Features and response should be __numeric__(irrespective of the fact whether its Classification or Regression).
3. Features and response should be __NumPy arrays__.
4. Features and response should have specific shapes

Both `iris.data` and `iris.target` are stored by default as nd.arrays.

In [8]:
# check the types of the features and response
print(type(iris.data))
print(type(iris.target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [9]:
# check the shape of the features (first dimension = number of observations, second dimensions = number of features)
print(iris.data.shape)

(150, 4)


In [10]:
# check the shape of the response (single dimension matching the number of observations)
print(iris.target.shape)

(150,)


In [11]:
# store feature matrix in "X"(capital letter) as it denotes a matrix 
X = iris.data

# store response vector in "y"(small letter) as it denotes a vector.
y = iris.target