# Machine learning on the iris dataset
 - Framed as a supervised learning problem: Predict the species of an iris using the measurements
 - Famous dataset for machine learning because prediction is easy
 - Learn more about the iris dataset: [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)

The datset contains data from different species of iris. Framed as a __supervised learning__ as we are trying to learn the relationship between the __data__(iris measurements) and the __outcome__ which is the species of iris. 

If this was unlabelled that is we only had the data but not the species then it would have been unsupervised learning by attempting to cluster data into meaningful clusters.

## Loading the iris dataset into scikit-learn

In scikit-learn we import the individual modules,classes and functions rather than importing the class as a whole.

In [2]:
# import load_iris function from datasets module
from sklearn.datasets import load_iris

In [3]:
# save "bunch" object containing iris dataset and its attributes
# 'bunch' is sklearn's special object type to store datasets and their attributes
iris = load_iris()
type(iris)

sklearn.utils.Bunch

In [4]:
# print the iris data
# 'data' is attribute of dataset
print(iris.data)

[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.4  3.7  1.5  0.2]
 [ 4.8  3.4  1.6  0.2]
 [ 4.8  3.   1.4  0.1]
 [ 4.3  3.   1.1  0.1]
 [ 5.8  4.   1.2  0.2]
 [ 5.7  4.4  1.5  0.4]
 [ 5.4  3.9  1.3  0.4]
 [ 5.1  3.5  1.4  0.3]
 [ 5.7  3.8  1.7  0.3]
 [ 5.1  3.8  1.5  0.3]
 [ 5.4  3.4  1.7  0.2]
 [ 5.1  3.7  1.5  0.4]
 [ 4.6  3.6  1.   0.2]
 [ 5.1  3.3  1.7  0.5]
 [ 4.8  3.4  1.9  0.2]
 [ 5.   3.   1.6  0.2]
 [ 5.   3.4  1.6  0.4]
 [ 5.2  3.5  1.5  0.2]
 [ 5.2  3.4  1.4  0.2]
 [ 4.7  3.2  1.6  0.2]
 [ 4.8  3.1  1.6  0.2]
 [ 5.4  3.4  1.5  0.4]
 [ 5.2  4.1  1.5  0.1]
 [ 5.5  4.2  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.   3.2  1.2  0.2]
 [ 5.5  3.5  1.3  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 4.4  3.   1.3  0.2]
 [ 5.1  3.4  1.5  0.2]
 [ 5.   3.5  1.3  0.3]
 [ 4.5  2.3  1.3  0.3]
 [ 4.4  3.2  1.3  0.2]
 [ 5.   3.5


# Machine learning terminology
- Each row is an __observation__ (also known as: sample, example, instance, record)
- Each column is a __feature__ (also known as: predictor, attribute, independent variable, input, regressor, covariate)

In [5]:
# print the names of the four features
# These can be thought of as column headers(names) for the data.
# Using the attribute feature_names
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [6]:
# Using attribute 'target'
# print integers representing the species of each observation
# This is what we are going to predict
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [7]:
# print the encoding scheme for species: 0 = setosa, 1 = versicolor, 2 = virginica
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


Each value we are predicting is the response (also known as: target, outcome, label, dependent variable).

### Types of Supervised learning
- __Classification__ is supervised learning in which the response is categorical. That is, values are in finite unordered set. Eg : Predicting species of iris, e-mail is spam or ham.
- __Regression__ is supervised learning in which the response is ordered and continuous. Eg : Price of a house, Height of a person.

As a ML practitioner we have to understand how the given data is encoded and decide whether the response variable is suited for regression or classification.

Here we know 0,1,2 represent unordered category so we know to use classification technique and not regression.


## Requirements for working with data in scikit-learn

1. Features and response are __separate objects.__
2. Features and response should be __numeric__(irrespective of the fact whether its Classification or Regression).
3. Features and response should be __NumPy arrays__.
4. Features and response should have specific shapes

Both `iris.data` and `iris.target` are stored by default as nd.arrays.

In [8]:
# check the types of the features and response
print(type(iris.data))
print(type(iris.target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [9]:
# check the shape of the features (first dimension = number of observations, second dimensions = number of features)
print(iris.data.shape)

(150, 4)


In [10]:
# check the shape of the response (single dimension matching the number of observations)
print(iris.target.shape)

(150,)


In [11]:
# store feature matrix in "X"(capital letter) as it denotes a matrix 
X = iris.data

# store response vector in "y"(small letter) as it denotes a vector.
y = iris.target

************
## [Training a machine learning model with scikit-learn](https://www.youtube.com/watch?v=RlQuVL6-qe8&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=4)

### Agenda
- What is the K-nearest neighbors classification model?
- What are the four steps for model training and prediction in scikit-learn?
- How can I apply this pattern to other machine learning models?

As the response variable used here is categorical so it is known as  categorical problem.

## K-nearest neighbors (KNN) classification
1. Pick a value for K.
2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris.
3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris.

<img src = '1.PNG'>

Example dataset with 2 numerical features and x and y-axis. Each point represents an observation and color of the point represents the response class namely R,B and G.

<img src = '2.PNG'>

KNN classification map where K=1. The background of diagram has been coloured Red R for all areas where nearest neighbour is Red R. Coloured Blue B for all areas where nearest nearest neighbour is B. Coloured Green G for all areas where nearest nearest neighbour is G.

In other words the backgrond color tells what the predicted response value will be for new observation depending on its X and Y features.

<img src = '3.PNG'>

K= 5 is used here. Observe decision boundary changes/improves. White areas shows where KNN can't make predictions. KNN is simple but can make highly accurate predictions if different classes in the dataset have highly disimilar feature values.  

### Going back to iris dataset

Lets verfiy that X and y have appropriate shape

In [12]:
print(X.shape)
print(y.shape)

(150, 4)
(150,)


## scikit-learn 4-step modeling pattern

### Step 1: Import the class you plan to use

In [13]:
from sklearn.neighbors import KNeighborsClassifier

### Step 2: "Instantiate" the "estimator"
- "Estimator" is scikit-learn's term for model
- "Instantiate" means "make an instance of"

In [14]:
knn = KNeighborsClassifier(n_neighbors=1)    # Created an instance of k-Neighbor classifier and called it knn
     # Have an object that knows how to do knn classification and data is needed for it to work 
    # n_neighbors=1 tells classifier that it should be looking for one nearest neighbour.
    # n_neighbors - tuning or hyperparamter

- Name of the object does not matter
- Can specify tuning parameters (aka "hyperparameters") during this step
- All parameters not specified are set to their defaults.
- By printing out the name of object we know about other parameters

In [15]:
print(knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')


### Step 3: Fit the model with data (aka "model training")
- Model is learning the relationship between X and y
- Occurs in-place so need to assign the result to another object

In [16]:
knn.fit(X,y)  # here we use fit method on knn object and 2 arguments

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

### Step 4: Predict the response for a new observation
- New observations are called "out-of-sample" data
- Uses the information it learned during the model training process
- Uses 'predict()' on knn object and pass it featues of unknown iris as a list.
- 'predict()' expects a numpy array but works with a list as numpy converts it to a numpy array of appropriate shape.

In [17]:
knn.predict([[3,5,4,2]])

array([2])

- input given as a list of list
- Used `[[3,5,4,2]]` instead of `[3,5,4,2]` otherwise `valueError`
- Returns a NumPy array with the predicted response value
- Can predict for multiple observations at once
- `array([2])` points to the corresponding species of iris.

In [18]:
X_new = [[3,5,4,2],[5,4,3,2]]
knn.predict(X_new)

array([2, 1])

## Model Tuning - Using a different value for K

In [19]:
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn.fit(X, y)

# predict the response for new observations
knn.predict(X_new)

array([1, 1])

Above model predicts 1 and 1 for both test data


## Using a different classification model

In [20]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(X, y)

# predict the response for new observations
logreg.predict(X_new)

array([2, 0])

For unknown iris 1 ---> 2

For unknown iris 2 ---> 0

__Model evaluation__ helps us to know which model performs perfectly and which value of 'K' is good and also whether to use KNN or Logistic Regression