# Supervised Learning with Scikit-learn On The Iris Flower Dataset

In this notebook, I will apply the concepts learned in DataCamp to the famous Iris Flower Dataset collected by botanist E. S. Anderson and popularized by statistician and biologist Ronald Fisher.

Track: Machine Learning Scientist With Python

Course: Supervised Learning With Scikit-learn

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
iris = datasets.load_iris()
type(iris)

sklearn.utils.Bunch

A Bunch is similar to a dictionary, in that it contains key-value pairs.

In [2]:
print(iris.keys())

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


- 'DESCR' = Description of the dataset
- 'feature_names' = a list with only four elements, which serve as column names for the four columns in 'data': 'petal_width', 'petal_length', 'sepal_width', 'sepal_length'
- 'data' = ndarray with 4 columns, containing the numeric values for each of the feature_names.
- 'target_names' = an array of just three values. Index 0 is 'setosa', 1 is 'versicolor' and 2 is 'virginica'.
- 'target' = contains values 0, 1 or 2, indicating the corresponding 'target_name'

In [3]:
X = iris.data
y = iris.target

In [4]:
# Create train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)

In order to implement K-Nearest Neighbors with scikit-learn, we need to take into account the following requirements:

- Features must be continuous, not discrete.
- There can't be missing values
- Data has to be stored in numpy arrays or pandas dataframes

# How to use scikit-learn to fit a classifier

In [6]:
from sklearn.neighbors import KNeighborsClassifier

# we instantiate the classifier
knn = KNeighborsClassifier(n_neighbors=6)

# we fit the model to the data
knn.fit(X_train, y_train)

# and now we use our model to label the test data
prediction = knn.predict(X_test)

for i,j in zip(prediction, y_test):
    print("Prediction: {}".format(i))
    print("Actual value: {}".format(j))

Prediction: 2
Actual value: 2
Prediction: 1
Actual value: 2
Prediction: 2
Actual value: 2
Prediction: 2
Actual value: 2
Prediction: 1
Actual value: 1
Prediction: 0
Actual value: 0
Prediction: 1
Actual value: 1
Prediction: 0
Actual value: 0
Prediction: 0
Actual value: 0
Prediction: 1
Actual value: 1
Prediction: 0
Actual value: 0
Prediction: 2
Actual value: 2
Prediction: 0
Actual value: 0
Prediction: 2
Actual value: 1
Prediction: 2
Actual value: 2
Prediction: 0
Actual value: 0
Prediction: 0
Actual value: 0
Prediction: 0
Actual value: 0
Prediction: 1
Actual value: 1
Prediction: 0
Actual value: 0
Prediction: 2
Actual value: 2
Prediction: 2
Actual value: 2
Prediction: 2
Actual value: 2
Prediction: 0
Actual value: 0
Prediction: 1
Actual value: 1
Prediction: 1
Actual value: 1
Prediction: 1
Actual value: 1
Prediction: 0
Actual value: 0
Prediction: 0
Actual value: 0
Prediction: 1
Actual value: 1
Prediction: 2
Actual value: 2
Prediction: 2
Actual value: 2
Prediction: 0
Actual value: 0
Prediction

It worked pretty well! But, how well? How much can we trust our model?
# How to measure model performance

The accuracy of a model is defined as the fraction of correct predictions

In [7]:
knn.score(X_test, y_test)

0.9555555555555556

Note: This is a work in progress