<a href="https://colab.research.google.com/github/ianomunga/IrisKNNClassifier/blob/main/Iris_KNN_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##K-Nearest Neighbour Iris Classification using Scikit-Learn
This is a simple model that demonstrates important fundamental concepts such as:
* Loading cached data 
* Splitting it into a training set and a test set to prevent overfitting
* Using discrete values that have been one-hot encoded to enable their mathematical manipulation as numerical Numpy Multi-dimensional arrays
* Validating training data results against test data

While the K-Nearest Neighbours approach is outperformed by more sophisticated classification algorithms, it's a good place to start. 

In [1]:
#importing all the dependecies
import numpy as np
import pandas as pd
import sklearn 
import matplotlib.pyplot as plt

#for enabling viewing visualizations on screen instead of 
#opening them in a new tab
%matplotlib inline 

In [2]:
#load the iris dataset that's been cached by sklearn
from sklearn.datasets import load_iris
iris_dataset = load_iris()
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))

Keys of iris_dataset: 
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])


In [3]:
#Dataset Descriptions Stored as key value pairs
print(iris_dataset['DESCR'][:193] + "\n...")

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, pre
...


In [7]:
#the labels we'll be trying to predict
print("Target Names We're Predicting: \n{}".format(iris_dataset['target_names']))

Target Names We're Predicting: 
['setosa' 'versicolor' 'virginica']


In [8]:
#the features of the sample flowers that will influence the predictions
print("Feature names: \n{}".format(iris_dataset['feature_names']))

Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [9]:
#Here we can see that the data has already been preprocessed into a suitable
#ndarray format
print("Type of data: {}".format(type(iris_dataset['data'])))

Type of data: <class 'numpy.ndarray'>


In [10]:
#And that there are 150 samples in the dataset, 
#represented by the 150 rows
print("Shape of data: {}".format(iris_dataset['data'].shape))

Shape of data: (150, 4)


In [11]:
#You can slice from the beginning of the dataset to check them out 
print("First five columns of data:\n{}".format(iris_dataset['data'][:5]))

First five columns of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


In [12]:
#And confirm that they've been one-hot encoded into a NumPy ndarray
print("Type of target: {}".format(type(iris_dataset['target'])))

Type of target: <class 'numpy.ndarray'>


In [13]:
#And that they're also 150 of them
print("Shape of target: {}".format(iris_dataset['target'].shape))

Shape of target: (150,)


In [14]:
print("Target:\n{}".format(iris_dataset['target']))\
#0 means setosa, 1 means versicolor, and 2 means virginica.

Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [15]:
#Splitting the dataset into training and test data to prevent overfitting
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
iris_dataset['data'], iris_dataset['target'], random_state=0
)

In [16]:
#Confirm that the splits have been done
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

X_train shape: (112, 4)
y_train shape: (112,)
X_test shape: (38, 4)
y_test shape: (38,)


In [18]:
#Instantiate the estimator class to get the KNN Model
from sklearn.neighbors import KNeighborsClassifier
#Set the neighbours parametre to 1, to use 
#the model at it's simplest state
knn = KNeighborsClassifier(n_neighbors=1)

In [19]:
#Build the model on the training and test data 
knn.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=1)

In [20]:
#Make a prediction using the learned model on new, never-before-seen data
X_new = np.array([[5, 2.9, 1, 0.2]])
print("X_new.shape: {}".format(X_new.shape))

X_new.shape: (1, 4)


In [21]:
prediction = knn.predict(X_new)
print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(
 iris_dataset['target_names'][prediction]))

Prediction: [0]
Predicted target name: ['setosa']


In [22]:
#Evaluate the model on the test data
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))

Test set predictions:
 [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


In [23]:
#Get the accuracy manually by finding out how many flower predictions
#the model got right
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))

Test set score: 0.97


In [24]:
#Or use knn's score method
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))

Test set score: 0.97
