Here, we are importing some important libraries for our project.

In [1]:
import pandas as pd
import numpy as np

The dataset we are using in this is iris dataset and is included in   scikit-learn in the datsets module.

In [2]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()


In this we are printing the keys of iris dataset.



In [3]:
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))

Keys of iris_dataset: 
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


Here, value of key 'DESCR' is a short description of the dataset.

In [5]:
val = iris_dataset['DESCR']
start_val=val[:200]
print( start_val + "\n")

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive
...


In this the value of the key target_names contains the species of flower that we want to predict i.e 'setosa', 'versicolor' and 'virginica'



In [6]:
print("Target names: {}".format(iris_dataset['target_names']))


Target names: ['setosa' 'versicolor' 'virginica']


In this the value of feature_names that includes 'sepal length' , 'sepal width' , 'petal length' and 'petal width' are printed.

In [7]:
print("Feature names: \n{}".format(iris_dataset['feature_names']))

Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


Here, the data contains numeric measurements of sepal length, sepal width, petal length and petal width in a NumPy array.

In [8]:
print("Type of data: {}".format(type(iris_dataset['data'])))

Type of data: <class 'numpy.ndarray'>


The rows in this data array correspond to the flowers, while the columns represent the four measurements that we were taken for each flower.

In [9]:
print("Shape of data: {}".format(iris_dataset['data'].shape))

Shape of data: (150, 4)


Here, the feature values of first five samples/datasets are printed.

In [11]:
print("First five rows of data:\n{}".format(iris_dataset['data'][:5]))

First five rows of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


Here, this is the last key i.e the target array which contains the species of each of the flowers that were measured as a NumPy array.

In [12]:
print("Type of target: {}".format(type(iris_dataset['target'])))

Type of target: <class 'numpy.ndarray'>


This is the shape of the target that contains one entry per flower.

In [14]:
print("Shape of target: {}".format(iris_dataset['target'].shape))

Shape of target: (150,)


Here, the species are encoded as integers from 0 to 2 and 0 means setosa , 1 means versicolor and 2 means virginica.

In [15]:
print("Target:\n{}".format(iris_dataset['target']))

Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


The data is usually denoted with a capital X, while labels are denoted by lowercase y. And before making the split ,the train_test_split function shuffles the dataset using a random number.

In [19]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
  iris_dataset['data'], iris_dataset['target'], random_state=0)


The output of the train_test_split function is X_train,X_test,y_train and y_test which are all NumPy arrays.

In [20]:
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_train shape: {}".format(y_test.shape))

X_train shape: (112, 4)
y_train shape: (112,)
X_test shape: (38, 4)
y_train shape: (38,)


Here, we are importing KNeighborsClassifier. The knn object will hold the information that the algorithm has extracted from the training data.

In [21]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

Here, we are calling the fit method of the knn object, which takes as arguments the NumPy array X_train containing the training data and the NumPy array y_train of the corresponding training labels.

In [23]:
knn.fit(X_train, y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

Here, we are making predictions using the model on new data.

In [24]:
X_new = np.array([[5, 2.9, 1, 0.2]])
print("X_new.shape: {}".format(X_new.shape))

X_new.shape: (1, 4)


In [25]:
prediction = knn.predict(X_new)

print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(
iris_dataset['target_names'][prediction]))    

Prediction: [0]
Predicted target name: ['setosa']


Here, we can make prediction for each iris in the test data and comparing it against its label(the known species). And we can measure how well the model works by computing the accuracy, which is the fraction of flowers for which the right species was predicted.

In [27]:
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))

Test set predictions:
 [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


In [28]:
print("Test set score: {}".format(np.mean(y_pred == y_test)))

Test set score: 0.9736842105263158
