Importing libraries


In [2]:
import pandas as pd
import numpy as np

fetching and loading the dataset.
The load_iris function return something that is similar to dictionary. It contains data in the form of key-value pairs.

In [3]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

let's see the name of the keys in iris_dataset



In [5]:
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))

Keys of iris_dataset: 
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


key 'DESCR' is the short name of DESCRIPTION. It give us info about the dataset. Value stored in it is type of "string" so we can apply slice operator to it.

In [8]:
print(iris_dataset['DESCR'][:200] + "\n....")

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive
....


key 'target_names' stored values in the form of array of strings, which are species of flower that we want to predict like 'setosa', 'versicolor', 'virginica'. 

In [9]:
print("Target names: {}".format(iris_dataset['target_names']))

Target names: ['setosa' 'versicolor' 'virginica']


up next we have key 'feature_names' which is the list of strings.

In [10]:
print("Feature names: \n{}".format(iris_dataset['feature_names']))

Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


up next is key 'data' which is the 2 dimensional array of 150 rows and 4 columns(that refers to the feature_names like sepal length, width and petal length, width) to confirm that following is the code.

In [11]:
print("Type of data: {}".format(type(iris_dataset['data'])))

Type of data: <class 'numpy.ndarray'>


In [12]:
print("Shape of data: {}".format(iris_dataset['data'].shape))

Shape of data: (150, 4)


now let's see the data actually. I will fetch only first five rows.

In [13]:
print("First five rows of data:\n{}".format(iris_dataset['data'][:5]))

First five rows of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


as you can see we have 5 flower instances.
for first flower
5.1 = sepal length, 3.5 = sepal width, 1.4 = petal length and 0.2 = petal width, all are in cm values.

---



at last we have key 'target' which is a type of ndarray. It is a 1 dimensional array. to confirm that following is the code.

In [14]:
print("Type of target: {}".format(type(iris_dataset['target'])))

Type of target: <class 'numpy.ndarray'>


In [15]:
print("Shape of target: {}".format(iris_dataset['target'].shape))

Shape of target: (150,)


let's see the data itself. values in it is integers from 0 to 2 which represents the species respectively.

In [16]:
print("Target:\n{}".format(iris_dataset['target']))

Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


preparing data for training and testing

In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'],iris_dataset['target'],random_state=0)

above here we split data in 75 : 25 %. 75% for training and 25% for testing.

after we train our model with 75% of data we can test it with rest 25% data.

random_state = 0 means we are shuffling the rows and = 0 (fixed number) means if we run this several times we get the same shuffled output.

---



output of train_test_split function is X_train, X_test, y_train, y_test which are all arrays.

In [20]:
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

X_train shape: (112, 4)
y_train shape: (112,)
X_test shape: (38, 4)
y_test shape: (38,)


creating the k-nearest neighbors ml model. here k is 1

In [21]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

train the model

In [22]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

now suppose we found a new flower in the forest and want to know which specie it belongs to.

before that we have to create this data into something that our ml model can understand.

In this case our ml model expect this data in 2 dimensional array. so the following code does that

In [23]:
X_new = np.array([[5,2.9,1,0.2]]) # these number represents the features respectively

print("X_new.shape: {}".format(X_new.shape))

X_new.shape: (1, 4)


now predict the specie of flower

In [24]:
prediction = knn.predict(X_new)

print("Prediction: {}".format(prediction))
print("Prediction target name: {}".format(iris_dataset['target_names'][prediction]))

Prediction: [0]
Prediction target name: ['setosa']


we have got the answer but how can be sure that this is the correct prediction.

To check whether our model's prediction's accuracy is good or not, we will use our test data that we store previously (whose answers we already know).

we fed this test data into the model and get result.

after getting the result we just compare them with the actual answers that we have already have in variable "y_test" and see the accuracy.

In [28]:
y_pred = knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))

Test set predictions:
 [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


to verify the result we are using mean function in np library. to know what mean function is doing refer below example

for example - 

prediction result (p) = [0,0,1,2,2]

actual result (a) = [0,0,0,2,2]

what mean function is doing it is checking every element to corresponding array element with same indice like p[0] == a[0], p[1] == a[1] etc.

mean array = [1,1,0,1,1]

mean formula = sum of elements / no of element
             = 4 / 5 = 0.8 means accuracy is 80%

now lets verify

In [29]:
print("Test set score: {}".format(np.mean(y_pred == y_test)))

Test set score: 0.9736842105263158




---



at any stage if i am not clear contact me on
insta - https://www.instagram.com/pushpend3r/

anyways thanks for the bootcamp 🙏