Importing the necessary libraries

In [74]:

import pandas as pd
import matplotlib as plt
import matplotlib.pyplot as plt

Reading the data from a csv file into a pandas dataframe

In [75]:
flowers = pd.read_csv('https://raw.githubusercontent.com/avinashjairam/avinashjairam.github.io/master/Iris.csv')

Dropping the ID Column - We don't need it for this exercise.

In [76]:
flowers.drop(columns=['Id'], inplace=True)

Taking a peek at our dataset


In [77]:
flowers.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Looking at the shape of our dataset - It has 150 rows and 4 columns




In [78]:
flowers.shape

(150, 5)

Inspecting what data types are store in the dataframe. It looks like 

---

we have 4 columns of floats and 1 column of strings.

In [79]:
flowers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  150 non-null    float64
 1   SepalWidthCm   150 non-null    float64
 2   PetalLengthCm  150 non-null    float64
 3   PetalWidthCm   150 non-null    float64
 4   Species        150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


Separating the features from the labels

In [80]:
#Note the use of .vales and  
#.values is used to extract the dataframe values to a numpy array
X = flowers[['SepalLengthCm',	'SepalWidthCm',	'PetalLengthCm',	'PetalWidthCm']].values

In [81]:
y = flowers[['Species']].values

Splitting the Dataset into Training and Test Sets


In [82]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 0)

Let's take another look at the dataset. We can see that the sepal length and and petal width values have a considerable difference in size. As a result, the larger values may have more of an impact on our model. Hence, we need to rescale our data.

Feature Scaling - One method to rescale data is to standardize it. 

To standardize a dataset means to scale all of the values in the dataset such that the mean value is 0 and the standard deviation is 1.


In [83]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Fitting (Training) the model to the training dataset

In [85]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold 
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

#Note: here we are calling our KNN model knn. In the titantic example, we called our model lr. This is your choice. 
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train.ravel())

#We are using a Gaussian distribution for the Naive Bayes classifier hence we chose the GaussianNB model. 
#For a list of other distributions, see here: https://scikit-learn.org/stable/modules/naive_bayes.html
gnb = GaussianNB()
gnb.fit(X_train, y_train.ravel())

GaussianNB()

Performing 5-Fold Cross Validation for KNN

In [86]:
k = 5
kf = KFold(n_splits=k)

#for other scoring parameters see here https://scikit-learn.org/stable/modules/model_evaluation.html
result = cross_val_score(knn, X_train, y_train.ravel(), cv = kf, scoring='accuracy')
 
print(f' Avg accuracy:{result.mean()}')

 Avg accuracy:0.9523809523809523


Performing 5-Fold Cross Validation for GNB

In [87]:
#for other scoring parameters see here https://scikit-learn.org/stable/modules/model_evaluation.html
result = cross_val_score(gnb, X_train, y_train.ravel(), cv = kf, scoring='accuracy')
 
print(f' Avg accuracy:{result.mean()}')

 Avg accuracy:0.9428571428571428


Performing Stratified 10-Fold Cross Validation for KNN

In [88]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=32)


In [89]:
result = cross_val_score(knn, X_train, y_train.ravel(), cv = skf, scoring='accuracy')
 
print(f' Avg accuracy:{result.mean()}')

 Avg accuracy:0.9636363636363636


Performing Stratified 10-Fold Cross Validation for GNB

In [90]:
#Note we don't need to import the StratifiedKfold library again as it was done previously
result = cross_val_score(gnb, X_train, y_train.ravel(), cv = skf, scoring='accuracy')
 
print(f' Avg accuracy:{result.mean()}')

 Avg accuracy:0.9354545454545455


KNN - Using the model to make predictions on the test dataset

In [91]:
# Predicting the Test set results
y_pred_knn = knn.predict(X_test)



GNB - Using the model to make predictions on the test dataset

In [92]:
# Predicting the Test set results
y_pred_gnb = gnb.predict(X_test)



KNN - Classification Report

In [93]:

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_knn))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        16
Iris-versicolor       1.00      0.94      0.97        18
 Iris-virginica       0.92      1.00      0.96        11

       accuracy                           0.98        45
      macro avg       0.97      0.98      0.98        45
   weighted avg       0.98      0.98      0.98        45



GNB - Classification Report

In [95]:
#Note - no need to import the classification_report libary again as it was done in the cell above
print(classification_report(y_test, y_pred_gnb))

                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        16
Iris-versicolor       1.00      1.00      1.00        18
 Iris-virginica       1.00      1.00      1.00        11

       accuracy                           1.00        45
      macro avg       1.00      1.00      1.00        45
   weighted avg       1.00      1.00      1.00        45

