<center><h2> Iris Dataset Classification </h2></center>
<center><h4> by Nickhil Tekwani </h4></center>

Intro to ML (classification) using popular Iris dataset from sci-kit learn. Use the following classifiers:
kNN, LinearSVC, Naive Bayes, and Decision Tree

### Data

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

def iris_dataset():
    iris = load_iris()
    data1 = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])
    return data1
df = iris_dataset()
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0.0
1,4.9,3.0,1.4,0.2,0.0
2,4.7,3.2,1.3,0.2,0.0
3,4.6,3.1,1.5,0.2,0.0
4,5.0,3.6,1.4,0.2,0.0


### Exploration and Wrangling

In [2]:
# function to convert target value to corresponding iris plant
target_dict = {0:"Iris-setosa", 1: "Iris-versicolor", 2: "Iris-virginica"}
def lookup_label(target_value):
    return target_dict[target_value]
lookup_label(0)

'Iris-setosa'

In [3]:
# function to return number of each class in dataset
def transform(col):
    return target_dict[col]

def case_distribution(df):
    df["iris_type"] = df["target"].apply(transform)
    
    return df["iris_type"].value_counts()
case_distribution(df)

Iris-virginica     50
Iris-versicolor    50
Iris-setosa        50
Name: iris_type, dtype: int64

In [4]:
# function that extracts and returns a tuple of features and target variables from the df

df["target"] = pd.to_numeric(df["target"], downcast="integer")

def features_and_target(df):
    subset = df[["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"]]
    return (subset ,df["target"])
features, target = features_and_target(df)

### Application and Evaluation of Classifiers
kNN, LinearSVC, Naive Bayes, and Decision Tree

In [5]:
# estimators
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

estimators = {"k-Nearest Neighbor": KNeighborsClassifier(), 
              "Support Vector Machine": LinearSVC(), 
              "Gaussian Naive Bayes": GaussianNB(), 
              "Decision Tree": DecisionTreeClassifier()}

In [6]:
# fits the four classifiers using a percentage-split approach
def classifiers_percentage_split():
    X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)
    for name, clf in estimators.items():
        accuracy = clf.fit(X=X_train, y=y_train).score(X_test, y_test)
        print(name + "\n \t Prediction accuracy on the test data: " + format(accuracy*100, ".2f") + "%" + "\n")
classifiers_percentage_split()

k-Nearest Neighbor
 	 Prediction accuracy on the test data: 94.74%

Support Vector Machine
 	 Prediction accuracy on the test data: 89.47%

Gaussian Naive Bayes
 	 Prediction accuracy on the test data: 89.47%

Decision Tree
 	 Prediction accuracy on the test data: 84.21%





In [7]:
# fits these four classifiers using a cross-validation approach
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

def classifiers_cross_validation():
    kf = KFold(n_splits = 10, shuffle=True, random_state=3000)
    for name, clf in estimators.items():
        scores = cross_val_score(estimator=clf, X=features, y=target, cv=kf)
        print(name + ": \n" + f'\t mean accuracy: {scores.mean():.2%}' + ", "+ (f'standard deviation={scores.std():.2%} \n'))   
classifiers_cross_validation()

k-Nearest Neighbor: 
	 mean accuracy: 96.67%, standard deviation=4.47% 

Support Vector Machine: 
	 mean accuracy: 94.67%, standard deviation=5.81% 





Gaussian Naive Bayes: 
	 mean accuracy: 95.33%, standard deviation=6.00% 

Decision Tree: 
	 mean accuracy: 95.33%, standard deviation=6.70% 



### kNN from Scratch

In [8]:
# split the data
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=3000)

# calculate euclidean distance between 2 vectors
import math
def euclidean_distance(row1, row2):
    dist = 0
    for x, y in zip(row1, row2):
        dist += (x-y) ** 2
    return math.sqrt(dist)
    
# function to find the distance between each row in the training set and a given test row 
def calculate_distances(test_row, X_train):
    output = []    
    for i in range(len(X_train)):
        df = X_train[i:]
        row = df.iloc[0]
        distance = euclidean_distance(test_row, row)
        output.append(distance)
    return output

In [9]:
# test calculate distances

test_row = X_test.iloc[0,:]
test_row_label = y_test.iloc[0]
distances = calculate_distances(test_row, X_train)
len(distances)

112

In [10]:
# add X_train to a new dataframe along with the distances calculated above, and the actual target labels from the original data

# assign() method adds a new column, assigns distances to this column and returns a new df
df_distances = X_train.assign(distance = distances) 

# add the actual labels as well
df_distances["label"] = y_train

df_distances.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),distance,label
143,6.8,3.2,5.9,2.3,2.760435,2
82,5.8,2.7,3.9,1.2,0.632456,1
12,4.8,3.0,1.4,0.1,2.861818,0
137,6.4,3.1,5.5,1.8,2.078461,2
100,6.3,3.3,6.0,2.5,2.681418,2


In [11]:
# function that sorts the above df_distances by the distance column in ascending order
def sort_by_distance(df_distances, column_name):
    return df_distances.sort_values(column_name)
df_distances = sort_by_distance(df_distances, "distance")
df_distances

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),distance,label
89,5.5,2.5,4.0,1.3,0.387298,1
94,5.6,2.7,4.2,1.3,0.509902,1
53,5.5,2.3,4.0,1.3,0.519615,1
80,5.5,2.4,3.8,1.1,0.529150,1
69,5.6,2.5,3.9,1.1,0.538516,1
...,...,...,...,...,...,...
105,7.6,3.0,6.6,2.1,3.691883,2
122,7.7,2.8,6.7,2.0,3.802631,2
131,7.9,3.8,6.4,2.0,3.887158,2
117,7.7,3.8,6.7,2.2,3.992493,2


In [12]:
# function to find the k-closest samples in the training set to a given test row instance
def find_neighbors(df_distances, k):
    return df_distances.iloc[:k]
df_knn = find_neighbors(df_distances, 3)
df_knn

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),distance,label
89,5.5,2.5,4.0,1.3,0.387298,1
94,5.6,2.7,4.2,1.3,0.509902,1
53,5.5,2.3,4.0,1.3,0.519615,1


In [13]:
# function that performs a majority vote to determine the predicted label associated with a test sample
def majority_vote(df_knn, column_name):
    return df_knn[column_name].value_counts().idxmax()
prediction = majority_vote(df_knn, "label")
# look up the label 
label = lookup_label(prediction)
label

'Iris-versicolor'

In [14]:
# compare our own knn prediction to that of sklearn
model = KNeighborsClassifier().fit(X=X_train, y=y_train)
predicted = model.predict([test_row])
predicted[0]

1