The main purpose of this document is to introduce how to apply two classifiers, **kNN** and **Naive Bayes**, implemented by [scikit-learn](https://scikit-learn.org/stable/). We will use the modified Iris dataset introduced in previous weeks, and assume we have already completed the imputation, normalization and one-hot encoding. 

## 1.Data Preparation

We first import the packages that will be used in this document.

1. [Pandas](https://pandas.pydata.org/): Pandas is an open-source Python library widely used for data manipulation, analysis, and cleaning tasks. The central data structure in Pandas is the [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) which provides methods to facilitate the preliminary examination of essential properties, statistical summaries, and a select number of rows for a cursory exploration of the data.

2. [Numpy](https://numpy.org/): Numpy is a powerful Python library for numerical and array-based computing. It provides support for large, multi-dimensional arrays and matrices, along with a wide range of mathematical functions to operate on these arrays efficiently. 

3. [sklearn.neighbors.KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html): KNeighborsClassifier is a class provided by scikit-learn, used to create a k-NN classification model.

4. [sklearn.naive_bayes.GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html): GaussianNB is a class provided by scikit-learn, which implements the Gaussian Naive Bayes algorithm.

5. [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics): sklearn.metrics includes performance metrics functions used to evaluate a classifier's performance.

6. [sklearn.model_selection.cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html): is a function provided by scikit-learn that facilitates performing cross-validation on a given model to evaluate its performance. 

7. [sklearn.model_selection.GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html): GridSearchCV is a class provided by scikit-learn that facilitates hyperparameter tuning for models using a technique called grid search with cross-validation.

These packages will be utilized in following tasks for data loading, classification and evaluation.

In [1]:
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

First, we load the training and test data.

In [2]:
df = pd.read_csv('iris_modified_train.csv')
df_test = pd.read_csv('iris_modified_test.csv')

In [3]:
X_train = df.iloc[:,:-1].values
y_train = df.iloc[:,-1].values
X_test = df_test.iloc[:,:-1].values
y_test = df_test.iloc[:,-1].values

## 2. Classifiers

### 2.1 kNN

For conducting kNN classification, we use the [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) provided by scikit-learn.

The parameter `n_neighbors` represents the important hyperparameter of kNN, the number of neighbours k. By default, it is set to `5`. Changing the value of it would control the learning process. We will introduce later on tuning this hyperparameter.

The parameter `metric` shows the method of distance computation. By default, it is set to `minkowski`, which results in the standard Euclidean distance when the other parameter `p` is set to 2.

The parameter `p` is the power parameter for the Minkowski metric. By default, it is set to `2`. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2.

We will use standard Euclidean distance to build the model in this file. Feel free to explore other distance metrics to better understand their effects on kNN classification in our analysis.

First, let's have a look at how the kNN classifier performs on the default `n_neighbors`.

In [4]:
kNN_5 = KNeighborsClassifier()

The [fit()](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.fit) method is a fundamental function in scikit-learn's machine learning models used for training the model on the provided training data.

However, the `fit()` method for kNN is different from those discussed last week. It simply stores the training data for reference when performing predictions instead of creating a generalisable model because the kNN classifier is a lazy learning approach as introduced in the lecture.

In [5]:
kNN_5.fit(X_train, y_train)

KNeighborsClassifier()

Then, we classify the test data by [predict()](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.predict).

In [6]:
y_pred_kNN = kNN_5.predict(X_test)

Using the method introduced before, we can evaluate the test performance.

In [7]:
acc_kNN = metrics.accuracy_score(y_test, y_pred_kNN)
print("The test accuracy of decision tree on the dataset is: ", acc_kNN)

The test accuracy of decision tree on the dataset is:  0.868421052631579


In [8]:
f1_kNN = metrics.f1_score(y_test, y_pred_kNN, average='macro')
print("The test macro f1-score of decision tree on the dataset is: ", f1_kNN)

The test macro f1-score of decision tree on the dataset is:  0.8758620689655173


#### 2.1.1 Cross Validation for Evaluation

However, in practical scenarios, acquiring the ground truth for the test dataset is often unfeasible. To overcome this, we employ cross-validation, which provides a more reliable estimate of our model's accuracy when test data is not available.We will use the [cross_val_score()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) to evaluate a classifier by cross validation. 

First, we employ 5-fold cross-validation on the model with k = 5 to evaluate its performance.

In [9]:
acc_kNN_5 = cross_val_score(kNN_5, X_train, y_train, cv=5, scoring=('accuracy'))
f1_kNN_5 = cross_val_score(kNN_5, X_train, y_train, cv=5, scoring=('f1_macro'))
print("The cross-validation accuracy is: {:}\nThe cross-validation f1-score is: {:}".format(acc_kNN_5.mean(), f1_kNN_5.mean()))

The cross-validation accuracy is: 0.9019762845849802
The cross-validation f1-score is: 0.8960523028170087


Then let's have a look at the performance of the model with k = 3 by 5-fold cross-validation.

In [10]:
kNN_3 = KNeighborsClassifier(n_neighbors = 3)
acc_kNN_3 = cross_val_score(kNN_3, X_train, y_train, cv=5, scoring=('accuracy'))
f1_kNN_3 = cross_val_score(kNN_3, X_train, y_train, cv=5, scoring=('f1_macro'))
print("The cross-validation accuracy is: {:}\nThe cross-validation f1-score is: {:}".format(acc_kNN_3.mean(), f1_kNN_3.mean()))

The cross-validation accuracy is: 0.9023715415019762
The cross-validation f1-score is: 0.8950534759358287


Obviously, the `kNN` classifier's performance varies with different hyperparameter values for `k`. 

To improve the score we can consider tuning some hyperparameters.

#### 2.1.2 Cross Validation for Tuning Hyperparameter

[Scikit-learn](https://scikit-learn.org/stable/) provides a function called [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) which will perform a grid search across the parameter space you define. 

The parameter determines the parameter space of `GridSearchCV` is `param_grid`. It should be in the form of a dictionary or a list of dictionaries.

We define parameter search space for `k` as an array of odd numbers between 1 and 21, and make it into the the required form.

In [11]:
parameters = [{'n_neighbors': [int(x) for x in np.arange(1, 22, 2)]}]

Then we create kNN models, evaluating their performance using 5-fold cross validation, and then returning the best performing model on macro-F1 and the output. 

In [12]:
kNN = KNeighborsClassifier()
clf_best_kNN = GridSearchCV(kNN, parameters, cv=5, scoring='f1_macro')
clf_best_kNN.fit(X_train, y_train)
print(clf_best_kNN.best_params_)

{'n_neighbors': 11}


Empolying the kNN classifier with the best parameter selected, we can further see its cross-validation performance by accuracy and F1 score.

In [13]:
kNN_11 = KNeighborsClassifier(n_neighbors = 11)
acc_kNN_11 = cross_val_score(kNN_11, X_train, y_train, cv=5, scoring=('accuracy'))
f1_kNN_11 = cross_val_score(kNN_11, X_train, y_train, cv=5, scoring=('f1_macro'))
print("The cross-validation accuracy is: {:}\nThe cross-validation f1-score is: {:}".format(acc_kNN_11.mean(), f1_kNN_11.mean()))

The cross-validation accuracy is: 0.9197628458498024
The cross-validation f1-score is: 0.9160363819187347


Fit the kNN classifier from the training dataset.

In [14]:
kNN_11.fit(X_train, y_train)   

KNeighborsClassifier(n_neighbors=11)

Then, we classify the test data by [predict()](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.predict).

In [15]:
y_pred_kNN_11 = kNN_11.predict(X_test)

Using the method introduced before, we can evaluate the test performance.

In [16]:
acc_kNN_11 = metrics.accuracy_score(y_test, y_pred_kNN_11)
print("The test accuracy of decision tree on the dataset is: ", acc_kNN_11)

The test accuracy of decision tree on the dataset is:  0.9473684210526315


In [17]:
f1_kNN_11 = metrics.f1_score(y_test, y_pred_kNN_11, average='macro')
print("The test macro f1-score of decision tree on the dataset is: ", f1_kNN_11)

The test macro f1-score of decision tree on the dataset is:  0.9488636363636364


We can observe that compared to the case setting `n_neighbors=5`, the test performance have increased when setting `n_neighbors` to 11, a hyperparameter selected by cross-validation. 

### 2.2 Naive Bayes

We utilize the [GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) provided in scikit-learn for our classification tasks for Naive Bayes.

In [18]:
gnb = GaussianNB()

Given that no particular hyperparameter necessitates tuning in this model, we proceed to directly fit it using the training dataset by [fit()](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.fit).

In [19]:
gnb.fit(X_train, y_train)

GaussianNB()

We employ 5-fold cross-validation on the model to evaluate its performance.

In [20]:
acc_gnb = cross_val_score(gnb, X_train, y_train, cv=5, scoring=('accuracy'))
f1_gnb = cross_val_score(gnb, X_train, y_train, cv=5, scoring=('f1_macro'))
print("The cross-validation accuracy is: {:}\nThe cross-validation f1-score is: {:}".format(acc_gnb.mean(), f1_gnb.mean()))

The cross-validation accuracy is: 0.8747035573122529
The cross-validation f1-score is: 0.8679511664805781


 We proceed with the process to observe the outcome. We classify the test data by [predict()](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict).

In [21]:
y_pred_gnb = gnb.predict(X_test)

And we evaluate by

In [22]:
acc_gnb = metrics.accuracy_score(y_test, y_pred_gnb)
print("The test accuracy of decision tree on the dataset is: ", acc_gnb)

The test accuracy of decision tree on the dataset is:  0.9473684210526315


In [23]:
f1_gnb = metrics.f1_score(y_test, y_pred_gnb, average='macro')
print("The test macro f1-score of decision tree on the dataset is: ", f1_gnb)

The test macro f1-score of decision tree on the dataset is:  0.9488636363636364


Author: *Kaki Zhou* 8/8/2024 