<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Exercise.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Exercise: KNN and Naive Bayes
© ExploreAI Academy

In this exercise, we train a k-nearest neighbours model and experiment with various parameters.

## Learning objectives

By the end of this train, you should be able to:
* Train a k-nearest neighbours model.
* Compare KNN models trained with different parameters. 

## Overview

K-nearest neighbours (KNN) is a simple, instance-based learning algorithm in which an observation's classification is determined by the majority vote of its neighbours. In this exercise, we will explore the KNN model's sensitivity to the choice of the number of neighbours (K) and the type of distance metric used.

## Import libraries 

In [5]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import seaborn as sns

## Load and prepare the dataset

The dataset used in this exercise is the `Breast Cancer` dataset provided by `scikit-learn`. This dataset comprises features crucial for distinguishing between malignant (cancerous) and benign (non-cancerous) tumours.

We aim to apply KNN models to this data to accurately classify malignant and benign tumours based on cell characteristics to aid in early diagnosis, guide treatment decisions, and potentially improve patient outcomes.

In [8]:
# Load the dataset
X, y = load_breast_cancer(return_X_y=True)

We then split the data to prepare the training and testing datasets.

In [10]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Exercises

### Exercise 1

Let's experiment with different values of k to understand how it affects the accuracy and generalisation of the model.

Use a for loop to train different k-nearest neighbours models, each with a different number of neighbours as follows: `1, 3, 5, 7, 9`.
Evaluate each model's performance on the test set and print its accuracy score.

In [13]:
# List of different values of k to evaluate
k_values = [1, 3, 5, 7, 9]

# List to store accuracy scores of each KNN model for comparison
scores = []

# Loop over each value of k 
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    scores.append(accuracy)
    print(f"Accuracy for K={k}: {accuracy:.4f}")

Accuracy for K=1: 0.9298
Accuracy for K=3: 0.9298
Accuracy for K=5: 0.9561
Accuracy for K=7: 0.9561
Accuracy for K=9: 0.9561


### Exercise 2

Let's also compare the impact that different distance metrics will have on the accuracy of the KNN model.

Again, use a for loop to train different k-nearest neighbours models, each with a different distance metric as follows: `Euclidean, Manhattan, Chebyshev`. Evaluate each model's performance on the test set and print its accuracy score.

**Note:** You can use what seems to be the optimal number of neighbours based on the results from Exercise 1.

In [15]:
# List of distance metrics to evaluate
metrics = ['euclidean', 'manhattan', 'chebyshev']

# Loop over distance metric
for metric in metrics:
    knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy with {metric} distance: {accuracy:.4f}")

Accuracy with euclidean distance: 0.9561
Accuracy with manhattan distance: 0.9474
Accuracy with chebyshev distance: 0.9561


### Exercise 3

**a)** 

We want to be able to train a KNN model using a given k and a distance metric. We will then plot the decision boundary to better understand how the model classifies different regions of the feature space based on the chosen k and metric.

The function below contains the code for visualising the decision boundary. You are required to complete it by adding appropriate parameters (this should include the feature dataset, the target dataset, the number of neighbours, and the distance metric) and lines of code such that it performs the function stated above.

**Note:** We train the model using only two features for ease of plotting.

In [17]:
# Function to plot decision boundaries 
def plot_decision_boundary(X, y, k, metric):
    # Train a KNN model using the specified number of neighbours and distance metric
    knn = KNeighborsClassifier(n_neighbors=k, metric=metric)
    knn.fit(X[:, :2], y)  # Use only two features for simplicity

    # Create a mesh grid based on feature ranges
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))

    # Predict class for each point on the grid
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot the contours
    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, alpha=0.8)
    plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k')  # Plot the training points
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(f"Decision Boundary for K={k} with {metric} metric")
    plt.show()

**b)** 

Use the function created in **section a** to plot the decision boundary of a KNN model trained using `5` nearest neighbours and the `Euclidean` distance metric. 

In [19]:
# Plotting decision boundary
plot_decision_boundary(X_train, y_train, 5, 'Euclidean')

InvalidParameterError: The 'metric' parameter of KNeighborsClassifier must be a str among {'pyfunc', 'manhattan', 'mahalanobis', 'haversine', 'chebyshev', 'hamming', 'sokalmichener', 'l2', 'correlation', 'yule', 'infinity', 'canberra', 'euclidean', 'cosine', 'jaccard', 'minkowski', 'l1', 'braycurtis', 'cityblock', 'p', 'sokalsneath', 'precomputed', 'sqeuclidean', 'dice', 'seuclidean', 'russellrao', 'rogerstanimoto', 'nan_euclidean'}, a callable or an instance of 'sklearn.metrics._dist_metrics.DistanceMetric'. Got 'Euclidean' instead.

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>