# K Nearest Neighbor Classifier

### Predict the label of a data point by

   * Looking at the $k$ closest labeled data points

   * Taking a majority vote

<center><img src="knn_classifier.png" width="200"/></center>
<br />
<center>Example of k-NN classification.</center>

In [None]:
import numpy as np
from sklearn.datasets import load_wine
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.model_selection import cross_val_score

## Loading Data

In [2]:
# Load wine dataset from sklearn
wine = load_wine()

# Get the detailed description of wine dataset
print(wine.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0.98  3.88    2.29  0.63
    Fl

In [1]:
############################################ Task 1 ############################################
# Display the three classes in a 2D plot using only the two features 'alcohol' and 'proline'
# ----------------------------------------- start here -----------------------------------------

# Take 'Alcohol' and 'Proline' as features
X = wine.data[:,[0,-1]]
y = wine.target

# Visualize the data
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], marker='o', label='0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], marker='^', label='1')
plt.scatter(X[y == 2][:, 0], X[y == 2][:, 1], marker='s', label='2')
plt.xlabel('Alcohol')
plt.ylabel('Proline')
plt.legend(loc = 'best')
plt.grid()
plt.show()

## Feature Scaling

#### Why feature scaling is important?

K-nearest neighbors uses the distance between data points to determine their similarity. Hence, features with a greater magnitude will be assigned a higher weight, this makes the model to be heavily biased toward a single feature.

#### MinMaxScaler:

$$x_{normalized} = (x – x_{min}) / (x_{max} – x_{min})$$

#### StandardScaler:

$$x_{standardized} = (x – mean(x)) / std(x)$$

In [3]:
############################################ Task 2 ############################################
# Display the different scales of the features by plotting parallel box plots
# ----------------------------------------- start here -----------------------------------------

# Plot all features in Boxplots
fig = plt.figure(figsize =(6, 3))
plt.boxplot(wine.data)
plt.show()

In [4]:
# Use MinMaxScaler to standardize the data
# ----------------------------------------- start here -----------------------------------------

# Scale the features using MinMaxScaler
scaler = MinMaxScaler()
wine_minmax = scaler.fit_transform(wine.data)

# Plot all scaled features in Boxplots
fig = plt.figure(figsize =(6, 3))
plt.boxplot(wine_minmax)
plt.show()

In [None]:
# Use StandardScaler to standardize the data
# ----------------------------------------- start here -----------------------------------------

# Scale the features using StandardScaler
scaler = StandardScaler()
wine_std = scaler.fit_transform(wine.data)

# Plot all scaled features in Boxplots
fig = plt.figure(figsize =(6, 3))
plt.boxplot(wine_std)
plt.show()

## Splitting the Data & Fitting Model

In [5]:
############################################ Task 3 ############################################
# Split the scaled data into training and test sets (80-20 split)
# ----------------------------------------- start here -----------------------------------------

# Scale the features using StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)


In [6]:
# Train k nearest neighbors with k ∈ {1, . . . , 10} using KNeighborsClassifier from sklearn.neighbors
# Compute the training and test error w.r.t. 0-1 loss
# ----------------------------------------- start here -----------------------------------------

# Create neighbors
neighbors = np.arange(1, 11)
train_accuracies = {}
test_accuracies = {}

for neighbor in neighbors:
    # Set up a KNN Classifier
    knnc = KNeighborsClassifier(n_neighbors=neighbor)
    # Fit the model
    knnc.fit(X_train, y_train)
    # Compute accuracy
    train_accuracies[neighbor] = knnc.score(X_train, y_train)
    test_accuracies[neighbor] = knnc.score(X_test, y_test)
    
# Add a title
plt.title("KNN: Varying Number of Neighbors")
# Plot training accuracies
plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")
# Plot test accuracies
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")
plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
plt.grid()
# Display the plot
plt.show()

In [15]:
# Display the decision boundaries for k = 1, k = 10, and your best choice of k in the 2D plot
# ----------------------------------------- start here -----------------------------------------

fig, axs = plt.subplots(1, 3, figsize=(15, 5))
neighbos = [1,4,10]
for i in range(3):
    # Set up a KNN Classifier
    knnc = KNeighborsClassifier(n_neighbors=neighbos[i])
    # Fit the model
    knnc.fit(X_train, y_train)
    # plot the decision boundary of the classifier
    disp = DecisionBoundaryDisplay.from_estimator(knnc, X, response_method="predict", plot_method="pcolormesh",
                                                  xlabel='Alcohol', ylabel='Proline', ax=axs[i], alpha=0.5)
    scatter = disp.ax_.scatter(X[:, 0], X[:, 1], c=y, edgecolors="k")
    disp.ax_.legend(scatter.legend_elements()[0], wine.target_names, loc="lower left", title="Classes",)
    _ = disp.ax_.set_title(f"k={knnc.n_neighbors}")
    

In [None]:
############################################ Task 4 ############################################
# Choose k as before (optimal choice from previous task) and vary between Minkowski, Manhatten, and cosine distance
# Again plot the decision boundarys and report the training and test accuracies
# ----------------------------------------- start here -----------------------------------------

fig, axs = plt.subplots(1, 3, figsize=(15, 5))
metrics = ...
train_accuracies = {}
test_accuracies = {}
for i in range(3):
    # Set up a KNN Classifier
    knnc = ...
    # Fit the model
    ...
    # plot the decision boundary of the classifier
    disp = DecisionBoundaryDisplay.from_estimator(..., ..., response_method="predict", plot_method="pcolormesh",
                                                  xlabel='Alcohol', ylabel='Proline', ax=axs[i], alpha=0.5)
    scatter = disp.ax_.scatter(..., ..., c=..., edgecolors="k")
    disp.ax_.legend(scatter.legend_elements()[0], wine.target_names, loc="lower left", title="Classes",)
    _ = disp.ax_.set_title(f"metric={metrics[i]}")
    
    # Compute the accuracies on the training set and the test set
    train_accuracies[i] = ...
    test_accuracies[i] = ...

# Print the accuracies on the training set and the test set
...
...

## Hyperparameter tuning

   #### k-fold cross validation


<img src="kfold_validation.png" width="800"/>

#### Grid search


<img src="gridsearch.png" width="500"/>

In [7]:
############################################ Task 5 ############################################
# Perform a grid search to find the best combination of k ∈ {2, 3, 4, 5, 6, 7}, 
# metric (Minkowski, Manhatten, cosine), and weights (uniform, distance)
# ----------------------------------------- start here -----------------------------------------

# Create a KNN Classifier
knn_gs = ...

# Setup GridSearchCV
params = ...
gs = GridSearchCV(estimator=..., param_grid=..., scoring='accuracy', cv=5)

# Fit GridSearchCV
gs.fit(X, y)

# Print the best combination of parameters



In [8]:
# Estimate the generalization error (w.r.t. 0-1 loss) for the chosen best set of hyperparameters
# ----------------------------------------- start here -----------------------------------------

# Set up knn with the best combination of parameters
knn = ...

# Fit knn model
...

# Predict test data with knn model
y_pred = ...

# Calculate the accuracy of predictions
accuracy = ...

# Print the accuracy
print("Accuracy:", accuracy)

## Add more features

In [25]:
############################################ Task 6 ############################################
# ----------------------------------------- start here -----------------------------------------

# Build feature addition list
feature_order = ...

scores = []
scores_scaler = []

for i in range(2, len(feature_order) + 1):
    
    # Increase features gradually
    selected = feature_order[:i]
    X_cv = wine.data[:, selected]
    y_cv = wine.target
    
    # Scale the features using StandardScaler
    scaler = ...
    X_cv_scaler = ...
    
    # Create a KNN classifier with the best combination of parameters
    knn_cv = ...
    
    # Calculate cross validaton errors for raw data
    score_cv = cross_val_score(..., ..., ..., cv=5)
    
    # Calculate cross validaton errors for standardized data
    score_cv_scaler = cross_val_score(..., ..., ..., cv=5)
    
    # Add cross validaton errors for raw data to a list
    scores.append(...)
    
    # Add cross validaton errors for standardized data to a list
    scores_scaler.append(...) 

In [None]:
# Compare cross validaton errors between raw data and standardized data
plt.title("KNN: Varying Number of Features")
plt.plot(..., ..., label='Raw data')
plt.plot(..., ..., label='Standardized data')
plt.xlabel("Number of Features")
plt.ylabel("Accuracy")
plt.legend()
plt.show()