# Wine classification using the k-Nearest Neighbors algorithm

## Overview

K-nearest neighbors (KNN) is a simple yet powerful supervised machine learning algorithm that can be used for both classification and regression tasks. It works by finding the K closest data points in the training set to a given test point, and using their labels or values to make predictions for the test point. In this project, we will load and prepare the dataset, explain how the algorithm works, fit it to the data and make the predictions, then finally we will evaluate the performance of the model.

## Import of libraries

In [65]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler

## Import and preprocessing of the data

In this section, we will import and scale the dataset, and then remove any outliers present in the dataset. We have to scale the data to ensure all features contribute equally to the overall classification, otherwise the model will be biased towards features with a larger scale and this can lead to reduced performance in our model. Then, we also have to remove the outliers for the kNN algorithm because if there are outliers in the dataset, they will be considered as k-Nearest neighbors despite the fact that they are not representive of the typical values in the dataset. This can lead to incorrect predictions and reduce the performance of the algorithm.

In [66]:
wine = load_wine()
X = wine.data
y = wine.target

In [67]:
# Scale the dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Calculate the Z-score of each feature
z_train = np.abs(stats.zscore(X))

# Define a threshold for outliers
threshold = 3

# Find the indices of outliers
outlier_indices = np.where(z_train > threshold)

# Remove the outliers from the dataset
X_filtered = np.delete(X_scaled, outlier_indices[0], axis=0)
y_filtered = np.delete(y, outlier_indices[0], axis=0)

# Split the data into train and test set
x_train, x_test, y_train, y_test = train_test_split(X_filtered, y_filtered, test_size=0.25, random_state=42)

## k-Nearest Neighbors classification

The kNN algorithm is a classifier that aims to correctly class groups based on their similarity to the exist training data using a distance metric. The kNN algorithm aims to classify a new instance by computing its distances to all instances in the training data using a distance metric, and selecting the k instances with the smallest distances. The predicted class of the new instance is then determined by a majority vote among the k nearest neighbors. This process is repeated for each new instance to be classified. 

In [68]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(x_train, y_train)

In [69]:
y_pred = knn.predict(x_test)

In [70]:
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
precision = precision_score(y_test, y_pred, average='weighted')

print('Evaluation report\n')
print(f'Accuracy: {accuracy * 100:.3f} %')
print(f'F1 score: {f1 * 100:.3f} %')
print(f'Recall: {recall * 100:.3f} %')
print(f'Precision: {precision * 100:.3f} %')

Evaluation report

Accuracy: 97.619 %
F1 score: 97.637 %
Recall: 97.619 %
Precision: 97.835 %
