<div style="text-align: center;">
 <h1>IMPLEMENTING K-NEAREST-NEIGHBOUR (KNN) ALGORITHM FOR IMAGE CLASSIFICATION ON CIFAR-10 DATASET</h1>
</div>

<div style="text-align: center;">
    <img width = 500 src="./KNN.jpg">
</div>

<h4>The KNN algorithm is a simple and effective method for classifying data based on the majority class of its nearest neighbors.</h4>

## DRIVER

In [1]:
# driver libraries
import numpy as np
import torch
import torchvision
import torchvision.transforms as transforms
from sklearn import datasets
from scipy.spatial.distance import cdist
from collections import Counter
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle 


## USER DEFINED FUNCTIONS

In [2]:
# Function to Define transformations to be applied to the images
def imageTransform():
    tform = transforms.Compose([
        transforms.ToTensor(),  # Convert images to PyTorch tensors
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))  # Normalize the images
    ])
    return tform

# Function to Download CIFAR-10 dataset
def downloadDataset():
    transform = imageTransform()
    train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
    test_dataset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
    return train_dataset, test_dataset

# Function to create data loaders for training and testing
def traintestLoaders(train_dataset, test_dataset):
    train_dataset, test_dataset = train_dataset, test_dataset
    
    # Create data loaders for training and testing
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True, num_workers=2)
    test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=64, shuffle=False, num_workers=2)
    
    # Extract features and labels for training set
    train_features = []
    train_labels = []
    
    for images, labels in train_loader:
        # Flatten the images into vectors
        images = images.view(images.size(0), -1)
        train_features.append(images)
        train_labels.append(labels)

    train_features = torch.cat(train_features, dim=0)
    train_labels = torch.cat(train_labels, dim=0)

    # Extract features and labels for test set
    test_features = []
    test_labels = []
    
    for images, labels in test_loader:
        # Flatten the images into vectors
        images = images.view(images.size(0), -1)
        test_features.append(images)
        test_labels.append(labels)
    
    test_features = torch.cat(test_features, dim=0)
    test_labels = torch.cat(test_labels, dim=0)

    return train_features, test_features, train_labels, test_labels

# Function to convert tensors to numpy arrays
def convertTensorsToNumpy():
    return train_features.numpy(), test_features.numpy(), train_labels.numpy(), test_labels.numpy()

# Function to reduce the sample size for custom KNN Classifier
def reduceSampleSize(sample_size, features, labels):
    # Shuffle your data
    f, l = shuffle(features, labels)
    
    # Sample without replacement
    indices = np.random.randint(low=0, high=len(f), size=sample_size)

    # Use indices to select subset
    reduced_features = f[indices]
    reduced_labels = l[indices]

    return reduced_features, reduced_labels

## MAIN

In [3]:
# Download the dataset
train_dataset, test_dataset = downloadDataset()

# Splitting the dataset
train_features, test_features, train_labels, test_labels = traintestLoaders(train_dataset, test_dataset)

# Converting to NumPy type
train_features, test_features, train_labels, test_labels = convertTensorsToNumpy()

#confirmation
print('Transformation and Loading is completed. Ready for training the model')

Files already downloaded and verified
Files already downloaded and verified
Transformation and Loading is completed. Ready for training the model


In [4]:
# Initialize and train KNN classifier
knn_classifier = KNeighborsClassifier(n_neighbors=3)
knn_classifier.fit(train_features, train_labels)

# Make predictions on the test set
predictions = knn_classifier.predict(test_features)

# Calculate accuracy
accuracy = accuracy_score(test_labels, predictions)
print(f'Accuracy of KNN classifier on test set: {accuracy * 100:.2f}%')

Accuracy of KNN classifier on test set: 33.03%


# BUILDING CUSTOM KNN-CLASSIFIER TO WORK ON CIFAR-10 DATASET

The KNN algorithm is a simple and effective method for classifying data based on the majority class of its nearest neighbors.

The custom KNN classifier is implemented in the Python programming language. The class, named KNN_Classifier, is designed to be flexible and easy to use. The key components of the implementation include:

<b>Initialization:</b><br>
The class is initialized with a parameter `k`, which represents the number of neighbors to consider during classification. The default value is set to `5`, a common choice in practice.

<b>Model Fitting:</b><br>
The `fitModel` method is responsible for fitting the model with training data. It takes training features `X` and corresponding labels `y` as input and stores them internally.

<b>Prediction:</b><br>
The `predict` method is used to make predictions for new data points. Given a set of test points `X`, it iterates through each test point, computes the distances to all training points, identifies the indices of the k-nearest neighbors, and predicts the label based on the majority class.

<b>Distance Computation:</b><br>
The `compute_distances` method calculates the distances between a test point and all training points using the Euclidean distance metric. The `cdist` function from the `scipy.spatial.distance` module is utilized for efficiency.

<b>Performance Analysis:</b><br>
The performance of the custom KNN classifier can be assessed based on metrics such as accuracy, precision, recall, and F1 score. These metrics can be calculated by comparing the predicted labels to the true labels of the test set.

In [11]:
# Custom KNN Classifier class
class KNN_Classifier:

    def __init__(self, k=5):
        self.k = k

    def fitModel(self, X, y):
        self.X_train = X 
        self.y_train = y

    def predict(self, X):
        y_pred = np.zeros(len(X))
        for i, test_point in enumerate(X):
            distances = self.compute_distances(test_point)
            k_indices = np.argsort(distances)[:self.k]
            k_neighbor_labels = [self.y_train[i] for i in k_indices]  
            most_common = Counter(k_neighbor_labels).most_common(1)
            y_pred[i] = most_common[0][0]
        return y_pred

    def compute_distances(self, test_point):
        return cdist(np.array([test_point]), self.X_train).flatten()


In [12]:
# Reducing the sample size for training and testing to suit the custom KNN Classifier
reduced_train_features, reduced_train_labels = reduceSampleSize(20000, train_features, train_labels)
reduced_test_features, reduced_test_labels = reduceSampleSize(250, test_features, test_labels)


In [13]:
# Using custom KNN classifier
custom_knn_classifier = KNN_Classifier()
custom_knn_classifier.fitModel(reduced_train_features, reduced_train_labels)

y_pred = custom_knn_classifier.predict(reduced_test_features)

num_correct = np.sum(y_pred == reduced_test_labels)
accuracy = (num_correct) / (reduced_test_features.shape[0])

print('Got %d / %d correct => accuracy: %f' % (num_correct, reduced_test_features.shape[0], accuracy*100))

Got 83 / 250 correct => accuracy: 33.200000


We can see that the custom KNN classifier that I have written gives the accuracy of 33.2%. But this can also depend on various factors because we have taken only a portion of the images by reducing the sample size to 20000 for traning and testing it on 250 samples only.

But neverthless, let's try to increase the accuracy by tuning the hyper-parameters and adding some weights.

# OPTIMISING THE CUSTOM KNN CLASSIFIER

<h4>
    When we have to deal with optimizing the custom KNN class, we need to experiment with different aspects of your algorithm
</h4>


<b>Distance Weights:</b><br>
Introduce distance weights when computing predictions. You can use the inverse of the distances as weights, giving more importance to closer neighbors.

<b>Use a Different Distance Metric:</b><br>
Experiment with different distance metrics. You might want to try Manhattan distance `(p=1)` or Minkowski distance with different `p` values.

<b>Kernel Density Estimation:</b><br>
Implement kernel density estimation for a more robust estimation of the underlying density function.

<b>Use a Larger Dataset:</b><br>
If possible, use a larger dataset for training. More data can often lead to better generalization.

<b>Implement Parallelization:</b><br>
Depending on the size of your dataset, you might benefit from parallelizing the distance calculations. This can be achieved using parallel computing libraries such as `joblib` or `concurrent.futures`.

<b>Optimize Code for Efficiency:</b><br>
Ensure that your code is optimized for efficiency. Use vectorized operations wherever possible to speed up computations.

<b>Optimize `k` Value:</b><br>
Experiment with different values of `k` to find the one that gives the best performance. You can use techniques like grid search or random search for hyperparameter tuning.


In [22]:
class KNN_Classifier_Optimized:

    def __init__(self, k=4, weights='uniform', distance_metric='euclidean'):
        self.k = k
        self.weights = weights
        self.distance_metric = distance_metric
        self.scaler = MinMaxScaler()

    def fitModel(self, X, y):
        # Scale the features
        self.X_train = self.scaler.fit_transform(X)
        self.y_train = y

    def predict(self, X):
        # Scale the test features
        X_scaled = self.scaler.transform(X)
        
        y_pred = np.zeros(len(X))
        for i, test_point in enumerate(X_scaled):
            distances = self.compute_distances(test_point)
            k_indices = np.argsort(distances)[:self.k]
            k_neighbor_labels = [self.y_train[i] for i in k_indices]  

            if self.weights == 'uniform':
                most_common = Counter(k_neighbor_labels).most_common(1)
            else:
                weights = 1 / (distances[k_indices] + 1e-5)  # Add a small epsilon to avoid division by zero
                most_common = Counter(dict(zip(k_neighbor_labels, weights))).most_common(1)

            y_pred[i] = most_common[0][0]

        return y_pred

    def compute_distances(self, test_point):
        return cdist(np.array([test_point]), self.X_train, metric=self.distance_metric).flatten()


In [23]:
# Using optimized custom KNN classifier
custom_knn_classifier_o = KNN_Classifier_Optimized()
custom_knn_classifier_o.fitModel(reduced_train_features, reduced_train_labels)

y_pred_o = custom_knn_classifier_o.predict(reduced_test_features)

num_correct = np.sum(y_pred_o == reduced_test_labels)
accuracy = (num_correct) / (reduced_test_features.shape[0])

print('Got %d / %d correct => accuracy: %f' % (num_correct, reduced_test_features.shape[0], accuracy*100))

Got 86 / 250 correct => accuracy: 34.400000


# 

# RESULT

            Classifier Name                           Accuracy
    Scikit-Learn In-Built KNN Classifier               33.03%
    Custom KNN Classifier                              33.20%
    Custom KNN Classifier Optimized                    34.40

By updating the custom KNN Classifier with above mentioned optimization methods, we were able to achieve the accuracy 34.40% beating the in-built Scikit-Learn KNN classifier. It may not be huge difference but on a larger scale it gives a better impact on handling the dataset and predicting using test data.