# K-Nearest Neighbours

In this notebook, I will work through an implementation of the KNN algorithm. This implementation will cover both regression and classification use cases. I will use the breast cancer and diabetes datasets, available from scikit-learn, to test this code.

In [1]:
## imports ##
import numpy as np
from scipy import stats
from typing import Dict, Any
from abc import ABC,abstractmethod
from sklearn.datasets import load_diabetes, load_breast_cancer
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import mean_squared_error,\
                            mean_absolute_error,\
                            accuracy_score,\
                            precision_score,\
                            recall_score,\
                            f1_score,\
                            make_scorer
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

Now let's proceed to develop our implementation of KNN:

In [2]:
class KNN(ABC):
    """
    Base class for KNN implementations
    """
    
    def __init__(self, K : int = 3, metric : str = 'minkowski', p : int = 2) -> None:
        """
        Initializer function. Ensure that input parameters are compatiable.
        Inputs:
            K      -> integer specifying number of neighbours to consider
            metric -> string to indicate the distance metric to use (valid entries are 'minkowski' or 'cosine')
            p      -> order of the minkowski metric (valid only when distance == 'minkowski')
        """
        # check distance is a valid entry
        valid_distance = ['minkowski','cosine']
        if metric not in valid_distance:
            msg = "Entered value for metric is not valid. Pick one of {}".format(valid_distance)
            raise ValueError(msg)
        # check minkowski p parameter
        if (metric == 'minkowski') and (p <= 0):
            msg = "Entered value for p is not valid. For metric = 'minkowski', p >= 1"
            raise ValueError(msg)
        # store/initialise input parameters
        self.K       = K
        self.metric  = metric
        self.p       = p
        self.X_train = np.array([])
        self.y_train = np.array([])
        
    def __del__(self) -> None:
        """
        Destructor function. 
        """
        del self.K
        del self.metric
        del self.p
        del self.X_train
        del self.y_train
      
    def __minkowski(self, x : np.array) -> np.array:
        """
        Private function to compute the minkowski distance between point x and the training data X
        Inputs:
            x -> numpy data point of predictors to consider
        Outputs:
            np.array -> numpy array of the computed distances
        """
        return np.power(np.sum(np.power(np.abs(self.X_train - x),self.p),axis=1),1/self.p)
    
    def __cosine(self, x : np.array) -> np.array:
        """
        Private function to compute the cosine distance between point x and the training data X
        Inputs:
            x -> numpy data point of predictors to consider
        Outputs:
            np.array -> numpy array of the computed distances
        """
        return (1 - (np.dot(self.X_train,x)/(np.linalg.norm(x)*np.linalg.norm(self.X_train,axis=1))))
    
    def __distances(self, X : np.array) -> np.array:
        """
        Private function to compute distances to each point x in X[x,:]
        Inputs:
            X -> numpy array of points [x]
        Outputs:
            D -> numpy array containing distances from x to all points in the training set.
        """
        # cover distance calculation
        if self.metric == 'minkowski':
            D = np.apply_along_axis(self.__minkowski,1,X)
        elif self.metric == 'cosine':
            D = np.apply_along_axis(self.__cosine,1,X)
        # return computed distances
        return D
    
    @abstractmethod
    def _generate_predictions(self, idx_neighbours : np.array) -> np.array:
        """
        Protected function to compute predictions from the K nearest neighbours
        """
        pass
        
    def fit(self, X : np.array, y : np.array) -> None:
        """
        Public training function for the class. It is assummed input X has been normalised.
        Inputs:
            X -> numpy array containing the predictor features
            y -> numpy array containing the labels associated with each value in X
        """
        # store training data
        self.X_train = np.copy(X)
        self.y_train = np.copy(y)
        
    def predict(self, X : np.array) -> np.array:
        """
        Public prediction function for the class. 
        It is assummed input X has been normalised in the same fashion as the input to the training function
        Inputs:
            X -> numpy array containing the predictor features
        Outputs:
           y_pred -> numpy array containing the predicted labels
        """
        # ensure we have already trained the instance
        if (self.X_train.size == 0) or (self.y_train.size == 0):
            raise Exception('Model is not trained. Call fit before calling predict.')
        # compute distances
        D = self.__distances(X)
        # obtain indices for the K nearest neighbours
        idx_neighbours = D.argsort()[:,:self.K]
        # compute predictions
        y_pred = self._generate_predictions(idx_neighbours)
        # return results
        return y_pred
    
    def get_params(self, deep : bool = False) -> Dict:
        """
        Public function to return model parameters
        Inputs:
            deep -> boolean input parameter
        Outputs:
            Dict -> dictionary of stored class input parameters
        """
        return {'K':self.K,
                'metric':self.metric,
                'p':self.p}

In [3]:
class KNNClassifier(KNN):
    """
    Class for KNN classifiction implementation
    """
    
    def __init__(self, K : int = 3, metric : str = 'minkowski', p : int = 2) -> None:
        """
        Initializer function. Ensure that input parameters are compatiable.
        Inputs:
            K       -> integer specifying number of neighbours to consider
            metric  -> string to indicate the distance metric to use (valid entries are 'minkowski' or 'cosine')
            p       -> order of the minkowski metric (valid only when distance == 'minkowski')
        """
        # call base class initialiser
        super().__init__(K,metric,p)
        
    def _generate_predictions(self, idx_neighbours : np.array) -> np.array:
        """
        Protected function to compute predictions from the K nearest neighbours
        Inputs:
            idx_neighbours -> indices of nearest neighbours
        Outputs:
            y_pred -> numpy array of prediction results
        """        
        # compute the mode label for each submitted sample
        y_pred = stats.mode(self.y_train[idx_neighbours],axis=1).mode.flatten()   
        # return result
        return y_pred

In [4]:
class KNNRegressor(KNN):
    """
    Class for KNN regression implementation
    """
    
    def __init__(self, K : int = 3, metric : str = 'minkowski', p : int = 2) -> None:
        """
        Initializer function. Ensure that input parameters are compatiable.
        Inputs:
            K       -> integer specifying number of neighbours to consider
            metric  -> string to indicate the distance metric to use (valid entries are 'minkowski' and 'cosine')
            p       -> order of the minkowski metric (valid only when distance == 'minkowski')
        """
        # call base class initialiser
        super().__init__(K,metric,p)
        
    def _generate_predictions(self, idx_neighbours : np.array) -> np.array:
        """
        Protected function to compute predictions from the K nearest neighbours
        Inputs:
            idx_neighbours -> indices of nearest neighbours
        Outputs:
            y_pred -> numpy array of prediction results
        """
        # compute the mean label for each submitted sample
        y_pred = np.mean(self.y_train[idx_neighbours],axis=1)         
        # return result
        return y_pred

## KNN Classification

### Load Classification Dataset

Here I'll load the breast cancer dataset. A full description of these data can be found at: https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset. 

Note I already analysed these data in Notebook II - Logistic Regression. As such, I won't repeat that work here.

In [5]:
## load classification dataset ##
data = load_breast_cancer()
X    = data.data
y    = data.target

In [6]:
# properly format labels
y = np.where(y==0,-1,1)

### Investigate Performance

Here I will use 10-fold cross-validation to measure the performance of the KNN classifier. We will also try a variety of values for K & the distance measures:

In [7]:
## define the scoring metrics ##
scoring_metrics = {'accuracy' : make_scorer(accuracy_score), 
                   'precision': make_scorer(precision_score),
                   'recall'   : make_scorer(recall_score),
                   'f1'       : make_scorer(f1_score)}

In [8]:
## define a helper function for our analysis ##
def cv_classifier_analysis(pipe : Any, 
                           X : np.array, 
                           y : np.array, 
                           k : int, 
                           scoring_metrics : Dict,
                           metric : str) -> None:
    """
    Function to carry out cross-validation analysis for input KNN classifier
    Inputs:
        pipe            -> input pipeline containing preprocessing and KNN classifier
        X               -> numpy array of predictors
        y               -> numpy array of labels
        k               -> integer value for number of nearest neighbours to consider
        scoring_metrics -> dictionary of scoring metrics to consider 
        metric          -> string indicating distance metric used
    """
    # print hyperparameter configuration
    print('RESULTS FOR K = {0}, {1}'.format(k,metric))
    # run cross validation
    dcScores = cross_validate(pipe,X,y,cv=StratifiedKFold(10),scoring=scoring_metrics)
    # report results
    print('Mean Accuracy: %.2f' % np.mean(dcScores['test_accuracy']))
    print('Mean Precision: %.2f' % np.mean(dcScores['test_precision']))
    print('Mean Recall: %.2f' % np.mean(dcScores['test_recall']))
    print('Mean F1: %.2f' % np.mean(dcScores['test_f1']))

In [9]:
## perform cross-validation for a range of model hyperparameters for the Custom model ##
K = [3,6,9]
for k in K:
    # define the pipeline for manhatten distance
    p_manhat = Pipeline([('scaler', StandardScaler()), ('knn', KNNClassifier(k, metric = 'minkowski', p = 1))])
    # define the pipeline for euclidean distance
    p_euclid = Pipeline([('scaler', StandardScaler()), ('knn', KNNClassifier(k, metric = 'minkowski', p = 2))])
    # define the pipeline for cosine distance
    p_cosine = Pipeline([('scaler', StandardScaler()), ('knn', KNNClassifier(k, metric = 'cosine'))])
    # cross validate for p_manhat
    cv_classifier_analysis(p_manhat, X, y, k, scoring_metrics, 'MANHATTEN DISTANCE')
    # cross validate for p_euclid
    cv_classifier_analysis(p_euclid, X, y, k, scoring_metrics, 'EUCLIDEAN DISTANCE')
    # cross validate for p_cosine
    cv_classifier_analysis(p_cosine, X, y, k, scoring_metrics, 'COSINE DISTANCE')

RESULTS FOR K = 3, MANHATTEN DISTANCE
Mean Accuracy: 0.97
Mean Precision: 0.97
Mean Recall: 0.99
Mean F1: 0.98
RESULTS FOR K = 3, EUCLIDEAN DISTANCE
Mean Accuracy: 0.96
Mean Precision: 0.96
Mean Recall: 0.99
Mean F1: 0.97
RESULTS FOR K = 3, COSINE DISTANCE
Mean Accuracy: 0.96
Mean Precision: 0.96
Mean Recall: 0.97
Mean F1: 0.97
RESULTS FOR K = 6, MANHATTEN DISTANCE
Mean Accuracy: 0.96
Mean Precision: 0.96
Mean Recall: 0.98
Mean F1: 0.97
RESULTS FOR K = 6, EUCLIDEAN DISTANCE
Mean Accuracy: 0.96
Mean Precision: 0.96
Mean Recall: 0.98
Mean F1: 0.97
RESULTS FOR K = 6, COSINE DISTANCE
Mean Accuracy: 0.96
Mean Precision: 0.97
Mean Recall: 0.96
Mean F1: 0.96
RESULTS FOR K = 9, MANHATTEN DISTANCE
Mean Accuracy: 0.96
Mean Precision: 0.96
Mean Recall: 0.99
Mean F1: 0.97
RESULTS FOR K = 9, EUCLIDEAN DISTANCE
Mean Accuracy: 0.97
Mean Precision: 0.96
Mean Recall: 0.99
Mean F1: 0.97
RESULTS FOR K = 9, COSINE DISTANCE
Mean Accuracy: 0.96
Mean Precision: 0.96
Mean Recall: 0.98
Mean F1: 0.97


In [10]:
## perform cross-validation for a range of model hyperparameters for the Scikit-learn model ##
K = [3,6,9]
for k in K:
    # define the model for manhatten distance
    p_manhat = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(k, metric = 'minkowski', p = 1))])
    # define the model for euclidean distance
    p_euclid = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(k, metric = 'minkowski', p = 2))])
    # define the model for cosine distance
    p_cosine = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsClassifier(k, metric = 'cosine'))])
    # cross validate for m_manhat
    cv_classifier_analysis(p_manhat, X, y, k, scoring_metrics, 'MANHATTEN DISTANCE')
    # cross validate for m_euclid
    cv_classifier_analysis(p_euclid, X, y, k, scoring_metrics, 'EUCLIDEAN DISTANCE')
    # cross validate for m_cosine
    cv_classifier_analysis(p_cosine, X, y, k, scoring_metrics, 'COSINE DISTANCE')

RESULTS FOR K = 3, MANHATTEN DISTANCE
Mean Accuracy: 0.97
Mean Precision: 0.97
Mean Recall: 0.99
Mean F1: 0.98
RESULTS FOR K = 3, EUCLIDEAN DISTANCE
Mean Accuracy: 0.96
Mean Precision: 0.96
Mean Recall: 0.99
Mean F1: 0.97
RESULTS FOR K = 3, COSINE DISTANCE
Mean Accuracy: 0.96
Mean Precision: 0.96
Mean Recall: 0.97
Mean F1: 0.97
RESULTS FOR K = 6, MANHATTEN DISTANCE
Mean Accuracy: 0.96
Mean Precision: 0.96
Mean Recall: 0.98
Mean F1: 0.97
RESULTS FOR K = 6, EUCLIDEAN DISTANCE
Mean Accuracy: 0.96
Mean Precision: 0.96
Mean Recall: 0.98
Mean F1: 0.97
RESULTS FOR K = 6, COSINE DISTANCE
Mean Accuracy: 0.96
Mean Precision: 0.97
Mean Recall: 0.96
Mean F1: 0.96
RESULTS FOR K = 9, MANHATTEN DISTANCE
Mean Accuracy: 0.96
Mean Precision: 0.96
Mean Recall: 0.99
Mean F1: 0.97
RESULTS FOR K = 9, EUCLIDEAN DISTANCE
Mean Accuracy: 0.97
Mean Precision: 0.96
Mean Recall: 0.99
Mean F1: 0.97
RESULTS FOR K = 9, COSINE DISTANCE
Mean Accuracy: 0.96
Mean Precision: 0.96
Mean Recall: 0.98
Mean F1: 0.97


We can better summarise these results in a table:

K | Distance | Custom Accuracy | Sklearn Accuracy | Custom Precision | Sklearn Precision | Custom Recall | Sklearn Recall | Custom F1 | Sklearn F1
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- 
3 | Manhatten | 0.97 | 0.97 | 0.97 | 0.97 | 0.99 | 0.99 | 0.98 | 0.98
3 | Euclidean | 0.96 | 0.96 | 0.96 | 0.96 | 0.99 | 0.99 | 0.97 | 0.97
3 | Cosine | 0.96 | 0.96 | 0.96 | 0.96 | 0.97 | 0.97 | 0.97 | 0.97
6 | Manhatten | 0.96 | 0.96 | 0.96 | 0.96 | 0.98 | 0.98 | 0.97 | 0.97
6 | Euclidean | 0.96 | 0.96 | 0.96 | 0.96 | 0.98 | 0.98 | 0.97 | 0.97
6 | Cosine | 0.96 | 0.96 | 0.97 | 0.97 | 0.96 | 0.96 | 0.96 | 0.96
9 | Manhatten | 0.96 | 0.96 | 0.96 | 0.96 | 0.99 | 0.99 | 0.97 | 0.97
9 | Euclidean | 0.97 | 0.97 | 0.96 | 0.96 | 0.99 | 0.99 | 0.97 | 0.97
9 | Cosine | 0.96 | 0.96 | 0.96 | 0.96 | 0.98 | 0.98 | 0.97 | 0.97

Firstly, it's clear that our custom KNN classifier yields results that are identicial to the scikit-learn implementation. Looking at the statistics tabulated, it appears that using the Manhatten distance with $K = 3$ produces the best results. However, it should be clear that performance of the KNN classifiers appear to vary little with the choice of hyperparameters analysed here.

## KNN Regression

### Load Regression Dataset

Here I'll load the diabetes dataset, available from scikit-learn. A full description of this dataset is available here: https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset.

Note I have already explored these data in Notebook XI - Adaboost Regression. As such, I won't repeat it here.

In [11]:
## load regression dataset ##
X,y = load_diabetes(return_X_y=True,as_frame=False)

### Investigate Performance

Here I will use 10-fold cross-validation to measure the performance of the KNN regressor. We will also try a variety of values for K & the distance measures:

In [12]:
## define the scoring metrics ##
scoring_metrics = {'mse' : make_scorer(mean_squared_error), 
                   'mae': make_scorer(mean_absolute_error)}

In [13]:
## define a helper function for our analysis ##
def cv_regressor_analysis(pipe : Any, 
                          X : np.array, 
                          y : np.array, 
                          k : int, 
                          scoring_metrics : Dict,
                          metric : str) -> None:
    """
    Function to carry out cross-validation analysis for input KNN regressor
    Inputs:
        pipe            -> input pipeline containing preprocessing and KNN regressor
        X               -> numpy array of predictors
        y               -> numpy array of labels
        k               -> integer value for number of nearest neighbours to consider
        scoring_metrics -> dictionary of scoring metrics to consider 
        metric          -> string indicating distance metric used
    """
    # print hyperparameter configuration
    print('RESULTS FOR K = {0}, {1}'.format(k,metric))
    # run cross validation
    dcScores = cross_validate(pipe,X,y,cv=10,scoring=scoring_metrics)
    # report results
    print('Mean MSE: %.2f' % np.mean(dcScores['test_mse']))
    print('Mean MAE: %.2f' % np.mean(dcScores['test_mae']))

In [14]:
## perform cross-validation for a range of model hyperparameters for the Custom model ##
K = [3,6,9]
for k in K:       
    # define the pipeline for manhatten distance
    p_manhat = Pipeline([('scaler', StandardScaler()), ('knn', KNNRegressor(k, metric = 'minkowski', p = 1))])
    # define the pipeline for euclidean distance
    p_euclid = Pipeline([('scaler', StandardScaler()), ('knn', KNNRegressor(k, metric = 'minkowski', p = 2))])
    # define the pipeline for cosine distance
    p_cosine = Pipeline([('scaler', StandardScaler()), ('knn', KNNRegressor(k, metric = 'cosine'))])
    # cross validate for p_manhat
    cv_regressor_analysis(p_manhat, X, y, k, scoring_metrics, 'MANHATTEN DISTANCE')
    # cross validate for p_euclid
    cv_regressor_analysis(p_euclid, X, y, k, scoring_metrics, 'EUCLIDEAN DISTANCE')
    # cross validate for p_cosine
    cv_regressor_analysis(p_cosine, X, y, k, scoring_metrics, 'COSINE DISTANCE')

RESULTS FOR K = 3, MANHATTEN DISTANCE
Mean MSE: 3934.75
Mean MAE: 48.82
RESULTS FOR K = 3, EUCLIDEAN DISTANCE
Mean MSE: 4087.34
Mean MAE: 49.00
RESULTS FOR K = 3, COSINE DISTANCE
Mean MSE: 3814.75
Mean MAE: 47.08
RESULTS FOR K = 6, MANHATTEN DISTANCE
Mean MSE: 3598.75
Mean MAE: 48.01
RESULTS FOR K = 6, EUCLIDEAN DISTANCE
Mean MSE: 3640.96
Mean MAE: 47.55
RESULTS FOR K = 6, COSINE DISTANCE
Mean MSE: 3483.80
Mean MAE: 45.79
RESULTS FOR K = 9, MANHATTEN DISTANCE
Mean MSE: 3504.69
Mean MAE: 47.05
RESULTS FOR K = 9, EUCLIDEAN DISTANCE
Mean MSE: 3451.70
Mean MAE: 46.59
RESULTS FOR K = 9, COSINE DISTANCE
Mean MSE: 3403.23
Mean MAE: 46.06


In [15]:
## perform cross-validation for a range of model hyperparameters for the Scikit-learn model ##
K = [3,6,9]    
for k in K:       
    # define the pipeline for manhatten distance
    p_manhat = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsRegressor(k, metric = 'minkowski', p = 1))])
    # define the pipeline for euclidean distance
    p_euclid = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsRegressor(k, metric = 'minkowski', p = 2))])
    # define the pipeline for cosine distance
    p_cosine = Pipeline([('scaler', StandardScaler()), ('knn', KNeighborsRegressor(k, metric = 'cosine'))])
    # cross validate for p_manhat
    cv_regressor_analysis(p_manhat, X, y, k, scoring_metrics, 'MANHATTEN DISTANCE')
    # cross validate for p_euclid
    cv_regressor_analysis(p_euclid, X, y, k, scoring_metrics, 'EUCLIDEAN DISTANCE')
    # cross validate for p_cosine
    cv_regressor_analysis(p_cosine, X, y, k, scoring_metrics, 'COSINE DISTANCE')

RESULTS FOR K = 3, MANHATTEN DISTANCE
Mean MSE: 3934.75
Mean MAE: 48.82
RESULTS FOR K = 3, EUCLIDEAN DISTANCE
Mean MSE: 4087.34
Mean MAE: 49.00
RESULTS FOR K = 3, COSINE DISTANCE
Mean MSE: 3814.75
Mean MAE: 47.08
RESULTS FOR K = 6, MANHATTEN DISTANCE
Mean MSE: 3598.75
Mean MAE: 48.01
RESULTS FOR K = 6, EUCLIDEAN DISTANCE
Mean MSE: 3640.96
Mean MAE: 47.55
RESULTS FOR K = 6, COSINE DISTANCE
Mean MSE: 3483.80
Mean MAE: 45.79
RESULTS FOR K = 9, MANHATTEN DISTANCE
Mean MSE: 3504.69
Mean MAE: 47.05
RESULTS FOR K = 9, EUCLIDEAN DISTANCE
Mean MSE: 3451.70
Mean MAE: 46.59
RESULTS FOR K = 9, COSINE DISTANCE
Mean MSE: 3403.23
Mean MAE: 46.06


We can better summarise these results in a table:

K | Distance | Custom MSE | Sklearn MSE | Custom MAE | Sklearn MAE 
--- | --- | --- | --- | --- | --- 
3 | Manhatten | 3934.75 | 3934.75 | 48.82 | 48.82 
3 | Euclidean | 4087.34 | 4087.34 | 49.00 | 49.00
3 | Cosine | 3814.75 | 3814.75 | 47.08 | 47.08 
6 | Manhatten | 3598.75 | 3598.75 | 48.01 | 48.01
6 | Euclidean | 3640.96 | 3640.96 | 47.55 | 47.55 
6 | Cosine | 3483.80 | 3483.80 | 45.79 | 45.79
9 | Manhatten | 3504.69 | 3504.69 | 47.05 | 47.05 
9 | Euclidean | 3451.70 | 3451.70 | 46.59 | 46.59
9 | Cosine | 3403.23 | 3403.23 | 46.06 | 46.06

Like the situation with the KNN classifier, it is clear that our custom KNN regressor yields results that are identicial to the scikit-learn implementation. Looking at the statistics tabulated, it appears that performance improves as $K$ increases across all distance settings, with optimal values being seen for $K = 9$. Of the distance metrics attempted, the Cosine distance yields the best results for both **MSE** and **MAE**. At the same time, the Euclidean distance with $K = 3$ produces the worst set of results for these data.