# K Nearest Neighbour

The K nearest neighbour algorithm is a supervised learning algorith which performs classification tasks

The intuition behind the KNN algorithm is one of the simplest of all the supervised machine learning algorithms. It simply calculates the distance of a new data point to all other training data points. The distance can be of any type e.g Euclidean or Manhattan etc. It then selects the K-nearest data points, where K can be any integer. Finally it assigns the data point to the class to which the majority of the K data points belong.

Let's import the necessary libraries:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_curve, auc
from matplotlib.legend_handler import HandlerLine2D
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

For understanding the KNN classifier, we will use the Titanic dataset which is used to predict if a passenger has survived the tragedy based on certain information.
Load the dataset using the pandas library. The data is stored in .csv format under the folder /titanic.

In [None]:
train = pd.read_csv("titanic/train.csv")

Let's check the shape and also peek into the first five entries of the dataset.

In [None]:
train.shape

In [None]:
train.info()

In [None]:
train.head(5)

Also check the statistics of the various fields in the dataset using the 'describe' function below:

In [None]:
train.describe(include="all")

We use the  include="all"  to include non-numeric columns in our analysis. This function results in lots of very useful data about the distribution of our data (minimum, maximum, average, etc.)

From the data, we can notice that there are plenty of data points that contain the NaN (Not a Number) values that cannot be processed by our algorithm. To fix this, we can fill these entries with the mean/mode of other data points for that field as shown below:

In [None]:
#Checking for missing data
NAs = pd.concat([train.isnull().sum()], axis=1, keys=[‘Train’])
NAs[NAs.sum(axis=1) > 0]

In [None]:
train['Age'] = train['Age'].fillna(train['Age'].mean())
train['Embarked'] = train['Embarked'].fillna(train['Embarked'].mode()[0])

Drop columns which may not be useful for our learning, like ‘Cabin’, ‘Name’ and ‘Ticket’.

In [None]:
train.drop("Cabin",axis=1,inplace=True)
train.drop("Name",axis=1,inplace=True)
train.drop("Ticket",axis=1,inplace=True)

‘Pclass’ is a categorical feature so we convert its values to strings

In [None]:
train['Pclass'] = train['Pclass'].apply(str)

Let’s perform a basic one hot encoding of categorical features

In [None]:
for col in train.dtypes[train.dtypes == 'object'].index:
    for_dummy = train.pop(col)
    train = pd.concat([train, pd.get_dummies(for_dummy, prefix=col)], axis=1)

As a final step, let's remove the labels from the data, as our objective is to predict them.

In [None]:
labels = train.pop('Survived')

Our data is now ready for training. Split the data into training and test sets, with 75% of data being used for training and remaining 25% used for testing.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(train, labels, test_size=0.25)

First, we will run our K Nearest Neighbors alogorithm using the default hyperparameters to check the performance.

Train the model:

In [None]:
model = KNeighborsClassifier()
model.fit(X_train, Y_train)

Now, using the trained classifier, let us predict the test label using the test data, and compare it with the actual labels.

In [None]:
y_pred = model.predict(X_test)

We can check the accuracy of our prediction using the method 'classification_report'.

In [None]:
classification_report(Y_test, y_pred)

Let's see if we can improve this score by tuning our hyper-parameters.

In [None]:
# Define hyperparameters and their ranges
n_neighbors = list(range(1,30))
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'chebyshev', 'minkowski']

Here, we are varying three hyperparameters associated with the K Nearest Neighbours algorithm to check which combination would fit best with our dataset.

In this example, we will use GridSearchCV to perform cross-validation while tuning our parameters. Once this function completes execution, it will return a classifier object produced the best accuracy score while training.

In [None]:
# Set the parameters by cross-validation
tuned_parameters = {'n_neighbors': n_neighbors, 'metric': metric, 'weights': weights}
clf = GridSearchCV(model, tun

Best parameters set found on development set:

In [None]:
clf.best_params_

Best score found on development set:

In [None]:
clf.best_score_

Grid scores on development set:

In [None]:
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))

Now our classifier is ready. Test the classifier using the test data to verify the accuracy.

In [None]:
y_true, y_pred = Y_test, clf.predict(X_test)
classification_report(y_true, y_pred)