# KNN (K-Nearest-Neighbors)

Problem Statement:

To model the knn classifier using the Breast Cancer data for predicting whether a patient is suffering from the benign tumor or malignant tumor

I am going to examine the Breast Cancer Dataset using python sklearn library to model K-nearest neighbor algorithm. After modeling the knn classifier,to use the trained knn model to predict whether the patient is suffering from the benign tumor or malignant tumor.

Principle behind KNN classifier (K-Nearest Neighbor) algorithm is to find K predefined number of training samples closest in the distance to new point & predict the label from these. The distance measure is commonly considered to be Euclidean distance.

In [13]:
# Importing required libraries including knnclassifier and accuracy_score

import numpy as np
from sklearn.preprocessing import Imputer
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score



The dataset I have downloaded is from https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data

For importing the data and manipulating it,I am going to use numpy arrays. Using genfromtxt() method, I am importing the dataset into the 2d numpy array and import txt files. since all the values in dataset is numeric but some missing and replaced by “?”. So performing some data imputation here therefore using float dtype.


In [17]:
# Data Import

cancer_data = np.genfromtxt(fname ='breast_cancer_data.txt', delimiter= ',', dtype= float)

In [18]:
# Le'ts see how our data look like once imported into 2nd numpy array

print("Dataset Lenght:: ", len(cancer_data))
print("Dataset:: ", str(cancer_data))
print("Dataset Shape:: ", cancer_data.shape)

Dataset Lenght::  699
Dataset::  [[  1.00002500e+06   5.00000000e+00   1.00000000e+00 ...,   1.00000000e+00
    1.00000000e+00   2.00000000e+00]
 [  1.00294500e+06   5.00000000e+00   4.00000000e+00 ...,   2.00000000e+00
    1.00000000e+00   2.00000000e+00]
 [  1.01542500e+06   3.00000000e+00   1.00000000e+00 ...,   1.00000000e+00
    1.00000000e+00   2.00000000e+00]
 ..., 
 [  8.88820000e+05   5.00000000e+00   1.00000000e+01 ...,   1.00000000e+01
    2.00000000e+00   4.00000000e+00]
 [  8.97471000e+05   4.00000000e+00   8.00000000e+00 ...,   6.00000000e+00
    1.00000000e+00   4.00000000e+00]
 [  8.97471000e+05   4.00000000e+00   8.00000000e+00 ...,   4.00000000e+00
    1.00000000e+00   4.00000000e+00]]
Dataset Shape::  (699, 11)


In [19]:
# First column have patient id so to have unbiased prediction we gonna remove it with below function

cancer_data = np.delete(arr = cancer_data, obj= 0, axis = 1)

Divide the dataset into feature & label dataset. i.e., feature data is predictor variables they will help us to predict labels(criterion variable). in the dataset the first 9 columns include continuous variables that will help to predict whether a patient is having the benign tumor or malignant tumor.

In [20]:
X = cancer_data[:,range(0,9)]
Y = cancer_data[:,9]

Missing values - the missing values are replaced by usual technique with mean, median, mode or any particular value using the imputer method of skylearn. 


In [21]:
imp = Imputer(missing_values="NaN", strategy='median', axis=0)
X = imp.fit_transform(X)

splitting the Train and test data
X_train & y_train are training datasets. X_test & y_test are testing datasets.
y_train & y_test are 2d numpy arrays with 1 column. To convert it into a 1d array, ravel() method used here.

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 100)
y_train = y_train.ravel()
y_test = y_test.ravel()

Time to fit the KNN algorithm on training data and predicting the labels for dataset and printing the accuracy of the model for different values of K(ranging from 1 to 25).
using accuracy function i am gonna print the accuracy of KNN algorithm.

In [23]:
for K in range(25):
 K_value = K+1
 neigh = KNeighborsClassifier(n_neighbors = K_value, weights='uniform', algorithm='auto')
 neigh.fit(X_train, y_train) 
 y_pred = neigh.predict(X_test)
 print("Accuracy is ", accuracy_score(y_test,y_pred)*100,"% for K-Value:",K_value)

Accuracy is  95.2380952381 % for K-Value: 1
Accuracy is  93.3333333333 % for K-Value: 2
Accuracy is  95.7142857143 % for K-Value: 3
Accuracy is  95.2380952381 % for K-Value: 4
Accuracy is  95.7142857143 % for K-Value: 5
Accuracy is  94.7619047619 % for K-Value: 6
Accuracy is  94.7619047619 % for K-Value: 7
Accuracy is  94.2857142857 % for K-Value: 8
Accuracy is  94.7619047619 % for K-Value: 9
Accuracy is  94.2857142857 % for K-Value: 10
Accuracy is  94.2857142857 % for K-Value: 11
Accuracy is  94.7619047619 % for K-Value: 12
Accuracy is  94.7619047619 % for K-Value: 13
Accuracy is  93.8095238095 % for K-Value: 14
Accuracy is  93.8095238095 % for K-Value: 15
Accuracy is  93.8095238095 % for K-Value: 16
Accuracy is  93.8095238095 % for K-Value: 17
Accuracy is  93.8095238095 % for K-Value: 18
Accuracy is  93.8095238095 % for K-Value: 19
Accuracy is  93.8095238095 % for K-Value: 20
Accuracy is  93.8095238095 % for K-Value: 21
Accuracy is  93.8095238095 % for K-Value: 22
Accuracy is  93.809

Conclusion

As we can I have got 95.71% accuracy on K = 3, 5
Choosing a large value of K will lead to greater amount of execution time & underfitting. Selecting the small value of K will lead to overfitting. There is no such guaranteed way to find the best value of K. So, to run it quickly we are considering K =3 for this classification.