## Implementing KNN on the Breast Cancer Dataset

#### Calling all the required libraries and loading the data set

In [1]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

In [2]:
data = load_breast_cancer()

#### Defining the model 

We will split the data into three parts: Test and Train. The train data will be further divided into train and validation dataset. This is done by calling the train_test_split function twice, once for the whole dataset and once for the Train set. This reduces overfitting and gives better generalization results. 

In [3]:
X, y = data.data, data.target

In [4]:
X_trainval, X_test, y_trainval, y_test = train_test_split(X,y)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval)

In [5]:
val_scores = []    # an empty array to store the scores of all iterations of the knn algorithm 
neighbors = np.arange(1, 15, 2)    # the algorithm checks for all odd number of neighbors between 1 and 15

#### Building the Model

In [6]:
from sklearn.neighbors import KNeighborsClassifier

for i in neighbors:
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    val_scores.append(knn.score(X_val, y_val))

print("Best validation score: {:.3f}".format(np.max(val_scores)))
best_n_neighbors = neighbors[np.argmax(val_scores)]  # perfomance of the model on the validation set
print("Best n_neighbors: {}".format(best_n_neighbors))  # the best accuracy was achieved when considering 13 nearest neighbors

Best validation score: 0.907
Best n_neighbors: 9


In [7]:
knn.fit(X_trainval, y_trainval)
print("Test-set score: {:.3f}".format(knn.score(X_test, y_test)))

Test-set score: 0.930
