# K_Neighbors Classification

In this notebook I used k_neighbors classification to perform an automatic diagnosis of Breast Cancer using the Wisconsin Breast Cancer dataset by UCI. You can find the dataset here : https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

This is my first k_neighbors model so I decided to keep it simple. I will come back to this over time and develop this model based on what I learned in the meantime.


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import math 

from sklearn.neighbors import KNeighborsClassifier

from scipy import stats


In [6]:
# colnames : mean, standard error then the mean of the 3 worst cells of that biopsie.
col_names = ["ID" , "Diagnosis" , "radius", "texture", "perimeter","area", "smoothness", "compactness", "concavity", "concave points", "symmetry", "fractial dimension" , "radius std", "texture std", "perimeter std","area std", "smoothness std", "compactness std", "concavity std", "concave points std", "symmetry std", "fractial dimension std" , "radius w", "texture w", "perimeter w","area w", "smoothness w", "compactness w", "concavity w", "concave points w", "symmetry w", "fractial dimension w" ]
# Build DataFrame
df = pd.read_csv("data_cancer.csv", header = None, names = col_names , index_col = "ID") 

# I will only work with the Mean of those variables, not the other variables which represented the worst cells and the std. 
df = df.drop([ "radius std", "texture std", "perimeter std","area std", "smoothness std", "compactness std", "concavity std", "concave points std", "symmetry std", "fractial dimension std" , "radius w", "texture w", "perimeter w","area w", "smoothness w", "compactness w", "concavity w", "concave points w", "symmetry w", "fractial dimension w" ], axis = 1)

# Print head() to see the first 5 rows of your data

df.head()



Unnamed: 0_level_0,Diagnosis,radius,texture,perimeter,area,smoothness,compactness,concavity,concave points,symmetry,fractial dimension
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883


KNN : "lazy learning" method. Good for both classification and regression.

1. Calculate Euclidean distance : distance between 2 rows => sqrt(sum i to N (row1_i – row2_i)^2)
2. Get nearest neighbors
3. Make Predictions

KNN is really bad with outliers, so you'll have to make sure your data is clean and you get the useful features only. ( Which I did not do for now ) .  




Training the model : 

In [4]:
# Method : Train Test Split

# 1. Import the needed library
from sklearn.model_selection import train_test_split

# 2. Create your variables. X being all the used features and y being the diagnosis. Benigne ( B ) or Maligne ( M ).
X = df.drop("Diagnosis", 1)
y = np.array(df["Diagnosis"])

# 3. Train, test, split. 
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 1)

# 4. Fit the model and choose a n_neighbors. 
knn = KNeighborsClassifier(n_neighbors = 8)
knn.fit(X_train,y_train)

# 5. Predict
y_pred = knn.predict(X_test)

print("Test set predictions: {}".format(y_pred))



Test set predictions: ['B' 'B' 'B' 'M' 'M' 'M' 'M' 'M' 'B' 'B' 'B' 'B' 'M' 'B' 'B' 'B' 'B' 'B'
 'B' 'M' 'B' 'B' 'M' 'B' 'B' 'B' 'B' 'M' 'M' 'M' 'M' 'B' 'M' 'B' 'B' 'B'
 'M' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'M' 'B' 'B' 'B' 'M' 'M' 'M' 'B' 'B'
 'B' 'B' 'B' 'M' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'M' 'B' 'B' 'B' 'B'
 'B' 'M' 'B' 'M' 'M' 'B' 'B' 'M' 'B' 'M' 'B' 'M' 'B' 'B' 'B' 'B' 'B' 'B'
 'B' 'B' 'B' 'B' 'M' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'B' 'M' 'B' 'B' 'B'
 'M' 'M' 'B' 'B' 'B' 'B']


In [5]:
from sklearn.metrics import classification_report, confusion_matrix

# 6. Evaluate your model. 
print(knn.score(X_test, y_test))

print(confusion_matrix(y_test, y_pred))

print(classification_report(y_test, y_pred))

0.8771929824561403
[[71  1]
 [13 29]]
              precision    recall  f1-score   support

           B       0.85      0.99      0.91        72
           M       0.97      0.69      0.81        42

   micro avg       0.88      0.88      0.88       114
   macro avg       0.91      0.84      0.86       114
weighted avg       0.89      0.88      0.87       114

