In [1]:
###### Config #####
import sys, os, platform
if os.path.isdir("ds-assets"):
  !cd ds-assets && git pull
else:
  !git clone https://github.com/lutzhamel/ds-assets.git
colab = True if 'google.colab' in os.sys.modules else False
system = platform.system() # "Windows", "Linux", "Darwin"
home = "ds-assets/assets/"
sys.path.append(home)  

Already up to date.


In [2]:
# notebook level imports
import pandas as pd
import dsutils                        # classification_confint
from sklearn import neighbors         # KNeighborsClassifier
from sklearn import tree              # DecisionTreeClassifier
from sklearn import metrics           # accuracy_score, confusion_matrix
from sklearn import model_selection   # train_test_split, GridSearchCV

# k-NN Classification

k-NN: **k** **N**earest **N**eighbors

In k-NN classification the label of an **unknown instance** is computed from a simple majority vote of the **nearest neighbors of that point***: a query point is assigned the label which has the most representatives within the nearest neighbors of that point.

K-NN classification is a type of **instance-based learning**: In **instance-based learning** we do not attempt to construct an internal model, but simply view the **instances of the training data as the model**.



## An Illustration

Consider the following figure,

<!-- ![knn](assets/knn.png) -->
<center>
<img src="https://raw.githubusercontent.com/lutzhamel/ds-assets/main/assets/knn.png" height="256" width="280">
</center>

We want to assign the unknown point either to the class of blue squares or to the class of red triangles,

* If k = 3 (solid line circle) it is assigned to the class of red triangles because there are 2 triangles and only 1 square inside the inner circle.

* If k = 5 (dashed line circle) it is assigned to the class of blue squares (3 squares vs. 2 triangles inside the dashed circle).

**Note**: The value k is a model parameter and model accuracy depends on this parameter.

## A Worked Example

Let's build an k-NN classifier for the iris dataset.  

**NOTE**: we are not searching for the optimal model, we just want to build a classifier and pick a value for k that seems appropriate.

In [3]:
# get data
df = pd.read_csv(home+"iris.csv")
X  = df.drop(columns=['id','Species'])
y = df['Species']

In [4]:
# train a model with default settings
model = neighbors.KNeighborsClassifier().fit(X, y)

In [5]:
# evaluate the model
dsutils.acc_score(model,X,y,as_string=True)

'Accuracy: 0.97 (0.94, 1.00)'

The performance is not bad for a default model.  

# Model Comparison

Here we are a little bit more careful with our model construction and do a cross-validated grid search for the optimal value of k.
Furthermore we want to see how our optimal k-NN classifier performance stacks up to the performance of an optimal decision tree model in a statistical valid manner.


Let’s work our way through this comparison using the `wdbc` dataset:

* Build optimal k-NN and tree models using grid search
* Compute the accuracy for the classifiers
* Print out the confusion matrix for each classifier
* Print out the confidence interval for each classifier
* Decide if the difference between classifiers is statistically significant or not.

## Set Up

Get our training data and format in way that `sklearn` expects.

In [6]:
# get data
df = pd.read_csv(home+"wdbc.csv").drop(columns=['ID'])

# format training data for sklean
X  = df.drop(columns=['Diagnosis'])
y = df['Diagnosis']

## k-NN Classifier

First up is the k-NN classifier.  In order to find the optimal model we set up a grid search over the number of neighbors.  In this case we search the values from 1 to 25.

In [7]:
# KNN
model = neighbors.KNeighborsClassifier()
param_grid = {'n_neighbors': list(range(1,26))}   # k = 1..25
best_model = model_selection\
   .GridSearchCV(model, param_grid)\
   .fit(X, y)
dsutils.acc_score(best_model,X,y,as_string=True)

'Accuracy: 0.94 (0.92, 0.96)'

In [8]:
# build the confusion matrix for more detailed error analysis
predict_y = best_model.predict(X)
labels = ['M', 'B']
cm = metrics.confusion_matrix(y, predict_y, labels=labels)
cm_df = pd.DataFrame(cm, index=labels, columns=labels)
cm_df

Unnamed: 0,M,B
M,186,26
B,8,349


Let's take a look at the performance data.  In terms of accuracy we see that the best k-NN model has an accuracy of 94% with a confidence interval of (92%, 96%).  From a medical application perspective the confusion matrix is worrisome.  We see that of the 212 malignant samples the model misclassifies 26 as benign.  This kind of error is called the 'false negative' error and in this case would mean that 12% of the malignant cases remain undetected. We also see that of the 357 benign samples it misclassifies 8 as malignant.  The is called the 'false positive' error. From a medical point of view this is not as worrisome because additional tests will identify these cases correctly as benign.

## Decision Trees

For decision trees we set up a grid search over the tree depth from 1 to 20 and the criterion which searches over `entropy` and `gini`.

In [9]:
# decision trees
model = tree.DecisionTreeClassifier(random_state=1)
param_grid = {'max_depth': list(range(1,21)), 'criterion': ['entropy','gini'] }
best_model = model_selection\
   .GridSearchCV(model, param_grid)\
   .fit(X, y)
dsutils.acc_score(best_model,X,y,as_string=True)

'Accuracy: 0.98 (0.97, 0.99)'

In [10]:

# build the confusion matrix for a more detailed error analysis
predict_y = best_model.predict(X)
labels = ['M', 'B']
cm = metrics.confusion_matrix(y, predict_y, labels=labels)
cm_df = pd.DataFrame(cm, index=labels, columns=labels)
cm_df

Unnamed: 0,M,B
M,210,2
B,7,350


The performance of the decision tree model is much better overall and from a medical point specifically.  Less than 1% of the malignant cases is classified as a 'false negative' giving much more confidence in its applicability in a medical setting. The accuracy of the model is 98% with a confidence interval of (97%, 99%).

## Performance Comparison and Model Selection

If we compare models we have to look beyond the raw performance numbers in this case 94% and 98% for k-NN and the decision tree model, respectively. We have to ask if the difference in performance between these two models is statistically significant.  Consider the performance of the k-NN model with an accuracy and confidence interval of,
```
94% (92%, 96%)
```
Also consider the performance of the decision tree model with an accuracy and confidence interval of,
```
98% (97%, 99%)
```
Here we see that
the confidence intervals for the decision tree and the K-NN classifier **do not overlap**.  That means here the decision tree is truly the better model and the performance difference between the two models is **statistically significant**.  

**Observation**: Therefore we will select the **decision tree as a model** for our breast cancer data.
