In [6]:
from IPython.display import HTML
css_file = './custom.css'
HTML(open(css_file, "r").read())

# K-Nearest Neighbors (KNN)

© 2018 Daniel Voigt Godoy

## 1. Definition

KNN is a ***non-parametric*** method used for both classification and regression. It uses ***instance based*** learning instead of model based learning, that is, it approximates a function locally and performs computations only at prediction time.

### 1.1 Algorithm

Given a new instance, KNN uses the ***k closest instances*** to it to make the prediction:
- for ***classification*** tasks, it uses a ***voting*** rule to assign the ***most frequent label*** (among the k-nearest neighbors) to the new instance (although it may be a problem for ***imbalanced*** datasets)
- for ***regression*** tasks, it computes the ***average*** of the k-nearest neighbor to output the predicted value

### 1.2 Finding Neighbors

The definition of ***closest instances*** depends on the ***distance*** used. The most common distances used are the ***Euclidean Distance*** (continuous) and the [Hamming Distance](https://en.wikipedia.org/wiki/Hamming_distance) (categorical), although different distances (like correlation metrics) can be used depending on the problem.

### 1.3 Defining k

The ***hyper-parameter k*** can be fine tuned as any other hyper-parameter. Once again, we face the ***bias-variance tradeoff***:
- ***small k*** will ***overfit*** (low bias, high variance) as it is aware of a small local region and more sensitive to noise - making the boundaries more jagged
- ***large k*** will ***underfit*** (high bias, low variance) as it is covering a larger region - smoothing the boundaries

## 2. Experiment

Time to try it yourself!

There are 35 points in two dimensions belonging to three different classes: red, green and blue.

A new data point will have its class assigned to the same class as the majority of its k-neighbors.

The ***dotted lines*** show the ***k*** nearest neighbors (using Euclidean distance).

The controls below allow you:
- define ***coordinates*** x1 and x2 of the new data point
- define ***how many (k) neighbors*** to use for classification
- show ***boundaries*** corresponding to the chosen k

Use the controls to play with different configurations and answer the ***questions*** below.

In [7]:
from intuitiveml.supervised.classification.KNN import *

In [8]:
X, y = data()
myk = plotKNN(X, y)
vb = VBox(build_figure(myk), layout={'align_items': 'center'})

In [9]:
vb

VBox(children=(FigureWidget({
    'data': [{'marker': {'color': [green, green, green, green, green, green,
   …

#### Questions

1. Change the ***coordinates*** x1 and x2 and make the new data point change colors. Can you pinpoint the boundaries like this?


2. Change the number of neighbors ***k*** and repeat the exercise.


3. Set the new data point to the coordinates (3.8, 0.6) and change ***k***
    - what do you observe?
    - set ***k*** to 3 and zoom into that region - what do you see? Do you agree with the assignment?

4. Leave the data point at the same coordinates, set ***k*** to 1 and check ***show boundaries***:
    - how many points would be ***misclassified*** using the boundaries?
    - increase ***k*** gradually - are there any ***misclassified*** points now? Why? 

## 3. Scikit-Learn

[Nearest Neighbors](https://scikit-learn.org/stable/modules/neighbors.html)

## 4. More Resources

[A Complete Guide to K-Nearest-Neighbors with Applications in Python and R](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/)

[The use of KNN for missing values](https://towardsdatascience.com/the-use-of-knn-for-missing-values-cf33d935c637)

#### This material is copyright Daniel Voigt Godoy and made available under the Creative Commons Attribution (CC-BY) license ([link](https://creativecommons.org/licenses/by/4.0/)). 

#### Code is also made available under the MIT License ([link](https://opensource.org/licenses/MIT)).

In [10]:
from IPython.display import HTML
HTML('''<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>''')