# Supervised learning
**If you can't implement it, you don't understand it**

- topic covered: naive bayes, k nearest neighbors, perceptron, decision trees

## Dataset
- in this course we use the MNIST dataset available here: https://www.kaggle.com/c/digit-recognizer/overview
- it consists of images of handwritten digits
- inputs are a flattened vector of 28x28 pixels, no RGB so values are 0-255
- input will be scaled to 0-1
- targets are digits from 0 to 9
- 42 000 images in the training set, test set doesn't come with labels on kaggle, so we will only be using the training set

<img src="./assets/mnist.png"
     alt="Markdown Monster icon"
     style="float: left;" />

## K-Nearest neighbors

### Intuition
- both simple conceptually and easy to implement in code  
- sample problem:  
    - Will you pass a course given that I know how many hours you studied?
    - I have data about students from past semesters
    - I can find students who have are "closest to you" in the number of hours they studied for their exam
    - by finding the most similar students, I can estimate your performance based on their result

| Name  | Hours studied  | Passed   |
|-------|---|---|
| Alice | 1 | N |
| Bob   | 3 | N |
| Carol | 6 | Y |
| David | 7 | Y |
| Eric  | 8 | Y |  



<img src="./assets/knn1.png"
     alt="Markdown Monster icon"
     style="float: left; width: 400px" />

- using 2-nearest neighbours yields (Alice, Bob) who both failed, so the the prediction for your result would also be fail
- using 3-nearest neighbours yields (Alice, Bob, Carol) which now has 1 instance of pass, in this case we would still predict the majority results, ie. fail
- another possibility is to weigh the results based on distance 
- or to have a heuristic to break ties


### Concepts and implementation
- given a $k$, find the $k$ nearest neighbours and use them to determine the prediction
- finding the 1 nearest neighbour is simple:

In [1]:
def predict(x_0):
    '''
    Given a single instance of input data, output the prediction
    '''
    closest_distance = inf
    closest_class = -1
    for x, y in training_data:
        d = dist(x, x_0)
        if d < closest_distance:
            closest_distance = d
            closest_class = y
    return closest_class

- keeping track of an arbitrary number of closest neighbours however is not so simple  
- for every datapoint $i$ I need to find their respective $k$ nearest neighbours
example:
    - $k = 3$ and I have stored distances [1, 2, 3]
    - I see a point with distance 1.5, so I should replace the 3
    - assuming we have $n$ datapoints in total, we need to look at all of them to make their respective prediction $\implies O(n)$
    - furthermore for every datapoint there is a list of $k$ closest neighbours, which needs to be iterated over to see if a datapoint should be updated $\implies O(k)$
    - in total then $\implies O(kn)$
    - improvements in complexity over the naive approach can be made by using a sorted list to hold the $k$ nearest neighbours
    


In [4]:
import numpy as np
from sortedcontainers import SortedList
from utils import get_data

X, Y = get_data()

ModuleNotFoundError: No module named 'sortedcontainers'

## Naive bayes and bayes classifiers

## Decision trees

## Perceptrons

## Practical machine learning

## Building a web service