In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Common imports
import os


# Summary: Supervised Machine Learning

## Learning from *labeled* examples
- each example is a vector of features $\x$ and a target/label $\y$
    - $n$ denotes length of vector $\x$
    - superscript to distinguish between examples
        $\x^\ip, \y^\ip$
    
## Prediction: creating a model $h$

- Given training example $\x^\ip$, we construct a function $h$ to predict its label

$$\hat{\y}^\ip = h(\x^\ip; \Theta)$$
- We use a "hat" to denote predictions: $\hat{\y}^\ip$
- The behavior $h$ is determined by parameters $\Theta$

## Fitting a model $h$: Finding optimal values for $\Theta$

The collection of examples used for fitting (training) a model is called the *training set*:

$$ \langle \X, \y \rangle= [ \x^\ip, \y^\ip | 1 \le i \le m ]$$

where $m$ is the size of training set and each $\x^\ip$ is a feature vector of length $n$.

$
  \X = \begin{pmatrix}
  (\x^{(1)})^T \\
  (\x^{(2)})^T\\
  \vdots \\
  (\x^{(m)})^T \\
  \end{pmatrix} = \begin{pmatrix}
 \x^{(1)}_1 \ldots\x^{(1)}_n \\ 
  \x^{(2)}_1 \ldots\x^{(2)}_n \\ 
   \vdots \\
  \x^{(m)}_1 \ldots\x^{(m)}_n \\
  \end{pmatrix}
$

### Fitting a model: Loss/Cost, Utility

Ideal: for each $i$ in training dataset: 
- prediction $\hat\y^\ip = \h(\x^\ip; \Theta)$ exactly equal to target $\y^\ip$
$$\hat\y^\ip = \y$$

Reality: prediction often has some "error"
- error measured by a *distance* function: smaller (closer to target) is better
- Call the distance between $\hat{\y}^\ip, \y^\ip$ the *Loss* (or *Cost*) for example $i$:

*Per-example* loss
$$
\loss^\ip_\Theta =  L( \;  h(\x^\ip; \Theta),  \y^\ip \;) = L( \hat{\y}^\ip , \y^\ip) 
$$

where $L(a,b)$ is a function that is $0$ when $a = b$ and increasing as $a$ increasingly differs from $b$.

Two common forms of $L$ are Mean Squared Error (for Regression) and Cross Entropy Loss (for classification).

### Optimal $\Theta$

The Loss for the entire training set is simply the average (across examples) of the Loss for the example

$$
\loss_\Theta  = { 1\over{m} } \sum_{i=1}^m \loss^\ip_\Theta
$$

The best (optimal) $\Theta$ is the one that minimizes the Average (across training examples) Loss

$$
\Theta^* = \argmin{\Theta} { \loss_\Theta }
$$

## Pattern matching

The "dot product" (special case of inner product) is one function
that often appears in template matching

- It measures the
similarity of two vectors

$$
\mathbf{v} \cdot \mathbf{v}' = \sum_{i=1}^n \mathbf{v}_i \mathbf{v}'_i
$$

- As a similarity measure (rather than as a distance) high dot product means "more similar".

In Machine Learning it is *often* (but not always) the case
- we match a feature vector $\x^\ip$
- to all/some of the parameters $\Theta$

# KNN: a simple model for the Classification task

Parameters $\Theta$ **are** the training examples
- training examples are discarded after training/fitting

$$\langle \Theta_\x, \Theta_\y \rangle = \langle \X, \y \rangle$$

KNN
- measures *similarity* out of sample feature vector $\x$ against the feature vector of each example $i$
- **dot product** matches example against a row of $\Theta_\x$
$$
\text{similarity}(\x, \Theta_\x^\ip) = \x \cdot \Theta_\x^\ip = \x \cdot \X^\ip
$$

KNN uses *lots* of parameters

$$ 
\begin{array} \\
\| \Theta \| & = & \| \Theta_\x \| & + & \| \Theta_\y \| \\
& & m*n & + & m \\
\end{array}
$$

Perhaps *exact matching* against a large set of examples is not necessary ?
- Digit classification
    - A "generic" pattern for each digit
        - pattern for a "1" is a vertical column of dark pixels in the center
        - pattern for a "8" is two "donut holes" stacked atop one another, with a "pinched waist"
    - Parameter size: $10 * n$
        - 10 patterns * $n$ pixel intensities per pattern

We will learn *other* models for Classification that essentially learn these *per-digit* patterns

In [4]:
print("Done")

Done
