In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

# Notation

- Supervised Learning involves supplying a number ($m$) of examples.
- Each example is a pair consisting of
    - vector $\x$ consisting of $n \ge 1$ *features* (attributes)
    - scalar (sometimes a vector) $\y$
        - referred to as the *target* value or *label* associated with $\x$

- we use **bold face** to indicate a vector (e.g, $\x$)

- We use superscript $\ip$ to index examples, when we have more than one
    - $\x^\ip, \x^{(i')}, i \ne i'$ are two distinct examples
    - denote an element $i$ of a collection of $m$ examples (e.g., $\x^\ip$)
- We use subscript $j$ to index element $j$ of a vector, e.g., $\x^\ip_j$

- So $\x^\ip$ is

$
  \x^\ip = \begin{pmatrix}
 \x^\ip_1 \\
 \x^\ip_2 \\
  \vdots  \\
 \x^\ip_n
  \end{pmatrix}
$

Each  element of $\x^\ip$ is a "feature"
- $\x^\ip_j$ is the $j^{th}$ feature of example $i$

## Training set

- The collection of examples used for fitting (training) a model is called the *training set*:

$$ \langle \X, \y \rangle= [ \x^\ip, \y^\ip | 1 \le i \le m ]$$

where $m$ is the size of training set and each $\x^\ip$ is a feature vector of length $n$.

- By seeing many ($m$) pairs of feature vectors and associated labels
we will try to infer the correct label $\y^\ip$ from the features in $\x^\ip$

- $\X$ is an $(m \times n)$ matrix and $\y$ is an $(m \times 1)$ vector of targets.


$
  \X = \begin{pmatrix}
  (\x^{(1)})^T \\
  (\x^{(2)})^T\\
  \vdots \\
  (\x^{(m)})^T \\
  \end{pmatrix} = \begin{pmatrix}
 \x^{(1)}_1 \ldots\x^{(1)}_n \\ 
  \x^{(2)}_1 \ldots\x^{(2)}_n \\ 
   \vdots \\
  \x^{(m)}_1 \ldots\x^{(m)}_n \\
  \end{pmatrix}
$

<table>
    <tr>
        <center><strong>Training set</strong></center>
    </tr>
<img src=images/mnist_small_train.png>
</table>

- We will sometimes add a "constant" feature by setting
$\x^\ip_0 = 1,  0 \le i \le m$
so that the first column of $\x$ is $1$:

$
\X =
\begin{pmatrix}
  1  &\x^{(1)}_1  & \ldots &\x^{(1)}_n \\ 
   1 &\x^{(2)}_1  &\ldots  &\x^{(2)}_n \\ 
   \vdots & \vdots & \ldots &  \vdots \\
   1 &\x^{(m)}_1  &\ldots  &\x^{(m)}_n \\
  \end{pmatrix}
$

- So each of the $m$ rows is an example and each of the $n$ columns is a feature.

## Not just numbers !

The features *aren't restricted to be numeric* !

In this course, we will deal with data that is
- numeric
- categorical
- text
- image
- sound (not this course)

Of course, you'll have to encode this data as numbers in order for numerical algorithms to handle them.

# Prediction

- Given training example $\x^\ip$, we construct a function $h$ to predict its label

$$\hat{\y}^\ip = h(\x^\ip)$$
- We use a "hat" to denote predictions: $\hat{\y}^\ip$
- The function $h$ will often be parameterized (by $\Theta$) so, for clarity, we should write

$$\hat{\y}^\ip = h(\x^\ip; \Theta)$$
- We will often drop $\Theta$ for ease of reading.
- Since $h$ is a function, it should also be possible to make a prediction for a vector $\mathbf{x}$ that is **not** part of the training set.
- That is, we are able *generalize* to non-training examples:  to make out of sample predictions


<table>
    <tr>
        <th><center>Training</center></th>
    </tr>
    <tr>
        <td><img src="images/W1_L4_S9_Intro_training_2.png" width="60%"/></td>
    </tr>
</table>


The key task of Machine Learning is finding the "best" values for parameters $\Theta$.

The process of using training examples $\X$ to find $\Theta$
- is called *fitting* the model
- is solved as an optimization problem (to be described)

<table>
    <tr>
        <th><center>Fitting a Linear Regression model</center></th>
    </tr>
    <tr>
        <td><img src="images/W1_L4_S11_Terminology_training_linear_regr.png" width="50%"></td>
    </tr>
</table>

**Summary**
- a training example is a pair $(\x^\ip,\y^\ip)$ drawn from training set $\langle \X, \y \rangle$ consisting of 
    - a feature vector $\x^\ip$ of length $n$
    - the associated label (target) $\y^\ip$
    - $\X$ is of dimension $m \times n$
    - $\y$ is dimension $m \times 1$, i.e., target is a single, continuous value per example
- predictions are indicated with a "hat:
    - $\hat{\y}^\ip$ is the prediction made given $\x^\ip$ as input
  

# Loss/Cost, Utility

- The prediction $\hat{\y}^{(i)}$ for example $\x^\ip$ is perfect if it matches the true label $\y^\ip$

$$ \hat{\y}^\ip = \y^\ip$$

- Perfection  is hard (at least at first) so we need a measure for "how far off" the prediction is.

- We will call the distance between $\hat{\y}^\ip, \y^\ip$ the *Loss* (or *Cost*) for example $i$:

$$
\loss^\ip_\Theta =  L( \;  h(\x^\ip; \Theta),  \y^\ip \;) = L( \hat{\y}^\ip , \y^\ip) 
$$

where $L(a,b)$ is a function that is $0$ when $a = b$ and increasing as $a$ increasingly differs from $b$.

Two common forms of $L$ are Mean Squared Error (for Regression) and Cross Entropy Loss (for classification).


The Loss for the entire training set is simply the average (across examples) of the Loss for the example

$$
\loss_\Theta  = { 1\over{m} } \sum_{i=1}^m \loss^\ip_\Theta
$$

<table>
    <tr>
        <th><center>Training Example</center></th>
    </tr>
    <tr>
        <td><img src="images/W1_L4_s15_Intro_training.jpg""/></td>
    </tr>
</table>
​

Whereas Loss describes how "bad" our prediction is, we sometimes refer to the converse -- how "good" the prediction is.

We call the "goodness" of the prediction  the *Utility* $U_\Theta$.

So we could state the optimization objective either as
"minimize Cost" or "maximize Utility".

By convention, the DL optimization problem is usually framed as one of minimization (of cost or loss) 
rather than maximization of utility.

Since Cost is inversely related to Utility, you will sometimes see
the minimization objective written as
"minimize -1 times Utility".

So be forewarned that you will often see Loss function with leading "negation" signs.

## Creating Loss functions is a key part of Deep Learning

As you will come to see, particularly for Deep Learning, the essence of many problems is in creating a Loss Function that captures the objective of your problem.

This is  far from a trivial part of the process.

# Fitting/Training a Model

The best (optimal) $\Theta$ is the one that minimizes the Average (across training examples) Loss

$$
\Theta^* = \argmin{\Theta} { \loss_\Theta }
$$



- The goal of fitting/training is to solve for the $\Theta$ that minimizes the training set loss 
$L_\Theta$ 
- The method for finding $\Theta$ is called optimization.



# The dot product: Template matching

- The "dot product" (special case of inner product) is one function
that often appears in template matching

- It measures the
similarity of two vectors

$$
\mathbf{v} \cdot \mathbf{v}' = \sum_{i=1}^n \mathbf{v}_i \mathbf{v}'_i
$$

- As a similarity measure (rather than as a distance) high dot product means "more similar".

- There are several intuitions for the dot product

- The dot product is maximized  when large (resp., small) values appear in similar positions in both vectors
  - this becomes even more obvious if we $0$-center both vectors such that "small" values become negative
  - this looks like the statistical formula for covariance
    - if we normalize both vectors to unit length, then this looks like correlation

We can generalize dot product to higher dimensions
- Compute pair-wise product of corresponding entries
- Reduce to a scalar by summing


In [2]:
print("Done")

Done
