In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

$
\newcommand{\likeli}{\mathbb{L}}
$

In [1]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

# Loss functions

Our treatment of Loss functions thus far has been somewhat superficial.

It's time to dive deeper.

To review
- The per example Loss is a measure of the success of a prediction on a **training** example
- Predictions are a function of parameters $\Theta$
- "Fitting" or "training" a model is the process of find the best $\Theta$
    - The optimal $\Theta$ is the one that minimizes average (across training examples) Loss
    
$$
\begin{array}[lll]\\
\loss^\ip_\Theta & =  & L( \;  h(\x^\ip; \Theta),  \y^\ip \;) = L( \hat{\y}^\ip , \y)  \\
\loss_\Theta  & = & { 1\over{m} } \sum_{i=1}^m \loss^\ip_\Theta \\ \\
\Theta^* = \argmin{\Theta} { \loss_\Theta }
\end{array}
$$

The purpose of this module is to present a mathematical basis behind
some of the common Loss functions in  Machine Learning
- Mean Squared Error (MSE), used in Regression task
- Cross Entropy, used in Classification task
- Kullback Leibler (KL) Divergence

# Likelihood

The statistical method known as Likelihood Maximization is the fundamental tool we will use.

Let us conceptualize our set of training examples
$$\langle \X, \y \rangle= [ \x^\ip, \y^\ip | 1 \le i \le m ]$$

as being *samples* from an unknown distribution 
$$\pdata(\y \; | \; \x)$$
which we call the *true* or *actual*  distribution
- *Conditional* distribution:  target, conditional on feature

The distribution of the sample $\langle \X, \y \rangle$ is called the *empirical* distribution.

Given the actual distribution $\pdata(\x, \y)$,
- The likelihood (i.e, probability) of drawing the $m$ particular examples in the training data
- Assuming independence, is

$$
\likeli_{\text{data}} = \prod_{i=1}^m { \pdata(\y^\ip \; | \; \x^\ip) }
$$


By taking the logarithm, we turn this product into a sum

$$
\log(\likeli_{\text{data}}) = \sum_{i=1}^m { \log \left( \pdata(y^\ip \; | \; \x^\ip) \right) }
$$

called the *Log Likelihood* of the data.

Note that the true process that generates examples is unknown to us; all we have is a sample (the empirical distribution) from the actual distribution.

We can *hypothesize* the existence of some process that generates the training data
- A Linear Model generates examples $\hat{\y}^\ip = \Theta^T \cdot \x^\ip$
- There may be more complex models we can suggest

But, given our limitations, our hypothetical model does not exactly match our examples
$$
\y^\ip = \hat{\y}^\ip + \epsilon^\ip
$$

where $\epsilon^\ip$ is the error between our hypothesized $\y^\ip$ and the one we observe.


This means that the conditional distribution of targets is
centered around the model (predicted) value $\hat{\y}^\ip$.


Our limitations
- Maybe we can only *measure* $\y^\ip$ with error
- There is a missing feature that caused the error  
    - Had this feature been included, there would be no error
        - Example: from everything I know about you, I observe that you *never* buy coffee at night
        - One night you buy coffee -- to bring to a friend, which is not a feature we capture
        


Putting this all together
- The observed error is defined as
$$
\epsilon^\ip = \y^\ip - \hat{\y}^\ip
$$

- A linear hypothesis for the true distribution is
$$
\hat{\y} = \Theta^T \cdot \x
$$

- The observed targets differ from our hypothesis by the observed error


$$
\y^\ip = \Theta^T \cdot \x^\ip + \epsilon^\ip
$$

Do these equations look familiar ?
- These are exactly the equations for Linear Regression
- Only the story we told is different
    - We didn't start with the goal of approximating $\y$
    - We adopt the standpoint that $\y$ differs from the hypothesized $\hat{\y}$ because of some error
    

Our hypothesis (now referred to as the *model*) is parameterized by $\Theta$.

It implies a conditional distribution of targets
$$\pmodel(\y \; | \; \x; \Theta)$$
called the *model* or *predicted* distribution

Predicted distribution $\pmodel(\y \; | \; \x ;\Theta)$ is an *approximation* of actual distribution $\pdata( \y \; | \; \x )$.

We now refer to the model values $\hat{\y}^\ip$ as *predictions*

# Maximum Likelihood Estimation (MLE)

We introduce the statistical concept known as Maximum Likelihood Estimation (MLE)

We will subsequently relate our Loss functions to MLE.

Suppose the true process $\pdata(\x, \y)$ is
$$\y^\ip = 2 * \x^\ip$$

If our model $$\pmodel(\y \; | \; \x ;\Theta)$$ is
$$\y^\ip = 1 + 3 * \x^\ip + \epsilon^\ip$$

then the errors $\epsilon^\ip$ are systematically incorrect.
- Mean error won't be 0
- $\sigma$, the standard deviation of errors, won't be small

What this means is that
- Under the model's poor assumptions for $\Theta$
- It is *less likely* for us to draw the $m$ particular examples in the training data
    - Compared to an assumption for $\Theta$ that is closer to the actual

The likelihood under the model is written
$$
\likeli_{\text{model}} = \prod_{i=1}^m { \pmodel(\y^\ip \; | \; \x; \Theta) }
$$

- With a poorly chose $\Theta$, errors are large, and the likelihood is small.


Under this framework, the best model
- Is the choice of $\Theta$
- That maximizes the likelihood of drawing the $m$ particular examples in the training data

This is called *Maximum Likelihood Estimation*

$$
\Theta^* = \argmax{\Theta}{\sum_{i=1}^m { \log(\pmodel(\y^\ip \; | \; \x^\ip; \Theta)) } }
$$

(Note that maximizing the *log* likelihood is the same as maximizing the likelihood).

[Deep Learning Book 5.5](https://www.deeplearningbook.org/contents/ml.html)

# Loss functions for Machine Learning

We now show that our choice of Loss functions
- MSE for Regression
- Cross Entropy for Classification

can be justified in terms of
**maximization of the log likelihood**.

## Regression: Log Likelihood of Linear models with normal errors

Under the hypothesis of a Linear Model, we have
$$
\begin{array}[lll]\\
\y^\ip & = & \hat{\y}^\ip + \epsilon^\ip \\
\hat{\y}^\ip & = & \Theta^T \cdot \x^\ip \\
\epsilon^\ip & = & \y^\ip - \hat{\y}^\ip
\end{array}
$$

We had not previously made any assumption about the nature of $\epsilon^\ip$

Suppose it is normally distributed 
$$\epsilon^\ip = \mathcal{N}(0,\sigma)$$

This means $\prc{\y^\ip}{\x^\ip ; \Theta}$ is $\mathcal{N}(\hat{\y}^\ip,\sigma)$

Substituting the formula for Normal distribution, the conditional probability of $\y^\ip$ given $\x^\ip$ is

$$
\begin{array}[llll] \\
\prc{\y^\ip}{\x^\ip ; \Theta} & = & \frac{1}{\sigma \sqrt(2\pi)} \exp(- \frac{(\y^\ip - \hat{\y}^\ip)^2}{2\sigma}) &   \prc{\y^\ip}{\x^\ip ; \Theta} \text{ is }\mathcal{N}(\hat{\y}^\ip,\sigma); \\ & & & \text{def. of Normal} \\
& = & \frac{1}{\sigma \sqrt(2\pi)} \exp(- \frac{(\epsilon^\ip)^2}{2\sigma}) &  \epsilon^\ip = \y^\ip - \hat{\y}^\ip\\
    & \propto &\exp(- \frac{(\epsilon^\ip)^2}{2 \sigma})  \\  
\end{array}
$$

The Likelihood of the training set, given this model of the conditional probability,
is just the product over the training set of $\pr{\y^\ip | \x^\ip}$:
$$
\mathbb{L}_{\text{model}} = \prod_{i=1}^m { \prc{\y^\ip}{\x^\ip ; \Theta} }
$$
and the Log Likelihood is
$$
\begin{array}[llll] \\
\mathbb{l}_{\text{model}} & = & \log \left( \prod_{i=1}^m { \prc{\y^\ip}{\x^\ip ; \Theta} } \right) \\
& = &  \sum_{i=1}^m { \log \left( \prc{\y^\ip}{\x^\ip ; \Theta} \right) } \\
& \propto &  \sum_{i=1}^m { \log \left( \exp(- \frac{(\epsilon^\ip)^2}{2 \sigma}) \right) } \\
& = &  \sum_{i=1}^m { - \frac{(\epsilon^\ip)^2}{2 \sigma}} \\
& =  &   - \frac{1}{2 \sigma} \sum_{i=1}^m { {(\y^\ip - \Theta^T \cdot \x^\ip)^2}} \\
& \propto & - \sum_{i=1}^m { {(\y^\ip - \Theta^T \cdot \x^\ip)^2}} \\
\end{array}
$$

You should recognize the (negative of) the Log Likelihood as the Mean Squared Error (MSE).

Thus minimizing MSE Loss function,  which we originally presented without justification
- Is equivalent to finding the model that maximizes the likelihood of the actual distribution 

Stated another way
- The MSE Loss
- Gives rise to the $\Theta$ obtained by MLE

## Classification: Log Likelihood of Binary classification

Review of Binary Classification:
- We encode the Positive labels $\y^\ip$ with the number 1 and Negative labels with the number 0
- For example $i$ we compute $\hat{p}^\ip = \pr{\y^\ip = \text{Positive} \; | \; \x^\ip}$

The conditional probability for a target $\y^\ip$ given the features $\x^\ip$ can be written as
$$
\begin{array}[lll]\\
\prc{\hat{\y}^\ip}{\x^\ip; \Theta} & = & \prc{\y^\ip = \text{Positive}}{\x^\ip; \Theta} 
& + & \prc{\y^\ip = \text{Negative}}{\x^\ip; \Theta} & \text{definition} \\
& = & \prc{\y^\ip = \text{Positive}}{\x^\ip; \Theta}^{\y^\ip} 
& * & \prc{y^\ip = \text{Negative}}{\x^\ip; \Theta}^{(1 - \y^\ip)} & \text{One of } \y^\ip, (1-\y^\ip) \text{ is } 0, \text{ other is } 1 \\
\end{array}
$$

Substituting the above probability into the 
Likelihood
$$
\likeli_{\text{model}} = \prod_{i=1}^m { \pmodel(\y^\ip \; | \; \x^\ip; \Theta) }
$$

and taking the log
$$
\begin{array}[lllll]\\
\mathcal{l} & = & \log(\likeli_{\text{model}}) \\
\mathcal{l} & = & \sum_{i=1}^m { \y^\ip * \log \left( \prc{\y^\ip = \text{Positive}}{\x^\ip; \Theta}\right) 
+ (1-\y^\ip ) * \log \left( \prc{\y^\ip = \text{Negative}}{\x^\ip; \Theta}\right)} \\
& = & \sum_{i=1}^m { \y^\ip * \log(\hat{p}^\ip)  + (1-\y^\ip ) * \log( 1 - \hat{p}^\ip )} \\
\end{array}
$$

Recalling  the per-example Loss function for Binary Classification
$$
\begin{array}[lll]\\
\loss^\ip_\Theta & = & - \left( \y^\ip*\log(\hat{p}^\ip) + (1-\y^\ip) * \log(1-\hat{p}^\ip) \right) \\
\loss_\Theta  & = &{ 1\over{m} } \sum_{i=1}^m \loss^\ip_\Theta
\end{array}
$$

You see that $\frac{1}{m}$ times the negative of the Log Likelihood is equal to the Binary Cross Entropy Loss.

Thus minimizing Binary Cross Entropy loss,  which we originally presented without justification
- Is equivalent to finding the model that maximizes the likelihood of the actual distribution 

Stated another way
- The Binary Cross Entropy Loss
- Gives rise to the $\Theta$ obtained by MLE

## KL divergence

We can now motivate the KL divergence:
- The difference between the log likelihood
of the empirical and model distributions.

Bayes Theorem relates joint and conditional probabilities

$$
\begin{array}[lll]\\
\prc{\y}{\x } & = & \frac{\pr{\x,\y}} {\pr{\x}} \\
\pr{\x,\y} & = & \prc{\y}{\x} \; \pr{\x}  & \text{re-arrange the terms} \\\\
\end{array}
$$

So we can re-write
$$
\begin{array}[lll]\\
\log(\likeli_{\text{model}}) & = & \sum_{i=1}^m { \log \left( \pmodel(\x^\ip, \y^\ip; \Theta) \right) } \\
                             & = & \sum_{i=1}^m { \log \left( \pmodel(\y^\ip | \x^\ip ; \Theta) \right) \; \pr{\x^\ip} } \\
                             & = & \E_{\x \sim \pdata}  {\log \left( \pmodel(\y^\ip | \x^\ip ; \Theta) \right)} 
\end{array}
$$
and similarly for $\log(\likeli_{\text{data}})$

The difference between the log likelihoods of the two distributions 
$$
\begin{array}[lll]\\
\log(\likeli_{\text{data}}) - \log(\likeli_{\text{model}}) & = &
    \E_{\x \sim \pdata} { \left( \log(\pdata(\y | \x))  \right) }  - \E_{\x \sim \pdata} { \left( \log(\pmodel(\y | \x; \Theta)) \right)} \\
\end{array}
$$

The above difference is called the *KL Divergence* between the distributions.

It is a measure of the "closeness" of two distributions.

This means
- Minimizing KL Divergence between the actual and predicted distributions
- Is equivalent to minimizing the difference between the log likelihoods of the distributions

The optimal $\pmodel(\y | \x; \Theta))$ is the one with smallest KL Divergence

Since only $\pmodel(\y | \x; \Theta))$ is a function of $\Theta$, the $\Theta$ that minimizes KL Divergence
is found by minimizing
$$
 - \E_{\x \sim \pdata} { \left( \log(\pmodel(\y \; | \; \x; \Theta)) \right) }
 $$
which is expression for the Cross Entropy Loss.

Thus minimizing Cross Entropy Loss,  which we originally presented without justification
- Is equivalent to minimizing the KL Divergence between the actual and predicted distributions
- Is equivalent to minimizing the difference between the log likelihoods of the distributions


# Unsupervised Learning

Although we have not yet covered Unsupervised Learning, we observe that
Likelihood Maximization can be applied there as well
- Unsupervised Learning has no targets
- So training examples 
$$\langle \X \rangle= [ \x^\ip | 1 \le i \le m ]$$
- The distribution is over features, not of targets conditional on features
$$\pdata(\x)$$

# Loss functions for Deep Learning: Preview

The Loss functions for Classical Machine Learning were perhaps motivated by the desire for closed form solutions.

In Deep Learning, the optimization is typically solved via search.

This opens the possibilities of complex loss functions that don't require closed form solution.

As we will see in the Deep Learning part of this course, the key part of solving a task
is in *defining* a loss function that mirrors the task's objective.

Thus, many loss functions are problem specific and often quite creative.

## Cool loss functions: Neural Style Transfer
Neural Style Transfer

Given 
- a "Content" Image that you want to transform
- a "Style" Image (e.g., Van Gogh "Starry Night")
- Generate a New image that is the Content image redrawn in the style of the Style Image
    - [Gatys: A Neural Algorithm for Style](https://arxiv.org/abs/1508.06576)
    - [Fast Neural Style Transfer](https://github.com/jcjohnson/fast-neural-style)
 

### Content image
<img src=images/chicago.jpg width=500> 

### Style image
<img src=images/starry_night_crop.jpg width=500>

### Generated image
<img src=images/chicago_starry_night.jpg width=500> 

### Loss function

Definitions:
- Style image, represented as a vector of pixels $\vec{a}$
- Content image, represented as a vector of pixels $\vec{p}$
- Generated image, represented as a vector of pixels $\vec{x}$

The Loss function (which we want to minimize by varying $\vec{x}$) has two parts

$$
\text{L} = \text{L}_{\text{content}}(\vec{p}, \vec{x}) + \text{L}_{\text{style}}(\vec{a}, \vec{x})
$$

- a Content Loss
    - measure of how different the generated image $\vec{x}$ is from Content image  $\vec{p}$
- a Style Loss
    - measure of how different the "style" of generated $\vec{x}$ is from style of Style image $\vec{a}$
    

Key: defining what is "style" and similarity of style

In [3]:
print("Done")

Done
