In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

Macro `_latex_std_` created. To execute, type its name (without quotes).
=== Macro contents: ===
get_ipython().run_line_magic('run', 'Latex_macros.ipynb')
 

$
\newcommand{\pdata}{\prob_\text{data}}
\newcommand{\pmodel}{\prob_\text{model}}
\newcommand{\likeli}{\mathbb{L}}
$

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

# Cost functions: Classical Machine Learning

We have thus far presented cost functions for Regression and Classification
- Mean Squared Error (Regression)
- Cross Entropy (Classification)

with little more than "intuitive justification.

We now explain them more mathematically.

In Classical Machine Learning, a technique called *Maximum Likelihood* is the basis
for many algorithms.

We explain this idea and show how the cost functions encountered thus far can be explained in terms
of likelihood maximization.

## Supervised prediction as likelihood

In the Classification task, we are predicting a distribution of values
rather than a single value.

The Regression task, at first glance, appears to predict a single value rather than a distribution of values.

There is an interpretaton of the Regression task that views it as also predicting a distribution of values.

This will be very useful in explaining where the Cost function comes from.

It's possible to have two training examples $i, i'$ with identical features but different targets
- $\x^\ip = \x^{(i')}$ but
- $\y^\ip \ne \y^{(i')}$

This means that the estimator is not a function.

A simple explanation is one of measurement error
- Imprecision in measuring the targets $\y^\ip, \y^{(i')}$
    - There is a "true" (in terms of a function mapping) $\tilde{\y}^\ip$ such that
        - $\y^\ip   =  \tilde{\y}^\ip + \epsilon^\ip$
        - $\y^{(i')} =  \tilde{\y}^\ip + \epsilon^{(i')}$
    - i.e., the two targets are really the same $\tilde{\y}^\ip$, but have been mis-measured
- Imprecision in measuring features $\x^\ip, \x^{(i')}$
    - the "true" feature values are *different* $\tilde{\x}^\ip \ne \tilde{\x}^{(i')}$ but mis-measured as equal  
        - $\x^\ip      =  \tilde{\x}^\ip + \epsilon^\ip$
        - $\x^{(i')}   =  \tilde{\x}^{(i')} + \epsilon^{(i')}$

So rather than our model (estimator) predicting a single value, it predicts a distribution of values.

Can frame the Supervised Learning task as being creating a model
to predict 
$$
\pr{\y^\ip | x^\ip}
$$
the *conditional probability* of $\y^\ip$ given input $\x^\ip$.

## Log likelihood
[Deep Learning Book 5.5](https://www.deeplearningbook.org/contents/ml.html)

The training data $\{ x^\ip, \y^\ip | i=1, \dots, m \}$
is a sample from an unknown joint distribution $\pdata(\x, \y)$.

So the training data is an empirical distribution of some true but unknown  underlying $\pdata(\x, \y)$

A model (parameterized by $\Theta$) creates an *approximation* $\pmodel(\x, \y;\Theta)$ of $\pdata(\x, \y)$.

Note that $\pdata(\x, \y)$ is not parameterized by $\Theta$.

We can motivate the choice of $\Theta$ by the principle of Likelihood Maximization.

Given the training set, and the true joint distribution $\pdata(\x, \y)$,
we can compute the likelihood (i.e, probability) of drawing the $m$ samples in the training set as

$$
\likeli_{\text{data}} = \prod_{i=1}^m { \pdata(\x^\ip, \y^\ip) }
$$

Similarly, we can compute the same likelihood, using the probabilities from the model

$$
\likeli_{\text{model}} = \prod_{i=1}^m { \pmodel(\x^\ip, \y^\ip; \Theta) }
$$

We can turn this product into a sum by taking the log of both sides

$$
\log(\likeli_{\text{model}}) = \sum_{i=1}^m { \log(\pmodel(\x^\ip, \y^\ip; \Theta)) }
$$

This is called the Log Likelihood.

How do we choose $\Theta$ ?

Let us choose $\Theta$ such that the choice
*maximizes* the likelihood of seeing the training set.

$$
\hat{\Theta} = \argmax{\Theta}{\sum_{i=1}^m { \log(\pmodel(\x^\ip, \y^\ip; \Theta)) } }
$$

That is, the choice of $\Theta$ that results in a model which best approximates the empirical (training) data.

Finding the best $\Theta$ means maximizing the log likelihood of the model.

## KL divergence

We can now motivated the KL divergence:
- the difference between the log likelihood
of the empirical and model distributions.

Bayes Theorem relates joint and conditional probabilities

$$
\begin{array}[lll]\\
\pr{\y | \x } & = & \frac{\pr{\x,\y}} {\pr{\x}} \\
\pr{\x,\y} & = & \pr{\y | \x} \; \pr{\x}  & \text{re-arrrange the terms} \\\\
\end{array}
$$

So we can re-write
$$
\begin{array}[lll]\\
\log(\likeli_{\text{model}}) & = & \sum_{i=1}^m { \log(\pmodel(\x^\ip, \y^\ip; \Theta)) } \\
                             & = & \sum_{i=1}^m { \log(\pmodel(\y^\ip | \x^\ip ; \Theta)) \; \pr{\x^\ip} } \\
                             & = & \E_{\x \sim \pdata}  {\log(\pmodel(\y^\ip | \x^\ip ; \Theta))} 
\end{array}
$$
and similarly for $\log(\likeli_{\text{data}})$

The difference between the log likelihoods of the two distributions 
$$
\begin{array}[lll]\\
\log(\likeli_{\text{data}}) - \log(\likeli_{\text{model}}) & = &
    \E_{\x \sim \pdata} { \left( \log(\pdata(\y | \x))  \right) }  - \E_{\x \sim \pdata} { \left( \log(\pmodel(\y | \x; \Theta)) \right)} \\
\end{array}
$$

You hopefully recognize the difference as being equal
to the definition of KL Divergence.

Thus, the KL divergence is explained as the difference between the log likelihoods
of the empirical and model distributions.

## Cost functions for Classical Machine Learning
We now show that our choice of cost functions
- MSE for Regression
- Binary Cross Entropy for (binary) Classification

can be justified in terms of
maximization of the log likelihood.

### Log likelihood of Binary classifiction

For binary classification (where $\hat{y}^\ip \in \{0,1\}$)
we compute a score as a linear function of $\x$
$$
s(\x^\ip) = \Theta^T \cdot \x^\ip
$$

and we convert the linear score into a probability via the logistic function

$$ \hat{p}^\ip  = \sigma(s(\hat\x^\ip))$$

A Positive prediction (i.e., prediction of value $1$) is the conditional probability
$$
p(\hat{y}^\ip  = 1 \, |\, \x^\ip) = \hat{p}^\ip
$$
And a Negative prediction (i.e., prediction of value $0$) is
$$
p(\hat{y}^\ip  = 0 \,|\, \x^\ip) = 1 - \hat{p}^\ip
$$

We can combine the equations for the two cases into the single equation
$$
p(\hat{y}^\ip | \x^\ip) = p(\hat{y}^\ip  = 1  | \x^\ip)^{\y^\ip} * p(\hat{y}^\ip  = 0 | \x^\ip)^{(1- \y^\ip)}
$$

(because $\y \in \{0,1\}$, one term in the product  always has exponent $0$)

Again, the likelihood is the product (over $i$) of these terms and the log likelihood is
$$
\begin{array}[lll]\\
\mathcal{l} & = & \sum_{i=1}^m { \y^\ip \log(p(\hat{y}^\ip  = 1  | \x^\ip))  + (1-\y^\ip )\log(p(\hat{y}^\ip  = 0  | \x^\ip) )} \\
   & = & \sum_{i=1}^m { \y^\ip \log(p(\hat{y}^\ip  = 1  | \x^\ip))  + (1-\y^\ip )\log( 1 - p(\hat{y}^\ip  = 1  | \x^\ip) )} \\
\end{array}
$$

You should recognize the (negative of) the Log Likelhood as the Binary Cross Entropy loss.

So maximizing the log likelihood minimizes the Binary Cross Entropy Loss.

### Log Likelihood of Linear models with normal errors

Our Linear models are of the form

$$
\hat{\y}^\ip = \Theta^T \cdot \x^\ip + \epsilon
$$
where $\epsilon \in \mathcal{N}(0,\sigma)$.

The $\epsilon$ can be interpretted in either of two ways
- as an approximation error (inability to fit data exactly)
- measurement error, as explained above
   
So our prediction  $\hat{\y}^\ip = \pr{\hat{\y}^\ip | \x^\ip}$
$$
\hat{\y}^\ip = \Theta^T \cdot \x^\ip + \epsilon
$$
becomes a Normal random variable with
mean $\mu = \Theta^T \cdot \x^\ip$ and standard deviation $\sigma$.

Substituting the formula for Normal distribution, the conditional probability of $\hat{\y}^\ip$ given $\x^\ip$ is

$$
\begin{array}[llll] \\
p(\hat{\y}^\ip | \x^\ip) & = & \frac{1}{\sigma \sqrt(2\pi)} \exp(- \frac{(\hat{\y}^\ip - \mu)^2}{2\sigma}) & \text{def. of Normal} \\
    & \propto &\exp(- \frac{(\hat{\y}^\ip - \Theta^T \cdot \x^\ip)^2}{2 \sigma}) & \text{def. of }\mu \\  
\end{array}
$$

The Likelihood of the training set, given this model of the conditional probability,
is just the product over the training set of $p(\hat{\y}^\ip | \x^\ip)$:
$$
\mathbb{L}_{\text{model}} = \prod_{i=1}^m {p(\hat{\y}^\ip | \x^\ip)}
$$
and the Log Likelihood is
$$
\begin{array}[llll] \\
\mathbb{l}_{\text{model}} & = & \log( \prod_{i=1}^m {p(\hat{\y}^\ip | \x^\ip)}) \\
& = &  \sum_{i=1}^m {\log( p(\hat{\y}^\ip | \x^\ip) )} \\
& = &  \sum_{i=1}^m { - \frac{(\hat{\y}^\ip - \Theta^T \cdot \x^\ip)^2}{2 \sigma}} \\
& = &   - \frac{1}{2 \sigma} \sum_{i=1}^m { {(\hat{\y}^\ip - \Theta^T \cdot \x^\ip)^2}}
\end{array}
$$

You should recognize the (negative of) the Log Likelihood as the Mean Squared Error (MSE).

So maximizing the log likelihood minimizes the MSE.


## Complex loss functions: multiple objectives

### Regularization objectives

# Cost functions for Deep Learning: Preview

The Cost functions for Classical Machine Learning were perhaps motivated by the desire for closed form solutions.

In Deep Learning, the optimization is typically solved via search.

This opens the possibilities of complex cost functions that don't require closed form solution.

As we will see in the Deep Learning part of this course, the key part of solving a task
is in *defining* a cost function that mirrors the task's objective.

Thus, many cost functions are problem specific and often quite creative.

## Cool cost functions: Neural Style Transfer
Neural Style Transfer

Given 
- a "Content" Image that you want to transform
- a "Style" Image (e.g., Van Gogh "Starry Night")
- Generate a New image that is the Content image redrawn in the style of the Style Image
    - [Gatys: A Neural Algorithm for Style](https://arxiv.org/abs/1508.06576)
    - [Fast Neural Style Transfer](https://github.com/jcjohnson/fast-neural-style)
 

### Content image
<img src=images/chicago.jpg width=500> 

### Style image
<img src=images/starry_night_crop.jpg width=500>

### Generated image
<img src=images/chicago_starry_night.jpg width=500> 

### Cost function

Definitions:
- Style image, represented as a vector of pixels $\vec{a}$
- Content image, represented as a vector of pixels $\vec{p}$
- Generated image, represented as a vector of pixels $\vec{x}$

The Loss function (which we want to minimize by varying $\vec{x}$) has two parts

$$
\text{L} = \text{L}_{\text{content}}(\vec{p}, \vec{x}) + \text{L}_{\text{style}}(\vec{a}, \vec{x})
$$

- a Content Loss
    - measure of how different the generated image $\vec{x}$ is from Content image  $\vec{p}$
- a Style Loss
    - measure of how different the "style" of generated $\vec{x}$ is from style of Style image $\vec{a}$
    

Key: defining what is "style" and similarity of style

In [3]:
print("Done")

Done
