In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

Macro `_latex_std_` created. To execute, type its name (without quotes).
=== Macro contents: ===
get_ipython().run_line_magic('run', 'Latex_macros.ipynb')
 

In [2]:
from IPython.display import Image

# AutoEncoder (AE): High Level

<div class="alert alert-block alert-warning">
    <b>TL;DR</b> 
    <br>
    <ul>
        <li>The Deep Learning analog of Principal Components (PCA)</li>
        <ul>
            <li>Most of the lessons of AE apply equally to PCA</li>
        </ul>
        <li>Unsupervised: no labels (really semi-supervised)</li>
        <li>Create "synthetic features" from the original set of features</li>
        <li>May be able to use reduced set of synthetic features (dimensionality reduction)</li>
        <li><b>Generative (vs Discriminative)</li>
    </ul>
</div>

<table>
    <tr>
        <th><center>Autoencoder</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_vanilla.jpg" width=1200></td>
    </tr>
</table>

An Autoencoder is
- A NN consisting of two halves
- An *Encoder* sequence of layers
    - transforms inputs $\x^\ip$ to synthetic features $\z^\ip$
    - *latent representation*
- A *Decoder* sequence of layers
    - "inverts" latent representation $\z^\ip$ to recover $\x^\ip$

The Encoder and Decoder are *jointly trained*, not trained separately !
- No obvious target for training the Encoder
- Semi-supervised
    - No labels
    - But we use $\x^\ip$ as the "label" associated with example $\x^\ip$ in training
    

This should all be reminiscent of the definition of Principal Components.

Just as for PCA, we can perform dimension-reduction
- size of latent representation $| \z | \le n$

When $| \z | \lt n$ we say 
- that the input has been passed through a *bottleneck*
- that the AE is *under complete*

The *main difference* from PCA
- PCA uses a *linear* transformation
- NN can use *non-linear* transformations too
    - PCA as a special case of AE
   

# Autoencoders: Uses

## Dimension reduction

After training
- we can discard the Decoder
- use the Encoder output (synthetic features) as reduced dimension inputs to a *new* task
    - Encoder weights are **frozen**: non-learnable when training new task
   - It may be easier to solve the new task given $\z$ rather than $\x$
       - have already discovered "structure" of $\x$
   - *Transfer Learning*

<table>
    <tr>
        <th><center>Autoencoder: Encoder + New head</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_encoder_new_head.jpg" width=1200></td>
    </tr>
</table>

In PCA, we eliminated original features that were "less important"
- i.e., explained variation among only a small fraction of the training set
    - recall how we redominated explained variance in terms of "number of features"
    
There is no direct similar concept of feature importance in AE
- other than minimizing a Cost function, which *may* wind up focussing on "important" features

## Layer-wise pre-training with Autoencoders

Autoencoders played a vital role in the development of Deep Learning:
- They made it possible to train otherwise untrainable NN's.
- Other innovations supplanted the need for AE's to assist training
    - better initialization
    - better activations functions
    - normalization
 
Although they are no longer needed for that purpose, it is interesting to see how (and why) they were used.


For a NN with $L$ layers that solves a Supervised Learning Problem
- Training attempts to learn the weights of all layers simultaneously
- *Layer wise pre-training* was an attempt
    - to *initialize* the weights of each layer
    - in succession
    - so that the task of simultaneously solving for optimal weights had a better chance of succeeding

The idea was to learn an initialization of $\W_\llp$, the weights of layer $l$.
- After having learned the weights $\W_{(l')}$ for all layers $l' \lt l$.


To initialize $\W_\llp$:
- Train an AE that takes $\x^\ip$ as input
- Using initialized weights $\W_{(l')}$ for all layers $l' \lt l$
- Produces $\tx^\ip$ at layer $l$'s output $\y_\llp$

So weight initializations were learned layer by layer.

Note that the labels $\y^\ip$ *were not used* !
- wouldn't be useful for the shallow NN

It was thought
- to be easier to learn the structure of the input $\x$ independent of the labels
- to be easier to learn $\W_\llp$ incrementally

One the weights $\W_\llp$ were initialized via AE's
- training of the Supervised Learning task had a better chance of succeeding
- compared to any other initialization

## Autoencoders and Transfer Learning
Today, autoencoders are useful for another purpose: Transfer Learning.

If we can train an AE network to create features that are useful for reconstruction
- it is possible
that these features are useful for solving more complicated tasks.

This was in essence what
- Our dimension reduction example (replace the head) was doing
- Layerwise Pre-training was attempting.

So it is not uncommon to approach a complicated problem
- by first constructing an autoencoder to
come up with an alternate (and smaller) representation of the input space.

Note that Autoencoders are *unsupervised*: they don't take labels.  

So the encodings they produce
stress syntactic similarity, rather than semantic similarity.

Their use in Transfer Learning depends on the hope that inputs that are syntactically similar also
have the same labels.

# Denoising

Very much like dimension reduction but with the assumption that
- "less important" features are just random noise that is added to the true example

# Generative Artificial Intelligence

A less obvious use of AE (using the Decoder rather than the Encoder) is to *generate* examples.

Most of the Machine Learning we have studied thus far is *discriminative*
- $\pr{\hat{\y}^\ip | \x^\ip )}$
    - e.g., classifier: discriminate among the possible classes $\y^\ip$, given example $\x^\ip$

We can use the Decoder on *arbitrary* $\z$ to *generate* a completely  new $\x$:
- $\pr{ \x^{(i')} | \z^{(i')} }$ for some $i'$ not in training
- *generate* a new example $i'$, in the domain of $\x$, that was not encountered during training

<table>
    <tr>
        <th><center>Generator</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_decoder.jpg" width=800</td>
    </tr>
</table>

# Autoencoder (AE): Details

The *task* that trains an Autoencoder
- Given input $\x^\ip$
- Output of Encoder: $\z^\ip = E(\x^\ip)$
- Output of Decoder: $\tx^\ip = D(\z^\ip)$
- "Target": $\x^\ip$

Both the Encoder and Decoder are parameterized (learnable parameters)
- Goal: find the parameters such that 
$$
\begin{array}[llll] \\
 \tx^\ip = D(E(\x))  & \approx & \x \\
\end{array}
$$

$\z^\ip = E(\x^\ip)$ is the latent representation of $\x^\ip$.

## Cost/Loss function

There are two obvious candidates for per-example loss

## Mean Squared Error (MSE)
$$
\loss^\ip = \sum_{j=1}^{|\z|} { (\x^\ip_j - \tx^\ip_j)^2 }
$$

## Binary Cross Entropy

For the special case where *each* original feature is in the range $[0,1]$
- e.g., pixel is on/off
- we can treat each original feature as a probability and use Binary Cross Entropy

$$
\loss^\ip = \sum_{j=1}^{|\z|} {  \left( \x^\ip_j    \log(\tx^\ip_j) + ( 1 - \x^\ip_j ) \log(1 - \tx^\ip_j) \right) }
$$

# Autoencoder: extensions

There are some extensions of the "vanilla" Autoencoder we have described thus far.

## De-noising Autoencoder

<table>
    <tr>
        <th><center>Denoising Autoencoder</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_denoising.jpg" width=1200></td>
    </tr>
</table>

# Variational Autoencoder (VAE): Generative ML

Observe that the Decoder part of the "vanilla" AE $D( \z^\ip )$
- has been trained to produce "realistic" $\tx^\ip$ *only* for a $\z^\ip = E(\x^\ip)$
    - i.e., "realistic": appears to come from the distribution of training $\X$
- there is no guarantee that $D( \z^{(i')} )$ for some $i'$ not in training is realistic

That is: the AE has not been trained to *extrapolate* beyond the training inputs.

A VAE is able generate outputs 
- that *could have* come from the training distribution from a latent representation $\z^{(i')}$ 
- but that *did not* come from $\X$.

Our goal is constructing a **Decoder** that can extrapolate.


<table>
    <tr>
        <th><center>Variational Autoencoder (VAE)</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_VAE.jpg" width=1200></td>
    </tr>
</table>

The Decoder will take a *latent vector* $\z$ and produce $D(\z)$, just as in a vanilla AE.


The difference is that $\z$ will be sampled from a *distribution* rather than being a 
unique mapping of a training example.

This will be done by modifying the Encoder
- It will *indirectly* create $\z^\ip$
- It will compute *variables* $\mu^\ip$ and $\sigma^\ip$
    - $\z^\ip$ will be *sampled* from a distribution with mean $\mu^\ip$ and standard deviations $\sigma^\ip$
    
As long as $\z$ is sampled from this distribution, the decoder will produce a "realistic" output.

**Note**

$\mu$ and $\sigma$ are computed values (and hence, functions of $\x$) and **not** parameters
- so training learns a *function* from $\x^\ip$ to $\mu^\ip$ and $\sigma^\ip$


To train a VAE:
- pass input $\x^\ip$ through the encoder, producing $\mu^\ip, \sigma^\ip$
    - use $\mu^\ip, \sigma^\ip$ to sample a latent representation $\z^\ip$ from the distribution
- pass the sampled $\z^\ip$ through the decoder, producing $D(\z^\ip)$
- measure the reconstruction error $\x^\ip - D(\z^\ip)$, just as in a vanilla AE
- backpropogate the error, updating all weights and $\mu, \sigma$

Essentially, each input $\x^\ip$ has *many* latent representations (with different probabilities):
any sample from the distribution.


**Training**

Encoder produces

$$
\begin{array}[llll] \\
E(\x) & = &  q_\phi(\z|\x) & \approx & p_\theta(\z|\x) \\
\end{array}
$$

We sample from
$$
\begin{array}[llll] \\
\hat{\z} & \sim & q_\phi(\z|\x) \\
\end{array}
$$

Decoder produces

$$
\begin{array}[llll] \\
D(\hat{\z})  & = & p_\theta (\x|\z)\\
\end{array}
$$

Each time (epoch) that we encounter the same training example, we select another random element from the distribution.

So the VAE learns to represent the same example from multiple latents.


**Generative**
- sample $\hat{\z} \sim \hat{p}(\z)$
- use decoder to produce output $p_\theta (\x|\z)$
 
This means we can feed in a $\z$ 
- that doesn't correspond to any training example
- and perhaps get an output that *resembles* something from the training set, rather than noise.

# Conditional VAE

Once a VAE is trained, $D(\z)$ should produce a realistic output, for any $\z$ from the distribution.

However, if the distribution of $\X$ includes examples from many classes 
- Assuming we have labels as auxilliary information (not used in training)
    - e.g., the 10 digits
- The VAE can't control *which class* the output will come from

A *Conditional VAE* allow our generator (Decoder) to control the class $c$ of the output $\tx$.

<table>
    <tr>
        <th><center>Conditional VAE (CVAE)</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_CVAE.jpg"" width=800></td>
    </tr>
</table>

The class label $c$
- is given as part of *training*
    - So the Encoder produces a distribution that is conditioned on *both* $\x$ and $c$.
- is an *additional parameter* of the Decoder
    - So the output class can be controlled
$$
\tx^\ip = D(\z^\ip, c)
$$



So now we 
- create a latent $\z$
- append a class label $c$
- and presumably have the decoder produce an output from the desired class.

- The encoding distribution is now conditional on class label $c$: $q_\phi(z|x,c)$ 
- So is the decoding distribution $p_\theta(x|z,c)$ 

Again, by restricting the functional form of the prior distribution $\hat{p}$ we can simplify the math.

# Detour: Autoencoder notebook on Colab

Let's examine some Keras code that implements several types of Autoencoders
- Vanilla
- Denoising
- VAE

We will write our AE's using the Keras *Functional* API rather than the *Sequential* model
- We *could* write the complete AE using the Sequential API
- **But**
    - we want to extract the Encoder and Decoder parts as *separate models*
    - we can do this with the Functional API

We will now switch to a notebook running on Google Colab
 [Autoencoder example from github](https://colab.research.google.com/github/kenperry-public/ML_Fall_2019/blob/master/Autoencoder_example.ipynb)

# VAE: Probabilistic formulation

**Note**: Advanced material

The mathematical derivation of a VAE is quite detailed
- it is interesting but not absolutely necessary to understand

The interested reader is refered to a highly recommended [VAE tutorial](https://arxiv.org/pdf/1606.05908.pdf).

We will try to give the essence in the following slides.

<div class="alert alert-block alert-warning">
    <b>TL;DR</b> 
    <br>
    <ul>
        <li>The VAE has a very interesting <b>two part</b> Loss Function</li>
        <ul>
            <li>Reconstruction Loss, as in the Vanilla AE</li>
            <li>Divergence Loss
        </ul>
        <li>The Reconstruction Loss is not sufficient</li>
        <ul>
            <li>Issues of intractability arise</li>
            <li>The Divergence Loss skirts intractability</li>
            <ul>
                <li>By constraining the Encoder to produce a tractable distribution</li>
            </ul>
        </ul>
    </ul>
</div>

From the description of the VAE, observe that we are now dealing with distributions rather
than deterministic values for
- the encoding (latent representation) $\z$
- the output

So we will need to describe these distributions

We begin with a distribution $p(\z)$ of the latent variables, and a joint probabiity distribution $p(\x, \z)$ of examples and latents.

We will approximate $p$, as usual, with a NN that we will parameterize with $\theta$.
So henceforth $p$ will be subscripted with $\theta$.

From this joint distribution we can obtain
- $p_\theta(\x|\z) = \frac{p_\theta(\x,\z)}{p_\theta(\z)}$
    - the conditional distribution of $\x$ given $\z$
    - this represents the output distribution of the decoder

- $p_\theta(\z|\x) = \frac{p_\theta(\x|\z) p(\z)}{p_\theta(\x)}$ (by Bayes rule)
    - this represents the distribution for the encoder

- $p_\theta(\x) = \int_\z p_\theta(\x|\z) p(\z)$
    - the unconditional distribution of $\x$, the input space, by marginalizing $\z$.

We motivate these distributions as they relate to the VAE:
- the encoder produces $p_\theta(\z|\x)$
- the decoder produces $p_\theta(\x|\z)$


## The loss function: first attempt

Let's try to create a loss function, given that we are dealing with probabilities.

Our first attempt at reconstruction loss is $\loss_R$:

$$
\begin{array}[llll] \\
\loss_R(\phi, \theta, \x ) & = & - \E_{\z \sim p_\phi(\z|\x)}\left( \log( p_\theta(\x | \z) ) \right) \\
\end{array}
$$

That is, we want to  maximize the probability of the decoder producing $\x$ when the VAE inputs is $\x$.
$$
\E_{z \sim p_\phi(\z|\x)}\left( \log( p_\theta(\x | \z) ) \right)
$$

Note the sampling of encoder output $\z \sim p_\phi(\z|\x)$ given input $\x$
and the probability of the decoder producing the same $\x$: $p_\theta(\x | \z)$.

## Intractability

It turns out that things are not so simple: 
- Some of the distributions we need to deal with 
may not be *tractable*

- They have no closed form, just empirical distributions
- Higher dimensional distributions may  pose computational issues


Can you spot the problem ?

Recall that
$$p_\theta(\x) = \int_\z p_\theta(\x|\z) p(\z)$$ 



But this integral is problematic.

$\z$ is multi-dimensional and to calculate the integral with respect to $\z$ we have to
integrate over the full range of each dimension.

As the dimension of $\z$ becomes large, it is no longer computationally tractable to numerically
evaluate the integral.

For the same reason $p_\theta(\z|\x)$ is problematic since $p_\theta(\x)$ appears in the denominator.

### Avoiding intractability

The solution is change the objective of the encoder
- from producing the intractable $p_\theta(\z|\x)$ 
- to producing an *approximation* $q_\phi(\z|\x)$
- that is both tractable and "close" (in distribution) to the intractable $p_\theta(\z|\x)$.

As usual, we use the KL divergence as a measure of similarity of two distributions:

$$
\KL( q_\phi(\z|\x) \; ||\; p_\theta(\z | \x))
$$

$q_\phi(\z|\x)$ will be implemented via a NN parameterized by $\phi$.

It turns out that if we re-write the divergence between 
$q_\phi(\z|\x)$ and $p_\theta(\z | \x)$
- We obtain a term $\loss$ that has a very intuitive interpretation
- Will serve as our modified Loss function.

$$
\begin{array}[llll] \\
\loss & = & \loss_R + \loss_D \\
\loss_R(\phi, \theta, \x ) & = & - \E_{\z \sim q_\phi(\z|\x)}\left( \log( p_\theta(\x | \z) ) \right) \\
\loss_D(\phi, \x) & = & \KL \left(  q_\phi(\z|\x) \; || \; p_\theta(\z) \right) \\
\end{array}
$$

where $\KL(f,g)$ denotes the KL divergence between distributions $f$ and $g$.

That is, our  new loss $\loss$ function has two components
- $\loss_R$
    - the reconstruction loss, as before, but using the encoder $q_\phi(\z|\x)$ instead of $p_\theta(\z|\x)$

- $\loss_D$
    - the "KL divergence" loss which constrains the approximate $q_\phi(\z|\x)$ 

### Advanced: Obtain $\loss$ by rewriting $\KL( q_\phi(\z|\x) \; ||\; p_\theta(\z | \x)$

You might be puzzled that 
$$
\loss_D = \KL \left(  q_\phi(\z|\x) \; || \; p_\theta(\z) \right)
$$
rather than
$$
\loss_D = \KL \left(  q_\phi(\z|\x) \; || \; p_\theta(\z | \x) \right)
$$

Let's examine the discrepancy between the approximation $q_\phi(\z|\x)$ and $p_\theta(\z | \x)$

$
\begin{array}[llll]\\
\KL( q_\phi(\z|\x) \; ||\; p_\theta(\z | \x)) &  = & \sum_z{ q_\phi(\z|\x )(\log(q_\phi(\z|\x) - \log(p_\theta(\z | \x)) } & \text{def. of KL} \\
&  = & \E_z \left( \log(q_\phi(\z|\x) - \log(p_\theta(\z | \x)) \right) & \text{def. of }\E \\
&  = & \E_z ( \; \log(q_\phi(\z|\x)) \\ & & -\left( \; \log( p_\theta(\x | \z)) + \log(p_\theta(\z)) - \log(p_\theta(\x) \right)    \,   )  \;\;)&  \text{Bayes theorem on } \\
 & & & \log(p_\theta(\z | \x)) \\
\KL( q_\phi(\z|\x) \; ||\; p_\theta(\z | \x)) \\ - \log(p_\theta(\x)) & = & \E_z \left( \; \log(q_\phi(\z|\x))  - \left( \log( p_\theta(\x | \z) ) + \log( p_\theta(\z) ) \right) \;\right) & \text{ move } \log(p_\theta(\x)) \text{ to LHS} \\
 & = & \E_z \left( \; (\log(q_\phi(\z|\x))  - \log( p_\theta(\z) ) )  - \log( p_\theta(\x | \z) )   \; \right) & \\
 & = & - \E_z \left( \log( p_\theta(\x | \z) ) \right) + \KL(q_\phi(\z|\x) \; ||\;  p_\theta(\z) ) & \text{def. of KL} \\
\end{array}
$

Observe that the LHS would seem to be a reasonable loss function
- maximizing the log likelihood of the training set $p_\theta(\x)$
- keeping the approximation $q_\phi(\z|\x)$ close to $p_\theta(\z | \x)$

So we maximize the "fit" to the training set $\X$ (maximizing likelihood) while keeping
the approximation error small.


The LHS cannot be optimized via SGD (recall the tractability issue with $p_\theta(\x)$ and  $p_\theta(\z|\x)$).

**But the RHS can be made tractable** giving a proper choice of $p_\theta(\z)$.

So the RHS is a tractable form that is equivalent to the LHS and will serve as the loss function
for the VAE.

So it may be fair to say that the idea for the VAE is obtained from the Loss function,
rather than vice-versa.


### Choosing $p_\theta(\z)$

So what distribution should we use for the prior $p_\theta(\z)$ ?

One important consideration is that, since we learn by SGD, we need to be able to differentiate.

Another consideration is that the functional form of the distribution (i.e., an empirical distribution doesn't have a closed functional form) may simplify the math (e.g. normal).

To force the tractability of $q_\phi(\z|\x)$ we
will define *prior distribution* $p_\theta(\z)$ to have a tractable, closed form (often a normal).

# Loss function: discussion

**TO DO: Expand**

The reconstruction loss should be familiar: it tries to force the decoded output to be "close" to
the input.

What would happen if we omitted the KL divergence constraint $\loss_D$ from $\loss$ ?

Without it, the model could theoretically learn encodings $q_\phi(\z|\x)$ whose
distribution had near zero variance.

This would collapse the VAE into the vanilla AE.

So by choosing $p_\theta(\z)$ with a non-zero variance, we force the encoder to be probabilistic.


## Variational inference

**TO DO: NEEDS WORK, or omit.  Really only need to discuss ELBO, if anything**

The cost/loss needs to be simplified quite a bit.

This is beyond the scope of this talk but we refer the reader to
[VAE tutorial](https://arxiv.org/pdf/1606.05908.pdf) (Also recommended by Geron in footnote 7, Chapt 15).

To summarize
- we still have an intractable term (appears as another $\KL$ divergence after re-writing cost/loss
    - this term appears as an additive term
    - by definition of $\KL$, it is positive
- so we can't evaluate the full cost/loss function
    - but, ignoring the intractable positive part, the remainder is a *lower bound* on the cost/loss
        - so we optimize the lower bound
        - called the *ELBO* term (LB is lower bound)
        

# Re-parameterization trick

There is still one more problem for training:
- sampling $\hat{\z}  \sim  q_\phi(\z|\x)$

This is not a problem in the forward pass.

Optimization via back propogation requires the ability to take derivatives of the loss wrt the trainable parameters.

How do we take the derivative of a node involving a random choice ?

The trick is to re-express $z$:
$$
\begin{array}[llll] \\
\mathbf{z}  & = & \mathbf{\mu} + \mathbf{\sigma} \mathbf{\epsilon} \\
\mathbf{\epsilon} & \sim & \hat{p}(\mathbf{z}) \\
\end{array}
$$

That is, we obtain $z$ 
- By sampling an $\epsilon$ from the constraining distribution $\hat{p}(z)$
- Scaling the random $\epsilon$ by variable (function of $\x$) $\sigma$ and adding variable (function of $\x$) $\mu$.

We still can't take derivatives of $L_R$ with respect to $\epsilon$, but we don't need to !

We only need to take derivates of $L_R$ with respect to $\phi, \theta, \mu, \sigma$, which we can do.

In evaluating derivatives, the $\epsilon$ that appears in the result (e.g., derivative wrt $\sigma$) can be treated as a constant.

- For a particular example, we can remember the  drawn $\epsilon$ in the forward pass and use it in the backward pass ?
  - but over a batch, the expected value over the drawn $\epsilon$ should be $E( \hat{p}(z) )$ (which is $0$ is we constrain $\hat{p}$ to be $0$ centered)
  
$$
\begin{array}[llll] \\
L_D(\phi, \mathbf{\mu}, \mathbf{\sigma}, \mathbf{x}) & = & D_{\text{KL}} \left(  q_\phi(\z|\x), \hat{p}_{\mathbf{\mu}, \mathbf{\sigma}}(\z) \right) \\
\end{array}
$$

<table>
    <tr>
        <th><center>Reparameterization trick</center></th>
    </tr>
    <tr>
        <td><img src="images/Reparameterization_trick.jpg" width=800></td>
    </tr>
</table>

In [3]:
print("Done")

Done
