In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

Macro `_latex_std_` created. To execute, type its name (without quotes).
=== Macro contents: ===
get_ipython().run_line_magic('run', 'Latex_macros.ipynb')
 

In [4]:
from IPython.display import Image

# AutoEncoders



We revisit Unsupervised Learning, this time in a Deep Learning genre via Autoencoders
- understand representations
- unsupervised pretraining
    - manifold hypothesis: for a supervised problem, two near-by inputs have the same output
    - AE learns groups of near-by (in input space)
    


# Introduction

- Dimension reduction
    - use one-layer FC network w/o activation to replicate PCA
    - can use *any* NN, single or multi-layer, with activations
        - so more general
        

# Autoencoder (AE)

An AE takes an input of volume $V_i$, passes it through a "encoder" NN $E$ whose final layer has volume $v_e$,
and then passes the encoder's output into a "decoder" NN $D$ which produces a volume of size $V_d = V_i$.
The goal is for the output of $D$ to be as close as possible to the input to $E$

$$
\begin{array}[llll] \\
 D(E(\x))  & \approx & \x \\
 \\
\mathbf{z} & = & E(\x) \\
D(\mathbf{z}) & = & \x
\end{array}
$$

If the volume of $z$ is strictly smaller than $D_i$, then the AE is *under complete* and at best
$$ D(E(\x)) \approx \x$$
Essentially, $z$ is a reduced dimension representation of $x$.

<table>
    <tr>
        <th><center>Autoencoder</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_vanilla.jpg" width=1200></td>
    </tr>
</table>

## Reconstruction Loss
Minimize distance between $\x$ and $D(E(\x))$
- RMS
- cross entropy



## De-noising Autoencoder

<table>
    <tr>
        <th><center>Denoising Autoencoder</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_denoising.jpg" width=1200></td>
    </tr>
</table>

## Playing games with the Encoder representation

What would happen if we fed the Decoder with some $z$ that did not correspond to any example in the training set ?

- If we were lucky (and the Manifold Hypothesis were true) we would get something that looked similar to a training example
  - that is, if the Encoder encoded syntactic "concepts"
  - if it did encode "concepts", we can view the latent of each input as being a weighted combo of concept latents ?
    - we could then manipulate any latent $z$ by some other latent that was a long/short portfolio of latents that were long the concept/short the concept
      - "top"
- There is no guarantee of being lucky !  It was not a training objective.  It is quite possible that we will output something that does not resemble a training example.


# Variational Autoencoder (VAE): Generative ML

Our objective in creating a VAE is to be able to generate outputs that could have come from the training distribution.

Thus, our end goal is constructing a **decoder** that can generate (synthesize) outputs, not only that match the training data, but completely new outputs that are "close" to the training set.

The encoder (which can be used for dimension reduction and transfer learning) is not the primary objective when building a VAE.

In a vanilla AE, the encoder produces a deterministic latent representation $\z^\ip$ given
example $\x^\ip$.

The decoder is only required to produce a valid output given a latent representation $\z^\ip$
that corresponds to a concrete training input $\x^\ip$.

Our goal is to be able to a produce "realistic" output given any latent $\z$, even one that doesn't correspond
to any training $\x^\ip$.

The decoder will take a *latent vector* $\z$ and produce $D(\z)$, just as in a vanilla AE.


The difference is that $\z$ will be sampled from a *distribution* rather than being a concrete member of the training set.

The encoder will take an input $\x^\ip$ and compute two
values: $\mu^\ip$ and $\sigma^\ip$ which will serve as the mean and standard deviation of
the distribution of $\z^\ip$ of the encoding of $\x^\ip$.

As long as $\z$ is sampled from this distribution, the decoder will produce a "realistic" output.

**Note**

$\mu$ and $\sigma$ are computed values (and hence, functions of $\x$) and **not** parameters
that are estimated (and become fixed after training).  So $\mu$ and $\sigma$ depend on the input --
they are not constants.

**CAN WE REFER to a cell in the Notebook ?  Chollet has $\mu$ and $\sigma$ as nodes**

To train a VAE:
- pass input $\x^\ip$ through the encoder, producing $\mu^\ip, \sigma^\ip$
    - use $\mu^\ip, \sigma^\ip$ to sample a latent representation $\z^\ip$ from the distribution
- pass the sampled $\z^\ip$ through the decoder, producing $D(\z^\ip)$
- measure the reconstruction error $\x^\ip - D(\z^\ip)$, just as in a vanilla AE
- backpropogate the error, updating all weights and $\mu, \sigma$

Essentially, each input $\x^\ip$ has *many* latent representations (with different probabilities):
any sample from the distribution.

Observe that, once trained, $D(\z)$ should produce a realistic output, for any $\z$ from the distribution.
We can't control *which class* the output will come from though.

A later innovation, called a  CVAE, will allow us to specify the output class too.

<table>
    <tr>
        <th><center>Variational Autoencoder (VAE)</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_VAE.jpg" width=1200></td>
    </tr>
</table>

## Probabilistic formulation

From the description of the VAE, observe that we are now dealing with distributions rather
than deterministic values for
- the encoding (latent representation) $\z$
- the output

So we will need to describe these distributions

We begin with a distribution $p(\z)$ of the latent variables, and a joint probabiity distribution $p(\x, \z)$ of examples and latents.

We will approximate $p$, as usual, with a NN that we will parameterize with $\theta$.
So henceforth $p$ will be subscripted with $\theta$.

From this joint distribution we can obtain
- $p_\theta(\x|\z) = \frac{p_\theta(\x,\z)}{p_\theta(\z)}$
    - the conditional distribution of $\x$ given $\z$
    - this represents the output distribution of the decoder

- $p_\theta(\z|\x) = \frac{p_\theta(\x|\z) p(\z)}{p_\theta(\x)}$ (by Bayes rule)
    - this represents the distribution for the encoder

- $p_\theta(\x) = \int_\z p_\theta(\x|\z) p(\z)$
    - the unconditional distribution of $\x$, the input space, by marginalizing $\z$.

We motivate these distributions as they relate to the VAE:
- the encoder produces $p_\theta(\z|\x)$
- the decoder produces $p_\theta(\x|\z)$


## The loss function: first attempt

Let's try to create a loss function, given that we are dealing with probabilities.

Our first attempt at reconstruction loss is $\loss_R$:

$$
\begin{array}[llll] \\
\loss_R(\phi, \theta, \x ) & = & - \E_{\z \sim p_\phi(\z|\x)}\left( \log( p_\theta(\x | \z) ) \right) \\
\end{array}
$$

That is, we want to  maximize the probability of the decoder producing $\x$ when the VAE inputs is $\x$.
$$
\E_{z \sim p_\phi(\z|\x)}\left( \log( p_\theta(\x | \z) ) \right)
$$

Note the sampling of encoder output $\z \sim p_\phi(\z|\x)$ given input $\x$
and the probability of the decoder producing the same $\x$: $p_\theta(\x | \z)$.

## Intractability

It turns out that things are not so simple: Some of the distributions we need to deal with 
may not be *tractable*

- they have no closed form, just empirical distributions
- higher dimensional distributions may  pose computational issues


Can you spot the problem ?

Recall that
$$p_\theta(\x) = \int_\z p_\theta(\x|\z) p(\z)$$ 



But this integral is problematic.

$\z$ is multi-dimensional and to calculate the integral with respect to $\z$ we have to
integrate over the full range of each dimension.

As the dimension of $\z$ becomes large, it is no longer computationally tractable to numerically
evaluate the integral.

For the same reason $p_\theta(\z|\x)$ is problematic since $p_\theta(\x)$ appears in the denominator.

### Avoiding intractability

The solution is change the objective of the encoder
from producing the intractable $p_\theta(\z|\x)$ to producing an *approximation* $q_\phi(\z|\x)$
that is both tractable and "close" (in distribution) to the intractable $p_\theta(\z|\x)$.

As usual, we use the KL divergence as a measure of similarity of two distributions:

$$
\KL( q_\phi(\z|\x) \; ||\; p_\theta(\z | \x))
$$

$q_\phi(\z|\x)$ will be implemented via a NN parameterized by $\phi$.

It turns out that if we re-write the divergence between 
$q_\phi(\z|\x)$ and $p_\theta(\z | \x)$
we obtain a term $\loss$ that has a very intuitive interpretation and 
will serve as our modified Loss function.

$$
\begin{array}[llll] \\
\loss & = & \loss_R + \loss_D \\
\loss_R(\phi, \theta, \x ) & = & - \E_{\z \sim q_\phi(\z|\x)}\left( \log( p_\theta(\x | \z) ) \right) \\
\loss_D(\phi, \x) & = & \KL \left(  q_\phi(\z|\x) \; || \; p_\theta(\z) \right) \\
\end{array}
$$

where $\KL(f,g)$ denotes the KL divergence between distributions $f$ and $g$.

That is, our  new loss $\loss$ function has two components
- $\loss_R$
    - the reconstruction loss, as before, but using the encoder $q_\phi(\z|\x)$ instead of $p_\theta(\z|\x)$

- $\loss_D$
    - the "KL divergence" loss which constrains the approximate $q_\phi(\z|\x)$ 


### Advanced: Obtain $\loss$ by rewriting $\KL( q_\phi(\z|\x) \; ||\; p_\theta(\z | \x)$

Kumar lecture 22, @9:00 - @11:00

You might be puzzled that 
$$
\loss_D = \KL \left(  q_\phi(\z|\x) \; || \; p_\theta(\z) \right)
$$
rather than
$$
\loss_D = \KL \left(  q_\phi(\z|\x) \; || \; p_\theta(\z | \x) \right)
$$

Let's examine the discrepancy between the approximation $q_\phi(\z|\x)$ and $p_\theta(\z | \x)$

$
\begin{array}[llll]\\
\KL( q_\phi(\z|\x) \; ||\; p_\theta(\z | \x)) &  = & \sum_z{ q_\phi(\z|\x )(\log(q_\phi(\z|\x) - \log(p_\theta(\z | \x)) } & \text{def. of KL} \\
&  = & \E_z \left( \log(q_\phi(\z|\x) - \log(p_\theta(\z | \x)) \right) & \text{def. of }\E \\
&  = & \E_z ( \; \log(q_\phi(\z|\x)) \\ & & -\left( \; \log( p_\theta(\x | \z)) + \log(p_\theta(\z)) - \log(p_\theta(\x) \right)    \,   )  \;\;)&  \text{Bayes theorem on } \\
 & & & \log(p_\theta(\z | \x)) \\
\KL( q_\phi(\z|\x) \; ||\; p_\theta(\z | \x)) \\ - \log(p_\theta(\x)) & = & \E_z \left( \; \log(q_\phi(\z|\x))  - \left( \log( p_\theta(\x | \z) ) + \log( p_\theta(\z) ) \right) \;\right) & \text{ move } \log(p_\theta(\x)) \text{ to LHS} \\
 & = & \E_z \left( \; (\log(q_\phi(\z|\x))  - \log( p_\theta(\z) ) )  - \log( p_\theta(\x | \z) )   \; \right) & \\
 & = & - \E_z \left( \log( p_\theta(\x | \z) ) \right) + \KL(q_\phi(\z|\x) \; ||\;  p_\theta(\z) ) & \text{def. of KL} \\
\end{array}
$

Observe that the LHS would seem to be a reasonable loss function
- maximizing the log likelihood of the training set $p_\theta(\x)$
- keeping the approximation $q_\phi(\z|\x)$ close to $p_\theta(\z | \x)$

So we maximize the "fit" to the training set $\X$ (maximizing likelihood) while keeping
the approximation error small.


The LHS cannot be optimized via SGD (recall the tractability issue with $p_\theta(\x)$ and  $p_\theta(\z|\x)$).

**But the RHS can be made tractable** giving a proper choice of $p_\theta(\z)$.

So the RHS is a tractable form that is equivalent to the LHS and will serve as the loss function
for the VAE.

[VAE tutorial](https://arxiv.org/pdf/1606.05908.pdf) page 8 observes that the RHS has
the form of an autoencoder ($q_\phi(\z|\x)$ looks like an encoder; $p_\theta(\x|\z)$ looks like a decoder).

So it may be fair to say that the idea for the VAE is obtained from the Loss function,
rather than vice-versa.


### Choosing $p_\theta(\z)$

So what distribution should we use for the prior $p_\theta(\z)$ ?

One important consideration is that, since we learn by SGD, we need to be able to differentiate.

Another consideration is that the functional form of the distribution (i.e., an empirical distribution doesn't have a closed functional form) may simplify the math (e.g. normal).

To force the tractability of $q_\phi(\z|\x)$ we
will define *prior distribution* $p_\theta(\z)$ to have a tractable, closed form (often a normal).

### Loss function: discussion

**Expand, as per Kumar**

The reconstruction loss should be familiar: it tries to force the decoded output to be "close" to
the input.

What would happen if we omitted the KL divergence constraint $\loss_D$ from $\loss$ ?

Without it, the model could theoretically learn encodings $q_\phi(\z|\x)$ whose
distribution had near zero variance.
This would collapse the VAE into the vanilla AE.
So by choosing $p_\theta(\z)$ with a non-zero variance, we force the encoder to be probabilistic.


## Variational inference

**NEEDS WORk, or omit.  Really only need to discuss ELBO, if anything**

The cost/loss needs to be simplified quite a bit.
This is beyond the scope of this talk but we refer the reader to
[VAE tutorial](https://arxiv.org/pdf/1606.05908.pdf) (Also recommended by Geron in footnote 7, Chapt 15).

To summarize
- we still have an intractable term (appears as another $\KL$ divergence after re-writing cost/loss
    - this term appears as an additive term
    - by definition of $\KL$, it is positive
- so we can't evaluate the full cost/loss function
    - but, ignoring the intractable positive part, the remainder is a *lower bound* on the cost/loss
        - so we optimize the lower bound
        - called the *ELBO* term (LB is lower bound)
        
 See page 6 in particular. The probablility $Prob(x)$ is the same as the x as from $Prob(z|x)$ via Bayes.
 **But** isssue for us is $q(z|x)$ ??

## Re-parameterization trick

Re-parameterization trick moves the sampling to the side. Figure 4 from https://arxiv.org/pdf/1606.05908.pdf

There is still one more problem for training:
- sampling $\hat{\z}  \sim  q_\phi(\z|\x)$

This is not a problem in the forward pass.
Optimization via back propogation requires the ability to take derivatives of the loss wrt the trainable parameters.

How do we take the derivative of a node involving a random choice ?

The trick is to re-express $z$:
$$
\begin{array}[llll] \\
\mathbf{z}  & = & \mathbf{\mu} + \mathbf{\sigma} \mathbf{\epsilon} \\
\mathbf{\epsilon} & \sim & \hat{p}(\mathbf{z}) \\
\end{array}
$$

That is, we obtain $z$ by sampling an $\epsilon$ from the constraining distribution $\hat{p}(z)$, scaling the random $\epsilon$ by some learnable paramter $\sigma$ and adding learnable paramater $\mu$.

We still can't take derivatives of $L_R$ with respect to $\epsilon$, but we don't need to !

We only need to take derivates of $L_R$ with respect to $\phi, \theta, \mu, \sigma$, which we can do.

In evaluating derivatives, the $\epsilon$ that appears in the result (e.g., derivative wrt $\sigma$) can be treated as a constant.

- For a particular example, we can remember the  drawn $\epsilon$ in the forward pass and use it in the backward pass ?
  - but over a batch, the expected value over the drawn $\epsilon$ should be $E( \hat{p}(z) )$ (which is $0$ is we constrain $\hat{p}$ to be $0$ centered)
  
$$
\begin{array}[llll] \\
L_D(\phi, \mathbf{\mu}, \mathbf{\sigma}, \mathbf{x}) & = & D_{\text{KL}} \left(  q_\phi(\z|\x), \hat{p}_{\mathbf{\mu}, \mathbf{\sigma}}(\z) \right) \\
\end{array}
$$

<table>
    <tr>
        <th><center>Reparameterization trick</center></th>
    </tr>
    <tr>
        <td><img src="images/Reparameterization_trick.jpg" width=800></td>
    </tr>
</table>

### Summary

**Training**

Encoder produces

$$
\begin{array}[llll] \\
E(\x) & = &  q_\phi(\z|\x) & \approx & p_\theta(\z|\x) \\
\end{array}
$$

We sample from
$$
\begin{array}[llll] \\
\hat{\z} & \sim & q_\phi(\z|\x) \\
\end{array}
$$

Decoder produces

$$
\begin{array}[llll] \\
D(\hat{\z})  & = & p_\theta (\x|\z)\\
\end{array}
$$

Each time (epoch) that we encounter the same training example, we select another random element from the distribution.

So the VAE learns to represent the same example from multiple latents.


**Generative**
- sample $\hat{\z} \sim \hat{p}(\z)$
- use decoder to produce output $p_\theta (\x|\z)$
 
This means we can feed in a $\z$ that doesn't correspond to any training example and perhaps get an output that *resembles* something from the training set, rather than noise.

# Conditional Variational Autoencoder (CVAE)



So far, a VAE will allow us to input an aribtrary $z$ and get a "reasonable" output, that is, something that resembles a training example.

However, we don't have any control as to *which* class the output will come from.

But the training examples sometimes come with labels, that partitions the training set into classes (each: particular digit in MNIST).

How about controlling *which* digit is generated by passing in the desired label as an input ?

This is remedied in CVAE by
- modifying the input of the Encoder to also take a label (e.g., class id)
 - modifiying the input of the Decoder to also take a label
 
So the Encoder output is a distribution that is conditional both on the input image **and** the class label.

Similarly the Decoder output is conditional on the latent **and** the class label.
 

So now we can create a latent, append a class label, and presumably have the decoder produce an output from the desired class.

Thus, the encoding distribution is now conditional on class label $c$: $q_\phi(z|x,c)$ 
and so is the encoding distribution $p_\theta(x|z,c)$ 

Again, by restricting the functional form of the prior distribution $\hat{p}$ we can simplify the math.

<table>
    <tr>
        <th><center>Conditional VAE (CVAE)</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_CVAE.jpg"" width=800></td>
    </tr>
</table>

# Layer-wise pre-training with Autoencoders

Autoencoders these days can be used for dimension reduction.

Moreover, they were an important transitional step (albeit no longer relevant)
in the history of DL. 

Before we learned all the tricks (better initialization, better activation functions, normalization)
that enable us to train deep networks today, autoencoders played a vital role.

For each layer $\ll$ in sequence, starting from the input layer: train layer $l$'s weights using an
autoencoding objective (to have the output of the layer replicate it's input).

This is, in effect, an initilization of the layer's weights that is better than random.

(Presumably, by being able to encode the syntactic structure of the training set, the weights have
learned something useful.)

For a brief time in history, this was the solution to the difficulty of not learning because of poor initialization.

## Autoencoders and Transfer Learning
Today, autoencoders are useful for another purpose: Transfer Learning.

If we can train an AE network to create features that are useful for reconstruction, it is possible
that these features are useful for solving more complicated tasks.

So it is not uncommon to approach a complicated problem by first constructing an autoencoder to
come up with an alternate (and smaller) representation of the input space.

Note that Autoencoders are *unsupervised*: they don't take labels.  

So the encodings they produce
stress syntactic similarity, rather than semantic similarity.

Their use in Transfer Learning depends on the hope that inputs that are syntactically similar also
have the same labels.

In [2]:
print("Done")

Done
