In [1]:
%run Latex_macros.ipynb

<IPython.core.display.Latex object>

# Variational Autoencoder (VAE)

Like the plain Autoencoder that we have already encountered
a *Variational Autoencoder (VAE)* is comprised of an Encoder and a Decoder

In both cases
- the Encoder produces a latent representation $\z^\ip$ of its input $\x^\ip$
- the Decoder attempts to reconstruct $\x^\ip$ from $\z^\ip$

However, the behavior of the Decoder is undefined when presented with a latent $\z$ that did
not arise from a training example.
- We can only hope that the output is reasonable

As we saw, this has implications as to our ability to use the AE as a means of generating synthetic examples.

The Decoder part of a VAE is identical to that of the plain Autoencoder.

But the Encoder part of a VAE is different in an important way.  Given input $\x$
- It creates a *distribution* for the the latent representation $\z$
- Rather than creating a unique $\z$

The Encoder part of a VAE, given input $\x$
- Produces the parameters (e.g., $\mu^\ip, \sigma^\ip$) of a distributional form
- Draws a sample from the distribution as its output $\z^\ip$

Thus, the latent representation $\z$ of a given $\x$ is a probability distribution $\qr{\z | \x}$.

This may address one of the concerns we raised with using a plain Autoencoder in a generative manner
- that a slight perturbation of the latent $\z^\ip$ obtained from input $\x^\ip$
- might have the Decoder produce $\tilde{\x}^\ip$ that is not similar to $\x^\ip$

<table>
    <tr>
        <th><center>Variational Autoencoder (VAE)</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_VAE.png"></td>
    </tr>
</table>

**Note**

$\mu^\ip$ and $\sigma^\ip$ are 
- vectors
- computed values (and hence, functions of $\x^\ip$) and **not** parameters
- so training learns a *function* from $\x^\ip$ to $\mu^\ip$ and $\sigma^\ip$


This is not just straightforward engineering.

In fact: the architecture of the VAE was obtained from the math rather than vice-versa !

We provide a brief overview of the mathematics.

The interested reader is referred to a highly recommended [VAE tutorial](https://arxiv.org/pdf/1606.05908.pdf)
for a detailed presentation.


# Details

**Notation summary**

term &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;| dimension &nbsp; &nbsp;  &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | meaning 
:---|:---|:---
$\x$ | $n$ | Random variable for Input
$\tilde\x$ | $n$ | Output: reconstructed $\x$
$\z$ | $n' << n$ | Random variable for Latent representation
$E$  | $\mathbb{R}^n \rightarrow \mathbb{R}^{n'}$ | Encoder
            | | $E(\x) = \z $
$D$  | $\mathbb{R}^{n'} \rightarrow \mathbb{R}^n$ | Decoder
            | | $\tilde\x = D(\z) $
            | | $\tilde\x = D( E(\x) )$
            | | $\tilde\x \approx \x$
$\pr{\x}$ | prob. distribution | *prior* distribution of Inputs
          | | intractable.  Only have empirical.
$\qr{\z}$ | prob. distribution | *prior* distribution of Latents
$\qr{\z \mid \x}$ | prob. distribution| *posterior* marginal distribution of Latents given Input
                  | | intractable
$\qrs{\z \mid \x}{\Phi}$ | Neural Network| NN to approximate $\qr{\z \mid \x}$ 
                         | | Encoder
$\prs{\x \mid \z}{\Theta}$ | Neural Network| NN to approximate $\pr{\x \mid \z}$ 
                  | | Decoder

Let's pretend that we don't already know the architecture of a VAE
- that the latent $\z^\ip$ is generated as a probability function of $\mu^\ip$ and $\sigma^\ip$ given input $\x^\ip$.

Instead let
- $\x$ denote a random variable representing an Input
    - the random variable is from the (unknown) distribution $\pr{\x}$
- $\z$ denote a random variable representing a Latent
    - the random variable is from the (unknown) distribution $\qr{\z}$
    

Because $\x$ and $\z$ are related, there is also a joint distribution of $(\x,\z)$ from which we
can define the marginal distributions
- $\qr{\z | \x}$: the marginal distribution of Latent, given an Input
- $\pr{\x | \z}$: the marginal distribution of Input, given a Latent



But there's a problem !
- The distribution $\pr{\x}$ is *intractable*
    - e.g., Who can say what the distribution of images is in the physical world ?
    - At best: we have an empirical distribution (our training dataset)

We will side-step the intractability issues by defining Neural Networks to learn
an approximation.
- $\qrs{\z | \x}{\Phi}$: Neural Network, with weights $\Phi$ to approximate $\qr{\z \mid \x}$
    - The Encoder
- $\prs{\x | \z}{\Theta}$:Neural Network, with weights $\Theta$, to approximate $\pr{\x \mid \z}$
    - The Decoder

The mapping from Latent to reconstructed Input is not necessarily unique,
thus we marginalize $\x$ over $\z$
$$
\pr{\x} = \int_{\z \in \qr{\z}}{ \pr{\x | \z} \; \qr{\z} }
$$



Some obvious concerns about the integral
- It may be very expensive to draw many samples of $\z$ from $\qr{\z}$ for each training example $\x^\ip$
- There are many random choices of $\z$ from $\qr{\z}$ which can't reconstruct $\x^\ip$
    - i.e., $\prs{\x^\ip | \z'}{\Theta} = 0$ for many $\z'$


We can improve our sampling by considering only those choices of $\z$ that could generate $\x$
and re-write the objective as

$$
\pr{\x} = \int_{ \z \in \qr{\z | \x} } { \pr{\x | \z} \; \qr{\z} }
$$


Using the Decoder Neural Network $\prs{\x | \z}{\Theta}$ to approximate $ \pr{\x | \z}$
gives rise to the following architecture

<table>
    <tr>
        <th><center>VAE derivation: 1</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_VAE_derivation_B_0.png" width=80%></td>
    </tr>
</table>

We still can't train $\prs{\z | \x}{\Theta}$ because we don't know $\qr{ \z | \x }$.

Let's use the Neural Network (Encoder) $\qrs{ \z | \x }{\Phi}$ to approximate $\qr{ \z | \x }$.

But we must constrain the Encoder to produce a distribution that approximates the true $\qr{ \z | \x }$.

We do so by including this as a constraint (part of the Loss function used in training) using KL divergence
as a measure of dissimilarity of two distributions

$$
\KL( \qrs{\z | \x}{\Phi} \; ||\; \qr{\z | \x})
$$


The resulting architecture:


<table>
    <tr>
        <th><center>VAE derivation: 2</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_VAE_derivation_B.png"></td>
    </tr>
</table>


We can train the Encoder/Decoder pair with the objective that the
reconstructed $\tilde{\x}^\ip$ approximates the true $\x^\ip$ from the training set, across all examples $i$.

One one of stating this is as a Maximum Likelihood:
- Solve for the weights $\Phi, \Theta$
- That maximize the (log) Likelihood of the set of reconstructions $\tilde{\X}$ reproducing the training set $\X$

Since our practice is to minimize Loss (rather than maximize an objective function)
we write our loss as (negative of log) likelihood
$$
\begin{array}[llll] \\
\loss & = & - \log( \pr{\X} )
\end{array}
$$

Minimizing $\loss$ is equivalent to maximizing likelihood.



Adding the KL divergence constraint to our Likelihood objective gives the loss function
$$
\begin{array}[lll]\\
\loss  & = & - \log(\prs{\x}{\Theta}) + \KL( \qrs{\z|\x}{\Phi} \; ||\; \qr{\z | \x}) \\
& = & \loss_R + \loss_D
\end{array}
$$

which now has two objectives
- Reconstruction loss $\loss_R$: maximize the likelihood (by minimizing the negative likelihood)
- Divergence constraint $\loss_D$: $\qrs{\z|\x}{\Phi}$ must be close to $\qr{\z | \x}$

$$
\begin{array}[llll] \\
\loss_R & = & - \log( p_\theta(\x ) ) \\
\loss_D & = & \KL \left(  \qrs{\z|\x}{\Phi} \; || \; \qr{\z | \x} \right) \\
\end{array}
$$

We will show (in the next section: lots of algebra !) that the loss can be re-written as
$$
\loss = - \E_{\z \sim \qrs{\z|\x}{\Phi}}\left( \log( \prs{\x|\z}{\Theta} ) \right) + \KL(\qrs{\z|\x}{\Phi}  \; ||\;  \qr{\z} )
$$

This is *almost* identical to our original express for $\loss$ except
- Re-write 
$$\log(p_\theta(\x)) = 
\E_{\z \sim \qrs{\z|\x}{\Phi}}\left( \log( \prs{\x|\z}{\Theta} ) \right)
$$
- the KL term becomes
$$
 \KL \left(  \qrs{\z|\x}{\Phi} \; || \; \qr{\z} \right)
$$
rather than the original
$$
\KL \left(  \qrs{\z|\x}{\Phi} \; || \; \qr{\z | \x} \right)
$$

**The purpose of re-writing**: replace intractable $\qr{\z|\x}$ with a tractable $\qr{\z}$ !
- So we can have a Loss function with which we can train !

<div class="alert alert-block alert-warning">
    <b>TL;DR</b> 
    <br>
    <ul>
        <li>The VAE has a very interesting <b>two part</b> Loss Function</li>
        <ul>
            <li>Reconstruction Loss, as in the Vanilla AE</li>
            <li>Divergence Loss
        </ul>
        <li>The Reconstruction Loss is not sufficient</li>
        <ul>
            <li>Issues of intractability arise</li>
            <li>The Divergence Loss skirts intractability</li>
            <ul>
                <li>By constraining the Encoder to produce a tractable distribution</li>
            </ul>
        </ul>
    </ul>
</div>

## Advanced: Obtain $\loss$ by rewriting $\KL( \qrs{\z|\x}{\Phi} \; ||\; \qr{\z|\x} )$

Let's derive a simpler expression for $\loss$ by manipulating $\KL( \qrs{\z|\x}{\Phi} \; ||\; \qr{\z|\x})$:

$
\begin{array}[llll]\\
\KL( \qrs{\z|\x}{\Phi} \; ||\; \qr{\z | \x}) &  = & \sum_z{ \qrs{\z|\x}{\Phi}(\log(\qrs{\z|\x}{\Phi} - \log(\qr{\z | \x}) } & \text{def. of KL} \\
&  = & \E_{z \sim \qrs{\z|\x}{\Phi} } \left( \log(\qrs{\z|\x}{\Phi} - \log(\qr{\z | \x}) \right) & \text{def. of }\E \\
&  = & \E_{z \sim \qrs{\z|\x}{\Phi} } ( \; \log(\qrs{\z|\x}{\Phi}) \\ & & -\left( \; \log( \pr{\x | \z}) + \log(\qr{\z}) - \log(\pr{\x} \right)    \,   )  \;\;)&  \text{Bayes theorem on } \\
 & & & \log(\qr{\z|\x}) \\
\KL( \qrs{\z|\x}{\Phi} \; ||\; \qr{\z | \x}) \\ - \log(\pr{\x}) & = & \E_{z \sim \qrs{\z|\x}{\Phi} } \left( \; \log(\qrs{\z|\x}{\Phi})  - \left( \log( \pr{\x | \z} ) + \log( \qr{\z} ) \right) \;\right) & \text{ move } \log(\pr{\x}) \text{ to LHS} \\
 & = & \E_{z \sim \qrs{\z|\x}{\Phi} } \left( \; - \log( \pr{\x | \z} ) + ( \; \log(\qrs{\z|\x}{\Phi})  - \log( \qr{\z} ) \; )     \; \right) & \text{re-arrange terms} \\
 & = & - \E_{z \sim \qrs{\z|\x}{\Phi} } \left( \log( \pr{\x | \z} ) \right) + \KL(\qrs{\z|\x}{\Phi} \; ||\;  \qr{\z} ) & \text{def. of KL} \\
 \loss & = & - \E_{z \sim \qrs{\z|\x}{\Phi} } \left( \log( \pr{\x | \z} ) \right) + \KL(\qrs{\z|\x}{\Phi} \; ||\;  \qr{\z} ) & \text{since LHS} = \loss \\
\end{array}
$

**The key step**:
- Using Bayes Theorem to re-write
$$\log(\qr{\z|\x}) $$
as
$$
\log( \pr{\x | \z}) + \log(\qr{\z}) - \log(\pr{\x} )
$$
- This allows us do away with intractable conditional probability $\qr{\z|\x}$
- In favor of unconditional probability $\qr{\z}$

The LHS cannot be optimized via SGD (recall the tractability issue with  $\qr{\z|\x}$).

**But the RHS can be made tractable** by
- Approximating $\pr{\x | \z}$ with $\prs{\x | \z}{\Theta}$
- Assuming that the distributions $\qr{\z}$ (and $\pr{\x}$) as Normal




## Choosing $\qr{\z}$

So what distribution should we use for the prior $\qr{\z}$ ?
- It should be differentiable, since we use Gradient Descent for optimization
- It should be tractable with a closed form (such as a normal)

A *convenient* (but **not necessary**) choice for $\qr{\z}$ is normal
- If we choose $\qr{\z}$ as normal, we can require $q_\phi( \z | \x )$ to be normal too
- The KL divergence between two normals is an easy to compute function of their means and standard deviations.
    - the "easy to compute" part is the "convenience"
    - See [VAE tutorial](https://arxiv.org/pdf/1606.05908.pdf) Section 2.2


## Re-parameterization trick

There is still one impediment to training.

It involves the random choice of $\z \sim \qrs{\z|\x}{\Phi}$ in

$$
\loss_R = \E_{z \sim \qrs{\z|\x}{\Phi} } \left( \log( \prs{\x | \z}{\Theta} ) \right)
$$

This is not a problem in the forward pass.

But in the backward pass we need to compute
$$
\frac{\loss_R}{\partial \Theta}
$$

How do we back propagte through a random choice ?


The "reparameterization trick" redefines the random choice $\z$ as

$$
\begin{array}[llll] \\
\mathbf{z}  & = & \mathbf{\mu}_\theta(\x) + \mathbf{\sigma}_\theta(\x) * \mathbf{\epsilon} \\
\mathbf{\epsilon} & \sim & N(0,1) \\
\end{array}
$$

That is
- Once we have defined $\qr{\z}$ to be a Normal distribution
- We can re-write an element of the distribution
    - as the distribution's mean plus a random standardized Normal $\epsilon$ times the distribution's standard deviation
    
In this formulation, the random variable $\epsilon$ appears in a product term
- We can differentiate the product with respect to $\Theta$
- $\epsilon$ can be treated as a constant in $\frac{\partial \epsilon}{\partial \Theta}$

The Encoder design is now to produce
(trainable parameters) $\mu_\Theta, \sigma_\Theta$
- And $\z$ indirectly


<table>
    <tr>
        <th><center>Reparameterization trick</center></th>
    </tr>
    <tr>
        <td><img src="images/Reparameterization_trick.png"></td>
    </tr>
</table>

This gets us to the  final picture of the VAE:

<table>
    <tr>
        <th><center>Variational Autoencoder (VAE)</center></th>
    </tr>
    <tr>
        <td><img src="images/Autoencoder_VAE.png"></td>
    </tr>
</table>

To train a VAE:
- pass input $\x^\ip$ through the Encoder, producing $\mu^\ip, \sigma^\ip$
    - use $\mu^\ip, \sigma^\ip$ to sample a latent representation $\z^\ip$ from the distribution
- pass the sampled $\z^\ip$ through the decoder, producing $D(\z^\ip)$
- measure the reconstruction error $\x^\ip - D(\z^\ip)$, just as in a plain AE
- back propagate the error, updating all weights and $\mu, \sigma$

Each time that we encounter the same training example (e.g., in different epochs), we select another random element from the distribution.

Thus the VAE learns to represent the same example from multiple latents.


We can examine how the latent representations produced by the VAE form clusters:

<table>
   <tr>
       <th><center>MNIST examples: clustering of latent $\z$</th>
    </tr>
    <tr>
        <td><img src="images/vae_latents.png" width=80%></td>
    </tr>

</table>

In comparing the clusterings produced by the VAE against our previous example of a plain Autoencoder be aware
- The two models are displaying results on different data: MNIST digits verus Fashion MNIST
- The architecture of the Encoder and Decoder are different in the two models
    - The plain Autoencoder used extrememly simple architectures
        - Could the more complex architecture of the VAE Encoder/Decoder be the cause of tighter clustering ?
        
Certainly room for experimentation !

# Using a VAE to produce synthetic examples

To give you an idea of the generative nature of the VAE, consider
- Creating latent vectors $\z$ from scratch
    - **not** as the output of the Encoder
- Varying these latent vectors systematically and examining the output created by the Decoder

<table>
   <tr>
       <th><center>Synthetic MNIST examples from a VAE: vary the 2 components of a 2D latent $\z$</th>
    </tr>
    <tr>
        <td><img src="images/vae_outputgrid.png" width=90%></td>
    </tr>

</table>

Note that the outputs
- are **not** instances of any examples
- There was no guarantee that a random $\z$ would produce something that looked like a digit !

We may even be able to interpret the elements of $\z$
- $\z_0$: control slant ?
    - See the bottom row of $0$'s
- $\z_1$: control "verticality" ?
    - See right-most column

## ELBo (Evidence-based Lower Bound)

By re-writing the Loss, we removed the intractable term $\qr{\z|\x}$

It turns out that even this may not be necessary.

For the truly interested reader:
- The derivation uses a method known as *Variational Inference*.  See this 
[blog](https://mbernste.github.io/posts/variational_inference/) for a summary.
- One can show that loss $\loss$ is equal to $-1$ times the *ELBo* (Evidence Based Lower Bound)

So if one knows how to maximize the [ELBo](https://mbernste.github.io/posts/elbo/), one can minimize the loss.
     

## Loss function: discussion

Let's examine the role of $\loss_R$ and $\loss_D$ in the loss function $\loss$.

- What would happen if we dropped $\loss_D$ ?
    - We would wind up with a deterministic $\z$ and collapse to a vanilla VAE
    
- What would happen if we dropped $\loss_R$ ?
    - the encoding approximation $\qrs{\z|\x}{\Phi}$ would be close to the empirical $\pr{\z | \x}$ *in distribution*
    - but two variables with the same distribution are not necessarily the same ?
        - e.g., get a distribution $p$ by flipping a coin
            - let distribution $q$ be a relabelling of $p$ by changing Heads to Tails and vice-versa
            - $p$ and $q$ are equal in distribution but clearly different !
    


# Conditional VAE

The VAE would seem to offer a solution to the problem of creating synthetic data.

But there is a problem
- an *unlabeled* example is created
- we have no way of knowing the label
- nor do we have a way of *controlling* the output so as to be an example with a specified label

We can modify the VAE so as to *conditionally* generate an example with a specified label.
- [Conditional VAE](Cond_VAE_Generative.ipynb)

# Code: VAE, CVAE

We can learn much more about the properties and use of a VAE through examples

A secondary objective is to look at the code which involves some advanced features of Keras.

The architecture of the VAE will be more complex compared to the vanilla Autoencoder.

<table>
   <tr>
            <th>VAE: Components</th>
    </tr>
    <tr>
        <td><center><strong>Encoder</bold></strpmg></td>
        <td><center><strong>Decoder</bold></strpmg></td>
    <tr>
        <td><img src="images/vae_conv_encoder.png" width=90%></td>
        <td><img src="images/vae_conv_decoder.png" width=90%></td>
    </tr>

</table>

Encoder
- Note the two branches to nodes `z_mean` and `z_log_var`
    - The output of their common parent is used to generate two separate values ($\mu$ and $\sigma$)
    - $\mu$ and $\sigma$ are both vectors of length $2$
        - Thus, the sampled $\z$ is also of length $2$
        - In our grid illustration of generating synthetic examples, we vary each of the $2$ components of $\z$
    - Latent is *much* shorter than in the plain VAE
        - does the random nature of sampling facilitate shorter representation ?

Let's explore the code through this [notebook](VAE_code.ipynb)
- VAE code
- CVAE code

In [2]:
print("Done")

Done
