# Variational Autoencoders
This paper (which is actually called *Autoencoding Variational Bayes*) introduces a fairly general inference method for many directed graphical models with continuous latent variables. It is called *Stochastic Gradient Variational Bayes* (SGVB) and is based around approximating the evidence lower bound (ELBO) by using differentiable Monte Carlo estimated expectations which is different from normal variational bayes (variational inference) which just computes the iterative update steps from optimizing the ELBO.

Since generative models are popular now they also apply this technique for generative purposes which results in the Variational Autoencoder (VAE). The basic generative model operates under the assumption that data is generated conditioned on some latent value $z$. In VAE we have a prior on $z$ which regularizes it to follow the given prior which means it will then be easier to input new $z$ that generates reasonable $x$.


## The problem
Have i.i.d. discrete or continuous data $X$ that is assumed to come from a generative process where
1. a latent variable $z^{(i)}$ is drawn from a prior $p_\theta(z)$
2. an observation $x^{(i)}$ is generated from a conditional distribution $p_\theta(x\ |\ z^{(i)})$.

The goal is then often to approximate the posterior (the true posterior often being intractable to compute). Sometimes the standard mean field variational inference can be enough, but sometimes any reasonable mean field VI can also lead to inractable integrals. This can appear in many cases of moderately complicated likelihood functions, like a neural network with a non linear hidden layer for example. 

This is the main problem that they want to solve with the following proposed inference method. Also important for large datasets where minibatches is the only possibility to get updates from. Training on streaming data is also useful which this model is capable of.

## The solution
They introduce the recognition model $q_\phi(z\ |\ x)$ which is an approximation of the true posterior $p_\theta(z\ |\ x)$. This can also be seen as a probabilistic encoder that given an $x$ gives a latent code $z$. The generative part of the model $p_\theta(x\ |\ z)$ can be seen as probabilistic decoder that given a latent code $z$ gives a distribution over possible $x$.

This solution does not make the same assumption as mean field variational inference where $q$ is factorized over the latent variables and variational parameters $\phi$ are computed from a closed from expectation expression.

The following method then learns the recognition model parameters (variational parameters) $\phi$ and the generative model's parameter $\theta$ jointly.

**Variational bound:** From basic variational inference we know

\begin{align*}
log\ p(x^{(i)}) &= KL(q_\phi(z\ |\ x^{(i)})\ ||\ p_\theta(z\ |\ x^{(i)})) + \mathcal{L}(\theta, \phi, x^{(i)}) && \text{KL is non negative} \\
\\
log\ p(x^{(i)}) &\geq \mathcal{L}(\theta, \phi, x^{(i)}) = \mathbb{E}_{q_\phi(z\ |\ x)} \left[ log\ p_\theta(x,z) - log\ q_\phi(z\ |\ x) \right] && \text{Variational lower bound} \\
\\
\mathcal{L}(\theta, \phi, x^{(i)}) &= -KL(q_\phi(z\ |\ x^{(i)})\ ||\ p_\theta(z)) + \mathbb{E}_{q_\phi(z\ |\ x^{(i)})} \left[ log\ p_\theta(x^{(i)}\ |\ z) \right] && \text{TODO: check this}
\end{align*}

The idea is then to use a **monte carlo estimator** for this variational bound.

\begin{align*}
\mathcal{L}(\theta, \phi, x^{(i)}) &= \mathbb{E}_{q_\phi(z\ |\ x)} \left[ log\ p_\theta(x,z) - log\ q_\phi(z\ |\ x) \right] \\
&\approx \frac{1}{L} \sum^L_{l=1} log\ p_\theta(x,z) - log\ q_\phi(z\ |\ x) \\
z_l &\sim q_\phi(z\ |\ x) && \text{samples}
\end{align*}

Then we would like to differentiate this with respect to $\theta$ and $\phi$ to optimize but this would exhibit very high variance (according to an earlier paper), and also not possible to backprop through a sampling operation (?). Instead they use the **reparametrization trick** which is a way to construct samples $z \sim q_\phi(z\ |\ x)$ deterministically from a differentiable transformation $g_\phi$ using another random variable.

\begin{align*}
\epsilon &\sim p(\epsilon) && \text{random seed independent of $\phi$} \\
z &= g_\phi(\epsilon, x) && \text{differentiable perturbation}
\end{align*}

Using all this, we can now form monte carlo estimates of the expectation from the lower bound, which we then differentiate to get update steps for $\theta$ and $\phi$. This is the generic **Stochastic Gradient Variational Bayes estimator** (SGVB).

\begin{align*}
\mathbb{E}_{q_\phi(z\ |\ x^{(i)})} \left[ log\ p_\theta(x,z) - log\ q_\phi(z\ |\ x) \right] &\approx \frac{1}{L} \sum^L_{l=1} log\ p_\theta(x^{(i)}, z^{i, l}) - log\ q_\phi(z^{i, l}\ |\ x) \\
z^{(i, l)} &= g_\phi(\epsilon^{(i, l)}, x^{(i)}) \\
\epsilon^{(i, l)} &\sim p(\epsilon)
\end{align*}

So in practice, we just sample some $\epsilon$ for each $x^{(i)}$, use both of them to sample $z^{(i, l)}$ from approximate posterior through $g_\phi$. The $z^{(i, l)}$ are then put through the reconstruction part $p(x\ |\ z)$.

All in all, this gives the following **training algorithm**
```
theta, phi = init_parameters()
while not done:
    X_b = next_minibatch()
    epsilon = sample_noise_inputs()
    g_theta, g_phi = compute_gradients_of_estimator(theta, phi, X_b, epsilon)
    theta, phi = update_parameters(g_theta, g_phi) /* sgd or other gradient optimizer */
    
    
```

The authors point out that the KL divergence of the lower bound can often be analytically integrated leaving only the expected reconstruction error $\mathbb{E}_{q_\phi(z\ |\ x^{(i)})} \left[ log\ p_\theta(x^{(i)}\ |\ z) \right]$ that needs to be monte carlo sample estimated. 

The KL divergence term can then be interpreted as regularizing $\phi$ to encourage the approximate posterior $q_\phi(z\ |\ x)$ to be close to the prior $p_\theta(z)$.

### Notes on reparametrization
Basically: Rewrite a continuous (does it have to be?) random variable from some complex distribution as a function together with another random variable instead.

Useful since we can rewrite an expectation wrt the complicated distribution as an expectation wrt the simple new introduced variable instead such that the monte carlo estimate of it is differentiable wrt the parameters we are interested in (the encoding network in the case of VAE).

To choose a reparametrization of $q_\phi(z\ |\ x)$ three approaches are
1. A "location-scale" type distribution where $\epsilon$ distribution has location = 0, scale = 1. Then $g_\phi = location + scale * \epsilon$ where location and scale are outputs from the recognition model. E.g. gaussian, uniform, etc
2. Tractable inverse CDF, $\epsilon$ is uniform (0, 1), $g_\phi$ is the inverse cdf
3. Composition, by expressing the random variables as different transformations

TODO: Don't fully understand 2 and 3 here, what is the intuition?

## Variational Autoencoder
The VAE is a special but useful case where the probabilistic encoder $q_\phi(z\ |\ x)$ and decoder $p_\theta(x\ |\ z)$ are based on neural networks. 

Here they pick the prior $p_\theta(z) = \mathcal{N}(z; \mathbf{0}, \mathbf{I})$ (i.e. no parameters in prior in this case). $p_\theta(x\ |\ z)$ can be a gaussian or bernoulli (TODO other examples? is it about conjugate distributions) depending on type of data being modeled.

The decoding MLP outputs the parameters of the output distribution. The true posterior is intractable but they let the encoding distribution be represented by an MLP which outputs the parameters of the approximate posterior $q_\phi(z\ |\ x)$. They assume an approximate diagonal covariance.

\begin{align*}
log\ q_\phi(z\ |\ x^{(i)}) &= log\ \mathcal{N}(z; \mu^{(i)}, \sigma^{2(i)}\mathbf{I}) \\
\\
\mu^{(i)},\ \sigma^{2(i)} & \quad \text{outputs of encoding MLP}
\end{align*}

Samples from approximate posterior are obtained by using the reparametrization trick in the following way

\begin{align*}
z^{(i, l)} &= g_\phi(x^{(i)}, \epsilon^{(l)} = \mu^{(i)} + \sigma^{(i)} \odot \epsilon^{(l)} \\
\\
\epsilon^{(l)} &\sim p(\epsilon) = \mathcal{N}(\mathbf{0}, \mathbf{I})
\end{align*}

In this case, since both the prior and the approximate posterior are gaussian, the KL divergence part of the lower bound estimator can be analytically integrated and differentiated which means that the following estimator can be used.

\begin{align*}
\mathcal{L}(\theta, \phi, x^{(i)}) &\approx \frac{1}{2} \sum^D_{d=1} \left( 1 + log \left( (\sigma_d^{(i)})^2 \right) - (\mu_d^{(i)})^2 - (\sigma_d^{(i)})^2 \right) + \frac{1}{L} \sum^L_{l=1} log\ p_\theta(x^{(i)}\ |\ z^{(i, l)})
\end{align*}

### Example applications
In the paper they use a VAE to generate samples close to the mnist and frey face datasets.

<img src="figs/vae/vae-mnist-frey.png" width="65%" height="65%">

Other things it could be used for 

* **Learning representations**
* **Compression?**
* **Can probably have deeper models too**
* **Using other output distributions**

## Discussion and thoughts
The main thing here is the inference method which can be applied to many directed graphical models with continuous latent variables. 

A nice thing with this method is that it is a way to not let the number of parameters increase with the size of the data which is usually the case with variational inference (?). This is simply because we have the encoder network's parameters that are shared for all inputs to compute the approximate distribution over the latent variables.
TODO: Source for this

"
The neural network used in the encoder (variational distribution) does not lead to any richer approximating distribution.  It is a way to amortize inference such that the number of parameters does not grow with the size of the data (an incredible feat, but not one for expressivity!) - Dustin tran's blog 
The inference network takes data as input and outputs the local variational parameters relevant to each data point. The optimal inference network outputs the set of Gaussian parameters which maximizes the variational objective. This means a variational auto-encoder with a perfect inference network can only do as well as a fully factorized Gaussian with no inference network.
"

As a generative model comparing VAE with Generative Adversarial Networks (GAN) is interesting because there are many similar papers where in one they use VAE and in the corresponding one they use GAN. 

* At least on image data, GANs seem to generate more defined because of the different training objective
* No control of latent space in standard GAN
* 


There are also attempts where they are combined as in TODO which 

VAEs better at compression? better for representation learning? how to use GAN for this though?

Standard GAN - no control of latent space?

GAN jensen shannon divergence? write about it in gan review?

Can we use supervision in anyway to be able to control the latent space further? Like in adversarial autoencoders.

TODO:  Hmm, did I misunderstand something? when generating new, do you sample from prior or approximative posterior? It should be ok to sample from prior. But I guess we could also sample from the posterior of an existing input to get something similar?

### Problems with VAE
Blurry images because of 

In the case of image data

Generative Adversarial Networks (GAN) usually give more defined images because of the different training objective. There have been some work where VAE and GANs have been combined such that the recontruction loss of the VAE is replaced by a loss coming from a discriminatory network. The discriminatory network in this case tries to distinguish generated samples from the VAE from the corresponding real sample that was an input to the VAE.


## TODO

Check the appendix as well, where they do the fully variational inferring of posterior over the parameters

read https://arxiv.org/abs/1406.5298 semisupervised vae
read https://arxiv.org/abs/1605.06197 stick breaking vae (complex prior)

actually few similarities with normal autoencoders, just resembles them

ancestral sampling?

We assume that the prior pθ
∗ (z) and likelihood pθ
∗ (x|z) come from
parametric families of distributions pθ(z) and pθ(x|z), and that their PDFs are differentiable almost
everywhere w.r.t. both θ and z.

Very importantly, we do not make the common simplifying assumptions about the marginal or posterior
probabilities. Conversely, we are here interested in a general algorithm that even works effi-
ciently in the case of: intractability and large datasets

when is it useful

SGVB is distinguished from classical variational Bayes by it’s use of differentiable
Monte Carlo (MC) expectations.

https://openreview.net/pdf?id=S1jmAotxg nice for some other insights about vae

, SGVB allows for a broad class of non-conjugate approximate posteriors and thus has the
potential to expand Bayesian nonparametric models beyond the exponential family distributions to
which they are usually confined.

discussion, why high variance with naive mc estimator?

can show something asdad about normal autoencoders, but this is not enough to learn a useful representation

robustness to high dimensional latent space, ie will not overfit