# Variational Autoencoder

## Resources

- [Lecture: Variational Inference](https://www.youtube.com/watch?v=UTMpM4orS30)
- [Auto-Encoding Variational Bayes](https://arxiv.org/pdf/1312.6114.pdf)
- [An Introduction to Variational Autoencoders](https://arxiv.org/pdf/1906.02691.pdf)
- [Understanding Variational Autoencoders (VAEs) from two perspectives: deep learning and graphical models](https://jaan.io/what-is-variational-autoencoder-vae-tutorial/)

## Information Theory

### Self-information
Claude Shannon's definition of self-information was chosen to meet several axioms:

- An event with probability 100% is perfectly unsurprising and yields no information.
- The less probable an event is, the more surprising it is and the more information it yields.
- If two independent events are measured separately, the total amount of information is the sum of the self-informations of the individual events.

It can be shown that there is a unique function of probability that meets these three axioms, up to a multiplicative scaling factor. Broadly given an event $x$ with probability $P$, the information content is defined as follows: $I(x) = -\log p(x)$

Formally, given a random variable $X$ with probability mass function $p$, the self-information of measuring X as outcome $x$ is defined as: $I_X(x) = -\log p(x)$

### Entropy
The Shannon entropy of the random variable X is defined as:
$$H(X) = \mathbb{E}(I_X(X)) = -\sum_x p(x)\log p(x)$$

or for a distribution $p$:
$$H(p) = -\sum_x p(x)\log p(x)$$

### Cross entropy
The cross-entropy of the distribution $p$ relative to a distribution $q$ is defined as:
$$H_p(q) = -\mathbb{E}_{q}(\log p(x)) = -\sum_x q(x) \log p(x)$$

### Kullback-Leibler divergence
The Kullback-Leibler divergence is defined as:
$$D_p(q) = H_p(q) - H(q) = -\sum_x q(x) \log \frac{p(x)}{q(x)}$$

## Variational bound
Let us consider some dataset $X = {x^{(i)}}_{i=1}^N$ consisting of $N$ i.i.d samples of some continuous or discrete variable $x$. We assume that the data are generated by some random process, involving an unobserved continuous random variable $z$. The process consists of two steps: (1) a value $z^{(i)}$ is generated from some prior distribution $p_{\theta^*}(z)$; (2) a value $x^{(i)}$ is generated from some conditional distribution $p_{\theta^*}(x|z)$. We assume that the prior $p_{\theta^*}(z)$ and likelihood $p_{\theta^*}(x|z)$ come from parametric families of distributions $p_{\theta}(z)$ and $p_{\theta}(x|z)$, and that their PDFs are differentiable almost everywhere w.r.t both $\theta$ and $z$. Unfortunately, a lot of this process is hidden from our view: the true parameters $\theta^*$ as well as the values of the latent variables $z^{(i)}$ are unknown to us.

$
\begin{align}
D_{p_{\theta}(z|x^{(i)})}(q_{\phi}(z|x^{(i)})) & = -\mathbb{E}_{q_{\phi}(z|x^{(i)})}(\log \frac{p_{\theta}(z|x^{(i)})}{q_{\phi}(z|x^{(i)})}) = \mathbb{E}_{q_{\phi}(z|x^{(i)})}(\log \frac{q_{\phi}(z|x^{(i)}) p_{\theta}(x^{(i)})}{p_{\theta}(x^{(i)}|z) p_{\theta}(z)}) = \mathbb{E}_{q_{\phi}(z|x^{(i)})}(\log \frac{q_{\phi}(z|x^{(i)}) p_{\theta}(x^{(i)})}{p_{\theta}(x^{(i)}, z)}) \\
& = -\mathbb{E}_{q_{\phi}(z|x^{(i)})}(\log p_{\theta}(x^{(i)}, z)) - H(q_{\phi}(z|x^{(i)})) + \log p_{\theta}(x^{(i)}) \\
\log p_{\theta}(x^{(i)}) & = D_{p_{\theta}(z|x^{(i)})}(q_{\phi}(z|x^{(i)})) + \mathcal{L}(\theta, \phi; x^{(i)})
\end{align}
$

Since the KL-divergence is non-negative,

$
\begin{align}
\log p_{\theta}(x^{(i)}) \geq \mathcal{L}(\theta, \phi; x^{(i)}) & = \mathbb{E}_{q_{\phi}(z|x^{(i)})}(\log p_{\theta}(x^{(i)}, z)) + H(q_{\phi}(z|x^{(i)})) \\
& = \mathbb{E}_{q_{\phi}(z|x^{(i)})}(\log p_{\theta}(x^{(i)} | z)) + \mathbb{E}_{q_{\phi}(z|x^{(i)})}(\log p_{\theta}(z)) + H(q_{\phi}(z|x^{(i)})) \\
& = \mathbb{E}_{q_{\phi}(z|x^{(i)})}(\log p_{\theta}(x^{(i)} | z)) - [H_{p_{\theta}(z)}(q_{\phi}(z|x^{(i)})) - H(q_{\phi}(z|x^{(i)}))] \\
& = \mathbb{E}_{q_{\phi}(z|x^{(i)})}(\log p_{\theta}(x^{(i)} | z)) - D_{p_{\theta}(z)}(q_{\phi}(z|x^{(i)}))
\end{align}
$

The first term is the reconstruction loss, or expected negative log-likelihood of the $i$-th datapoint. The expectation is taken with respect to the encoder’s distribution over the representations. This term encourages the decoder to learn to reconstruct the data. If the decoder’s output does not reconstruct the data well, statistically we say that the decoder parameterizes a likelihood distribution that does not place much probability mass on the true data. For example, if our goal is to model black and white images and our model places high probability on there being black spots where there are actually white spots, this will yield the worst possible reconstruction. Poor reconstruction will incur a large cost in this loss function.

The second term is a regularizer that we throw in. This is the Kullback-Leibler divergence between the encoder’s distribution $q_{\phi}(z|x^{(i)})$ and $p_{\theta}(z)$. This divergence measures how much information is lost when using $q$ to represent $p$. It is one measure of how close $q$ is to $p$.

By approximating the expectation by a single sample we get:

$
\begin{align}
\mathcal{L}(\theta, \phi; x^{(i)}) & = \mathbb{E}_{q_{\phi}(z|x^{(i)})}(\log p_{\theta}(x^{(i)} | z)) - D_{p_{\theta}(z)}(q_{\phi}(z|x^{(i)})) \\
& \approx \log p_{\theta}(x^{(i)} | z) - D_{p_{\theta}(z)}(q_{\phi}(z|x^{(i)}))
\end{align}
$

Everything in this loss expression has a closed form for Gaussians, and can be used for training.