### Variational autoencoder (VAE)

Based on Agustinus Kristladl's blog [link](https://wiseodd.github.io/techblog/2016/12/10/variational-autoencoder/)

VAE is trying to generate data based on some `latent variable`, which is different from vanilla `GANS` that tries to generate data blindly.

### A game
Before we start let's play a little game. Source [link](https://towardsdatascience.com/probability-concepts-explained-marginalisation-2296846344fc).

Suppose we have four dice: 4-sided die, 6-sided die, 8-sided die and a 10-sided die (as shown below).

The game:

- I put a six-sided and an eight-sided die in a red box and a four-sided and ten-sided die in a blue box.
- I select a die from each of the red and blue boxes at random and put them in a yellow box.
- I select a die at random from the yellow box, roll the die and tell you the result.

After playing the game we are told the result is 3. The question that we want to answer is: Did the die most likely come from the red box or the blue box originally?


Solution:

$$P(dice=3 | box=red) = P(dice=3, die=6 sided | box=red) + P(dice=3, die=8 sided | box=red) = 0.5 * 1/6 + 0.5 * 1/8 = 0.145$$


Same for blue ...

Notice that we didn't observe the die that we picked. This is marginalization we will be using this fact later on during VAE derivation.

The equation is:

$$P(X) = \sum_y  P(X, Y=y)$$

then using the rule $P(A|B) = \frac{P(A\cap B)}{P(B)}$ gives

$$P(X) = \sum_y  P(X|Y=y)\times P(Y=y)$$


### VAE

Let's have a joint probability distribution $P(X, z)$ where $X$ and $z$ represents the data and latent variables respectively. If we marginalize the joint distribution then we have : 

$$P(X) = \int_z P(X|z) P(z) dz$$

The idea of VAE is the infer $P(z)$ using $P(z|X)$. So what is $z$ when given a $X$, or in other words the latent variable needs to understand our data. But what we don't know is $P(z|X)$, so let's approximate it using $Q(z|X)$ using KL divergence.

$$D_{KL}[Q(z|X)|P(z|X)] = E[log(Q(z|X) - log(P(z|X))]$$


Recal Bayes' theorem: $P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}$

$$D_{KL}[Q(z|X)|P(z|X)] = E[log(Q(z|X) - log\frac{P(X|z) \times P(z)}{P(X)}]$$
$$D_{KL}[Q(z|X)|P(z|X)] = E[log(Q(z|X) - log P(X|z) - log P(z) + log P(X)]$$

Since $log P(X)$ does not depend on $z$. 
$$D_{KL}[Q(z|X)|P(z|X)] - log P(X) = E[log(Q(z|X) - log P(X|z) - log P(z) ]$$
$$ log P(X) -  D_{KL}[Q(z|X)|P(z|X)] = E[log P(X|z)] -D_{KL}[Q(z|X) |P(z) ]$$


So we have $Q(z|X)$ that projects the data $X$ into latent variable space, and $P(X|z)$ that generates the data given the latent space.