# Variational Autoencoder (VAE)
VAE is a type of generative neural network architecture that consists a encoder and a decoder

* Traditional autoencoder
<img src="https://miro.medium.com/v2/resize:fit:1234/format:webp/1*qVx8wDUpqIqpEWy6-fW3Cg.png" width=400>

* VAE
<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*kXiln_TbF15oVg7AjcUEkQ.png" width=400>

* Encoder: the encoder takes an input and compress it into a latent representation. Traditional autoencoder outputs a single point (vector) in the latent space. A VAE output a probability distribution over the latent space

* Decoder: the decoder samples from the distribution in the latent space given by the encoder and try to reconstruct the input as accurately as possible by upsampling the encoding

After training, we will remove the encoder and randomly sample from the distribution in the latent space to produce output

### Issue with traditional autoencoder and advantages of VAE
Traditional autoencoder will convert input data into points in the latent space, which are discrete. However, during generation, we can only sample from those discrete points, meaning we're essentially replicating the same set of images. If we use a data point that's not in this discontinuous latent space, the decoder output will be unrealistics since it has never been trained on those inputs.

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*-i8cp3ry4XS-05OWPAJLPg.png" width=300>

Compare to traditional autoencoder, VAE has the advantage of:
1. VAE learns the distribution of the input data and can generate sample similar to original input
2. Since the VAE encoder outputs a distribution, its latent space is smooth and continuous, meaning we can interpolate between 2 points and generate outputs with smooth transition
<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*96ho7qSyW0nKrLvSoZHOtA.png">

### Encoder
In order to output a distribution, the encoder outputs will output 2 vectors, one for the means, $\mu$, and one for the standard deviations, $\sigma$, where each pair of $\mu$ and $\sigma$ represents the distribution of 1 feature in the embedding of the input. With these 2 vectors, we can randomly sample each feature to form an encoding and pass it on to the decoder

<img src="https://miro.medium.com/v2/resize:fit:1314/format:webp/1*CiVcrrPmpcB1YGMkTF7hzA.png" width=500>

VAEs has density estimation, invertiable, and has more stable training compared to GANs, but it also generates lower quality results

## Loss function
### Loss for traditional autoencoder
The loss function for a traditional autoencoder is simply the binary cross entropy loss or MSE loss. This loss function tells the model how different is the generated image and the input image. This loss is called the reconstruction loss

<img src="https://miro.medium.com/v2/resize:fit:1400/0*zLOeTMUQ67OrqjLp.png" width=600>
<img src="https://miro.medium.com/v2/resize:fit:808/1*-e1QGatrODWpJkEwqP4Jyg.png" width=300>

$y_i$: the given input image/its feature map

$\hat{y_i}$: the generated image by the decoder/its feature map

$N$: number of pixels in the image/number of element in the extracted feature map

The reconsturction loss can either be calculated on based on the pixel distance between the input and generated images or the distance of their feature map. The feature map of the image can be obtained by passing it into a pre-trained network and extract the activation from a middle layer. In general, the feature map is more realiable than the pixel distance


### KL divergence term 
One issue the encoder may experience using reconstruction loss is that the range of $\mu$ and $\sigma$ for each class can be very different, which means there will be discontinuity between in between the distributions. However, we want to make the mean of each class somewhat close to each other so the model will be able to generate output in between classes

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*xCjoga9IPyNUSiz9E7ao7A.png" width=500>

To resolve this problem, we need to introduce a KL divergence term to the loss function. This term measures the difference between the learned probability distribution over the latent space and a predefined prior distribution (usually a standard normal distribution). Minimizing this term will ensure that the learned distribution over the latent space is close to the prior distribution, which means the mean and standard deviation of each class will have very similar value.

Note: the predefined prior distribution is arbitrary; it only tells the model where we want all the distribution to be centered at

<img src="https://miro.medium.com/v2/resize:fit:1040/format:webp/1*uEAxCmyVKxzZOJG6afkCCg.png" width=300>

$n$: number of elements in the $\mu$ or $\sigma$ vector (they have the same size)

$\mu_i$: the $i$th element in the $\mu$ vector

$\sigma_i$: the $i$th element in the $\sigma$ vector

In this case, we choose the predefined distribution to be the normal distribution with mean of 0 and standard deviation of 1. Therefore, this loss function forces the output $\mu$ and $\sigma$ vector to take on similar values

However, if we only use the KL divergence loss without the reconstruction loss, the distribution for all classes are densely placed and the decoder cannot decode any valuable information from it, so the reconstruction term and the KL divergence term should be used together to make up the loss function

* Distribution by only using KL divergence term as loss function (results in posterior collapse issue)
<img src="https://miro.medium.com/v2/resize:fit:1294/format:webp/1*XdSPoB3rcb7LymviDJBJUg.png" width=350>

* Distribution by using both reconstruction cost and KL divergence term as loss function (results in dense, but distinguishable distribution)
<img src="https://miro.medium.com/v2/resize:fit:1286/format:webp/1*BIDBG8MQ9-Kc-knUUrkT3A.png" width=350>

### Full VAE loss function
The loss function used for VAE is reconstruction loss and KL divergence term; we want to minimize both term at the same time

<img src="https://miro.medium.com/v2/resize:fit:1400/format:webp/0*DJeT99qHwbF7iewk.png" width=700>

$L$: ELBO (Expectation Lower BOund), which is essentially the total loss that we try to minimize

$x^{(i)}$: the input image

$\theta, \phi$: parameters of the decoder and the encoder respectively

$z$: sampled latent variable

$p_{\theta}(x^{(i)}|z)$: the expectation of the log probability of the data given the latent variable $z$. It measures that given a input latent vector $z$, how well the decoder can reconstruct it to match the input image, $x^{(i)}$

$p_{\theta}(z)$: the predefined prior distribution (normal distribution is most commonly used)

$q_{\phi}(z|x^{(i)})$: the output distribution by the encoder given a input image $x^{(i)}$

## Training process
1. Feed an image to the encoder, which outputs parameters $\mu$ and $\sigma$ of the latent distribution
2. Sample a latent variable $z$ based on $\mu$ and $\sigma$ using the reparametrization trick
3. Feed the latent variable to the decoder to generate a output image $\hat{x}$
4. Compute the loss using ELBO loss function
5. Use gradient descent to minize the ELBO loss and update the model's parameters

### Reparametrization trick
To generate a latent variable, we sample a random variable from the distribution based on $\mu$ and $\sigma$. However, backpropagation requires calculating the gradients with respect to the model’s parameters ($\mu$ and $\sigma$), but calculating the gradient with respect to z (a randomly sampled variable) is mathematically undefined. The reparametrization trick solve this problem

Instead of directly sampling $z$ from its distribution, we decompose z into a deterministic component and a stochastic component by introducing a new random variable $\epsilon$, where $\epsilon$ is randomly sampled from the standard normal distribution. So the latent variable is now calculated as

$$z = \mu + \epsilon * \sigma$$

Now, $z$ still captures the original distribution based on $\mu$ and $\sigma$ to properly perform backpropagation while $\epsilon$ introduce randomness to the sampling process

## Flow model
