# Variational (Bayesian) Inference 

Probabilistic model: $p(z, \theta)=p(z|\theta)p(\theta)$

Training: $p(\theta | X_{tr}, Y_{tr}) = \frac{p(Y_{tr}|X_{tr}, \theta)p(\theta)}{\int{p(Y_{tr} | X_{tr}, \theta)p(\theta)d\theta}}$  ===> math. intractable 

Testing: $p(y|x, X_{tr}, Y_{tr})=\int{p(y|x, \theta)p(\theta | X_{tr}, Y_{tr})}$ ===> math. intractable 


## MCMC
Samples from unnormalized $p(\theta|z)$
- Unbiased
- Need a looooooot of samples regarding to state space dimension

## Variational Inferernce
Instead to approximate of $Posterior~p(\theta|z)$ directly, approximate $p(\theta|z) \approx q(\theta)$, which can be understanden as kind of Representation of a model. So called latent space.
- Biased
- Faster and more scalable


Latent space: refers to an abstract multi-dimensional space containing feature values that we cannot interpret directly, but which encodes a meaningful internal representation of externally observed events.

Main Idea: find posterior approximation $p(\theta|z) \approx q(\theta) \in \mathcal{Q}$ using the relative entropy (so called Kullback-Leibler divergence) as criterion function:

$L(q):= KL(q(\theta) || p(\theta | z))$, where $KL(q || p) \geq 0$

Hint:
- Entropy of a distribution: $H = - \sum_{i=1}^Np(x_i)logp(x_i)$
- Kullback-Leibler divergence: $KL(q || p) = \sum_{i=1}^Np(x_i)\frac{log(p(x_i))}{log(q(x_i))}$


Solution: $L(q):= KL(q(\theta) || p(\theta | z)) \rightarrow min_{q(\theta) \in \mathcal{Q}}$

Two problems:
1. The posterior in the KL can still not be computed
2. How to perform an optimization w.r.t. a distribution?


## Problem 1: The posterior in the KL can still not be computed

### Magic:

$\log p(z) = \int q(\theta)\log p(z)d\theta = \int q(\theta)\log \frac{p(z, \theta)}{p(\theta|z)}d\theta$

$= \int q(\theta) \log \frac{p(z, \theta)q(\theta)}{p(\theta | z)q(\theta)}d\theta=$

$= \int q(\theta) \log \frac{p(x,\theta)}{q(\theta)}d\theta + \int q(\theta)\log \frac{q(\theta)}{p(\theta | z)}d\theta =$

$=\mathcal{L}(q(\theta))+KL(q(\theta)||p(\theta | z))$

Here:
- $=\mathcal{L}(q(\theta))$ is called Evidence lower bound (ELBO) or Variational lower Bound
- $KL(q(\theta)||p(\theta | z))$: KL-divergence needed for VI, which is still, intractable...


### ELBO

Evidence: total probability of observing the data.

$\log p(z)=\mathcal{L}(q(\theta))+KL(q(\theta)||p(\theta | z))$


Notice, KL-divergence is intractable but $KL(q || p) \geq 0$.


$\log p(z) \geq \mathcal{L}(q(\theta))$

#### Now we could formulate an optimization problem with intractable posterior!

$\mathcal{L}(q):= KL(q(\theta) || p(\theta | z)) \rightarrow min_{q(\theta) \in \mathcal{Q}}$

$\mathcal{L}(q(\theta)) = \int q(\theta) \log \frac{p(z, \theta)}{q(\theta)}d\theta = \int q(\theta)\log \frac{p(z|\theta)p(\theta)}{q(\theta)}d\theta=$

$= \int q(\theta) \log p(z |\theta)d\theta + \int q(\theta)\log \frac{p(\theta)}{q(\theta)}d\theta =$

$= \mathbb{E}_{q(\theta)} \log p(z | \theta) - KL(q(\theta)||p(\theta))$

Where:

- $\mathbb{E}_{q(\theta)} \log p(z | \theta)$ is called data / measurements term.
- $KL(q(\theta)||p(\theta))$ is called regularizer

Because $KL \geq 0$, 

- Either maximize $\mathbb{E}_{q(\theta)} \log p(z | \theta)$
- Or minimize $KL(q(\theta)||p(\theta))$

Now, the problem will be formulated as:

$\mathcal{L}(q(\theta)) = \int q(\theta) \log \frac{p(z, \theta)}{q(\theta)}d\theta \rightarrow \max_{q(\theta) \in \mathcal{Q}}$






## How to perform an optimization w.r.t. a distribution?

Two Options in general:

### 1. Mean Field Approximation (Factorized family)

$q(\theta)=\prod_{j=1}^mq_j(\theta_j)$, $\theta=[\theta_1, ..., \theta_m]$

Examples: 
- Mixture Model with Expectation Maximization
- Mixture Model with Expectation Propagation


### 2. Parametric Approximation (Parametric approximation, also be called as Amoritzed inference)

$q(\theta) = q(\theta | \lambda)$




## Example: Variational Auto-Encoder
***
### Auto Encoder

([1] https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73)

An idea came for dimensionality reduction for but not limited as data visualisation, data storage, heavy computation…, same as Single Value Decomposition, Principle Component Analysis ...

Idea: Compress the original data with most important / informative features, so that the original data can be reconstructed (original data as label). These features cannot or can very hard be handcrafted defined. 

Latent space (bottleneck): representation (model) of the most informative features

![alt text](figures/AE.png "Idea of Auto Encoder")


Within AE, for a given set of possible encoders and decoders, we are looking for the pair that keeps the maximum of information when encoding and, so, has the minimum of reconstruction error when decoding.

$(enc^*, dnc^*) = argmin_{(enc,dnc) \in ExD}~\epsilon (z, dnc(enc(z)))$

![alt text](figures/AE_loss.png "Idea of Auto with loss terms")

#### Limitation: The latent space is not, at least enought, regularized. 

After training, no new content can be reconstructed with the latent space => strongly overfitted, which against the idea of generative modelling.





### Variational Encoder [1]

Idea: instead of training a determinstic latent space, train the distribution (normally normal distribution) of the latent space...

![alt text](figures/VAE_idea.png "Idea of Variational Auto Encoder")

![alt text](figures/VAE_loss.png "Variational Auto Encoder with loss")

### Recap the loss of VBI regarding to ELBO:

$\mathcal{L}(q(\theta)) = \int q(\theta) \log \frac{p(z, \theta)}{q(\theta)}d\theta = \int q(\theta)\log \frac{p(z|\theta)p(\theta)}{q(\theta)}d\theta=$

$= \mathbb{E}_{q(\theta)} \log p(z | \theta) - KL(q(\theta)||p(\theta))$

Where:

- $\mathbb{E}_{q(\theta)} \log p(z | \theta)$ is called data / measurements term.
- $KL(q(\theta)||p(\theta))$ is called regularizer


We see the loss function of VAE, which contains two parts:
- reconstruction loss: $|| x - d(z)||^2$, this represents the data term in the loss of VBI
- KL-regularizer: $KL(\mathcal{N}(\mu_x, \sigma_x), \mathcal{N}\mathbf{(0, I)})$



### Intuition about the regularisation 

![alt text](figures/VAE_Intuition.png "Variational Auto Encoder: Intuition")

