# Variational Bayesian Inference
## What
Bayesian inference is able to tackle many inference problems. As to the techniques for solving these problems, they can be divided into two categories: exact methods and approximate methods.

$$p(\theta|y) = \frac{p(y|\theta) p(\theta)}{\int p(y,\theta) \,d\theta}$$

Most of the time, analytical solutions (exact methods) are not available and neumerical integration (e.g MCMC) can be too computationally expensive.
Variational inference is an approximate method which is thought to be more efficient than MCMC. The word "variational" comes from the variational calculus，which means it works on functionals. 

In order to approximate a true distribution $P(x)$, we have a set of distribution candidates $\{Q_i(x; \theta)\}$ and wish to select the one that minimize the KL divergence:
$$D_{KL}(Q(x)||P(x|D)) = \int Q(x) \ln \frac{Q(x)}{P(x|D)}\,dx $$
where $D$ is the data set and x are unobserved variables.

## Transformation
Then we rewrite the KL formula:

\begin{equation}
\begin{split}
D_{KL}(Q(x)||P(x|D)) &= \int Q(x) \ln \frac{Q(x)}{P(x|D)}\,dx\\  
&= -\int Q(x) \log \frac{P(x|D)}{Q(x)}\,dx\\
&= -\int Q(x) \big[\log \frac{P(x,D)}{Q(x)} - \log P(D) \big]\,dx\\
&= -\int Q(x) \log \frac{P(x,D)}{Q(x)} \,dx + \log P(D)\\
\end{split}
\end{equation}

Next, replace the conditional $P(x|D)$ with a joint $P(x, D)$ and a prior P(D). The reason for making this rewrite is that for Bayesian networks with exponential family nodes, the $log P(x, D)$ term will be a be a very simple sum of node energy terms, whereas $log P(x|D)$ is more complicated. This will simplify later computations.

Then we have:
$$ 
\log P(D) = D_{KL}(Q(x)||P(x|D)) + L
$$
where 
$$ 
L = \int Q(x) \log \frac{P(x,D)}{Q(x)} \,dx 
$$ 
is called Evidence Lower BOund (ELBO).

Bacause $\log P(D)$ can be seen as a constant, minimizing $D_{KL}(Q(x)||P(x|D))$ is to maximizing ELBO $L$.

The approximation is to minimising the *variational free energy F*, which can be intepreted as a 

## How
Variational inference is not satisfactory because: (1) the cadidates used to approximate are limited to exponential distribution family; (2)the mean-field assumption is too strong that components of variable $x$ (vector) are independent.

In variational autoencoder, neural network is employed to approximate due to its ability to fit data.

## MCMC estimator

A MCMC estimator of gradient of and expectation w.r.t a general function.

\begin{equation}
\begin{split}
\nabla_{\phi} E_{q_{\phi}(z)}\big[f(z)\big] &= \nabla_{\phi} \int q_{\phi}(z)f(z) \,dz \\
&= \int \big[\nabla_{\phi}q_{\phi}(z)\big] f(z) \,dz \\
&= \int \big[\nabla_{\phi}q_{\phi}(z)\big] \frac{q_{\phi}(z)}{q_{\phi}(z)}f(z) \,dz \\
&= \int \big[\nabla_{\phi}\log q_{\phi}(z))\big] q_{\phi}(z)f(z) \,dz \\
&= E_{q_{\phi}(z)}\big[f(z) \nabla_{\phi}\log q_{\phi}(z))\big]\\
&\simeq \frac{1}{L}\sum_{l=1}^{L} f(z)\nabla_{\phi} \log q_{\phi}(z^{(l)}) \\
\end{split}
\end{equation}