
# 1. Intro

The [variational Bayesian](concept_Bayesian.ipynb)
 approach involves the optimization of variational model to the posterior which is often intractable.
- Stochastic Gradient Variational Bayes (SGVB) estimator (of the [variational lower bound](concept_Bayesian.ipynb) using a *reparameterization trick*) enable efficient approximate posterior inference.


- For i.i.d. datasets with continuous latent variables, the paper propose the Auto-Encoding VB (AEVB a.k.a **VAE**) algorithm.
   - for recognition, denoising, representation purposes.

# 2. Method

### 2.1 Problem Scenario


Assume the following data generation process:
1. sample latent variable $z$ from some prior distribution $p_{\theta^*}(z)$ 
2. observable $x$ is generated from conditional distribution $p_{\theta^*}(x|z)$

where $\theta$ is the model parameter and $\theta^*$ is assumed to be the true model parameter which is unknown to us as well as $z$
<img src="2014_VAE_Fig1.png" width="200"/>


In general,
1. marginal likelihood $p_{\theta}(x) = \int p_{\theta}(x|z)p_{\theta}(z) dz$ is intractable.
2. true posterior $p_{\theta}(z|x) = p_{\theta}(x|z)p_{\theta}(z)/p_{\theta}(x)$  is intractable


In order to solve, use variational method: introduce an approximation $q_\phi(z|x)$ to the posterior $p_\theta(z|x)$.


Note that $z$ can be interprested as a latent **representation**, $q_\phi(z|x)$ be recognition model serving as a probabilistic **encoder** and $p_\theta(x|z)$ 
as a probabilistic **decoder**

### 2.2 Variational Bound

The marginal likelihood of a data point is

$$
\log p_\theta(x) = D_{KL}\left(q_\phi(z|x)\,||\,p_\theta(z|x)\right) + \mathcal{L}(\theta,\phi;x)
$$ 

where $D_{KL}$ is the [KL divergene](concept_KLdiv.ipynb) and $\mathcal{L}(\theta,\phi;x)$ is the variational lower bound because $D_{KL}\ge 0$ and defined by

$$\mathcal{L}(\theta,\phi;x) = \mathbb{E}_{q_\phi(z|x)}\log\frac{p_\theta(x,z)}{q_\phi(z|x)}$$

<font size="1">since $p_\theta(x,z)=p_\theta(z|x)p_\theta(x)$, $D_{KL}\left(q_\phi(z|x)\,||\,p_\theta(z|x)\right)$ cancel out most terms of $\mathcal{L}(\theta,\phi;x)$ leaving $\log p_\theta(x)$ only </font>

It can be re-written by
$$
\begin{align}
\mathcal{L}(\theta,\phi;x) = &\,\mathbb{E}_{q_\phi(z|x)}\log\frac{p_\theta(x|z)p_\theta(z)}{q_\phi(z|x)}\\
= &\,\mathbb{E}_{q_\phi(z|x)}\log p_\theta(x|z)-D_{KL}\left(q_\phi(z|x)\,||\,p_\theta(z)\right)
\end{align}
$$

In order to optimize the lower bound, we want the gradient of the lower bound w.r.t $\phi$. However, it is problematic as $\nabla_\phi$ and $\mathbb{E}_{q_\phi(z|x)}$ do not commute.

### 2.2 SGVB estimator and VAE

**reparameterization trick**
: with an auxiliary random variable $\epsilon$ and a differential transformation $g_\phi(\epsilon,x)$, a latent variable sample $z\sim q_\phi(z|x)$ can be reparameterized by

$$z = g_\phi(\epsilon,x) \quad \mathrm{with} \quad \epsilon \sim p(\epsilon)$$

<font size="1"> 
- (e.g.) if $z \sim p(z|x)=\mathcal{N}(\mu,\sigma^2)$, one can use  $z = g_\phi(\epsilon,x) =\mu + \sigma \epsilon$ where $\epsilon \sim \mathcal{N}(0,1)$
- (e.g.) When inverse CDF is tractable, one can use $\epsilon \sim \mathcal{U}(0,1)$ with $g_\phi(\epsilon,x)$ being inverse CDF of $q_\phi(z|x)$
</font>

Then, a SGVB estimator $\tilde{\mathcal{L}}^A(\theta,\phi;x) \simeq \mathcal{L}(\theta,\phi;x)$ can be

$$
\tilde{\mathcal{L}}^A(\theta,\phi;x) = \frac{1}{L}\sum_{l=1}^L \log p_\theta (x,z^{(l)}) - \log q_\phi (z^{(l)}|x)
$$

Or, when $D_{KL}\left(q_\phi(z|x)\,||\,p_\theta(z)\right)$ is analytically available, the second version of SGVB estimator $\tilde{\mathcal{L}}^B(\theta,\phi;x) \simeq \mathcal{L}(\theta,\phi;x)$ is

$$
\tilde{\mathcal{L}}^B(\theta,\phi;x) = \frac{1}{L}\sum_{l=1}^L \log p_\theta (x,z^{(l)}) - D_{KL}\left(q_\phi(z|x)\,||\,p_\theta(z)\right)
$$

<font size="1">when dataset is large, one can use $L=1$ </font>

# VAE example

<img src="https://miro.medium.com/max/3374/1*22cSCfmktNIwH5m__u2ffA.png" width="700"/>

Let propr be isotropic multivariate Gaussian $p_\theta(z) = \mathcal{N}(z;0,I)$ 
<font size="1"> (note that prior does not contain model paramemter in this case).</font> 
  Also let the **decoder** $p_\theta(x|z)$ isotropic multivariate Gaussian

$$
\log q_\phi (z|x) = log \mathcal{N}(z;\mu(x),\sigma^2(x)I)
$$

Then,
$$
\begin{align}
\tilde{\mathcal{L}}^B(\theta,\phi;x) = &\frac{1}{L}\sum_{l=1}^L \log p_\theta (x,z^{(l)}) - D_{KL}\left(q_\phi(z|x)\,||\,p_\theta(z)\right) \\
= &\frac{1}{L}\sum_{l=1}^L \log p_\theta (x,z^{(l)}) + \frac{1}{2}\sum_{d=1}^D \left( 1 + \log \sigma_d^2 - \mu_d^2 - \sigma_d^2\right) 
\end{align}
$$

where $D$ is the size of the latent space, and the followings were used
$$
\begin{align}
\int \mathcal{N}(z;\mu,\sigma^2 I)\log \mathcal{N}(z;0,I) dz = &-\frac{D}{2}\log 2\pi - \frac{1}{2}\sum_{d=1}^D \left(\mu_d^2 + \sigma_d^2\right) \\
\int \mathcal{N}(z;\mu,\sigma^2 I)\log \mathcal{N}(z;\mu,\sigma^2 I) dz = &-\frac{D}{2}\log 2\pi - \frac{1}{2}\sum_{d=1}^D \left(1 + \log \sigma_d^2\right) 
\end{align}
$$

recall that $\mu$, $\sigma$ depends on $x$ and $\phi$ as does $g_\phi(\epsilon,x)$