# Auto-Encoding Variational Bayes

-[Generative model](generative_model.ipynb)

-[Variational Inference](Bayesian.ipynb)


This note is a section by section summary of the [paper](https://arxiv.org/abs/1312.6114)  based on personal understanding. Some personal remarks at the end of the note.

## 1. Introduction
$\,$ The [variational Bayesian](concept_Bayesian.ipynb) (VB) approach involves the optimization (=maximizing variational *lower bound*) of an approximation to the intractable posterior.
A *reparameterization trick* yields a differentiable unbiased estimator of the *lower bound* so that very efficient learning and approximate posterior inference (over latent variable) is possible.

## 2. Method
(assumption: i.i.d. dataset with latent variables per datapoint)


#### 2.1. Problem scenario
- prior over latent space $p_\theta (z)$ where $\theta$ is the model parameter
- conditional over the domain of the data space $p_\theta(x|z)$
- problem: marginal likelihood $p_\theta(x)$ and posterior $p_\theta(z|x)$ are intractable


#### 2.2. Variational bound
$\,$ Let $q_\phi(z|x)$ be a recognition model (where $\phi$ is the model parameter), i.e. variational approximation to the true posterior $p_\theta(z|x)$. It can be interpreted as a probabilistic *encoder*. 
Then the variational [lower bound](concept_Bayesian.ipynb) can be written as (derivation is at the end of this notebook)

$$
\begin{equation}
\mathcal{L}(\theta,\phi;x) = \mathbb{E}_{q_\phi(z|x)}\log p_\theta(x|z)-D_{KL}\left(q_\phi(z|x)\,||\,p_\theta(z)\right)
\end{equation}
$$
where $D_{KL}$ is the KL-divergence. 

$\,$ How can we optimize $\mathcal{L}(\theta,\phi;x)$ efficiently?


#### 2.3. SGVB estimator and AEVB algorithm
- SGVB (Stochastic Gradient Variational Bayes)
- AEVB (Auto-Encoding VB)

>**reparameterization trick**
: with an auxiliary random variable $\epsilon$ and a differential transformation $g_\phi(\epsilon,x)$, a latent variable sample $z\sim q_\phi(z|x)$ can be reparameterized by
$$z = g_\phi(\epsilon,x) \quad \mathrm{with} \quad \epsilon \sim p(\epsilon)$$
<br>
<font size="2"> 
$\qquad\qquad$ (e.g.)  if $z \sim q_\phi(z|x)=\mathcal{N}(\mu(x),\sigma^2(x))$, one can use  $z = g_\phi(\epsilon,x) =\mu + \sigma \epsilon$ where $\epsilon \sim \mathcal{N}(0,1)$
</font>


$\,$ Using reparameterization trick SGVB estimator of lower bound is 
$$
\mathcal{L}(\theta,\phi;x)  \simeq \frac{1}{L}\sum_{l=1}^L \left\{\, \log p_\theta (x,z^{(l)}) - \log q_\phi (z^{(l)}|x) \,\right\}
$$

$\,$ Often KL-divegence $D_{KL}\left(q_\phi(z|x)\,||\,p_\theta(z)\right)$ can be integrated analyitically. Then,
$$
\mathcal{L}(\theta,\phi;x)  \simeq \frac{1}{L}\sum_{l=1}^L \left\{\, \log p_\theta (x|z^{(l)}) - D_{KL}\left(q_\phi(z|x)\,||\,p_\theta(z)\right) \,\right\}
$$
Note that $\log p_\theta (x|z^{(l)})$ can be interpreted as a negative *reconstruction error* in view of AEVB

#### 2.4.  Reparametrization trick
$\,$ Recall that we are sample $z$ from $g_\phi(\epsilon,x)$ not from $q_\phi(z|x)$. i.e., $g_\phi(\epsilon,x)$ reparameterize $q_\phi(z|x)$ enabling SGVB. In order words, the Monte Carlo estimate of the expectation (over $q_\phi(z|x)$) is differentiable w.r.t. $\phi$

## 3. Example: Vaiational Auto-Encoder

<img src="https://miro.medium.com/max/3374/1*22cSCfmktNIwH5m__u2ffA.png" width="600"/>

image source is [here](https://www.topbots.com/intuitively-understanding-variational-autoencoders/)

- Let $p_\theta(z) = \mathcal{N}(z;0,I)$ 
<font size="1"> (note that prior does not contain model paramemter in this case).</font> 
- Let also $p_\theta(x|z)$ multivariate Gaussian

Then, the true posterior is also multivariate Gaussian. If we further assume that the posterior covariance is diagonal,
$$
\log q_\phi (z|x) = log \mathcal{N}(z;\mu(x),\sigma^2(x)I)
$$

Then,
$$
\mathcal{L}(\theta,\phi;x)  \simeq \frac{1}{L}\sum_{l=1}^L \log p_\theta (x|z^{(l)}) + \frac{1}{2}\sum_{d=1}^D \left( 1 + \log \sigma_d^2 - \mu_d^2 - \sigma_d^2\right) 
$$

where $D$ is the size of the latent space, and we used the followings
$$
\begin{align}
\int \mathcal{N}(z;\mu,\sigma^2 I)\log \mathcal{N}(z;0,I) dz = &-\frac{D}{2}\log 2\pi - \frac{1}{2}\sum_{d=1}^D \left(\mu_d^2 + \sigma_d^2\right) \\
\int \mathcal{N}(z;\mu,\sigma^2 I)\log \mathcal{N}(z;\mu,\sigma^2 I) dz = &-\frac{D}{2}\log 2\pi - \frac{1}{2}\sum_{d=1}^D \left(1 + \log \sigma_d^2\right) 
\end{align}
$$

## 4. Related work
refer paper directly...

## 5. Experiments
> "Interestingly enough, more latent variables
does not result in more overfitting, which is explained by the regularizing effect of the lower bound."

## Personal remarks



#### Derivation sketch of the lower bound


We want the variational distribution $q_\phi(z|x)$ to be as close as the true posterior $p_\theta(z|x)$. The [KL divergene](concept_KLdiv.ipynb) of these two can be written as
$$
\begin{eqnarray}
D_{KL}\left(q_{\phi}(z|x)\,||\,p_{\theta}(z|x)\right)&=&\mathbb{E}_{q_{\phi}(z|x)}\log\frac{q_{\phi}(z|x)}{p_{\theta}(z|x)}\\&=&\mathbb{E}_{q_{\phi}(z|x)}\log\frac{q_{\phi}(z|x)}{p_{\theta}(z,x)/p_{\theta}(x)}\\&=&\mathbb{E}_{q_{\phi}(z|x)}\log\frac{q_{\phi}(z|x)}{p_{\theta}(z,x)}+\mathbb{E}_{q_{\phi}(z|x)}\log p_{\theta}(x)\\&=&\mathbb{E}_{q_{\phi}(z|x)}\log\frac{q_{\phi}(z|x)}{p_{\theta}(z,x)}+\log p_{\theta}(x)
\end{eqnarray}
$$ 

Because $D_{KL}\ge 0$ by definition, the lower bound can be defined as
$$\mathcal{L}(\theta,\phi;x) = \mathbb{E}_{q_\phi(z|x)}\log\frac{p_\theta(x,z)}{q_\phi(z|x)}$$




#### VAE for nonlinear ICA ( Indepedent Component Analysis )

In the following assumption,
$$
\log q_\phi (z|x) = log \mathcal{N}(z;\mu(x),\sigma^2(x)I)
$$
the latent variables are indepent to each other. Therefore, VAE can be used as a nonlinear ICA.

In this [example]() I used VAE to find number of independent sources of time seires data


