# Key derivations for VAEs

1) Maximize Likelihood </br></br>

\begin{aligned}
\left[
\begin{array}
pp_{\theta}(y_n, z_n)\\
p_{\theta}(y_n|z_n)p_{\theta}(z_n)
\end{array}
\right]\\
\end{aligned}

2) Maximize observed $u$

\begin{aligned}
\theta_{\text{MLE}} &= \mathrm{argmax}_{\theta}\; \mathrm{log} \prod \overbrace{\int p_{\theta}(y_n, z_n) dz_n}^{p_{\theta}(y_n)}\\
&= \mathrm{argmax}_{\theta}\; \nabla_{\theta} \sum_n \mathrm{log}\; \mathbb{E}_{p_{\theta}(z_n)} [p_{\theta}(y_n, z_n)]\\
\end{aligned}
However, this presents an issue when we consider the $\log$ term and the $\nabla_{\theta}$ and $\mathbb{E}_{p_{\theta}}$, so we use EM steps:
</br>  
</br>  
\begin{aligned}
&= \mathrm{argmax}_{\theta}\; \sum_n\; \mathrm{log} \int p_{\theta}(y_n|z_n)\frac{p_{\theta}(z_n)}{q(z_n)} q(z_n)d z_n\\
&= \mathrm{argmax}_{\theta}\; \sum_n\; \mathrm{log}\; \mathbb{E}_{q(z_n)}\; \left[\frac{p_{\theta}(y_n|z_n)p_{\theta}(z_n)}{q(z_n)}\right]\\
\end{aligned}
Apply Jensen's inequality, and we get:</br>  
</br>  
\begin{aligned}
&\ge \mathrm{argmax}_{\theta}\; \underbrace{\sum_n\; \mathbb{E}_{q(z_n)}\; \left[\mathrm{log}\left(\frac{p_{\theta}(y_n,z_n)}{q(z_n)}\right)\right]}_{\text{ELBO}(q, \theta)}\\
\end{aligned}

**Maximize the ELBO**
\begin{aligned} 
\theta_{\text{MLE}} &= \underset{\theta, q}{\text{argmax}}\; \text{ELBO}(\theta, q)\\
\end{aligned}
</br>  
*M Step*:
\begin{aligned} 
\underset{\theta}{\text{max}}\; \nabla_{\theta}\; \text{ELBO}(\theta, q^*)\\
\end{aligned}
*E Step*:
\begin{aligned} 
\underset{q}{\text{max}}\; \text{ELBO}(\theta^*, q)\\
\end{aligned}

We use KL divergence to find:
\begin{aligned} 
q^* &= p_{theta^*}(z_n | y_n)\\
\end{aligned}

But finding the closed form of this equation is hard -- so we rely on variational inference. We use the Gaussian family with mean field:

\begin{aligned} 
q(z_n) & = \mathcal{N}(z_n, \mu_n, \Sigma_n)\\
\theta_{\text{MLE}} &= \underset{\theta}{\text{max}}\; \nabla_{\theta}\; \text{ELBO}(\theta, q^*)\\
\end{aligned}
</br>  
*A) Variational Inference*  
Step E:
\begin{aligned}
\mu_n^*, \Sigma_n^* &= \underset{\mu, \sum}{\text{arming}}\; \text{D}_{\text{KL}} [q(z_n)||p_{\theta^*}(z_n|y_n)]\\
&\equiv \underbrace{\underset{\mu, \sum}{\text{argmax}}\; \mathbb{E}_{q_{\mu, \sum}(z_n)} \left[\mathrm{log}\frac{p_{\theta^*}(z_n, y_n)}{q_{\mu_n, \sum_n}(z_n)}\right]}_{\text{ELBO}(\theta^*,q_{\mu_n, \sum_n} )}\\
\end{aligned}
</br>
As you can see, this is the ELBO of $(\theta^*,q_{\mu_n, \sum_n})$

*B) Amortization*  
\begin{aligned}
g_{\phi^*} &= \mu_{\phi^*}(y_n), \Sigma_{\phi^*}(y_n)\\
\phi^* &= \underset{\phi}{\text{argmin}}\; \sum_n\; \text{D}_{\text{KL}}\left[\text{N}(\mu_{\phi^*}(y_n), \Sigma_{\phi^*}(y_n))||p_{\theta^*}(z_n|y_n)\right]\\
\end{aligned}
</br>
Where the term $\phi^*$ comes from $g_{\phi^*}(y_n)$.  
</br>  
\begin{aligned}
&= \underset{\phi}{\text{argmax}} \underbrace{ \sum_n\; \mathbb{E}_{q_{\phi}(z_n)}\left[ \mathrm{log} \frac{p_{\theta^*}(z_n, y_n)}{q_{\phi}(z_n)} \right]}_{\text{ELBO}(\theta^*,q_{\phi} )}\\
\end{aligned}

*C) Joint Training -- do the E & M Steps together!*  
</br>  
\begin{aligned} 
\theta^*, \phi^* &= \underset{\theta, \phi}{\text{argmax}}\;\text{ELBO}(\theta, q_\phi)
\end{aligned}
Use gradient descent:  
</br>  
\begin{aligned}
\nabla_{\theta, \phi} \; \text{ELBO}(\theta, q_\phi) &= \sum_n\; \nabla_{\theta, \phi}\; \mathbb{E}_{q_\phi(z_n)}\;  \left[ \mathrm{log} \frac{p_{\theta^*}(z_n, y_n)}{q_\phi(z_n)}\right]\\
\end{aligned}
<hr/>

**Not sure exactly where this belongs -- sorry!**

\begin{aligned}
\nabla_{\theta, \phi} \; \text{ELBO}(\theta, q_\phi) &= \sum_n\; \nabla_{\theta, \phi}\; \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)}\;  \left[ \mathrm{log} \frac{p_{\theta}(\mu_\phi + \epsilon\Sigma_\phi^{1/2}, y_n)}{q_\phi(\epsilon\Sigma_\phi^{1/2} + \mu_\phi)}\right]\\
\end{aligned}
<hr/>

*Added during Lab*
</br>
</br>
\begin{aligned}
\nabla_{\theta, \phi} \; \text{ELBO}(\theta, q_\phi) &= \sum_n\; \underbrace{ \nabla_{\theta, \phi}\; \mathbb{E}_{q_\phi(z_n)}\;   \mathrm{log} p_{\theta}(y_n|z_n)}_{\text{(1)}}- \underbrace{\nabla_{\theta, \phi}\text{D}_{\text{KL}}\; \left[q_{\phi}(z_n)||\overbrace{p_{\theta}(z_n)}^{\text{Normal prior}\; \mathrm{N}(0,1)}\right]}_{\text{(2)}}\\
\\
\end{aligned}
</br>  
Consider equation (1):
</br>  
\begin{aligned}
\text{(1)}\;  &= \nabla_{\theta, \phi}\; \mathbb{E}\; \mathrm{log}\; p_\theta(y_n|\mu_n+\epsilon\Sigma_n^{\frac{1}{2}})\\
\end{aligned}
</br>  
Where we know that:
\begin{aligned}
z_n &\in \mathbb{R}^J\\
\epsilon &\sim \mathcal{N}(0,1)\\
z_n &\sim \mathcal{N}(\mu_n, \Sigma_n)\\
y_n = \theta_1 z_n &\sim \mathcal{N}(\theta_1 \mu_n, \theta_1^2 \Sigma_n)\\
\end{aligned}
So, we can say that:
</br>
\begin{aligned}
p(y_n|z_n) &= \nabla_{\theta, \phi} \underset{\epsilon\sim\mathcal{N}(0,1)}{\mathbb{E}}\; \left(\frac{-1}{2}\mathcal{log}2\pi-\frac{1}{2}\mathcal{log}|\theta_1^2\Sigma_n)| - \frac{1}{2} (y_n-\theta_1\mu_n)^\text{T}\; \frac{1}{\theta_1^2} \Sigma_n^{-1}(y_n - \theta_1\mu_n \right)\\
\end{aligned}
</br>  
And equation (2):
</br>  
\begin{aligned}
\text{(2)}\; &=\frac{1}{2}\sum_{j=1}^{J}\; (1+\mathrm{log} (\sigma_{j}^2 - \mu_j^2 - \sigma_j^2) \\
\end{aligned}
</br>  
