## General Settings and Goal
For a density $p\in\mathcal P$, we want to give approximation $p(x) = \frac{1}{z}\tilde p(x)$ where $z$ is the normalizer that is intractable to compute and $\tilde p$ is some function of $x$ that is not necessarily a probability. Then, to efficiently find an estimation, we choose some $q_\phi\in \mathcal Q$ that $p$ may or may not in $\mathcal Q$. 

The goal is to minimize the __loss__ between $p, q$. While the __loss__ may not be measured by Euclidean distance of the parameters. 

## Kullback–Leibler divergence

Compare two distribution $p, q$, 
$$D_{KL}(p\parallel q) = E_{x\sim q}\log \frac{q(x)}{p(x)} = \int \log \frac{q(x)}{p(x)} q(x)dx$$

### Properties
$D_{KL}(p\parallel q) \geq 0$  
$D_{KL}(p\parallel q) = 0 \Leftrightarrow p=q$  
$D_{KL}(p\parallel q)\neq D_{KL}(q\parallel p)$ if $p\neq q$

Then, to assess $D_{KL}$  
$p\approx q\Rightarrow D_{KL}$ small   
$p$ large and $q$ small $\Rightarrow$ ${D_{KL}(q\parallel p)}$ small, ${D_{KL}(p\parallel q)}$ large

Then, if we find $q_{\phi}$ that optimize $D_{KL}(q\parallel p)$

$D_{KL}(q\parallel p)$ is called __information projection__ since it is to capture the partial (but the majority) pattern of $p$.  
Therefore, if we sample from such $q$, we might be biased for partial of the data. 

$D_{KL}(p\parallel q)$ __moment projection__ since its objective is to estimate the moments.  
Therefore, if we sample from such $q$, we may introduce new, non-existing features. 

Most times, we'd prefer to information projection since it does not introduce variances and captures the shape (although might be partially the shape). 

### Evidence lower bound (ELBO)

\begin{align*}
D_{KL}(q\parallel p) &= E_{x\sim q} \log(\frac{q(x)}{p(x)}) \\
&= E_q \log(q(x)) - \log(\frac{1}{z} + \log(\tilde p(x)))\\
&= E_q [\log q(x) - \log(\tilde p(x))] + \log z\\
&= D_{KL}(q\parallel \tilde p) + \log z
\end{align*}
Therefore, minimizing the information projection is the same as minimizing the unnormalized information projection. 

And that 
$$D_{KL}(q\parallel \tilde p) = D_{KL}(q\parallel p) - \log z \geq -\log z \geq - \log p(\mathcal D)$$
hence it is a bound on the NLL of the data. Let __Evidence lower bound (ELBO)__ $=-D_{KL}(q\parallel \tilde p) \leq \log(p(\mathcal D))$ so that we turn intractable integration into an optimization problem. 