- Deep GP: https://arxiv.org/pdf/1602.04133.pdf, https://arxiv.org/abs/1211.0358, https://github.com/otokonoko8/deep-Bayesian-nonparametrics-papers/blob/master/README.md
- OPVI: https://arxiv.org/abs/1610.09033
- neural AR flows: http://proceedings.mlr.press/v80/huang18d.html
- deep emsembles: https://proceedings.neurips.cc/paper/2017/file/9ef2ed4b7fd2c810847ffa5fa85bce38-Paper.pdf
- https://csc2541-f17.github.io/
- discusión BNN: https://jacobbuckman.com/2020-01-22-bayesian-neural-networks-need-not-concentrate/, https://cims.nyu.edu/~andrewgw/caseforbdl/
- https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html
- Comparación interesante al final: https://wjmaddox.github.io/assets/BNN_tutorial_CILVR.pdf


# [Advances in Variational Inference](https://arxiv.org/pdf/1711.05597.pdf)

## Recap from class 3

We are interested in a posterior 
$$
p(\theta|\mathcal{D}) = \frac{p(\mathcal{D}|\theta) p(\theta)}{p(\mathcal{D})}
$$
which may be intractable

In that case we do approximate inference either through sampling (MCMC) or optimization (VI). 

In the latter we select a (simple) approximate posterior $q_\nu(\theta)$ and we optimize the parameters $\nu$ by maximizing the evidence lower bound (ELBO)

$$
\begin{align}
\log p(\mathcal{D}) \geq  \mathcal{L}(\nu) &= - \int q_\nu(\theta) \log \frac{q_\nu(\theta)}{p(\mathcal{D}|\theta) p (\theta)} d\theta  \\
&= \mathbb{E}_{\theta \sim q_\nu(\theta)} \left[\log p(\mathcal{D}|\theta) \frac{p(\theta)}{q_\nu(\theta)}\right ]  \nonumber \\
&= \mathbb{E}_{\theta \sim q_\nu(\theta)} \left[\log p(\mathcal{D}|\theta)\right]- D_{KL}[q_\nu(\theta) || p(\theta)]  \nonumber 
\end{align}
$$

$$
\hat \nu = \text{arg}\max_\nu \mathbb{E}_{\theta \sim q_\nu(\theta)} \left[\log p(\mathcal{D}|\theta)\right]- D_{KL}[q_\nu(\theta) || p(\theta)] 
$$
which makes $q_\nu(\theta)$ close to $p(\theta|\mathcal{D})$

> There is a trade-off between how flexible/expressive the posterior is and how simple is to approximate this expression

We have seen in this course how VI is coupled with **stochastic gradient descent** and  **parameter amortization through neural networks** making this scalable to large datasets. We have also seen different estimators that reduce variance and make VI applicable to more general models.

In what follows we review different ways to improve VI beyond these

*Disclaimer:* This is an active area of research and I may have missed something in this review

## 1. More flexible approximate posteriors for VI


#### Normalizing flows

One way to obtain a more flexible posterior that is still tractable is to start with a simple distribution and apply a sequence of invertible transformations

This is the key idea behind [normalizing flows](https://arxiv.org/abs/1505.05770)

Let's say that $z\sim q(z)$ where $q$ is simple, *e.g.* standard gaussian

and that there is a smooth and invertible transformation $f$ such that $f^{-1}(f(z)) = z$

Then $z' = f(z)$ is a random variable too but its distribution is

$$
q_{z'}(z') = q(z) \left| \frac{\partial f^{-1}}{\partial z'} \right| = q(z) \left| \frac{\partial f}{\partial z} \right|^{-1}
$$

which is the original distribution times the inverse of jacobian of the transformation

And we can apply a chain of transformations $f_1, f_2, \ldots, f_K$ obtaining

$$
q_K(z_K) = q_0(z_0) \prod_{k=1}^K \left| \frac{\partial f_k}{\partial z_{k-1}} \right|^{-1}
$$

With this we can go from a simple Gaussian to more expressive/complex/multi-modal distributions 

Nowadays several types of flows exist in the literature, *e.g.* planar, radial, autoregresive

[Normalizing flows have been used to make the approximate posterior in VAE more expressive](https://arxiv.org/abs/1809.05861)

Three excellent blog posts covering normalizing flows:
- https://blog.evjang.com/2018/01/nf1.html
- http://akosiorek.github.io/ml/2018/04/03/norm_flows.html
- https://lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html

[Normalizing flows in Pyro](https://bmazoure.github.io/posts/nf-in-pyro/)

#### Adding more structure

Another way to improve the variational approximation is including auxiliary variables. For example in [Auxiliary Deep Generative Models](https://arxiv.org/abs/1602.05473) the VAE was extended by introducing a variable $a$ that does not modify the generative process but makes the approximate posterior more expressive

In this case the graphical model of the approximate posterior is $q(a, z |x) = q(z|a,x)q(a|x)$, so that the marginal $q(z|x)$ can fit more complicated posteriors. The graphical model of the generative process is $p(a,x,z) = p(a|x,z)p(x,z)$, *i.e.* under margnalization of $a$, $p(x,z)$ is recovered

The ELBO in this case is 

$$
\log p(x) = \int_z \int_a \log p(a, x, z) dz dz \geq \mathbb{E}_{\theta \sim q_\nu(a|z,x)} \left[\log \frac{p_\theta(a|x,z)p_\theta(x|z)p(z)}{q_\nu(a|x)q_\nu(z|a,x)}\right ]  \nonumber$$
we have $\log p(x) = \int_z \int_a p(x, z, a) dz dz$



## 2. Tigher bounds for the KL divergence

#### Importance weighting

Idea based on importance sampling. Tigher bounds for the ELBO can be obtained by sampling several $z$ for a given $x$. This was explored for autoencoders in [Importance Weighted Autoencoders](https://arxiv.org/abs/1509.00519)

Let's say we sample independently $K$ latent variables, this yields progressively tighter lower bounds for the evidence:

$$
\mathcal{L}_k = \mathbb{E}_{z_K, \ldots, z_2, z_1 \sim q_\phi(z|x)} \log \frac{1}{K}\sum_{k=1}^K \frac{p_\theta(x, z_k)}{q_\phi(z_k|x)}
$$

where $w_k = \frac{p_\theta(x, z_k)}{q_\phi(z_k|x)}$ are called the importance weights. Note that for $K=1$ we recover the VAE bound.

This tighter bound [has been shown to be equivalent to using the regular bound with a more complex posterior](https://arxiv.org/pdf/1808.09034.pdf) 

An interesting [blog post](http://artem.sobolev.name/posts/2016-07-14-neural-variational-importance-weighted-autoencoders.html) and recent follow-up: [Debiasing Evidence Approximations: On Importance-weighted Autoencoders and Jackknife Variational Inference](https://openreview.net/forum?id=HyZoi-WRb) and discussion [Tighter Variational Bounds are Not Necessarily Better](https://arxiv.org/abs/1802.04537)

[Support for importance sampling in Pyro?](http://docs.pyro.ai/en/stable/inference_algos.html#module-pyro.infer.importance)

## 3. Other divergence measures

#### $\alpha$ divergence 

The KL divergence is computationally-convenient but there are other options to measure how far two distributions are

For example the family of $\alpha$ divergences (Renyi's formulation)

$$
D_\alpha(p||q) = \frac{1}{\alpha -1} \log p(x)^\alpha q(x)^{1-\alpha} \,dx
$$

which is a generalization of the KL divergence: for $\alpha \to 1$ the KL is recovered

$\alpha$ represents a trade-of between the mass-covering and zero-forcing effects

The $\alpha$ divergence has been explored for [VI recently](https://arxiv.org/pdf/1511.03243.pdf) and is [implemented in Pyro](http://docs.pyro.ai/en/stable/inference_algos.html#pyro.infer.renyi_elbo.RenyiELBO)

#### f divergence

The $\alpha$ divergence is a particular case of the f-divergence

$$
D_f(p||q) =  q(x) f \left ( \frac{p(x)}{q(x)} \right) \,dx
$$

where $f$ is a convex function with $f(0) = 1$. The KL is recovered for $f(z) = z \log(z)$

$f$ should be such that the result in the bound does not depend on the marginal likelihood

[VI with f-divergence](https://papers.nips.cc/paper/7816-variational-inference-with-tail-adaptive-f-divergence.pdf) and its [Pyro implementation](http://docs.pyro.ai/en/stable/inference_algos.html#pyro.infer.trace_tail_adaptive_elbo.TraceTailAdaptive_ELBO)

#### Stein variational gradient descent (SVGD)

Other totally different approach is based on the **Stein operator**

$$
\mathcal{A}_p \phi(x) = \phi(x) \nabla_x \log p(x)  + \nabla_x \phi(x)
$$

where $p(x)$ is a distribution and $\phi(x) = [\phi_1(x), \phi_2(x), \ldots, \phi_d(x)]$ a smooth vector function

Under this following, known as the **Stein identity**, holds

$$
\mathbb{E}_{x\sim p(x)} \left [  \mathcal{A}_p \phi(x)  \right] = 0,
$$


Now, for another distribution $q(x)$ with the same support as $p$, we can write 

$$
\mathbb{E}_{x\sim q(x)} \left [ \mathcal{A}_p \phi(x) \right] - \mathbb{E}_{x\sim q(x)} \left [ \mathcal{A}_q \phi(x) \right]= \mathbb{E}_{x\sim q(x)} \left [ \phi(x) ( \nabla_x \log p(x) - \nabla_x \log q(x)) \right]
$$ 

from which the **Stein discrepancy** between two distributions is defined

$$
\sqrt{S(q, p)} = \max_{\phi\in \mathcal{F}} \mathbb{E}_{x\sim q(x)} \left [ \mathcal{A}_p \phi(x) \right]
$$

Which to actually work requires $\mathcal{F}$ to be broad enough

This is were kernels can be used. By taking an infinite amount of basis function $\phi(x)$ on the stein discrepancy it can be shown that the optimization is solved by

$$
\textbf{S}(q, p) = \mathbb{E}_{x, x' \sim q(x)} \left [ \mathcal{A}_p^x \mathcal{A}_p^{x'} \kappa(x, x')\right]
$$

where $\kappa$ is a kernel function, *e.g.* RBF or rational quadratic

From this one can use stochastic gradient descent


- [List of papers related SVGD](https://www.cs.dartmouth.edu/~qliu/stein.html)
- [Pyro implementation](http://docs.pyro.ai/en/stable/inference_algos.html#module-pyro.infer.svgd)
- [Operator Variational Inference (OPVI)](https://papers.nips.cc/paper/6091-operator-variational-inference.pd) also employs the Stein operator

# More on Bayesian Neural Networks

As we have seen in this course training a full bayesian neural network is on state-of-the-art and several problems exist: slow convergence, high variance, too simple posteriors, etc

Training a Bayesian neural network using VI resort to maximizing 
$$
\mathcal{L}(\nu) = \mathbb{E}_{\theta \sim q_\nu(\theta)} \left[\log p(\mathcal{D}|\theta)\right]- D_{KL}[q_\nu(\theta) || p(\theta)]
$$
where $\nu$ are the parameters of the approximate posterior


The strategy proposed in [Blundel et al 2015 called Bayes by Backprop](https://arxiv.org/pdf/1505.05424.pdf) consists on replacing the expectation with monte-carlo estimates 

$$
\mathcal{L}(\nu) \approx  \sum_{i=1}^N \sum_{k=1}^K \log p(x_i|\theta_k)-  \log q_\nu(\theta_k)  + \log p(\theta_k)
$$

where $N$ is the number of data samples in the minibatch and $K$ is the number of times we sample from the parameters $\theta$

- Freedom on the variational posterior (we don't require a closed form for the KL divergence)
- The reparameterization trick is used to reduce variance

##### Prior in Bayes by backprop

As no closed-form is needed more complex priors can be used. In the original Bayes-by-backprop paper the following is considered

$$
p(\theta) = \pi_1 \mathcal{N}(0, \sigma_1^2) + \pi_2 \mathcal{N}(0, \sigma_2^2)
$$

with $\sigma_1<<<\sigma_2$ 

The term with smaller variance allows for automatic "shut-down" (pruning) of weights: **sparsification**

[Gaussian scale mixtures are implemented in Pyro](http://docs.pyro.ai/en/stable/distributions.html#pyro.distributions.GaussianScaleMixture) but are a bit tricky to use

Other implementations of Bayes by backprop
- https://www.nitarshan.com/bayes-by-backprop/
- http://krasserm.github.io/2019/03/14/bayesian-neural-networks/
- https://gluon.mxnet.io/chapter18_variational-methods-and-uncertainty/bayes-by-backprop.html

##### [Local reparametrization trick to reduce noise](http://papers.nips.cc/paper/5666-variational-dropout-and-the-local-reparameterization-trick)

In BNN we sample from every weight as 

$$
w_{ji}\sim \mathcal{N}(\mu_{ji}, \sigma_{ji}^2)
$$

using the reperameterization trick


$$
w_{ji} = \mu_{ji} +\epsilon_{ji} \cdot\sigma_{ji}, \quad \epsilon_{ji} \sim \mathcal{N}(0, I)
$$

but this is still quite a lot. The idea behind local reparameterization is instead of sampling from every weight we sample from the pre-activations

$$
Z = WX + B
$$

then

$$
z_i = \nu_i + \eta_i  \cdot \epsilon_{i}
$$

where $\epsilon$ is still a standard normal and $\nu_i = \sum_j x_j \mu_{ji}$ and $\eta_i = \sqrt{\sum_j x_j^2 \sigma_{ji}^2}$

This reduces the amounts of samples we take by orders and magnitud and also reduces the variance of the estimator

[Implementation in pyro](http://docs.pyro.ai/en/stable/contrib.bnn.html#pyro.contrib.bnn.hidden_layer.HiddenLayer) with [demonstration](https://alsibahi.xyz/snippets/2019/06/15/pyro_mnist_bnn_kl.html)

##### Bayes-by-backprop for convolutional neural networks

Local reparameterization trick for convolutional layers

- Blog post: https://medium.com/neuralspace/bayesian-convolutional-neural-networks-with-bayes-by-backprop-c84dcaaf086e
- [A Comprehensive guide to Bayesian Convolutional Neural Network with Variational Inference](https://arxiv.org/pdf/1901.02731.pdf)
- [Uncertainty Estimations by Softplus normalization inBayesian Convolutional Neural Networks withVariational Inference](https://arxiv.org/pdf/1806.05978.pdf)

##### [Dropout as a Bayesian approximation](https://arxiv.org/abs/1506.02142)

Alternative take on Bayesian neural networks based on the dropout technique for regularization

Dropout turns-off neurons following a certain distribution. The authors argue that these is like having an ensemble of neural networks and hence uncertainties can be computed

TLDR: Use dropout not only in the training set but also when predicting to estimate uncertainty

[How good this is?](http://bayesiandeeplearning.org/2016/papers/BDL_4.pdf): Uncertainty with this approach (fixed dropout probability) does not decrease as new data points arrive. [A solution to this?](https://papers.nips.cc/paper/6949-concrete-dropout)


##### [FLIPOUT](https://arxiv.org/abs/1803.04386)

Decorrelation of the gradients within a minibatch speeding up bayesian neural networks with gaussian perturbations

##### [Natural gradient VI](https://papers.nips.cc/paper/8681-practical-deep-learning-with-bayesian-principles.pdf)

# Advances in MCMC

If MCMC would be faster we probably would not need VI

- https://github.com/pyro-ppl/numpyro
- [Approximate MCMC](https://arxiv.org/abs/1908.03491)


Bayesian optimization

http://pyro.ai/examples/bo.html