# Lecture #17: Black-box Variational Inference
## AM 207: Advanced Scientific Computing
### Stochastic Methods for Data Analysis, Inference and Optimization
### Fall, 2021

<img src="fig/logos.jpg" style="height:150px;">

## Summary of Lecture #17

1. **(Width vs Depth)** In a NN architecture, we control the width and the depth of the network. What is the effect of increasing each (independently) on the function that we learn?
<table>
    <tr>
        <td>
            <img src="./fig/wide.jpg" style="height: 350px;" align="center"/>
        </td>
        <td>
            <img src="./fig/deep.jpg" style="height: 350px;" align="center"/>
        </td>
    </tr>
</table>

  - Extremely wide NNs have fundamentally different behaviors than wide NNs
    
    [Reconciling modern machine learning practice and the bias-variance trade-off](https://arxiv.org/pdf/1812.11118.pdf)<br><br>

  - Deep networks behave fundamentally differently than shallow but wide NNs 
    
    [Exponential expressivity in deep neural networks through transient chaos](https://proceedings.neurips.cc/paper/2016/file/148510031349642de5ca0c544f31b2ef-Paper.pdf)<br><br>

  - Deep networks can have unexpected pathologies 
    
    [COLLAPSE OF DEEP AND NARROW NEURAL NETS](https://openreview.net/pdf?id=r1MSBjA9Ym)
    
    [DEEP DOUBLE DESCENT: WHERE BIGGER MODELS AND MORE DATA HURT](https://arxiv.org/pdf/1912.02292.pdf)
<br><br>

2. **(Linear vs Neural Network Models)** While both simple and complex models can be interpreted and successfully trained, there are trade-offs between using a complex vs a simple model!

  - Easily interpreted vs interpretations are harder to come by
  - Interpretations are more faithful vs interpretations can be more misleading
  - Training is easier (because objective function is convex) vs training is much harder (because objective function is extremely non-convex)<br><br>
  
3. **(Bayesian Neural Networks)** Deep Bayes refers to the subfield of ML wherein we try to combine deep learning with Bayesian learning. A Bayesian Neural Network is one example of a deep Bayesian model.

  - Why do we want to make our NNs Bayesian? 
  - How do we make NNs Bayesian?
  - How do we sample from the posterior?
    - Will sampling from the posterior using HMC going to work?
    - Can we do variational inference? Now we can - with 4 pieces of technology:
      - algebra tricks to get the gradient past the expectation in the ELBO
      - Monte Carlo estimate of the expectation in the ELBO
      - automatic differentiation
      - gradient descent
      
      \begin{aligned}
\nabla_{\mu, \Sigma}\,\underbrace{\mathbb{E}_{\mathbf{W} \sim q(\mathbf{W} | \mu, \Sigma)} \left[ \log \left( \frac{p(\mathbf{W}) \prod_{n=1}^N p(Y^{(n)} | \mathbf{X}^{(n)}, \mathbf{W})}{q(\mathbf{W} | \mu, \Sigma)} \right) \right]}_{ELBO(\mu, \Sigma)}.
\end{aligned}
      
---
### Variational Inference for BNNs
The ***Black-box Variational Inference (BBVI)*** algorithm for BNN's:
0. **Initialization:** pick an intial value $\mu^{(0)}, \Sigma^{(0)}$
1. **Gradient Ascent:** repeat:

   1. Approximate the gradient 
   \begin{align}
   \nabla_{\mu, \Sigma} \, ELBO(\mu, \Sigma) &= \nabla_{\mu, \Sigma}\,\underbrace{\mathbb{E}_{\mathbf{W} \sim q(\mathbf{W} | \mu, \Sigma)} \left[ \log \left( \frac{p(\mathbf{W}) \prod_{n=1}^N p(Y^{(n)} | \mathbf{X}^{(n)}, \mathbf{W})}{q(\mathbf{W} | \mu, \Sigma)} \right) \right]}_{ELBO(\mu, \Sigma)}\\
   &= \underbrace{\mathbb{E}_{\mathbf{W} \sim q(\mathbf{W} | \mu, \Sigma)}\left[ \nabla_{\mu, \Sigma}\, q(\mathbf{W} | \mu, \Sigma) * \log \left( \frac{p(\mathbf{W}) \prod_{n=1}^N p(Y^{(n)} | \mathbf{X}^{(n)}, \mathbf{W})}{q(\mathbf{W} | \mu, \Sigma)} \right) \right]}_{\text{score function gradient}}\\
   &\approx\underbrace{\frac{1}{S}\sum_{s=1}^S \nabla_{\mu, \Sigma}\, \log q(\mathbf{W}^s | \mu, \Sigma) * \log \left( \frac{p(\mathbf{W}^s) \prod_{n=1}^N p(Y^{(n)} | \mathbf{X}^{(n)}, \mathbf{W}^s)}{q(\mathbf{W}^s | \mu, \Sigma)} \right)}_{{\text{MC approx. of score function gradient}}},
   \end{align}
   where $\mathbf{W}^s\sim q(\mathbf{W} | \mu^{\text{current}}, \Sigma^{\text{current}})$.<br><br>
   2. Update parameters $(\mu^{\text{current}}, \Sigma^{\text{current}}) \leftarrow (\mu^{\text{current}}, \Sigma^{\text{current}}) + \eta * {\text{(MC approx. of score function gradient)}}$<br><br>

The ***reparametrization gradient*** is another way to push the gradient past the expectation in the gradient of the ELBO.

We note that since $q(\mathbf{W} | \mu, \Sigma) = \mathcal{N}(\mathbf{W};\mu, \Sigma )$, sampling $W\sim q(\mathbf{W} | \mu, \Sigma)$ is equivalent to sampling $\epsilon \sim \mathcal{N}(0, \mathbf{I})$ and then transforming the sample $\mathbf{W} = \epsilon^\top \Sigma^{1/2} + \mu$, where $\mathbf{I}$ and $\Sigma$ have the same dimensions.

Thus, we can rewrite the ELBO:
<img src="./fig/reparametrized_grad.png" style='height:300px;'>

The ***Black-box Variational Inference (BBVI) with the reparametrization trick*** algorithm for BNN's:
0. **Initialization:** pick an intial value $\mu^{(0)}, \Sigma^{(0)}$
1. **Gradient Ascent:** repeat:

   1. Approximate the gradient 
   \begin{align}
   \nabla_{\mu, \Sigma} \, ELBO(\mu, \Sigma) \approx& \small\frac{1}{S} \sum_{s=1}^S \nabla_{\mu, \Sigma} \log \left[p(\epsilon_s^\top \Sigma^{1/2} + \mu) \prod_{n=1}^N p(Y^{(n)} | \mathbf{X}^{(n)}, \epsilon_s^\top \Sigma^{1/2} + \mu)\right] \\
   &+ \nabla_{\mu, \Sigma}\underbrace{-\mathbb{E}_{\mathbf{W} \sim \mathcal{N}(\mu, \Sigma )}\left[\log \mathcal{N}(\mathbf{W};\mu, \Sigma ) \right]}_{\text{Guassian entropy: has closed form}},
   \end{align}
   where $\epsilon_s \sim \mathcal{N}(0, \mathbf{I})$.
   2. Update parameters $(\mu^{\text{current}}, \Sigma^{\text{current}}) \leftarrow (\mu^{\text{current}}, \Sigma^{\text{current}}) + \eta * {\text{reparametrization gradient}}$