# Lecture #19: Variational Inference in Context
## AM 207: Advanced Scientific Computing
### Stochastic Methods for Data Analysis, Inference and Optimization
### Fall, 2021

<img src="fig/logos.jpg" style="height:150px;">

## Lecture #19 Summary

What did we observe in HW#8? We setup a BNN with 1 hidden layer and 5 hidden nodes.
1. Did your HMC converge? Do you feel confident that you can get it to converge?
    - Did the posterior samples give you desirable predictive uncertainty?<br><br>
    
2. Did your BBVI optimize the ELBO fully? How do you know that you've optimized your ELBO fully?
    - Did the variational posterior give you desirable predictive uncertainty?<br><br>
    
3. When the network is large, do you believe that either HMC or mean-field Gaussian VI will give you a good approximation of the posterior?
    - Do we necessarily need a good approximation of the posterior to get desirable uncertainty?<br><br>
    
4. **The Training Wheels are Off!** When doing machine learning in the wild, no one knows the "right answer", your metrics are not absolute, the only thing you can rely on is your intuition (which is you internalizing the theory you've learned as well as your hands-on experience working with these models).
    - How can we know that our HMC has no bugs in it?
    - How can we know that our BBVI has no bugs in it?
    - How can we know that we've fully optimized anything (e.g. the ELBO) through gradient descent?
    - What are the impediments to BBVI optimizing fully? I.e. what do we tune to improve performance?
    - What are the impediments to HMC mixing? I.e. what do we tune to improve performance? <br><br>
    
5. **What this class is not:** This class is not a one-stop for all the tools you need to be a fluent practitioner of in-demand state-of-the art machine learning. It's normal that:
    - You still don't feel like you're comfortable impplementing all the models from scratch and deploying it in real-life conditions. 
      You should **not** be comfortable now, but you will be with practice in the future (you're getting some of that practice with the project)!
    - You don't remember everything from the class, and that as time passes, you seem to forget more.
      You'll forget many if not most facts, but when you re-learn these facts, understanding will come more easily and with more depth.<br><br>

6. **What is this class:** This class is a map or atlas and paints a broad landscape of modern-day probabilistic machine learning. Using the structure of this class you can better organize and process new ML knowledge. 

  This class is a foundation for you to build on, future classes in ML **will** reference these models, techniques and concepts:
    - Reinforcement learning (Harvard)
    - Causal inference (Harvard)
    - ML for Healthcare (MIT)
    - Neural Processing (Harvard)
    - Special Topics in ML (Harvard)
    - Interpretable ML (Harvard)
    - AI for Social Good (Harvard)

# New Developments in Deep Bayes

## What's Wrong with Posterior Approximations for BNNs
***(joint work with Jiayu Yao, Soumya Ghosh, Finale Doshi-Velez)***

**The motivation:** Frequently, we introduce a new approximate inference method for BNNs based on our intuition of what properties of the true posterior are important to include in our approximate posterior. We then measure the ***log-likelihood*** of our learned model on test data.

**The idea:** We evaluate the ability of a number of state-of-the-art approximate inference methods for BNN on synthetic data that can be visualized to see if various custom approximations of the posterior translates to desirable posterior predictives.

We also test if the commonly used evaluation metrics, like test log-likelihood, can distinguish high quality posterior predictves from poor quality ones.

***From:*** [Quality of Uncertainty Quantification for Bayesian Neural Network Inference](https://arxiv.org/pdf/1906.09686.pdf)

## What Did We Learn?

**Take-away:** Metric like test log-likelihood measures how well the approximate posterior predictive aligns with the data, not how well it approximate the true posterior predictive.

<img src="fig/cubic_compare.png" style="height:170px;" align="center"/>

**Take-away:** Training with data gaps and evaluating test likelihood can only catch problems if the true function is complex in these gaps.

<img src="fig/complex_compare.png" style="height:110px;" align="center"/>

**Take-away:** Inference methods that specialize in approximating specific properties in the posterior do not necessarily produce desirable posterior predictives.

## What's Wrong with the Predictions from Mean-Field Variational Posteriors of BNNs?
***(joint work with Beau Coker, Finale Doshi-Velez)***

The problem with mean-field VI for BNNs isn't just with the variance. In [Wide Mean-Field Variational Bayesian Neural Networks Ignore the Data](https://arxiv.org/pdf/2106.07052.pdf). we show that as the network width becomes wider, the posterior predictive mean ignores the data completely and learns a constant (flat) function!

<img src="fig/widebnn.png" style="height:300px;" align="center"/>

The fact that the posterior predictive mean of mean-field variational posteriors of wide BNNs ***underfit*** the data has been observed empirically in [Overpruning in Variational Bayesian Neural Networks](https://arxiv.org/abs/1801.06230).

## Relationships Between Deep Ensembles and Bayesian Neural Networks
Althhough we've been thinking about ensembles and Bayesian models as belonging to a dichotomy (frequentist vs Bayesian statistics), in papers like [Conservative Uncertainty Estimation By Fitting Prior Networks](https://openreview.net/pdf?id=BJlahxHYDS) and [Repulsive Deep Ensembles are Bayesian](https://arxiv.org/abs/2106.11642), we see that ensembles and Bayesian models have a deep relationship.

**Take-away:** In fact, some ensembles can be interpreted as collections of samples from a corresponding Bayesian model!

## Alternative Models for Uncertainty Quantification

Although mean-field VI is touted as a fast way of performing inference on BNNs, they are still hard to scale to truly large networks. For this reason, alternative deep Bayesian models have been gaining popularity. **Neural Linear Models** is a model that learns deterministic weights for an NN except for the last layer, where we place priors and perform exact Bayesian inference:

<img src="fig/Unknown-20.png" style="height:280px;" align="center"/>

***From:*** [Learned Uncertainty-Aware (LUNA) Bases for Bayesian Regression using Multi-Headed Auxiliary Networks](https://arxiv.org/pdf/2006.11695.pdf)

## Predictive Uncertainties of Alternative Deep Bayesian Models May Not be Better
***(joint work with Sujay Thakur, Cooper Lorsung, Yaniv Yacoby, Finale Doshi-Velez)***

In [Learned Uncertainty-Aware (LUNA) Bases for Bayesian Regression using Multi-Headed Auxiliary Networks](https://arxiv.org/pdf/2006.11695.pdf), we show that the posterior predictive uncertainty of NLMs are not what we might want. The problem has a lot to do with how we typically train neural networks.

<img src="fig/Unknown-18.png" style="height:270px;" align="center"/>



## What Uncertainties Do We Need in Deep Learning?
More and more folks in the Deep Bayes community is coming to the conclusion that the quality of uncertainty estimation can only be assessed in reference to a specific down-stream task, e.g. active learning, continual learning, Bayesian optimization, out of distribution detection, learning to defer (rejection learning), model interpretation etc.

<img src="fig/bayesopt.png" style="height:350px;" align="center"/>

**Take-away:** There is no universal definition of or metric for "good" uncertainty - "good" uncertainty is uncertainty that is useful for the task.

**Take-away:** Uncertainty estimation that is good for one task might be bad for another!