Scott Schwartz
Research/Reading | Auxiliary/Deferred |
---|---|
Yichen Ji | Neal Jin |
Eric Jiang | Jianing Zhang |
Haining Tan | Xin Peng |
Ryan Wang | Eric Zhu |
Following on the rapid advancements in deep learning using neural networks, an extremely interesting area of research addresses the quantification of uncertainty in the context of neural networks. This topic has seen exciting (and sometimes contentiously debated) success under the label of Bayesian Deep Learning, with examples such as "Bayes by Backprop" (Blundell et al.) and "Gaussian Process Approximation using Dropout" (Gal et al.) demonstrating the great potential of applying Bayesian considerations in the context of neural networks. This research project will briefly review existing Bayesian Deep Learning ecosystem and the key methodologies upon which it is based, e.g., Variational Inference. Then, in a meeting of "classic" with "new", the research will consider how a Bayesian Analysis might proceed for the Normalizing Flow (i.e., bijection) class of neural network models based on Importance Sampling.
Normalizing Flows are a widely used methodology which can approximate an arbitrarily complex data distribution by applying a sequence of invertible smooth (change of variables) transformations to a multivariate normal distribution. Bayesian posterior analysis can proceed with Importance Sampling using the prior as the proposal function and the likelihood evaluation as the importance weights. By employing a Bayesian posterior analysis in this context we intend to capture epistemic uncertainty inherent in model fitting, and by using a Normalizing Flow as a likelihood we intend to capture aleatoric uncertainty inherent in data generating mechanisms (as well as produce the required importance weights). The crux of the issue then is the obligatory (and sometimes adversarial) question of "What is your prior?" that Bayesians everywhere are more than familiar with; however, here the answer also crucially drives the computational cost of the proposed Importance Sampling method. It is expected that the research will begin by looking into the suitability of the SWAG posterior approximation methodology (Wilson et al.) as an "empirical Bayes prior" which can then be fine-tuned on the basis of Importance Sampling. The work will be done in Python using TensorFlow and TensorFlow Probability neural networks.
Students must have experience working in TensorFlow (or PyTorch, etc.) as evidenced by course or portfolio work. Students must have a solid understanding of Bayesian Analysis and familiarity with applied computation, i.e., as should generally be the case for students who have taken the appropriate advanced coursework. Strong interest in Bayesian, neural network, and computational methodology is of course preferred. And enthusiasm for and comfort with working through challenging problems in new areas is of course very beneficial for research work.
- Importance Sampling (IS) (Givens/Hoeting Chapter 6.4.1)
-
Some initial orienting questions and partial responses for your review
Scott's summary of the 4PM conversation on Monday, May 9.
The second half of the meeting was recorded (and has passcode Sc#1wsPr9#).
In the first half of the meeting Ryan addressed the final question of the notebook regarding distribution quantile estimation and we discussed empirical CDFs and Gentle's view on rank order statistics as the fundamental information contained in a data sample. Thanks to Haining's considerations of what it would mean to integrate an inverse quantile function it became clear to me that quantile estimation cannot be formulated as an integration problem, and so quantile estimation (i.e., analyzing rank order statistics) is something different from MC-integration (which seems quite interesting, but I think for now we'll have to put a pin this topic for a later time).
I do not believe the middle two questions addressed in the notebook (regarding helpful attributes of proposal distributions and the computational distinction between unnormalized and normalized importance weights) were systematically address in our question, but they were each tangentially touched upon to some degree. I.e., respectively, see (b) below, and note that normalized IS weights do not require a (generally very hard) marginal likelihood computation to find the normalizing constant of the posterior, but instead can just be based upon normalized likelihood computations as weights.
The first question of the notebook was addressed on the basis of Eric's questions, from which we were able to discuss (a) Monte Carlo Integration as simply being epectation (an integral) estimation done on the basis of standard statistical (CLT) analysis, (b) that IS affords estimator variance reduction for well-chosen proposal distributions which produce large IS weights only for very small g(θ), and following up on that (c) that IS allows us to have control over what sampling distribution to use which can allow us to avoid "expensive to sample" distributions.
Other discussion focussed on (1) the "big picture" of the proposed Normalizing Flow BDL method and why it was cute, I mean, chosen. And (2) the history of Bayesian Deep Learning (BDL) and why it exists. If I've missed anything else of note please let me know and I will add it here!
-
-
Empirical Bayes (EB) (Introduction and Commentary from Haining Tan)
- OPEN QUESTION: what impact (if any) do Empirical Bayes piror specifications have on estimation based on importance sampling?
-
Variational Inference (VI) (Schwartz STA410 3.0.2)
To make the link above work...
Remove the (annoyingly) appended "=" at the end of the address and you'll link directly to the intended section[Click] Recording of the 10:00AM-12:15PM conversation on Thursday, May 12.
recording (+Cg2&AgvP6)-
Introduction to VI in TensorFlow (based on this TensorFlow "article")
-
Bayesian Neural Networks (BNN) / Bayes by Backprop (BBB) relative to Bayesian (posterior) analysis and Markov chain Monte Carlo (MCMC) with PyMC3
- Landmark paper: Weight Uncertainty in Neural Networks (and perhaps see also The Local Reparameterization Trick)
-
Review Paper: Variational Inference: A Review for Statisticians
Hopefully the preceeding materials have been sufficient and this is just a reference at this point.
-
Landmark paper: Auto-Encoding Variational Bayes
Autoencoders are a seminal application of VI; however, they are not focussed on uncertainty charcterization. Thus, while they serve as a "proof of understanding" exercise of VI, they are tangential to our own efforts. So skip this for now, but if you wish to return to it later see the Keras Documentation, and other open source resources, e.g., for MNIST and Fashion MNIST
-
-
Gaussian Processes (Introductory Lecture)
-
The Gaussian Process (GP) and Stochastic Processes
-
MC-Dropout Approximates a GP, including
-
MC-Dropout is Bayes (when it approximates a posterior which is a GP)
From the tour de force Thesis and resulting landmark Manuscript and Appendix
-
Concerns regarding MC-Dropout [from Ian Osband (1, 2, 3, 4) and HMG2018]
-
Dropout isn't Bayesian (and MCMC using "Metropolis within Gibbs", Hamiltonian MC, etc.)
-
Hopefully the preceeding materials have been sufficient and this is just a reference at this point.
At this stage...
We've seen BNN/BBB and MC-Dropout as characterizing uncertainty in the NN context.
And we've also seen more traditional Bayesian analysis with MCMC using PyMC.
Can we add something to the Bayesian Deep Learning (BDL) domain? -
-
Normalizing Flows (NF), covering:
- Change of Variables, Jacobians, Determinants, and Eigenthings
- Computation versus Transformation Tradeoff Motivating NF
- MADE autoregressive structure, conditional parameter outputs, and the chain rule
- Masked/Inverse Autoregressive Flows (MAF/IAF), but not RealNVP or Hamiltonian Flows
- May 26th 4PM Onboarding Material Presentations [MCMC/VI+BDL+NF]
- May 30th 4PM Onboarding Material Presentations: GP+MC-Dropout and SWAG
- Proposed Mansuscript Outline and Writing Assignments [PDF, tex]
Feedback | Topics |
---|---|
Haining | Importance Sampling, MCMC (MH, SGLD, HMC, Gibbs), Variational Inference (VAE, BBB), SWAG |
Eric | Gaussian Processes (GP) and MC-Dropout as posterior GP approximation |
Yichen | Generative Models, Normalizing Flows, BDL (BMA and Deep Ensembling, etc.) and its critiques |
Ryan | NF Determinant classes, Autoregressive NNs, conditioners/transformers (MAF, IAF, RealNVP), |
Linear Flows and Permutations, Residual Flows, and ODE/SDE Continuous Infinitesimal Flows |
Week of | Days | Topics | Deliverable | Target |
---|---|---|---|---|
May 9 | 4 | IS, EB, VI, BBB | ||
May 16 | 4 | GP, MC-Dropout, NF | ||
May 23 | 4 | SWAG, SNF | Slides Presentation I: IS through NF | May 26 |
May 30 | 4 | SWAG, SNF | Slides Presentation II: SWAG and SNF | June 2 |
June 6 | 4 | Coding / Writing | Outline Planned Paper | June 9 |
June 13 | 3 | Coding / Writing | Finalize Analysis Examples | June 16 |
June 20 | 3 | Coding / Writing | Draft Manuscript Version I | June 23 |
June 27 | 3 | Coding / Writing | Draft Manuscript Version II | June 30 |
July 4 | 3 | Final Manuscript Submission | ASAP | |
July 11 | 3 | final poster or slides presentation | TBA |
Deadline | Event | Special Note | Dates |
---|---|---|---|
*Will ask Radu about poster crash | ISBA, Montreal | First time ever hosted in Canada | June 26 - July 1 |
Paper May 16/19 Poster October 12 | NeurIPS, New Orleans | Bayesian Deep Learning Workshop | Nov 28 - Dec 9 |
Intermediate Objective: special presentation targetting David Duvenaud, Murat Erdogdu, Rohan Alexander (Assistant Director of CANSSI Ontario, etc.), Nathan Taback (outgoing DoSS Director of DS, incoming DoSS Associate Chair of UG Studies), Scott Schwartz (incoming DoSS Director of DS), and Radu Craiu (outgoing DoSS Chair).
- The intention of this meeting is to raise awareness of and increase interest in undergraduate research within the DoSS by showcasing the success of our own research efforts.
- Additionally, attracting the immediate and long-term engagement of potential collaborators David Duvenaud and Murat Erdogdu would likely raise the profile of our work beyong the DoSS.
- Other potential invites include incoming intermim DoSS Chair Michael Evans, Dan Roy, and Jeff Rosenthal; however, while all of the aforementioned individuals are computationally oriented with interests in theoretical MCMC, it remains to be determined if our topic is well-aligned and of interest relative to their research interests.
-
Parallelize
- Haining/Eric will create a presentation of the SWAG manuscript including all of it's introductory and contextual material.
- It seems the SGLD citation [59] may be a key reference (highlighted also in manuscript footnote 3) - Is SGD just the "cheap version" of Langevin dynamics?
- References [45] and [39] appear of possible interest
- Covariance should be single pass computable
- Wilson et al. (along with Goodfellow et al. in Chapter 8.2.2 (Local Minima) believe local modes are likely interpretable as "weight space symmetry", and feel they are often not pathelogical mis-fitting or problematic model non-identifiabilities; thus, Wilson et al. indeed exactly propose to characterize posterior uncertainty as the uncertainty within a local mode of the loss function as exchangable with any other local mode.
- Fun fact: "Neal [49]"
- Etc., where "Etc." means identifying and gathering together relevant BDL literature that might be helpful to us
- Ryan/Yichen will create a presentation on Stochastic Normalizing Flows which will include general introducions the following general topics and then explain their specific applications in the manuscript.
- Optimizing NF and SNF
Concepts
JKL = E_{μZ (z)Pf (z→x)} [− log w(z → x)] = KL (μZ (z)Pf (z → x)||μX (x)Pb(x → z)) + constant
JML = E_{μX (x)Pb (x→z)} [− log w(x → z)] = KL (μX (x)Pb(x → z)||μZ (z)Pf (z → x)) + constant
KL (pX (x) ‖ μX (x)) ≤ KL (μZ (z)Pf (z → x) ‖ μX (x)Pb(x → z))
- Langevin dynamics: is "t" in this section is different than "t" in other parts of the of paper?
- Simulated annealing: what is the basis idea and how does this manifest in the SNF architecture?
- Neural MCMC: you will need to see what this is through the references
Questions
Is "t" in the MCMC subsection is different than "t" in other parts of the of paper?
How does (log) path probability in the MCMC subsection (Langevin dynamics subsections) fit into things?
Is the basic idea just to LD + MCMC perturb intermittently between flows to create the noise bypasses topological constraints?
If the prior is the base distribution and SNF fit on data is interpreted as "transforming the prior into the posterior", then repeated stochastic realizations of the SNF are samples with importance weights which are the ratio of the SNF "output distirubtion" relative to the base distribution, so repeated stochastic realizations are importance weighted representations of the postrior. How are realizations created and stored? How are uncertainty characterizations presented?
On page 7 the paper says: "Note that neural spline flows perform better than RealNVP without reweighting, but significantly worse with reweighting - presumably because the sharper features representable by splines can be detrimental for reweighting weights." I think this is saying that the NSF is not sufficiently heavy tailed to be a good importance sampling proposal. What do you think?
- Optimizing NF and SNF
- Haining/Eric will create a presentation of the SWAG manuscript including all of it's introductory and contextual material.
-
Optional Foundations Material
- Graph construction using "symbol-to-number" (Torch/Caffe) versus "extended graph" derivative representations (Theano/TensorFlow) is discussed in Chapter 6.5.5 (Symbol-to-Symbol Derivatives) [in Chapter 6.5 (Back-Propagation and Other Differentiation Algorithms)] of the Goodfellow et al. textbook; however, our language preferences will really just come down to ease of implementation of SWAG and (Ryan and Haining's) propsed "dilution" treatment of NF parameterization.
- The "universal approximation" character of NN methodology (also a hallmark of GP methodology) is suggested in this cool visual and addressed in Chapter 6.4.1 (Universal Approximation) of the Goodfellow et al. textbook.
- Speaking of the GP (for which many reference resources abound), here's a cool visual of the GP, and a discussion of how a NN can be shown to (in the limit) be equivalent to a GP. [Subsequently, a sparse spectrum GP approximates a GP, and an MC-dropout NN can be shown (in the limit) to be equivalent (i.e., have the same objective function) as a sparse spectrum GP].
- The PyMC3 documentation is a good place to start for MCMC. For the underlying HMC methodology this cool visual provides some initial intuition, and for the details see Radford Neal's seminal paper.
- By the way, there's something called Hamiltonian Flows. I'm just not sure what it really is yet.