This is the icml 2020 tutorial on bayesian deep learning. 
We'll start with a pattern recognition problem where we have airline passenger numbers indexed by time we'll consider three different modeling choices:

1- a linear function

2- a cubic polynomial and choice

3- a ten thousandth order polynomial

<img src="images/bayesdl/bayesdlicml2020-002.png">

which choice you would make in order to provide a good description of the data? most people go with choices one or two. In this tutorial we'll argue for choice three, because the real world is a complicated place and there'll be some setting of the coefficients in choice three the $w_j$s which provide a better description of reality than could be possibly managed by choices one or two which are just special cases of choice three.

In practice we're often making something like choice three and using neural nets that have tens of millions of parameters to follow problems with just tens of thousands of data points and finding we often get very good generalization. In part two of this talk we'll actually consider models with an infinite number of parameters that at the same time have very simple inductive dot biases and provide great generalization even on problems with a small number of data points. 



To begin understanding the motivation for choice three. It's helpful to think about modeling from a function space perspective let's consider choice one, the linear function.

<img src="images/bayesdl/bayesdlicml2020-003.png">

$f(x)=w_0 + w_1 x$ and we'll just put a standard normal distribution over $w_0$ and $w_1$ and this will induce a distribution over functions which we can visualize by sampling from this distribution over the parameters and looking at the different straight lines with different slopes and intercepts that we get. We'll get different $w_0$ and $w_1$s. Different straight lines the gray shade here shows a 95 credible set containing 95 of these functions and the solid blue curve here shows the expectation of this distribution over functions. 




<img src="images/bayesdl/bayesdlicml2020-004.png">

In this diagram we have a conceptualization of all possible data sets on the horizontal axis and the marginal likelihood or evidence on the vertical axis which is the
probability that we would generate a
data set
if we were to randomly sample from the
parameters of our model
the model which we were considering on
the last slide can just generate
straight lines with different slopes and
intercepts not very many data sets
but because the marginal likelihood is a
proper normalizable probability density
it's going to give a lot of mass to
those data sets we could alternatively
consider a model like a multi-layer
perceptron which might have a lot of
hidden units and layers and we have a
broad distribution over the parameters
and it can generate a wide range of
different data sets
but it won't give any one of those data
sets very much probability
we could alternatively consider a third
type of model like a convolutional
neural net which is very flexible it can
describe a wide array of different data
sets
but at the same time it has very
particular inductive biases like
translation equivariance which says that
if we translate an image we don't want
to change the class label
and this means that these types of
models are going to give a reasonable
amount of mass also to structured image
data sets and provide
good generalization on those problems.




in order to construct models with good
generalization we'll argue that we want
large support the support being which
solutions are a priority possible
and reasonably calibrated inductive
biases the inductive biases being the
distribution of support which solutions
are a priori likely
so we can consider the support to be the
flexibility and we want a lot of
flexibility
at the same time we should be careful
not to conflate flexibility and
complexity
in fact in part two as we've mentioned
we'll be considering models with
infinitely many parameters that are
extraordinarily flexible
and at the same time provide very good
generalization even given very small
data sets
by the same token we should not treat
parameter counting as a proxy for
complexity
in this figure on the left panel we have
the same diagram we had on the previous
slide in the other panels
we see what happens as these different
models are exposed to a given data set
in green here we see in the second panel
the model with
large support is able to contain a good
ground truth description of reality
but it has well-calibrated inductive
biases so it efficiently collapses down
onto that good description of reality
in the next panel in blue we can hardly
describe
many data sets at all just straight
lines with different slopes and
intercepts and so
while the model becomes quickly
constrained by the available data it
becomes erroneously constrained and it
starts to collapse down onto a bad
solution
and in the last panel we have a flexible
model which cap
casts wide enough a net to provide a
good description of reality
but it it doesn't have it spreads its
support too thinly to have good
inductive biases and reasonable
contraction
around that good description of reality
in this tutorial we'll argue that the
key distinguishing property of a
bayesian approach
is marginalization rather than
optimization
that is instead of use a single setting
of parameters w
we want to use all possible settings of
parameters and weight them by their
posterior probabilities in what's called
a bayesian model average
and we'll argue that this bayesian model
average will be especially relevant
in deep learning
i first became interested in bayesian
deep learning
after listening to a talk about
optimization which is perhaps a bit
ironic given what i said
on the previous slide we might
characterize bayesian methods as trying
to avoid optimization at all costs
don't just bet everything on a single
setting of parameters use all possible
settings of parameters
the argument was being made that mini
batch std would converge to flat regions
of the loss which would provide
better generalization in deep learning
than full batch gradient methods
in this diagram we have a
conceptualization of parameters on the
horizontal axis
and the value of the loss on the
vertical axis training loss in black
testing loss in red we can see that a
flat solution and train
has reasonably low loss and test whereas
a sharp solution has pretty high loss
after there's this horizontal shift
between the training and the test loss
which will typically happen because our
model won't be
completely determined by a finite sample
when we evaluate the loss on different
sets of points even if they're drawn
from the same distribution
we should get a different optimal
setting of parameters and the shape of
the losses should be relatively similar
because the training loss is still not a
horrible proxy for generalization
now if this argument were true and it
meant something
then to mean something the different
parameters in the flat region would have
to correspond to different functions
which provide
compelling and complementary
explanations for the data otherwise we
could just
contrive flatness to re-parametrization
and it wouldn't really
mean anything then this was an
extraordinary argument for following a
bayesian approach and doing
marginalization or integration basically
integrating a flipped version
of this curve where we want to consider
all of those good solutions and weight
them by their posterior probabilities
in a sense it might just be a bit
arbitrary if everything on just one good
solution
we know that there are many
and so there are many reasons to be
excited about bayesian deep learning
neural nets can represent a variety of
complementary explanations for the data
and we'll be seeing this particularly in
part three
and this will lead to better uncertainty
representation which is crucial for
decision making
we could think from a practical
perspective if our model could never
influence
a decision conceivably then it might not
have much of a practical impact but it
will also have a big effect
on the accuracy of our point predictions
which is
perhaps an under appreciated aspect of
the benefits of bayesian marginalization
in deep learning in particular because
there are all these different
complementary
solutions we can form a rich ensemble
of high performing and diverse solutions
and by doing that we'll often get
much better accuracy if we can do this
marginalization effectively
bayesian neural nets were also a gold
standard for a wide variety of problems
in the second wave of neural nets
led in many ways by bradford neil's
hamiltonian monte carlo approaches
which don't scale to modern
architectures but we know that
in a sense there's treasure buried in
some direction and we just need to
build the right tools as a community to
extract that treasure
as we started to see neural nets are
also much less mysterious
when viewed through the lens of
probability theory
over parametrization double descent
model construction
and many other properties like being
able to fit random labels
become very understandable when we think
about things from a probabilistic
perspective
why not these models can be
computationally intractable and can
involve a lot of moving parts design
decisions approximate inference
procedures and so on but they don't have
to and in the last year there's been
really extraordinary empirical progress
for bayesian deep learning
where we now have several methods often
providing better practical results than
classical training
without significant overhead on quite a
wide variety of problems
this tutorial will have four parts this
first part is about foundations of
bayesian
machine learning particularly with
respect to
how bayesian methods could impact deep
learning in particular
in part two we'll be considering a
function space perspective of machine
learning
in part three we'll consider several
practical methods for
modern bayesian deep learning the goal
of this part isn't just to
enumerate all the state-of-the-art
methods but rather to
exemplify many of the foundational
concepts that we introduced in other
parts of the tutorial
with several modern approaches and in
part four we'll be considering
bayesian model construction and
generalization including deep ensembles
and their connection with bayesian
marginalization and how we can build on
those connections for the multi-swag
approach which
marginalizes within multiple basins of
attraction
tempering prior specification rethinking
generalization double descend with that
trade-offs and a variety of other topics
now add a brief disclaimer which is
similar to a nice disclaimer i saw in a
nurip's tutorial
on deep learning with bayesian
principles this tutorial
is not meant to be a review of all
things bayesian deep learning that may
have actually been possible three years
ago but i'm excited to say it's not
anymore we're having
workshops now on busy and deep learning
with hundreds of paper submissions
rather this tutorial is meant to provide
a complimentary perspective which is
largely based in my own experiences and
expertise
a decent portion will be based on my own
work in a sense it's
what i would tell myself if i could
build a time machine and
go back in time that said if you feel i
should have included something
please send me an email and i'll try to
include it next time
so let's go back to this airline
passenger number example
and really forget everything that we
know about
statistics and machine learning and you
know think about the foundations
and use this example also to set up a
lot of notation that we'll use
throughout the tutorial
so we have these n training points or
targets observations
y and they're indexed by x's x one up to
x n
generally the x's could be like time
spatial locations images
and we want to make a prediction at some
arbitrary test input
x star in this case could be like the
airline passenger number is in 1961.
now just pause the video for a moment
and think about a step-by-step procedure
that you might might have followed just
knowing what you knew in high school
in order to solve this problem
if it were me in high school i would
start by thinking about the functional
forms that i'm familiar with like
science cosines exponentials
polynomials and then i would create a
functional form that
i thought would be a reasonable
description of of what i'm seeing
and it might have some free parameters
and i would specify an error function
which could be like the square distance
between the outputs of
of my function and and the training
observations
and i would minimize that error function
with respect to my parameters to learn
those parameters
but this approach would involve a lot of
ad hoc design decisions like y squared
error and not
absolute error for instance
we can instead follow a probabilistic
approach where we
suppose that our observations are drawn
from a noise-free function
f x w plus for example additive gaussian
noise with noise variance sigma squared
and we can then use this observation
model to form
a likelihood and then we can maximize
that likelihood with respect to our
parameters and learn those parameters
and then use our
conditional predictive distribution
given those parameters to make our
predictions
we can see by taking logs of the
likelihood that if we follow this
approach
we'll get exactly the same point
predictions as we had using the approach
on the previous slide where we just
specified the squared error function
however in this approach the design
decisions are a bit more interpretable
we probably have some intuitions for
example about
whether we want to use gaussian noise
perhaps
if we thought there were outliers in our
problem we might use a heavy-tailed
noise model like a laplace distribution
and that would lead to an absolute value
error function
so we can make different design
decisions here and derive different loss
functions
if we believe our model f x w to some
extent we can also get
an estimate of the noise variance in the
data
now remembering what we know about
statistics
you may be familiar with the idea that
either of those approaches could lead to
what's called overfitting
where we get very low training loss but
we get
very bad testing error
in order to combat overfitting it's
quite popular to introduce what's called
a regularizer where we add
some kind of complexity penalty like we
want to penalize the the magnitude of
the weights in our model
but this also involves all sorts of
heuristic design decisions like
how do we know whether we want large
weights or small weights it would
totally depend on the parametrization of
our model
what is complexity how much should we
penalize it we could use some kind of
lambda parameter maybe it determines
your cross validation but
what would be our validation sets and if
we had several
lambda parameters then we'd have a
cursive dimensionality in estimating
those parameters
we can gain some interpretability by
thinking about maximizing a log
posterior which would equal a log
likelihood plus a log
prior and the log prior could be
interpreted as a regularizer
but this really isn't a bayesian
approach
there isn't much you need to know in
order to use a bayesian approach for
your own research
we have bayes rule here which is often
expressed as a posterior being
proportional to a likelihood times of
prior
the normalization constant is the
marginal likelihood what we were
considering on the vertical axis of that
plot
we had all possible data sets on the
horizontal axis
the sum rule says the marginal
distribution of p of x is equal to the
sum
over the joint distribution of p of x
and y summing out y
the product rule says that the joint
distribution over x and y
is equal to the conditional distribution
of p of x given y times p of y
or the conditional distribution of y
given x times p of x and we can derive
the
phase rule from the product rule
now ultimately we want to compute the
unconditional predictive distribution p
of y
given our data bulb y but not given
parameters
and so the sum and product rules give us
this integral in equation
11. the integral of p of y given the
parameters times the posterior over
those parameters given the data y
and so this is called marginalization
because we see w
doesn't appear on the left side it does
on the right
in words this integral is saying let's
not just use one setting of parameters
let's use all possible settings of
parameters weighted by their posterior
probabilities
and this isn't a controversial
expression it's a direct consequence of
the sum and product rules of probability
this model average represents what's
called epistemic uncertainty over which
function
fits the data there are many different
functions corresponding to different
settings of the parameters
and we're not sure given a finite sample
which is the right description of the
data
by representing episomic uncertainty we
can
we can have some robustness against
against overfitting
we can also view classical training as a
special case of this approach where we
have an approximate posterior q of w
given our data y
equal to just a point mass a delta
function centered on the map the
regularized maximum likelihood
solution of parameters we can see that
if we substitute this in
we're just going to get our conditional
predictive distribution given those
maximum likelihood or map parameters
we can also see then that bayesian and
classical approaches will be
similar when the posterior is highly
concentrated around a setting of
parameters which of course is exactly
not the case
in deep learning where we have neural
nets that are
very diffuse in their posteriors and
also the posteriors capture a variety of
different models corresponding to
complementary explanations
of the data
so we're going to especially want to do
this integral in deep learning
we can also see that we can probably do
a lot better than classical training in
terms of estimating this integral
by using some fairly simple posteriors
which
might not be good descriptions of the
exact posterior but are still a lot
better than a point mass
so we can definitely improve our
estimates without needing to have
you know an exact approximation of this
integral or exact
representation of this integral
now there are fundamental differences
between bayesian model averaging and
some types of model combination in
particular the bayesian model average is
meant to represent
a statistical inability to distinguish
between hypotheses given limited
information but the assumption
is that one of those hypotheses one
setting of those parameters is the
correct setting of parameters
and as we get more and more data our
posterior over our hypotheses our
parameters
will collapse onto a particular setting
and we'll recover the maximum likelihood
solution
this is different than some approaches
to ensembling and model combination
which work by enriching the hypothesis
space and assuming for example the
combination models might be a correct
description
of reality
now let's exemplify some of these ideas
with a few applications
suppose we flip a bias coin with the
probability lambda
of landing tails i'd like you to pause
the video
and answer these three questions one
what is the likelihood of a set of data
y1 up to yn so we're just doing
m flips maybe we see m tails
two what is the maximum likelihood
solution for lambda
three suppose the first flip is tails
what is the probability that the next
flip will be tails
using our maximum likelihood estimate
for lambda
you can assume m tails and n flips
so the likelihood of our data is just a
product of bernoulli distributions two
possible outcomes here we have y
equals one if y i is tails and y equals
zero if y is
heads if we don't care about ordering
and we observe m tails then our
likelihood is
a binomial distribution
m tails here probability of getting
tails is lambda
and we can easily maximize this
likelihood you could try taking
logs of this expression then derivatives
with respect to lambda setting those
derivatives to equal to zero and so on
we'll get the solution that the maximum
likelihood setting of lambda is m over n
where we have m tails and total flips
which in a sense
is kind of intuitive but on the other
hand is kind of problematic
why do you think this estimate might be
problematic and in considering this
question
think about the third part of the
problem what's the probability
that we would get tails on the next flip
assuming we've done one flip and we've
just observed tails
using this estimator pause the video and
think about the problem for a moment
so if we substitute in m equals one n
equals one
we're saying there's a one hundred
percent chance that the next flip was
tails
do you believe that of course not and
when we arrive at a clearly unbelievable
prediction
it's usually because some part of our
model modeling procedure
has not honestly represented our beliefs
let's think about a bayesian approach
this problem if we choose a prior p of
lambda
proportional to lambda to the alpha one
minus lambda to the beta
then the posterior after we multiply the
likelihood against the prior
will have the same functional form as
the priors this is called a conjugate
prior
a beta distribution has this functional
form
the gamma functions here for
normalization
we can analytically compute the moments
of the beta distribution
here we have visualizations of the beta
distribution corresponding to different
settings of its parameters a
and b we can see in the top right panel
if we if we want to express the belief
that we don't know what the bias is then
we can use a uniform distribution
setting a and b equal to one and this
means lambda
is equally probable for any value
between zero and one a priori
and so we can express you know
even the idea that we really just don't
know using a prior distribution we don't
need to have
an informative prior however we might
want to consider
a prior that says well we think that the
bias is probably close to a half but
we're not going to say it's
definitely a half just choose whichever
prior
is an honest reflection of your beliefs
even if your belief is
i don't know so we can
multiply our prior with our likelihood
to get our unnormalized posterior
this is a beta distribution we can
compute its moments and
use the posterior expectation over
lambda for our predictions
we can see in equation 27 it's m plus a
over
n plus a plus b now let's consider
a few questions it's good to do some
sanity checks here
what's the probability the next flip is
tails let's suppose that
a and b are one we can see that it's not
going to be
a hundred percent if m is one and n is
one so that's good
what happens when we make a and b really
large well the prior starts to dominate
that gives us a strong prior
if we make the data really large then
both n and m will be large
and m over n will dominate in this
estimate it will recover the maximum
likelihood solution
which is what we want that's a good
sanity check
and now i'd like you to consider this
fourth question which is conceptually
very important
does the map estimate what we get when
we
take the r max of the log posterior over
lambda
which is equal to the arg max of the log
likelihood plus the log prior
with a uniform prior over lambda
give the same answer as bayesian
marginalization
to find the probability that the next
flip is tails
pause the video and think about this
question for a minute
so if we have a uniform prior over
lambda
then log p of lambda won't affect our
optimization in equation 31 and we'll
just get the maximum likelihood solution
and we just saw that when we do
marginalization we get a different
answer than the maximum likelihood
solution even with a uniform prior
so i can't emphasize enough that we
should not interpret
bayesian methods as regularizers in
optimization
there is a conceptually very important
difference
between marginalization and regularized
maximum likelihood
optimization and that difference will be
practically
crucial when we're thinking about
bayesian methods
in deep learning
let's consider one more example suppose
we have observations y1 up to yn
drawn from an unknown density p of y
we'll start by specifying an observation
model we'll suppose that the points are
drawn from a mixture of gaussians and in
order to estimate this unknown density
we'll
learn the parameters of this mixture of
two gaussians
parameters here are the weights the
means and the variances
so we can use the observation model to
form a likelihood and i've just written
it down here
and i'd like you to pause again and
think about
choosing a setting of parameters which
will provide
a lot of likelihood without you know
having to take derivatives and stuff
like that you can just kind of look at
this expression
and play with a few settings and find
something and just think about the
the means and the variances don't worry
about the weights
so if we make for example the mean of
the first component equal to one of the
data points
then this x will disappear it'll just be
1 and we get this normalization constant
w1 over square root 2 pi sigma 1 squared
and then we can make sigma 1 the
variance or the standard deviation very
small for the first component
and then this term will blow up and the
other term we can use to assign density
to all the points so we're not
multiplying again zeros and the
likelihood goes to infinity
now do we believe this solution of
course not we typically wouldn't believe
our data are comprised of point masses
and when we reach an unbelievable
solution it's typically because we
haven't
fully represented our beliefs in our
modeling procedure
we could introduce a regularizer or a
prior which would go to zero faster than
the likelihood goes to infinity
as the variance parameters go to zero
but we might want to include the point
mass solution
as long as it's one of an uncountably
infinite number of solutions
which we can do through full bayesian
marginalization in which case we can use
extremely flexible models
even infinite mixtures of gaussians
corresponding to dirichlet processed
mixture models
and achieve good generalization even
with a small number of points
now ultimately as we've been saying we
wish to compute a bayesian model average
corresponding to our unconditional
predictive distribution
p of y given data rather than our
conditional predictive distribution
p of y given parameters w this
unconditional predictive distribution is
equal to the integral
of our conditional predictive
distribution times our posterior
p of w given data this is just an
expression of the sum and product rules
of probability
rather than use a single setting of
parameters we want to use all possible
settings of parameters
weighted by their posterior
probabilities which is going to be
especially impactful in deep learning
where we have
highly diffused posteriors containing
different settings of parameters that
correspond to a variety of compelling
and different solutions to a given
problem
for most models including bayesian
neural nets this integral is not
analytic
it's common to use what's called a
simple monte carlo approximation
where we take an average of the
conditional predictive distributions
for different settings of parameters
sampled from an approximate posterior q
of w given data
we find these samples typically through
one of two approaches
deterministic methods approximate the
posterior distribution with some
convenient distribution q
although the integral can't be computed
in closed form
typically we can represent the
unnormalized posterior analytically
it's just the likelihood times the prior
q our approximate posterior is chosen
for convenience often so that it's easy
to sample from
like a gaussian distribution in which
case its parameters would be its mean
vector and its covariance matrix
which we choose typically to make q
close to p
in some sense for example variational
methods find these parameters by
minimizing the kl divergence between
q and p as we mentioned earlier
classical training is a special case of
approximate inference
where our approximate posterior is just
a point mass centered at the maximum
likelihood
or regularized maximum likelihood map
setting of parameters
the laplace approximation is another
popular deterministic method which we'll
discuss further in part three
expectation propagation is another
popular approach and there are several
others
we could alternatively consider markov
chain monte carlo which forms a markov
chain of approximate but asymptotically
exact samples from our posterior
metropolis hastings is a popular mcmc
approach
hamiltonian monte carlo uses gradient
information and was very
successfully developed by radford neal
in the mid-90s for bayesian neural nets
recently stochastic gradient mcmc
methods have been very
up-and-coming and exciting approaches in
bayesian deep learning because they
algorithmically resemble sgd which means
they can be applied in
a wide variety of variety of
applications where you might otherwise
use classical training but often with
better results stochastic gradient
longitude dynamics and
sarcastic radiant hamiltonian monte
carlo our stochastic gradient mcmc
approaches
we'll discuss in part three later in
part four
we'll also argue that we may sometimes
want to avoid the simple monte carlo
perspective in equation 33 really what
we're most interested in
ultimately is estimating this
unconditional predictive distribution in
equation 32
under computational constraints and from
this perspective it's helpful to think
of estimating that integral as an
active learning problem under
constraints in which case the deep
ensembles method can be very
compelling as an approximate bayesian
method which we'll discuss
in part four
that's the end of part one in part two
we'll be considering a function space
perspective
hi i'm andrew wilson and welcome back to
part two of the icml 2020 tutorial
on bayesian deep learning in this part
we'll be considering a function space
perspective
of machine learning
we'll cover gaussian processes infinite
neural nets
how training a neural net is like
learning a kernel bayesian
non-parametric deep learning
and several other topics
from the function space perspective the
parameters in isolation
are entirely divorced from the
statistical properties of a model
in part one we considered how in
regularization whether or not we want
large weights or small weights entirely
depends
on the parametrization of the function
that we're using
yet we focus most of our modeling
efforts on learning parameters w
what we really care about are how those
parameters w combine with the functional
form
f x w ideally we want to perform
inference
directly in function space
let's return to the example that we
considered in part one
where we had a distribution over
functions induced by a distribution over
parameters in a linear model we'll
consider f of x w
equals w naught plus w one x and we'll
place a standard normal distribution
over w naught and w one we can sample
from that distribution
to get different values of w naught and
w1
corresponding to different straight
lines with different slopes and
intercepts
the gray shade here corresponds to a 95
credible set
95 of these functions are in that shade
and the solid blue line corresponds to
our expectation
of this induced distribution over
functions
we'll now consider the more general
model class where we have an inner
product of a vector of weights w
with a vector of basis functions phi
the entries of phi for example could be
one x x squared x cubed if we're
considering a polynomial basis we could
alternatively consider a fourier basis
this model class is quite general and
we'll place a gaussian distribution
over w with mean zero for convenience
and covariance matrix sigma w
now i'd like you to pause the video for
a second and see if you can derive the
moments of the induced distribution
over functions given by equations 38 and
39
now all of the randomness in this model
is coming from the distribution over
parameters w
so we can pull w out of we can pull phi
out of the expectation it's just
deterministic
and we can evaluate the expectation over
w
and by definition in this case it's just
zero so the mean of this induced
distribution over functions is zero
now f of x i and f of x j
are two different random variables
corresponding to querying
our random function at two different
input points
x i and x j so the inputs could be time
spatial locations whatever is
indexing our random function
we can use the definition of covariance
and derive the covariance function
also known as the kernel function in
this case as the inner product
of phi of x i with phi of x j
under sigma w now i'd like you to
consider whether there are any
higher moments in this induced
distribution over functions
in this case there are no higher moments
because we just have a sum of gaussian
random variables
in equation 40 and gaussians are closed
under addition
in fact f of x is a gaussian process
with a mean function m of x and a
covariance function or a kernel
k x x prime for two arbitrary inputs x
and x prime
in this case we can actually do
inference directly over f
x instead of over parameters w
formerly a gaussian process is a
collection of random variables
any finite number of which have a joint
gaussian distribution
we can use a gaussian process to define
a prior over functions
this notation here f x is distributed as
a gp
means that any collection of function
values queried at any collection of
inputs
x1 up to xn these could be time spatial
locations images etc
has a joint multivariate normal
distribution with a mean vector mu
defined by the mean function of the
gaussian process
and a covariance matrix k defined by the
covariance function or kernel of the
gaussian process
created all possible pairs of inputs x1
up to xn
in the bottom left panel here we have
samples from our induced distribution
over functions to create this sample
to create these samples i chose a bunch
of input points
a bunch of x's and then i formed my
multivariate normal distribution
by building up my mean vector mu
evaluating my mean function in this case
zero at all the input points that i've
chosen
and forming my covariance matrix by
querying my kernel at all possible pairs
of the inputs
i then sampled from that multivariate
normal distribution
to get the black dots the random
function values
and then i sampled from exactly that
same distribution two more times to get
the purple curve and the green curve
here i've just joined the dots together
the gray shade again corresponds to our
95
credible set and the blue curve here
corresponds to our prior mean function
in this case zero for convenience
in the right panel i've conditioned on
some data denoted by crosses
and i've sampled several posterior
functions in black
purple and green we also have the
posterior mean
function in solid blue which we can use
to make point predictions
and a 95 credible set in gray shape
very importantly all of the statistical
properties of these functions
are controlled by our kernel
and in this case we're using what's
called an rbf kernel
also known as a squared exponential or a
gaussian kernel
this kernel has the functional form a
squared
times the exponential of the negative
euclidean distance
between our two inputs x and x prime
over two
l squared our length scale
it incorporates the very intuitive
inductive bias the functions that are
close together in the input space
should be more correlated than functions
that are far away in the input space
for example airline passenger numbers in
1951 and 1952
should be more similar than airline
passenger numbers in 1951 and 1960.
so this is a very natural inductive bias
the extent of the correlations is
controlled by the length scale
hyperparameter
l if l is very small
the function values become uncorrelated
we just have a white noise process
if l is very large then the function
values become
all very correlated
a is the amplitude hyperparameter if a
is large then the amplitude of these
functions will be large
in the top panel here we have
visualization of the entries of our
covariance matrix when we have 1d
ordered inputs we can see that there's
this strong diagonal band where we have
highest covariance
and then the covariances decrease away
from that band as we increase distance
in the input space
so let's consider the distribution over
functions that we get
when we have particular settings of our
length scale
and our amplitude parameters l naught
and a naught
here we have our sample prior functions
and our sample posterior functions
pause the video for a moment and think
about whether these sample functions
look strange to you in some way
and it might be a little bit subtle
to me these functions look a bit too
wiggly they're varying a little bit too
quickly they're a bit too complex
the mean shoots back down to the prior
mean
a bit too fast
so to me that's that that seems like the
length scale is too small
now let's see what happens is we
increase the length scale
perhaps here we've increased the length
scale a bit too much we see now that the
functions look
too slowly varying too simple they're
over smoothing the data
let's consider again the marginal
likelihood that we considered
in part one of this tutorial we can form
the marginal likelihood
in this case by integrating away our
gaussian process
p of f the marginal likelihood again
is the probability that we would
generate a data set if we were to
sample from this distribution over
functions
now if we have a very small length scale
we just have a white noise process
so we're going to be able to generate
all sorts of different data sets but
they're all going to be pretty different
from one another so if we keep sampling
from that distribution over functions we
won't see the same thing again
if we have a very large length scale
then this induced distribution over
functions is pretty simple as we saw
everything looks sort of like a straight
line everything is is pretty similar
and so we're not really generating that
much many data sets with very much
probability
using that large length scale in red
for a given data set the marginal
likelihood will have an occam's razor
property
where it will automatically favor a
model of appropriate complexity
this is described very eloquently by
david mackay
both in his phd thesis and in his book
on information theory it's also
described
in the gaussian process for machine
learning book by rasmussen and williams
if we optimize our marginal likelihood
with respect to length scale
we get the green fit here which
intuitively has an appropriate level of
complexity
if we had to choose between one of these
curves we'd probably choose the green
curve
now we opened this part of the tutorial
by deriving
covariance functions from classical
weight space models
and then we said well let's just use a
gaussian process with an rbf kernel
now let's derive where that rbf kernel
comes from
we'll again consider an inner product of
weights
and basis functions in this case not
expressed using vector notation
we'll put a gaussian distribution over
our weights and
we'll use as our basis functions
gaussian bumps
centered at points ci as drawn in this
diagram
we can grind through the algebra revert
to the definition
of our covariance function and get our
covariance function in equation 55.
now let's consider what happens as we
let the number of basis functions
j go to infinity
we'll want these basis functions to
cover the whole real line
so we'll let cj the center of the j
spaces function be
log j and c1 the center of the first
basis function
be minus log j
if these basis functions are equally
distributed
then the difference between the i plus
first basis function and i face's
function will be
2 log j over j so we see that that goes
to 0 as j
goes to infinity which is why i used log
j for the endpoints i could have used
root j for instance but not not j
and the expression for the kernel in
equation 57
becomes a riemann sum an integral
with limits c naught and c infinity in
this case minus infinity
and infinity we can substitute in our
expression
for the basis functions and evaluate
this analytic this integral
in closed form and get something that's
proportional to the rbf kernel
now this is a very extraordinary result
i'd like you to
pause the video and just consider what
we've done for a moment
we've actually shown that by thinking
from a function space perspective
we can use a model with an infinite
number of basis functions
using a finite amount of computational
resources
so in part one of the talk i open with
this question
about how we might model airline
passenger numbers and i said i'd like to
use
the ten thousandth order polynomial from
the three choices that we were
considering
you might have wondered well carrying
this argument to its limit are we even
going to be able to store these models
in memory assuming that we want to use
them in fact what we're seeing here is
we can use something like an infinite
order polynomial we can use a model
that is extraordinarily flexible in fact
this is a universal approximator and we
can see that by looking at this
derivation if we collapse these basis
functions onto point masses we can have
densely dispersed point masses with
different different heights
but we've also seen that this model can
generalize very well it has very
intuitive inductive biases
so we should be very careful not to
conflate flexibility with the complexity
of our model class
and we should also be very careful not
to do parameter
counting as a proxy for complexity here
we're seeing
that we can actually use models with an
infinite number of parameters
that in some sense have fairly simple
inductive biases
and provide very good generalization
in this diagram where we have all
possible data sets on the horizontal
axis
we're saying we want to have heavy tails
we want to support all sorts of
different data sets
but we also want to have mass in
reasonable places
and we can get that in this case through
the inductive biases
of the kernel function
i'd like to add a brief note about the
mean function
in this tutorial so far and in many
texts about gaussian processes
the mean function m of x is often taken
to be zero for notational convenience
but in fact we can use any deterministic
mean function without fundamentally
changing
the modeling procedure and it's usually
pretty standard to subtract off the
empirical mean and then use the
zero mean gp or subtract some kind of
deterministic mean function
and then add it back later
also typically the covariance function
or the kernel is really the
the key object of interest there are
often degeneracies between specifying
the mean
and covariance function in which case
often it's typically preferred therefore
to do the modeling
in the covariance function for example
the kernel function shows up in
the aqua factor term we get in the
marginal likelihood a log determinant
but not the mean function that said
a mean function can be a great way of
incorporating scientific
inductive biases into a model there's
sometimes a tension
between using an interpretable
scientifically motivated model for
example a model with physically
interpretable
parameters and some black box function
approximator
in actuality this is often a false
choice we can often use both in the
sense
we can use as our mean function that
scientifically motivated
model that parametric approach which has
interpretable parameters
and then we can at the same time use our
gaussian process with an rbf kernel to
allow for a non-parametric
relaxation around that mean function to
account for
inevitable model misspecification
so we can really have both we can we can
use that mean function of whatever other
model we would have used
to calibrate the inductive biases of our
approach and we can concentrate
our distribution of functions as closely
or
as loosely as we like around that that
scientific mean function
interest in gaussian processes in the
machine learning community
was triggered by work that bradford neil
was doing on bayesian neural nets
he really embraces this idea that we
should build very flexible models and
accordingly
was pursuing the limits of very large
bayesian neural nets
and he showed that as we we let the
number of hidden units
in a bayesian neural net go to infinity
we get a gaussian process
here let's consider the simple neural
net f of x equals a bias
plus a sum of a bunch of hidden units
with hidden unit weights ui
with fairly general assumptions one can
show
using a central limit theorem argument
and not even you know requiring
for example that the parameters have
gaussian distributions
that will get a gaussian process in the
limit as j goes to infinity
and we can derive the moments of the
gaussian process just as we already have
been doing in other examples
we can make particular design choices
like we can let our activation functions
be error functions
and we can use a gaussian prior over our
hidden unit weights
and derive a neural network kernel in
equation 69.
we have samples from a gp with this
neural net kernel we can see that
they're non-stationary that the
at a high order the statistical
properties in the center are different
than
in other regions unlike say the rbf
kernel which is translation in variance
it's a very flexible model it could look
different in different parts of space
but
in a high order the statistical
properties the functions in in those
types of kernels will be similar in
different regions of space
so this result that an infinite neural
net converts to a gaussian process
was very exciting in the machine
learning community and
in some cases it drove the idea that
we ought to just use gaussian processes
instead of neural nets
at the time people were becoming very
frustrated by all sorts of design
decisions associated with working with
neural nets like
how many hidden units do we want how
many hidden layers
what are the activation functions what
optimization procedure are we using what
what learning rate schedule are we using
with that optimization procedure
and the lack of a principle framework to
answer these types of questions
gaussian processes by contrast were very
simple and interpretable
and flexible and you could write the
same 10 lines of code and get the same
answer
anywhere in the world
now david mackay was a big supporter of
gaussian process research but
at the same time he was a bit of a
contrarian he wrote this essay
in an edited book on neural nets where
he has this quote
how can gaussian processes possibly
replace neural nets have we thrown the
baby out with the bathwater
and what he meant was that when neural
nets were being developed they were
envisaged as becoming intelligent agents
that could
discover interesting hidden
representations in data
and while gaussian processes have all
these nice statistical properties
they're also basically just smoothing
devices
in which case are we throwing the baby
out with the bath water and treating gps
as
replacements for neural nets
the answer to this question is to build
more expressive kernel functions
which can discover interesting hidden
representations in data
although kernel methods and gaussian
processes and neural nets are somehow
treated as competing approaches they're
often actually quite complementary
what we get with the neural net is a way
to create
highly adaptive basis functions which
have very good inductive biases for a
number of different application domains
such as image recognition
what we get through a kernel method in a
gaussian process
is a way to have an infinite number of
basis functions
using a finite amount of computation
we can combine both of these both of
these properties together
into approaches like deep kernel
learning where we have a neural net
transforming the inputs of a base kernel
to create a deep kernel which is then
jointly trained through the marginal
likelihood of the gaussian process as
one model
so this gives us an infinite number of
highly adaptive
basis functions importantly this is
quite different than what we would get
if we were to just
use a neural net as a feature extractor
train the neural net
and then apply a gp to the result
rather deep kernel learning here is a
one-step end-to-end procedure it's a
single
model and it's trained through the
marginal likelihood objective
we can use gaussian processes with deep
kernels to do representation learning
in this example here we have a gaussian
process with a deep kernel applied to
the all of eddie faces problem
where we're considering faces with
different orientation angles and we're
trying to predict the orientation angle
for new faces
we can see projecting what the model
learns into two dimensions
that it discovers that faces with
similar orientation angles
here each face is given by a line
segment and the orientation angle is
given by the slope of that line segment
are clustered together are similar in
some way
we can also see this by visualizing
the learned covariance matrices that we
get ordering the faces by orientation
angle
in the left two panels where we're using
deep kernels we see a pronounced
diagonal band
which shows that the model learns that
faces with similar orientation angles
should be more correlated this is
non-euclidean
metric learning in a sense we can
describe what a neural net
is doing as learning a flexible
non-euclidean
similarity metric for the data in this
sense
training a neural net is a lot like
learning a kernel
in the far right panel we have the
entries of the trained rbf kernel matrix
which we see are quite diffuse
euclidean distance is just not a good
proxy for similarity in this application
looking at euclidean distance of vectors
of pixel intensities
is not going to describe what we need to
describe in order to solve this kind
of representation learning problem very
effectively
in this example we're considering a
discontinuous step function
and we see that a gp with a deep kernel
in green here the green shows the 95
credible set describes that data very
effectively
a gp with a spectral mixture kernel a
different type of kernel is shown in red
it still fits the data all right but
it's sort of over smooth
we see a gp with an rbf kernel fit in
blue
from this perspective of a gp with an
rbf kernel even though this is a
flexible model
um this kind of data is a very unlikely
draw from the prior we would have to
sample for a very very long time to see
anything like the step function and so
since we can learn the noise here the
model is just saying well
it's it's quite probable if we you know
believe the gpu with the rbf kernel that
this this data is basically just noise
and will have very simple fit and a lot
of
uncertainty so this example shows that
gaussian processes for example don't
necessarily
over smooth the data or have trouble
fitting discontinuous data
it totally depends on the kernel and
with the deep kernel we can learn
this kind of discontinuous step function
data
here we've applied a deep kernel
structured with an lstm network
to an autonomous driving application
where decision making
with predictive distributions is very
important we don't want to just know
where lane boundaries might be but we
want to know you know
error bars where these lean boundaries
could be in making decisions
and we can see in the bottom row here
that
the predictive distributions do a good
job of capturing the ground
truth and the point predictions are also
better than the point predictions that
what we get
if we use the classic lstm neural net
conventional conventionally gaussian
processes have suffered from
computational constraints we need to
solve a linear system
involving our n by n covariance matrix
for n training points and also
compute log determinants and derivatives
of log determinants
which incurs naively an n cube
computational cost and
exact gaussian processes have often been
intractable for problems with more than
a few thousand points
recently by embracing advances in
hardware krylov subspace methods have
been developed that enable
very scalable exact gaussian processes
through
gpu acceleration and parallelization
and so in a sense deep learning has
progressed not only through
advances in methodological design but
also
by building algorithms that can really
benefit
from hardware acceleration by thinking
about how to benefit from systems
and gpi torch this package here
really is inspired by that approach in
order to scale gaussian processes
with exact inference to very large
problems
and when we do that we can see that gps
with deep kernels
often outperform standalone deep neural
nets
also we can see that when we scale an
exact gaussian process
to large problems we can really realize
the benefit of a non-parametric
representation
as we add more and more data points we
can see that the error decreases
very substantially with a non-parametric
method that has an infinite number of
basis functions
the capacity of the model scales
automatically with the amount of
available information
unlike a parametric model which is
entirely determined by a finite number
of parameters
many approximate scalable inference
techniques for gaussian processes
introduce low rank kernel matrices so
this would this would apply to inducing
point methods stochastic variational
inducing point methods
as well as random feature expansions
which basically turns the model into a
parametric method
and we can see that on really large data
sets the gap between the
scalable exact gps and the approximate
gps is especially
large there are several other
popular gaussian process libraries such
as gp flow
g pi bow torch for bayesian optimization
that's
black box optimization automatic ml
hyper parameter tuning a b
testing and these libraries have
different algorithmic foundations and
use cases
there's also a lot of additional work on
combining gaussian processes
and neural nets in various different
ways for example gaussian process
regression networks and deep gaussian
processes
build hierarchical models replacing
neurons in neural nets with gaussian
processes
several other recent works have extended
bradford niels limits to
multi-layer neural nets and other
architectures like convolutional neural
nets
and those limits are also very related
to neural tangent kernels which have
been a very
hot topic in deep learning recently
where we take infinite
neural net limits and derive kernels
which have recently achieved fairly
promising empirical
results now most of these kernels from
infinite neural net limits
have fixed covariance structure and we
described how training a neural net
in many ways is like learning a kernel
that's what's doing the representation
learning
and so bridging this gap and developing
these infinite limits
so that we can also do kernel learning i
think is a very exciting direction for
future work in this space
now there are many ways to realize
bayesian principles in deep learning in
addition to the standard approach of
margin line
marginalizing distributions over
parameters in a neural net
in this part we've really been
considering how we can use neural nets
to provide inductive biases for gaussian
processes
for a bayesian non-parametric approach
to deep learning
in general combining principles of
bayesian non-parametrics
with deep learning is a very exciting
area for future work
that's the end of part two in part three
we'll be considering practical methods
for
bayesian deep learning hi i'm andrew
wilson at new york university
and welcome back to the icml 2020
tutorial on bayesian deep learning
this is part three methods for practical
bayesian deep learning
in this part we'll be using several
modern approaches for highly practical
bayesian deep learning
to exemplify many of the foundational
concepts that we've introduced in other
parts of the tutorial
while i've long been interested in
bayesian machine learning
i ironically became quite interested in
bayesian deep learning
after listening to a talk about
optimization the irony is that i might
consider
bayesian approaches as trying to avoid
optimization at all costs
rather than use one parameter use all
the perimeters and weight them by their
posterior probabilities
the argument being made was that mini
batch sgd
would converge to flatter regions of the
loss than
full batch gradient methods and
therefore provide better
generalization in this figure
the loss is on the vertical axis train
in black
chest and red and there's a
conceptualization of parameters on the
horizontal axis
we can see that a flat solution has
still reasonably good
testing loss but sharp solution has
pretty bad
testing loss once there's this
horizontal shift between
train and test which happens because
our model won't be entirely specified by
a finite amount of data and in fact we
would want the different
optimal solutions to change as we query
our loss on different sets of data even
if they're from the same distribution
at the same time the shape of the loss
is unchanged because
we imagine that the loss is still a
reasonable proxy for generalization
in actuality will change a bit but not
necessarily that much
now this was actually a really great
argument for bayesian integration
because
in order for this to be true
the flat regions of the loss would have
to have parameters corresponding to
very different functions which would
provide complementary explanations of
the data
otherwise we could just contrive
flatness for example is
re-parametrization
it would really mean that much and if
that's the case
that we have all of these low-loss
parameter settings that are providing
complementary explanations of the data
that we really really want to do that
integral we want to use all these
parameters and it's kind of arbitrary in
a sense to say
let's just put everything on one
hypothesis we want to integrate
a flipped version of this curve
and that's going to give a very
different answer than just using one
solution
in particular recall that the bayesian
model average here given by this
integral
says that we want to use a conditional
predictive distribution given w
weighted by the posterior distribution w
given d
and integrating over all possible values
of w that's why it's called
marginalization we're not
depending on w anymore we've
marginalized it out when we find this
conditional predictive distribution the
posterior w
given the uh often the losses taken as
the negative log posterior
for neural nets is extremely complex
containing many complementary solutions
which is why bayesian leveraging is
especially significant in deep learning
you know more significant for neural
nets than for most other model classes
in order to come up with a really good
approximation of this integral however
we're going to
really need to carefully understand the
structure of the neural net lost
landscapes
towards that end we showed that if you
retrained a neural network
twice and uh uh uh
got different architectures different
solutions different basins of attraction
you could actually find subspaces along
which these different solutions were
connected
meaning that you could walk from one
point to another
along a curve in this subspace and the
training loss wouldn't change very much
as we traverse this curve and so this is
a surprising result because if we take a
direct linear path between
the different uh solutions found by by
training sgd with different
initializations
we often occur incredibly high loss at
almost 100
uh training error and so um
you know in a sense the intuition is
that the points were isolated but in
fact there are subspaces along which
they aren't and this actually means in a
sense that
we're not even finding local optima when
we train sgd for these neural nets we're
just finding these kind of basins of
attraction and there's some directions
along which
they're extremely flat and we can walk
from one solution
to another and we can find these paths
by minimizing the loss that we found
used sort of for sgd training like cross
entropy
uniformly an expectation over the curve
so this this corresponds to something
like uh
a line integral of cross entropy
normalized by
arc length so in this visualization we
have a two two-dimensional slice
through a very high dimensional loss
surface and
each point in this plane here
corresponds to an affine composition of
three high dimensional weight vectors
and the
height of the loss the color is is is is
is
is the value of the loss of that
combination
and so these visualizations are in
collaboration with with
javier idme lost landscape.com very cool
website
here we have several other
visualizations of mode connectivity
we can see even in the two-dimensional
slices that the structure of the lost
surface is extremely
complex multimodal
why would we choose just one of these
points
so the next question to ask is whether
we can actually
recycle geometric information in the sgd
trajectory for scalable posterior
approximations
centered on flat regions of the loss so
this mode connecting result
is very significant in the sense that it
empirically verifies
this diagram that was used to motivate
sort of flat optima
it shows that in fact there are these
extraordinarily flat regions
of the lost surface and in fact as we'll
see they contain
a variety of different solutions we
really want to do this integral
and we want to see now how we can
practically kind of learn something
about the shape of the law surface just
through something that resembles kind of
basic sgd training
and so we found that if we use the
learning rate schedule that decayed to a
relatively high constant learning rate
and took an equal average of the weights
traversed by sgd with
that high constant learning rate we were
spinning around
a region of the lost surface that
contained a bunch of different
models which all had very low training
loss and so it would normally be very
hard to penetrate inside that region
but when we're spinning around we can
take an equal average of the iterates
it's very important this is an equal
average and a constant learning rate
schedule
we can we can move inside to a fairly
centered region so this is
and and that those those solutions in in
this procedure called stochastic weight
averaging often leads to much better
generalization without a lot of
additional
computational overhead and so this is
very you know importantly different than
say polyacr
rupert averaging where uh uh often we're
we're using um exponential moving
averages with decaying rates and
the idea is to get better convergence
say in convex optimization
now in addition to taking the mean
of these estimates we can also come with
a low rank
plus diagonal approximation to the
second moment the covariance matrix
and then we can use that to come up with
a gaussian approximate posterior for our
parameters
and then we can use that and sample from
it to get our simple monte carlo
estimate
of our predictive distribution and the
whole method here
uh called swag swa gaussian
because it's a gaussian approximate
posterior using
ideas kind of motivated by swa
uh uh can be written here just on you
know just just just one
one on this one slide and uh you know
that's why we call it a simple baseline
for bayesian uncertainty and deep
learning
now there's theory suggesting that if we
run sgd with modified learning rate
schedules asymptotically we're sampling
from a gaussian distribution which is a
reasonable description of the
posterior however this theory has
assumptions which are violated in deep
learning
so as a sanity check we also visualized
the lost surface in the subspaces
spanned by the principal components of
the sgd
iterates with these modified learning
rate schedules and showed that
the swag procedure came up with a pretty
good posterior approximation of the loss
surface in these subspaces
here we're considering uh the
calibration
of various models so on the vertical
axis of each of these plots
we have confidence minus accuracy
confidence corresponds to the highest
softmax
output in our comp net and accuracy is
the accuracy of using the associated
class label
and so here a horizontal curve at zero
corresponds to a perfectly calibrated
model
a curve above that horizontal curve
corresponds to
uh over confidence and below that curve
corresponds to under confidence in the
horizontal
axis here we just have bins for
different confidence levels
so green is sgd training we see that
it's almost always
overconfident and in a way that isn't
surprising because those models are
ignoring epistemic uncertainties so
they're
you know in classical training we're
betting everything on a single model
whether or not it's regularized
and um this means that we're ignoring
all the other
possible functions that are consistent
with our observations so of course in a
sense we're overconfident
we try to account for epistemic
uncertainty through model averaging that
we
get through swag for example then we see
in many cases we end up with quite good
calibration
we're also considering a number of other
approaches so swag here is in blue
there's a dropout like approach in gold
we have
k factor class in in pink temperature
scaling
procedure in brown we can also notably
see in the bottom row here we have
imagenet results with huge neural nets
like a dense net 161 or a resnet 152 and
so this is a very scalable procedure it
can be applied
virtually wherever you would just do
standard classical sgd training
and in the top right we have a transfer
learning problem which is
important because a lot of non-bayesian
alternatives like temperature scaling
where we just scale
our logits by a particular parameter um
uh the the the logics just before we
pass it pass through the softmax
uh uh a special case of flat scaling
um uh require you know a validation set
that's really representative of the
target distribution
and uh that's hard if we have covariate
shift and so
capturing epistemic uncertainty is going
to be especially robust in these kinds
of settings
compared to the alternatives but still
we see all the different approaches here
are
are you know over confident some extent
we can very naturally visualize
epistemic uncertainty and regression
problems so
here we see that with the swag method as
we move away from the data point
where we're getting a wider predictive
distribution and this is because there
are
many different types of curves that are
consistent with our observations
away from the data but not towards the
data will become constrained
and so that's really you know what
epistemic uncertainty
is trying to capture compared to say
illeatoric uncertainty which would
correspond in this case to noise in the
measurements themselves
on the other hand we see with the full
space vi
using standard variational approach for
for
approximate marginalization of neural
net parameters we get a reasonably good
mean fit point predictions but the
uncertainty is pretty homogeneous and it
doesn't grow that much as we move away
from the data so it's not doing a great
job of epidemic
uncertainty representation so in short
this
swa gaussian swag procedure provides a
simple and scalable method for bayesian
deep learning
ideas just fit the estee uterus to a low
rank plus diagonal gaussian distribution
and you know we can capture geometric
properties
of of the posterior uh in the subspace
of sgd and we can we can
improve predictions and uncertainty at
the imagenet scale
now in order to motivate this swag
procedure
we created a visualization of the
subspace spanned by the principal
components of the yes deuteros and this
led to the
question of whether it might make sense
just to try to do bayesian
marginalization in the very low
dimensional subspace directly and so
here we're saying let's construct a
subspace of a network
that has a very high dimensional
parameter space perform inference in
that subspace and then sample from the
approximate posterior
uh just using bayesian and then and then
then use those samples for bayesian
model averaging and what we found is
that we could approximate the posterior
of a wide resnet with
36 million parameters in a 5d subspace
and achieve state-of-the-art results so
the contention here
is that even though the parameter space
is very high dimensional
a lot of the functional variability
could be captured in a very low
dimensional subspace and
once we've reduced that integral that
we're trying to compute to
a very low dimensional integral then
it's going to be a lot easier to
estimate
so a lot of the challenges with bayesian
neural nets
are around the fact that we have to
compute this very high dimensional
integral typically
so in particular in this approach uh
what we did was we
collected all the weights of the drill
net and said they were equal to an
offset like the swa solution
plus a linear projection p of a very low
dimensional vector z
and then we did inference in z space so
we could even use slice sampling if
we're in a low
enough dimensional space which is a
really great mcmc method but it has a
cursive dimensionality
um and um then we can sort of use this
equation to go back into w space and do
the bayesian model average
so let's first consider the mode
connecting subspace so in the right
panel here we're traversing through
parameter space
in this mode connecting subspace and in
the left panel we're looking at the
corresponding functions in purple
and we can see first that there's a lot
of functional variability
in this very flat region of the
posterior
and if i were to ask you which
curve you preferred you know this purple
curve or that purple curse you probably
have a hard time saying and this is why
it's quite arbitrary just to bet
everything on one solution we want to
consider all of these solutions
informing our predictive distribution
when we do that we see we get
good epistemic uncertainty
representation
we can see that the spread of the
predictive distribution here showing the
95
credible set you know increases as we as
we move away from the data
decreases as we move towards the data
here we consider two other subspaces the
random subspace of the pca subspace
random subspace is what we get when we
just use
uh samples from a gaussian distribution
for for our projection matrix
independence
standard gaussian and we can see that
the
curve that here actually fits the data
pretty
well in terms of point predictions but
the uncertainty is very homogeneous so
we're not doing a good job of epidemic
uncertainty representation
the bottom row here we see the the
subspaces in parameter space so
each point is sort of a different
composition of
parameters and the colors correspond to
the values of
the loss at those points in the middle
column here we have the pca subspace
which is what we get when that
projection matrix is found
by looking at the principal components
of the sg iterates in particular as
we're traversing the lost surface with
this
high constant learning rate schedule we
can
store a matrix which has its columns the
weights that we're passing through
minus the average of those weights then
we can do a partial svd and that gives
us our projection matrix it's super
efficient it's just
a single computation and uh uh
you know uh uh once we've got that
subspace we perform inference and we can
see that we
we get a pretty good predictive
distribution here which
is doing an intuitively good job of
representing epidemic uncertainty
now in part one i emphasize that because
there's all this functional
variability uh inside the lost surface
of a neural net
uh we're really going to benefit not
just in terms of uncertainty
representation which can be
measured by say negative log likelihood
but also in terms of just the accuracy
of our point predictions
and we can really realize that here with
that subspace approach so here we have a
resnet on c400 it's getting 78.5
accuracy with classical sgd training
it's getting
80.17 with the random
projection um and we're getting eighty
point five percent then
with the pca subspace and eighty one
point two eight percent of the curve
subspace so these are really
non-trivial gains in accuracy and a pca
subspace for example is
not really much more expensive than just
classical training
and so we can really empirically
empirically realize the benefits of
bayesian marginalization
in terms of not just uncertainty
representation measured by an ll which
we see is also increasing
or decreasing but uh but also accuracy
so historically the laplace
approximation has been very
important in performing inference with
bayesian neural nets david mackay who
did a lot of the
the first work in this space um was
considering
laplace approximations and recently
they've come back with
chronic factorizations so the the basic
idea
is we approximate the posterior with a
gaussian
has its parameters theta it's it's mean
mu and it's it's covariance matrix a
inverse
and uh these parameters are determined
by a second-order taylor approximation
around the log of normalized true
posterior pw
given d and we do that taylor expansion
we see that we set
the mean equal to the map setting of the
parameter is what we get when we
maximize the posterior with respect to w
and we set a equal to the the
negative hessian of the log posterior
evaluated at w
map so for a high dimensional parameter
space
this a matrix is way too large to store
uh
so you know if it's the w is 10 million
dimensional
typically a is taken to be diagonal like
in david bukhai's work
or recently you can you can it's been
k's been expressed as a chronicler
product of much smaller matrices which
is more expressive
than the diagonal approximation and
leads to better results but it's still
scalable
and you know once we have our gaussian
approximate posterior we can sample from
it
using a simple monte carlo estimate of
our
unconditional predictive distribution so
overall the laplace approach is
compelling because
it's simple and can be relatively
efficient
perhaps as a drawback it's constrained
to unimodal approximations
uh uh it's like it's a gaussian
approximate posterior and
uh it's fairly local in its description
of the loss you know even for gaussian
so um
the curvature is entirely different
defined by the hessian evaluated at that
map setting of parameters
and so this means if for example there's
a little kink
in the the lost surface then it'll be
very concentrated around a very small
region and a
very very compact representation of the
posterior
uh swag by contrast even though it
provides a gaussian approximation
it's more global because we have sgd
with this fairly high constant learning
rate bouncing around
within some basement of attraction so if
their little kings
will still get a fairly kind of global
gaussian approximation that won't be
trapped in those those regions at least
for that basis
mc dropout is another in uh you know
very popular approach
uh for for bayesian neural nets and
really catalyzed a lot of
renewed interest in bayesian deep
learning uh in 2016
and so the idea is to run dropout during
both train and test you randomly drop
out each hidden unit
probability r at each input and this
creates you know a mask in regression
each network can be trained to output
both a mean mu and a variance sigma
square by maximizing a gaussian
likelihood
we then create an equally weighted
ensemble of the corresponding sub
networks with a different dropout mass
at test time
so this is a you know very compelling
approach it's
very easy to apply it's very scalable
and it's had great empirical results uh
representing episode uncertainty a
number of problems uh
we know in the in this case the ensemble
doesn't collapse as we get more data on
like a standard bayesian model average
so in part one we
considered how bayesian model averaging
isn't sort of a model combination
in the sense that you know it it's just
representing a statistical inability to
distinguish between hypothesis given in
limited
information um whereas uh
here uh you know we're still going to be
sampling from this bernoulli
distribution to get the dropout mass and
so this isn't something that collapses
as we get more data and you know
figure out how to maybe modify it for
that would be you know an interesting
direction for future work
so uh another historically important
approach
for bayesian deep learning is called
bayes by back prop this is where we
introduce
typically gaussian approximate posterior
for our parameters
and then we learn those parameters using
a variational method which minimizes the
kl divergence between our approximate
posterior and the true posterior
even though the we can't integrate with
respect to the true posterior we can
usually write down
the normalized true posterior exactly we
just multiply our likelihood in our
prior
and we can manipulate the kale
divergence again evidence lower bound
which can be
optimized with with sgd and back
propagation
so even though in principle we can
choose sort of different distributions
for q it is often a lot easier to use
gaussian stochastic
mcmc is a very up-and-coming
direction for bayesian deep learning so
rad for neil's mcmc methods in the mid
90s were achieving state of the
art results for all sorts of interesting
interesting problems but these
these standard hamiltonian monte carlo
methods
um wouldn't really scale to you know big
architectures
stochastic gradient longitudinal
dynamics and stochastic radiant hmc for
example
on the other hand are very broadly
applicable they uh
algorithmically resemble sgd noisy sgd
pretty closely uw here is a log
posterior and so we can really apply
these wherever we we apply std and often
achieve
better results through the bayesian
model average
now when we're um doing optimization
you know even though we sort of explore
the law surface to try to find a good
solution in the end all we care about
in principle is finding a solution with
low loss
whereas when we're sampling and we're
doing bayesian model averaging more
generally we really care about exploring
this surface and so
in this case the learning rate schedule
is particularly crucial for good
performance
and it's recently been found that a
cyclical learning rate schedule can
really
greatly enhance the ability for the
stochastic mcmc procedure to explore
complex multimodal loss surfaces and
achieve you know much better
practical performance so here we
increase the learning rate and we're
sort of exploring and then we decrease
and we sort of
explore within a mode and and you know
use those samples
deep ensembles has recently gained a lot
of popularity and has achieved a lot of
success
the idea is to specify neural net
architecture
retrain the neural net a bunch of times
to get a bunch of different sgd
solutions typically starting from
different initializations
as we get a bunch of different weights
uh and then
you know again in regression each model
can be specified to output a mean mu and
a variance sigma squared transfer say
gaussian likelihood
and once we have all these different
architectures corresponding the
different weights that we found through
sgd retraining we take an equally
weighted
ensemble
so you might be wondering why is deep
ensembles
in a section about practical methods for
bayesian neural nets
are deep ensembles bayesian in fact
aren't they often explicitly treated as
a competing approach to bayesian methods
in the next part part 4 will argue that
deep ensembles in fact provide
a better approximation to bayesian model
averaging than many of the methods that
we've described so far that
provide single basin marginalization
we'll also introduce an approach
multi-swag which generalizes deep
ensembles for
even more accurate bayesian model
averaging
we'll note briefly now though that for
example the average here
unlike with mc dropout uh actually does
collapse in the same way as a bayesian
model average
and so uh for example we're not
enriching the hypothesis space we're
just using
uh the same architecture retrained a
bunch of times with different maximum
likelihood or map solutions so as the
likelihood collapses so will the the
bayesian model average so
this ensemble also by
looking at different basis of attraction
we're capturing functional variability
which intuitively is going to be very
important for
estimating this bayesian predictive
distribution
when we're trying to do marginalization
we're trying to estimate that integral
and
given a finite amount of computation
being able to
uh spread where we're querying the lost
surfaces
across different basins of attraction is
going to be
very important for coming up with a good
approximation to that that integral so
in a sense it's
it's doing very well at approximating
the bayesian predictive distribution
given a finite number of resources
next we'll have part 4 on bayesian model
construction
and generalization hi i'm andrew wilson
at new york university
and welcome back to the icml 2020
tutorial on bayesian deep learning
this is part 4 on bayesian model
construction and generalization
in this part we'll consider deep
ensembles and their connection with
bayesian marginalization
we'll use this connection to describe
multi-swag a procedure which
marginalizes within multiple basins of
attraction
we'll also consider tempering prior
specification
rethinking generalization double descent
and with depth trade-offs
here we'll return to a function space
perspective that we considered in depth
in part two
we have a straight line function with a
distribution over its parameters w
naught and w1
we can visualize this induced
distribution over functions by sampling
from this distribution over parameters
and looking at the different straight
lines we get the different slopes and
intercepts
the gray shade here corresponds to a 95
credible set
and the solid blue line corresponds to
the expectation of this distribution
over functions
in this diagram here we have a
conceptualization of all possible data
sets on the horizontal axis
and the marginal likelihood or evidence
on the vertical axis
expressed in equation 74. the marginal
likelihood is the probability
that we would generate a data set if we
were to randomly sample from the
parameters the distribution of over the
parameters of our model
we can see here in equation 74 the
marginal likelihood is what we get when
we integrate away these parameters
in the example from the last slide we're
just generating
straight lines with different slopes and
intercepts so we're not able to generate
many data sets at all
but because the marginal likelihood is a
proper normalizable probability density
it'll have to assign a lot of mass to
those data sets
alternatively we can consider a model
like maybe a huge multi-layer perceptron
with the broad
distribution of its parameters which can
generate lots of different data sets but
each with not very much probability
we could alternatively consider a model
like a convolutional neural net which is
very flexible
so it can generate all sorts of
different data sets but it has very
particular inductive biases like
translation equivalence this idea that
if we translate an image its class label
remains unchanged and so
it will give a lot of mass to structured
image data sets
in order for a model to generalize it
needs to both have large support it
needs to be very flexible
and it should have good inductive biases
a good distribution of that support
so models like convolutional nets can
generalize very effectively on image
recognition problems because it supports
all sorts of different
data sets in this application domain but
it has a good distribution of support
through biases like translation
equivariance
we should be very careful not to
conflate flexibility and complexity as
we are describing in part two of the
talk we can have models like gaussian
processes
which are extraordinarily flexible even
universal approximators but
in a sense have very simple inductive
biases and can generalize very well on
problems with even a small number of
data points
in this figure here in addition to the
panel that we considered on the last
slide
we look at what happens as these three
models are exposed to
a given data set we can see that the
model with large support
with but good inductive biases both
contains the
ground truth description and is able to
contract efficiently around it
the model with truncated support
contracts quickly it's quickly
constrained by the available data but
erroneous
it contracts around an erroneous
solution the model with large support
but poor inductive biases that spreads
its support too thinly across too many
data sets
contains the truth but it doesn't
contract very efficiently
at the end of the last part part three
we were considering deep ensembles
and we started to hint at their
potential connection with bayesian
marginalization
recall that the predictive distribution
that we want to compute
is this integral of our conditional
predictive distribution given parameters
weighted by the posterior probabilities
of parameters given data
and so we want to in words consider
every possible setting of the parameters
and weight them by their posterior
probabilities
now this bayesian approach and the
classical approach will be
similar when the posterior is very
concentrated around a small number of
parameters or
when the conditional predictive
distribution y given w
is not varying a lot where
the posterior distribution w given d has
most of its mass
now in this top panel here we have a
conceptualization of the posterior
distribution
for a neural net multimodal lots of
global optima
and in the second row we have
a conditional distribution now with
with a deep neural net there isn't a lot
of functional variability within a
single basis of attraction
compared to between different bases of
attraction so we can see that
conceptualized in this second row where
we see that p of y given w
doesn't change very much within a basin
but it changes a lot between the basins
in the bottom row here we're thinking
about estimating this integral as an
active learning problem
we've observed a single solution say an
sgd solution here
in one of the basins and we're seeing
where we need to move in weight space
to to decrease the distance between
our approximate predictive distribution
and the exact predictive distribution
the exact answer to this integral
we can see that we would benefit a lot
more by moving to a new basin
then continuing to query points within
the same basin
and so indeed a lot of approaches
to bayesian marginalization focus their
efforts just on a single basis and they
do gaussian approximate posteriors
but in deep learning if we view this
integration problem as really an active
learning problem rather than through the
lens of simple monte carlo
then it's very advantageous given finite
computational resources to do something
like deep ensembles and just select
points from different basins and we'll
get
a better approximation to this integral
and in fact a more bayesian approach
than the bayesian approaches which are
marginalizing within a single basin
so we're in practice in in you know real
world we're always
handed computational constraints and we
try to do our best
given those constraints and deep
ensembles is actually a very good
heuristic for you know achieving a good
approximation to this integral given
those constraints
here we empirically test how close the
dpon solves predictive distribution
is to a near exact predictive
distribution
which we can see in panel a in panel b
we have the deep ensembles we can see
visually it's fairly similar to the
exact distribution
in panel c we have a variational
procedure doing single basin
marginalization with a gaussian
approximate posterior we can see that it
doesn't capture epistemic uncertainty so
well it's
it's doesn't have a lot of uncertainty
between different
uh clusters of data points and certainly
doesn't grow very consistently away from
the data
in the bottom right panel we look at
what happens is we
in terms of the distance between the
approximate predictive distribution
given by both deep ensembles and
stochastic variational inference as we
increase the number of samples we have
access to
we can see that that distance decreases
very quickly for deep ensembles
but hardly at all as we increase the
number of samples that we have at the
variational procedure so
continuing to see samples within a
single basin with this variational
approach
is not really improving our estimate of
the
of the of the bayesian model average
corresponding to this integral that we
want to compute
but training more models in the deep
ensembles is pretty dramatically
decreasing
the distance between our approximation
and the exact answer
so in addition to just selecting
different
points in different basins we can also
try to marginalize within basins of
attraction and we do that in a procedure
called multi-swag where we train
multiple independent swag
methods discussed in in part three uh to
create a mixture of gaussian's
approximation to the posterior
so we're marginalizing within multiple
basins
when we do that and apply multi-swag to
several of the applications in reference
one
evaluating predictive uncertainty under
data set shift
which show that deep ensembles often
outperform single basin marginalization
procedures
we see two key trends
one is that multi-swag tends to
significantly outperform deep ensembles
in cases where we're not training
very many independent models or when
there's a lot of data corruption
in purple here we have an additional
model multi-swa which takes the
means of the different gaussian
components in multi-swag and that
those will be in flatter regions of
those different basins of attraction
than
uh than say sgd solutions and so we see
uh you know a nice compromise here
recently a phenomenon called double
descent
has been of great interest a belkan at
all a couple years ago
showed that we could find these
so-called double descent curves on a
wide range of problems
where uh we have uh our loss on the
vertical axis
and we can see as we increase model
flexibility
initially loss appears to be decreasing
and then it
increases corresponding to overfitting
and then
it turns out it decreases again and so
this first regime is called the
classical regime which is in line with
classical intuitions about
generalization and the next regime is
referred to as the modern interpolating
regime so the training loss here just
you know keeps going down the test loss
has this really non-monotonic
behavior and the question is well why
don't we just keep overfitting why did
it what is suddenly generalization start
getting
better especially once we move past
where the training loss is zero
so i'd like to ask you and feel free to
pause the video for a moment
whether you think a bayesian model
should experience double descent given
what we've discussed so far and
considering this problem
especially think back to that first
example at the beginning of the talk
where we're considering airline
passenger numbers and which model we
would want to fit that data with
using a bayesian approach
so in that airline passenger number
examples i was saying that if we
have you know a reasonable prior and
we're doing exhaustive marginalization
then
we would expect a bayesian method to
improve
monotonically with increases in
flexibility we should embrace
flexibility of large support
and indeed that is actually what we see
if we apply this multi-swag procedure we
see that performance is essentially
monotonic
whereas sgd in this example where we
have c4 100 20
label corruption uh has a prominent
double descent
behavior uh swag where we're doing
single basin marginalization has a less
prominent double descent but you know
it's still
clearly there another really important
feature to note in this plot
is the fact that the multi-swag is
actually just
a lot more accurate than classical
training so
if we have uh you know a resnet uh with
layers of width 20
here we see just below 30
test error on c400 whereas if we're
using sgd we see about 45 percent tester
so it's a really massive discrepancy and
you know an empirical realization
this idea that if we're doing multimodal
marginalization
with a neural net which can capture all
sorts of complementary explanations to
the data
then we're really going to benefit a lot
empirically not just in terms of
uncertainty representation but also in
terms of accuracy
now we'll consider the prior and
function space that's induced by having
a gaussian prior in weight space where
we're just
varying the the the signal variance of
the prior for each of the parameters
alpha squared
and so there are two key results
that suggest that this actually induces
a pretty reasonable prior and function
space
first is the deep image prior which
showed that randomly initialized
confidence without training provide
excellent performance for image
denoising super resolution and in
painting
in other words a sample function from
this induced distribution over functions
captures low level image statistics
before any training another result
from the paper by zhang adol on
rethinking generalization
uh shows that if we pre-process c410
with a randomly initialized untrained
cnn
we get dramatically improved test
performance when we're
using a gaussian kernel directly on
pixels so it goes from 54
accuracy to 71 accuracy and then we just
get another two percent from l2
regularization which would be like
tuning this alpha perimeter
and so this is really saying that these
this induced distribution over functions
is fairly reasonable and
and in a way it's not surprising because
the properties of the induced
distribution over functions comes
largely from the functional form of the
model that's really a big
part of the prior translation
equivariance in sort of a similarity
metric that we get for
for different images etc you know this
is this is going to come from mostly the
functional form of the model
now um in bayesian deep learning it's
typical to consider what's called a
tempered posterior where we raise the
likelihood to a power one over t
where t is a temperature parameter t
less than one corresponds to a cold
posterior where the posterior is
more concentrated around solutions with
high likelihood t equals one corresponds
to the standard
bayesian posterior t is greater than one
corresponds to warm posteriors
where the prior effect is stronger and
posterior collapse is slower
in a paper at this icml by wenzel it all
uh
there's a result that's highlighted that
the the that with a standard
gaussian prior over parameters cold
posteriors often provide
improved performance but if we use just
a temperature equals one we can
we can get even worse performance than
classical training
in this paper they suggest the result is
due to prior misspecification and show
here that
sample functions uh from this induced
distribution or functions with the
standard normal prior seem to assign one
label to most classes on c410 and we
know the classes are balanced
so here we just sample from the standard
normal get
our function uh with our neural nets and
we see that it's giving most of the data
points class six here and another
sample function is giving most of the
data points class 9.
we examine this behavior as we vary the
scale
of the prior variance on the parameters
so for with for alpha equals root 10
we reproduce this results and we see
that one sample function is assigning
most of the data one class
another sample function was the date of
the other class however if we reduce
alpha we see that the samples are very
quickly
assigning about the same amount of data
at each of the classes
and in fact of course in practice we
would specify alpha through cross
validation or through
uh for example just you know what we
normally use for l2 regularization which
is roughly near 0.1 in this case
we also see that even for the
misspecified alpha the unconditional
predictive distribution
is actually quite reasonable it's
basically you know uniform over the
different
classes so it doesn't affect the
predictions
we can also examine the effect of data
on the posterior here so
um here we start with the prior the
misspecified prior that this you know
where sample functions are assigning
most of the data one class
we then observe just ten data points
with this huge resnet and we see that
already the predictive just that the
samples are giving almost uniform
predictions across the
the different different points 100 data
points you know even closer to uniform
and so this is a prior bias that is very
quickly modulated by data we can imagine
with the gaussian process for instance
if we
if we multiplied our kernel by some
factor uh
the the amplitude would be awkward if we
observe um
you know a few just few data points that
the posterior would
quickly collapse in that region so this
is the kind of
prior bias which isn't really going to
affect generalization that much in
practice what's much more important for
example is the induced covariance
function that we get
in this distribution over functions and
we saw this a bit in part two where
things like covariance function really
affected generalization
so here we have the induced correlation
function for
uh mnis digits different classes
um we can see that the correlations
generally decrease as we increase
alpha we also see that um pretty
consistently
uh the same classes the same classes are
most correlated and classes that are
visually similar more correlated than
other classes
this is a good sign this is also
suggestive in addition to the
the the deep image prior and the random
network features that
reducing a prior that is actually pretty
reasonable and you know which covariance
function for example you use
in a gaussian process is going to affect
generalization much much more
than you know whether you have a signal
variance perimeter that's a bit
misspecified
so some thoughts on tempering um i i
think it would be very surprising if t
equals one just happened to be the best
setting of this hyper parameter
um you know i in fact i think you know
we should we should always be doing
tempering it's basically just an
acknowledgement that our model is
mis-specified
that's not to say that we should you
know just say well
who cares about the model specification
will correct it with tempering we should
do our very best
to specify our prior as as honestly as
we can
but it's still going to be misspecified
and we should be honest about that too
and try to correct it by by learning the
temperature
and in fact this isn't really too
different than learning other properties
of
likelihood like noise
now while the the prior p of f x is
certainly
misspecified um in in
you know most cases in bayesian deep
learning the result of assigning one
class to most data is
is really a soft prior bias which one
doesn't hurt predict the predictive
distribution two
is easily corrected by appropriately
setting the prior parameter variance
alpha squared and three is quickly
modulated
by data what's much more important is
this induced covariance function over
images
uh in addition to not tuning alpha
in this whole posteriors paper um the
results
could have been exacerbated due to lack
of multimodal marginalization which
we've shown is extremely important for
generalization
it's also interesting to note that there
are cases when a cold
posterior a t less than one will be
helpful
in coming up with estimates even if we
believe the prior and the likelihood
if we have you know access to a finite
number of samples so
we can imagine for instance estimating
the mean of a standard normal
distribution
in high dimensions in this case the
samples will be very concentrated
around a norm of root d and so uh
if we decrease the temperature we'll
come up with a better estimate for the
mean
and so i encourage you to try this i
just sample from a high t
distribution look at a histogram of the
norms
so there was a paper a few years ago uh
called understanding deep learning
requires rethinking generalization they
showed that
confnets could fit cfar with random
labels and this was presented as if it
were you know in the face of everything
we know about generalization because
um uh uh it showed that the confidence
could
fit in noise they could greatly overfit
yeah they were generalizing in all sorts
of different
problems we can understand this result
from
the probabilistic functions based
perspectives so
in the top row here we have samples from
a gaussian process
in in panel a and panel b we see
some structured data and a gaussian
process predictive distribution which
looks reasonable
in panel c we see a lot of corrupted
data in red and we see an updated gp
predictive distribution which achieves
zero training error
now this red curve unlike the green
curve in panel b
looks nothing like the sample prior
functions however
the gp with the rbf kernel is very
flexible it contains somewhere in its
support
this red curve even though we might have
to sit there sampling for a very very
long time to see anything
like it however if there's a strong
enough likelihood signal we're saying
there's no noise then
it's going to run through that data
perfectly and produce this predictive
distribution
and so it doesn't want to fit that data
but it can
and we can quantify how much a model
wants to fit data with something called
the marginal likelihood
and so we can actually fit a gaussian
process
on a c410 with with altered labels and
show that we get zero percent training
error
so that means that you know this result
in the rethinking generalization paper
wasn't unique to deep neural nets you
can reproduce it with gaussian processes
and we can also compute the approximate
marginal likelihood and show that it
gets
very very low as we increase the number
of altered labels so that noisy c4 set
is somewhere in the tails of that
distribution where we're considering all
possible data sets and the marginal
likelihood of the vertical axis when
we're considering this
gp model with an rbf kernel uh the
uh we can also compute exactly the same
thing using a bayesian neural net
and find the approximate marginal
likelihood using whole class and we see
that it decreases
in in the same way as we increase the
number of altered labels
so this bayesian neural net is able to
fit that kind of data set but it really
doesn't want to
and we can quantify that with the
merchant likelihood and this is sort of
in accordance with
what we would want when we're thinking
about model construction we want
large support but we want to distribute
that support carefully and that means
not giving a lot of mass to things like
noise you see far
but if we see noisy c4 there's a strong
likelihood signal and we can fit it
okay so of course having said all that
um
the the priors in function space aren't
going to be perfect by just having sort
of
generic dowsing distributions or
parameters combine the functional forms
of
of neural nets um and we can certainly
do better and i i certainly embrace the
function space perspective in
constructing
compelling priors for these neural nets
at the same time we should be careful
not to contrive priors over parameters w
to induce distributions over functions
that resemble familiar models such as
gaussian processes with rbf kernels
we could be throwing the baby out with
the bath water and doing that
indeed drill nets are useful as their
own model class precisely because
they have different inductive biases
from other models we already have
gaussian processes with
rbf kernels and we can try to gain
insights in thinking in function space
but this is in a sense what we're
already doing when we're doing
architecture design that thinking and
function space trying to encode
properties such as translation
equivariance rotation equivariance
color and scale invariants and other
interesting properties that would
induce a compelling similarity metric
across our data instances
and so these properties really imbue the
associated distribution of our functions
with desirable properties for
generalization
and really all of the heavy lifting
for the the prior functions that we get
is is is from the the architecture and
um you know even if we're thinking
broadly
uh the functional form of a model is
like a strong prior
um it's a very strong assumption we
can't escape assumptions and we
shouldn't try to we should embrace
assumptions they're
you know we need to make assumptions to
to uh
have good generalization pack bays has
been a very exciting approach for
deriving explicit generalization error
bounds and stochastic networks with
posterior's q prior p
training points in uh uh the the pac
bay's generalization error bounds are
are based on this this term and equation
79
and uh we've uh recently uh uh you know
people have
have derived non-vacuous bounds
exploiting flatness in
queue with at least 80 generalization
accuracy
predicted on binary mnist uh this is you
know very exciting
um it's it's a promising framework uh
but it tends not to be prescriptive
about model construction
or informative for for understanding why
a model generalizes so in a sense it's
very complementary to what we've been
describing so far
in this tutorial and in fact if we were
to try to treat it as a prescription
what we would get is this sort of almost
contrary to a lot of what we've
discussed the bounds are improved by
compact priors for example
um and we've been saying well we want
prioritize with support for all sorts of
different data sets
um and a low dimensional parameter space
in a sense that the kl divergence is
going to be
uh often sort of hard to assign a lot of
overlap between two high dimensional
distributions and so we can
very often achieve better bounds by
doing model compression etc
um but here we've really been trying to
embrace having as many parameters as
possible for good generalization
um also generalization as we've shown
can be significantly improved by
a multi-modal posterior especially in
bayesian deep learning
uh this is you know really uh one of the
key practical takeaways
and in order to realize a good
approximation to the
the unconditional predictive
distribution we really need to do
multimodal marginalization
but the pac-based generalization bounds
aren't typically too influenced by
multimodal posteriors you end up with a
log factor it doesn't really change
things that much so
um that the pack based bounds are
are quite quite complementary and
they're very exciting in the sense that
they
they help us make sort of quantitative
sort of
explicit statements about generalization
error bounds
um uh but they don't tend to be too
prescriptive and this could be partly
also because although we in some cases
get non-vacuous bounds they're also
typically fairly loose
so what improves the bounds won't
necessarily improve generalization
we can relate posterior contraction to a
quantity called
effective dimension and actually gain
insights into
into behavior such as double descents so
uh
effective dimension is of the hessian
here
is the sum of lambda i over lambda i
plus alpha where alpha is a
regularization parameter lambda i are
the eigenvalues of the hessian when we
compute this quantity
of networks resnets of varying width
here we can see that it
tracks a test loss very closely as well
as you know test error
and um in fact the effective dimension
is going down
as we increase the size of the models
and this suggests that actually we're
getting simpler models even though they
have
more parameters we should be very
careful not to treat parameter counting
as a proxy for model
complexity and we can relate effective
dimension to
model compression uh in a sense this
sort of uh
counts the number of kind of sharp
directions uh in this this this
space given by the eigenvectors of the
hessian and um
we can see that that in this regime we
have basically zero training loss all
the different models are providing
lossless compressions of the data
and so the best compression will capture
the most regularity it'll have this
occam's razor property and tend to
provide the best generalization
now the reason sgd finds uh these these
kinds of solutions when we make the
the model size bigger is is largely
because
um you know flat region of the loss will
occupy a much greater volume in high
dimensions and so sgv will
be able to find them more easily we also
look at with depth trade-offs so looking
at
confidence of different widths and
depths and we can see above the green
partition where we have zero training
loss
effective dimension is a very good proxy
for generalization performance
the yellow curves here show level
cursives constant numbers of parameters
um and uh
you know what's interesting here to
especially to consider
different depths uh typically in in
recent years we've been looking at
infinite
widths of neural nets for neural tangent
kernels and things like this
um but in a sense you know depth is what
what gives
deep learning a lot of useful
hierarchical inductive biases that
provides good
generalization and representation
learning and so we can see that
this this quantity associated with
posterior contraction based in deep
learning
uh is is a reasonably good proxy for
generalization
um we can also look at the properties in
function space
as we move in directions given by
eigenvectors of the hessian with the
smallest eigenvalues and show that
the decision boundaries are essentially
unchanged that we get a lot of
functional homogeneity and this provides
a mechanism for why things like subspace
inference discussed in part
three of the talk uh work so well why we
can have a
model with tens of millions of
parameters and then basically just do
marginalization in a five dimensional
subspace and you know
see a big difference for that um and so
uh we can really understand you know
things like model compression by looking
at these quantities
in conclusion what we've been really
reiterating through this tutorial is
that the key defining feature of
bayesian methods is marginalization aka
bayesian model averaging and that's
going to be especially relevant in deep
learning because neural nets are very
underspecified by the data contain
all sorts of complementary and exciting
solutions to a given problem
and it really makes a lot of sense just
to use the sum and product rules of
probability to marginalize those
solutions
and really in trying to do this
marginalization as best as we can
we shouldn't think of the integration
purely through the lens of simple monte
carlo integration
um we should probably think of it more
as an active learning problem and by
doing that we can gain insights into
methods like deep ensembles and also
propose other kind of bayesian methods
that um provide even
better marginalization like multi-swag
um
and you know very excited to say in the
last year or so that a lot of bayesian
methods are now providing
better results than classical training
both in terms of accuracy
and uncertainty representation without a
lot of additional overhead
uh and uh you know really this
emphasizes there's a big difference
between your marginalization just
regularization as we saw in that coin
toss
example uh and you know we should be
careful not to conflate flexibility and
complexity
gaussian processes for instance can be
extremely flexible of an infinite number
of parameters
simple inductive biases good
generalization of a small number of
points
and also careful not to parameter count
to the proxy for complexity sometimes
the models of many more parameters in a
sense are helping us find simpler
solutions and we can really resolve a
lot of mysterious results in deep
learning
by thinking about model construction and
generalization from a probabilistic
perspective
so that's everything and i really like
to thank you for attending this tutorial