In [None]:
%matplotlib notebook
import matplotlib.pyplot as plt
import numpy as np
from tqdm.notebook import tqdm
import corner
import torch
import pyro
display(pyro.__version__)

# Bayesian Neural Networks

Deep Neural Networks are non-linear function approximators which represent the state of the art in pattern recognition

But they do have limitations

- Very deep models require lots of data to train
- Selecting an architecture requires a lot of experimentation
- [Easily ](https://arxiv.org/abs/1412.1897) [fooled](https://openai.com/blog/adversarial-example-research/)
- Poor at representing uncertainty 

> We can address some of these limitations by going Bayesian

A Bayesian neural network (BNN) places a prior distribution on its parameters. Training the BNN is equivalent to learning the posterior distribution of the parameters given the data. Most importantly the **uncertainty on the data and the parameters** can be propagated to estimate the **uncertainty on our predictions**

- Uncertainty on the data is called **aleatoric uncertainty** and it is related to irreducible noise
- Uncertainty on the model (parameters and structure) is called **epistemic uncertainty**

> BNN's (and other bayesian models) know what they don't know

We can use this "new knowledge" to

- Choose when to use a more simple/complex model (complexity-control)
- Make critical decisions, e.g. [autonomous cars](https://en.wikipedia.org/wiki/Tesla_Autopilot#Non-fatal_crashes), cancer diagnosis


## A bit of history

- 1980's: Bayes theorem is applied to Neural Networks (John Hopfield and Naftali Tishby)
- 1990's: Monte-Carlo and VI for bayesian neural networks was studied extensively by [David Mackay](http://www.inference.org.uk/mackay/BayesNets.html) and [Radford Neal](https://www.cs.toronto.edu/~radford/res-neural.html) (Also Bishop, Barber, Hinton, Gharamani and many others). Neal shows that Gaussian process are bayesian neural networks with infinite neurons
- 2011: [Alex Graves' VI for neural networks](https://papers.nips.cc/paper/4329-practical-variational-inference-for-neural-networks). Explosion of practical deep bayesian networks 
    - [Charles Blundell's Bayes by backprop](https://arxiv.org/abs/1505.05424)
    - [Yarin Gal's many work](http://mlg.eng.cam.ac.uk/yarin/thesis/thesis.pdf)
    - Durk Kingma, Danilo Jimenez Rezende, Shakir Mohamed, José Miguel Hernandez-Lobato
- [Hot topic now a days](http://bayesiandeeplearning.org/)

History in video by [Zoubin Gharamani](http://mlg.eng.cam.ac.uk/zoubin/) at [NIPS 2016](https://www.youtube.com/watch?v=FD8l2vPU5FY) and [interesting panel discussion](https://www.youtube.com/watch?v=HumFmLu3CJ8) on the same conference

## Formalism recap

Assuming

- $N$ *iid* samples $\mathcal{D} =\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(N)}, y^{(N)}) \}$ 
- $x$ is a $D$ dimensional vector, $y$ is a scalar
- Fully-connected neural network with one hidden layer ($H$ neurons) for regression
- $\text{tanh}(\cdot)$ as non-linear activation function

$$
\begin{align}
f_\theta(x) &=   \hat b + \sum_{j=1}^H \hat w_{j} h_j  \nonumber \\
&=  \hat b + \sum_{j=1}^H \hat w_{j} \text{tanh} \left( b_j + \sum_{d=1}^D w_{jd} x_d  \right) \nonumber
\end{align}
$$

The parameter vector $\theta = (b, w, \hat b, \hat w)$ contains all the weights and biases of the model

**Prior:** We propose a prior for $\theta$, typically

$$
\theta \sim \mathcal{N}(\theta | 0, \Sigma_\theta)
$$

**Likelihood:** We propose a likelihood depending on our task, typically Gaussian for regression and Bernoulli/Categorical for binary/multiclass classification 

**Posterior:** We use Bayes theorem to write the posterior

$$
p(\theta | \mathcal{D}) = \frac{p(\mathcal{D}|\theta) p(\theta)}{p(\mathcal{D})} = \frac{1}{{p(\mathcal{D})}} \prod_n \mathcal{N}(y^{(n)} | f(x^{(n)}), \sigma^2) \mathcal{N}(\theta | 0, \Sigma_\theta)
$$

Even though the likelihood and prior are normal **the posterior in this case is not normal** because of the nested nonlinearity 

In general:

> We cannot obtain an analytical posterior for a bayesian neural network

We resort to sampling-based (MCMC) or deterministic (VI) approximate inference

## My first Bayesian Neural Network using `pyro`


We will use the same synthetic data from the linear regression lecture

In [None]:
# Synthetic data
se = 0.1
np.random.seed(0)
x = np.linspace(0, 1, num=20) 
x_test = np.linspace(-0.05, 1.05, num=200)
f = lambda x : x*np.sin(10*x)

x = np.delete(x, slice(9, 14))
y = f(x) + se*np.random.randn(len(x))
fig, ax = plt.subplots(figsize=(7, 3), tight_layout=True)
ax.scatter(x, y);

x_torch = torch.from_numpy(x.astype('float32')).unsqueeze(1)
x_test = torch.from_numpy(x_test.astype('float32')).unsqueeze(1)
y_torch = torch.from_numpy(y.astype('float32'))

**Coding the bayesian neural net**

Neural networks in `pyro` are classes that inherit from [`pyro.nn.PyroModule`](https://docs.pyro.ai/en/stable/nn.html#pyro.nn.module.PyroModule) which is a subclass of `torch.nn.Module`

Within the `PyroModule` defined model we use

- `PyroSample` to declare random variable, e.g. weights and biases
- `PyroParam` to declare deterministic parameters, e.g. the parameters of the priors
- `PyroModule` to declare torch modules which accept random parameters

In the following example we lift `torch.nn.Linear` using `PyroModule`, and add priors to its parameters using `PyroSample`

In this regression problem we assume that the output is Gaussian distributed. The likelihood is declared with its corresponding plate in the `forward` function

In [None]:
from pyro.nn import PyroSample, PyroModule
import pyro.distributions as dists 

class BayesianMLPRegression(PyroModule):
    def __init__(self, n_hidden=10, prior_scale=1.):
        super().__init__()
        prior = dists.Normal(0, prior_scale)
        # Hidden layer
        self.hidden = PyroModule[torch.nn.Linear](1, n_hidden)
        self.hidden.weight = PyroSample(prior.expand([n_hidden, 1]).to_event(2))
        self.hidden.bias = PyroSample(prior.expand([n_hidden]).to_event(1))
        # Output layer
        self.output = PyroModule[torch.nn.Linear](n_hidden, 1)
        self.output.weight = PyroSample(prior.expand([1, n_hidden]).to_event(2))
        self.output.bias = PyroSample(prior.expand([1]).to_event(1))
        # activation function
        self.activation = torch.nn.Tanh()
        
    def forward(self, x, y=None):
        z = self.activation(self.hidden(x))
        f = self.output(z).squeeze(-1)            
        #sigma = pyro.sample("sigma", dists.Uniform(0.0, 0.1))
        with pyro.plate("data", x.shape[0]):
            loc = pyro.deterministic("mean", f, event_dim=0)   
            obs = pyro.sample("obs", dists.Normal(loc, 0.1), obs=y) #likelihood
        return f

Once the network is coded we can use `pyro.poutine.trace` with pyro validation activated to make sure that the shapes are correct

- Batch dimension is 15 (number of samples)
- Event dimension is equal to the number of neurons for each layer

Independent RV (likelihood) should be in the left while dependent (weights and biases) should be on the right

This is controlled using plates and the `to_event()` attribute

In [None]:
pyro.enable_validation(True)

model = BayesianMLPRegression()

print(pyro.poutine.trace(model).get_trace(x_torch, y_torch).format_shapes())

**Training the BNN: MCMC** 

We could train this model using MCMC as seen before

```python
from pyro.infer import MCMC, NUTS

pyro.clear_param_store() 
model = BayesianMLPRegression(n_hidden=10, prior_scale=1.) # Declare the neural network

nuts_kernel = NUTS(model, adapt_step_size=True)
sampler = MCMC(nuts_kernel, num_chains=2, num_samples=1000, warmup_steps=100)
sampler.run(x_torch, y_torch)
```

But even for a extremely simple BNN and using the most advanced samplers MCMC can be inpractical. Note that this may change in the future with projects such as [NumPyro](https://github.com/pyro-ppl/numpyro)

### Recap of VI

We propose an approximate (simple) posterior $q_\nu(\theta)$ and optimize so that it looks similar to the actual posterior

We do this by maximizing a lower bound on the evidence

$$
\mathcal{L}(\nu) = \mathbb{E}_{q_\nu(\theta)}[ \log p(\mathcal{D}|\theta)] - \text{KL}[q_\nu(\theta)|p(\theta)]
$$

Then we use $q_\nu(\theta)$ as our replacement for $p(\theta|\mathcal{D})$ to calculate the **posterior predictive distribution**

$$
p(\mathbf{y}|\mathbf{x}, \mathcal{D}) = \int p(\mathbf{y}|\mathbf{x}, \theta) p(\theta| \mathcal{D}) \,d\theta
$$

In what follows we will see how to do this for the BNN using pyro

**Training the BNN: VI**

Once the model is specified we need to write a guide (approximate posterior). This can be done manually or using the automatic guides in `pyro.infer.autoguide`. Typically we would start with the simplest diagonal normal guide that assumes no correlation between the parameters of the BNN

Then we create an SVI object and call the `step` attribute of this object iteratively. We can evaluate the posteriors of the parameters and the predictive posterior using `pyro.infer.Predictive`

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(10, 3), tight_layout=True, dpi=80)

def update_plot(k, epoch_loss, samples):
    ax[0].cla()
    ax[0].plot(range(k), epoch_loss[:k])
    ax[0].set_yscale('log')
    ax[0].set_ylabel('ELBO')
    ax[1].cla()
    ax[1].plot(x, y, 'k.');
    med = np.median(samples, axis=[0])
    qua = np.quantile(samples, (0.05, 0.95), axis=0)
    ax[1].plot(x_test.numpy()[:, 0], med)
    ax[1].fill_between(x_test.numpy()[:, 0], qua[0], qua[1], alpha=0.5)
    fig.canvas.draw()

In what follows the neural network is trained for 1000 epochs and every 10 epochs the predictive posterior is plotted. The `mean` site is plotted (model uncertainty). To observe model plus data uncertainty plot the `obs` site

Note that the scale of the prior, the scale of the likelihood and the initial scale of the approximate posterior are sensible parameters

This is of course in in addition to the number of hidden units and the learning rate

In [None]:
# Turn this on for additional debugging
pyro.enable_validation(True) 
pyro.set_rng_seed(123)
pyro.clear_param_store() 
# Declare the neural network
model = BayesianMLPRegression(n_hidden=10, prior_scale=10) 

# Create a guide
from pyro.infer.autoguide import AutoDiagonalNormal
guide = AutoDiagonalNormal(model, init_scale=1e-2)

# Create SVI object
svi = pyro.infer.SVI(model, guide, 
                     optim=pyro.optim.ClippedAdam({'lr':1e-2, 'clip_norm': 10.0}), # Optimizer
                     loss=pyro.infer.TraceMeanField_ELBO(num_particles=1)) # Loss function 

epoch_loss = np.zeros(shape=(1000,))
for k in tqdm(range(len(epoch_loss))):
    loss = svi.step(x=x_torch, y=y_torch) # Actual training step
    epoch_loss[k] = loss / len(x_torch)
        
    if k % 10 == 0:
        # Compute predictive posterior
        predictive = pyro.infer.Predictive(model, guide=guide, num_samples=100)
        samples = predictive(x_test, None)['mean'].detach().numpy()
        # Plot it
        update_plot(k, epoch_loss, samples)        

After training is complete we can use the guide as our replacement to the posterior

The trained pararemeters of the guide are stored in

In [None]:
for name, value in pyro.get_param_store().items():
    print(name, pyro.param(name))

As before we can use `pyro.infer.Predictive` to get samples from our bayesian neural network when evaluated on new inputs 

Here we sample "100 neural networks" and evaluate them on `x_test` 

This returns the sampled parameters (weights and biases) and outputs (obs)

In [None]:
predictive = pyro.infer.Predictive(model, guide=guide, num_samples=100)
for k, v in predictive(x_test, None).items():
    print(k, v.shape)
    
fig, ax = plt.subplots(figsize=(5, 3), tight_layout=True, dpi=80)
ax.plot(x_test, predictive(x_test)['mean'].detach().numpy().T, c='b', alpha=0.1);
ax.plot(x, y, 'k.');

## Bayesian network for multi-class classification with `pyro`

Let's create synthetic 2D data with 3 classes

In [None]:
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='int') # class labels

for j in range(K):
    ix = range(N*j,N*(j+1))
    r = np.linspace(0.0, 0.5, N) # radius
    t = np.linspace(j*4, (j+1)*4, N) + np.random.randn(N)*0.2 # theta
    X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
    y[ix] = j

#X, y = sklearn.datasets.make_moons(200, noise=0.2)
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))

fig, ax = plt.subplots(figsize=(4, 3))
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Set1, alpha=0.5);

idx = np.random.permutation(N*3)
train_prop = 0.7
x_train = torch.from_numpy(X[idx[:int(N*K*train_prop)]].astype('float32'))
y_train = torch.from_numpy(y[idx[:int(N*K*train_prop)]])
x_test = torch.from_numpy(X[idx[int(N*K*train_prop):]].astype('float32'))
y_test = torch.from_numpy(y[idx[int(N*K*train_prop):]])

The following is an implementation of a Bayesian Neural Network with two hidden layers and normal prior in all activations

For the likelihood we use the Categorical (Multinomial with $n=1$). The categorical distribution expects unnormalized probabilities (logits) as input, in this case the un-activated output of the last layer

In [None]:
from pyro.nn import PyroSample, PyroModule
from pyro.distributions import Normal, Categorical

class BayesianMLPClassifier(PyroModule):
    def __init__(self, num_hidden=10, prior_std=1.):
        super().__init__()
        prior = Normal(0, prior_std)
        self.layer1 = PyroModule[torch.nn.Linear](2, num_hidden)
        self.layer1.weight = PyroSample(prior.expand([num_hidden, 2]).to_event(2))
        self.layer1.bias = PyroSample(prior.expand([num_hidden]).to_event(1))
        
        #self.layer2 = PyroModule[torch.nn.Linear](num_hidden, num_hidden)
        #self.layer2.weight = PyroSample(prior.expand([num_hidden, num_hidden]).to_event(2))
        #self.layer2.bias = PyroSample(prior.expand([num_hidden]).to_event(1))
        
        self.layer3 = PyroModule[torch.nn.Linear](num_hidden, 3)
        self.layer3.weight = PyroSample(prior.expand([3, num_hidden]).to_event(2))
        self.layer3.bias = PyroSample(prior.expand([3]).to_event(1))        
        
        self.activation = torch.nn.Tanh()

    def forward(self, x, y=None):
        h = self.activation(self.layer1(x))
        #h = self.activation(self.layer2(h))
        f = self.layer3(h).squeeze(1)
        with pyro.plate("data", size=x.shape[0]):
            logp = pyro.deterministic("logp", f, event_dim=1)
            obs = pyro.sample("obs", Categorical(logits=logp), obs=y) # Multiclass
            #obs = pyro.sample("obs", dist.Bernoulli(logits=p), obs=y) # Binary
        return f
    
    
#pyro.enable_validation(True)
#model = BayesianMLPClassifier()
#print(pyro.poutine.trace(model).get_trace(x_train, y_train).format_shapes())

Again we use an automatic diagonal normal guide (no covariance) and train using `Trace_ELBO`

We plot the mean of the predictive posterior every 100 epochs

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(7, 3), tight_layout=True)
line2 = ax[1].plot([], [])

def update_plot(k, samples):
    ax[0].cla()
    p = torch.nn.functional.one_hot(samples["obs"], num_classes=3).sum(dim=0)
    zz = p.argmax(dim=1).reshape(xx.shape).detach().numpy()
    ax[0].pcolormesh(xx, yy, zz, shading='auto', cmap=plt.cm.Set1, alpha=0.75)
    for i, m in enumerate(['o', 'x', 'd']):
        ax[0].scatter(X[y==i, 0], X[y==i, 1], c='k', marker=m, s=20, alpha=0.25)    

    line2[0].set_xdata(range(k))
    line2[0].set_ydata(epoch_loss[:k])
    ax[1].relim()
    ax[1].autoscale_view()
    fig.canvas.draw()

In [None]:
pyro.enable_validation(True)
pyro.set_rng_seed(123)
pyro.clear_param_store()
model = BayesianMLPClassifier(num_hidden=100, prior_std=10.)

from pyro.infer.autoguide import AutoDiagonalNormal
guide = AutoDiagonalNormal(model, init_scale=1e-1)

svi = pyro.infer.SVI(model, guide, 
                     optim=pyro.optim.ClippedAdam({'lr':1e-2}),
                     loss=pyro.infer.TraceMeanField_ELBO())

epoch_loss = np.zeros(shape=(3000,))
for k in tqdm(range(len(epoch_loss))):
    epoch_loss[k] = svi.step(x_train, y_train)
    if k % 100 == 0:
        predictive = pyro.infer.Predictive(model, guide=guide, num_samples=10)
        samples = predictive(torch.from_numpy(np.c_[xx.ravel(), yy.ravel()].astype('float32')))
        update_plot(k, samples)

We sample 100 neural networks and plot four individual results

In [None]:
predictive = pyro.infer.Predictive(model, guide=guide, num_samples=100)
samples = predictive(torch.from_numpy(np.c_[xx.ravel(), yy.ravel()].astype('float32')))

fig, ax = plt.subplots(1, 4, figsize=(9, 2), tight_layout=True)
for k in range(4):
    zz = samples["obs"][k].reshape(xx.shape).detach().numpy()
    ax[k].pcolormesh(xx, yy, zz, shading='auto', cmap=plt.cm.Set1)
    for i, m in enumerate(['o', 'x', 'd']):
        ax[k].scatter(X[y==i, 0], X[y==i, 1], c='k', marker=m, s=20, alpha=0.25)    

From these categorical samples we can compute statistics

In the left we plot the mode (more repeated class) and in the right the entropy. 

The higher then entropy the more different the output of the neural networks (high uncertainty)

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(7, 3), tight_layout=True)

zz = torch.mode(samples["obs"], dim=0)[0].reshape(xx.shape).detach().numpy()
ax[0].pcolormesh(xx, yy, zz, shading='auto', cmap=plt.cm.Set1, alpha=0.75)
for i, m in enumerate(['o', 'x', 'd']):
    ax[0].scatter(X[y==i, 0], X[y==i, 1], c='k', marker=m, s=20, alpha=0.25)

p = torch.nn.functional.one_hot(samples["obs"], num_classes=3).sum(dim=0)/100.
entropy = lambda p: -(p*(p+1e-32).log()).sum(dim=1)

zz = entropy(p).reshape(xx.shape).detach().numpy()
cf = ax[1].contourf(xx, yy, zz, cmap=plt.cm.Blues, alpha=0.75)
fig.colorbar(cf, ax=ax[1])
for i, m in enumerate(['o', 'x', 'd']):
    ax[1].scatter(X[y==i, 0], X[y==i, 1], c='k', marker=m, s=20, alpha=0.25)

#### Result using a non-bayesian neural network

In [None]:
class MLPClassifier(torch.nn.Module):    
    def __init__(self, num_hidden=10):
        super(MLPClassifier, self).__init__()
        self.layer1 = torch.nn.Linear(2, num_hidden) 
        #self.layer2 = torch.nn.Linear(num_hidden, num_hidden)
        self.layer3 = torch.nn.Linear(num_hidden, 3)
        self.activation = torch.nn.ReLU()
        
    def forward(self, x): 
        z = self.activation(self.layer1(x))
        #z = self.activation(self.layer2(z))
        return self.layer3(z)     
    
fig, ax = plt.subplots(1, 2, figsize=(7, 3), tight_layout=True)
line2 = ax[1].plot([], [])

def update_plot(k, model):
    ax[0].cla()
    Z = model.forward(torch.from_numpy(np.c_[xx.ravel(), yy.ravel()].astype('float32')))
    zz = torch.nn.Softmax(dim=1)(Z).argmax(dim=1).detach().numpy().reshape(xx.shape[0], xx.shape[1])
    ax[0].pcolormesh(xx, yy, zz, shading='auto', cmap=plt.cm.Set1, alpha=0.75)
    for i, m in enumerate(['o', 'x', 'd']):
        ax[0].scatter(X[y==i, 0], X[y==i, 1], c='k', marker=m, s=20, alpha=0.25)
    
    line2[0].set_xdata(range(k))
    line2[0].set_ydata(epoch_loss[:k])
    ax[1].relim()
    ax[1].autoscale_view()
    fig.canvas.draw()

In [None]:
model = MLPClassifier(num_hidden=100)
display(model)
criterion = torch.nn.CrossEntropyLoss(reduction='sum')
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

def train_one_epoch(x, y, phase='train'):
    haty = model.forward(x) # Evaluate the model
    loss = criterion(haty, y) # Calculate errors
    if phase == 'train':
        optimizer.zero_grad()
        loss.backward() # Compute derivatives
        optimizer.step() # Update parameters 
    return loss.item()

x_train = torch.from_numpy(X.astype('float32'))#.reshape(-1, 1)
y_train = torch.from_numpy(y)#.reshape(-1, 1)
epoch_loss = np.zeros(shape=(3000,)) 

for k in tqdm(range(len(epoch_loss))):
    epoch_loss[k] = train_one_epoch(x_train, y_train)
    if k % 100 == 0: 
        update_plot(k, model)

If we consider the softmax output as probabilities we can also compute its entropy

Is it the same as before?

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(7, 3), tight_layout=True)

Z = torch.nn.Softmax(dim=1)(model.forward(torch.from_numpy(np.c_[xx.ravel(), yy.ravel()].astype('float32'))))
zz = Z.argmax(dim=1).detach().numpy().reshape(xx.shape[0], xx.shape[1])
ax[0].pcolormesh(xx, yy, zz, shading='auto', cmap=plt.cm.Set1, alpha=0.75)
for i, m in enumerate(['o', 'x', 'd']):
    ax[0].scatter(X[y==i, 0], X[y==i, 1], c='k', marker=m, s=20, alpha=0.25)
    
zz = -(Z*(Z+1e-32).log()).sum(dim=1).reshape(xx.shape).detach().numpy()
cf = ax[1].contourf(xx, yy, zz, cmap=plt.cm.Blues, alpha=0.75, vmin=0., vmax=np.log(3))
fig.colorbar(cf, ax=ax[1])
for i, m in enumerate(['o', 'x', 'd']):
    ax[1].scatter(X[y==i, 0], X[y==i, 1], c='k', marker=m, s=20, alpha=0.25)

This is related to phenomenon of **uncertainty miscalibration in neural networks**, i.e. the uncertainty of the predictions tends to be very low even when far from the data

> "after (almost) all training samples are correctly classified, crossentropy (neg log likelihood) can be further minimized by increasing the confidence of the predictions", *i.e.* reducing the entropy of softmax output

The uncertainty obtained from model averaging (bayesian) and the one derived from the softmax output should not be confused

Further reading and references on this topic:

- [On Calibration of Modern Neural Networks](https://arxiv.org/pdf/1706.04599.pdf)
- [Being Bayesian, Even Just a Bit,Fixes Overconfidence in ReLU Networks](https://arxiv.org/pdf/2002.10118v1.pdf)
- [Evidential Deep Learning to Quantify Classification Uncertainty](https://arxiv.org/pdf/1806.01768.pdf)

## Final remarks on Bayesian Neural Networks Training

- Actively research nowadays 
- Delicate: bad initializations, local minima, appropriate priors
- Variance control and reparameterization (more on this next class)