# Deep Learning
## Formative assessment
### Week 11: Bayesian neural networks

#### Instructions

In this notebook, you will write code to implement and train a Bayesian neural network model.

Some code cells are provided you in the notebook. You should avoid editing provided code, and make sure to execute the cells in order to avoid unexpected errors. Some cells begin with the line: 

`#### GRADED CELL ####`

These cells require you to write your own code to complete them.

#### Let's get started!

We'll start by running some imports, and loading the dataset.

In [None]:
#### PACKAGE IMPORTS ####

# Run this cell to import all required packages. 

import keras
from keras import ops
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Input, Flatten
from keras.optimizers import RMSprop
import tensorflow_datasets as tfds
import numpy as np
import matplotlib.pyplot as plt

<center><img src="figures/mnist_mnist_c.png" title="MNIST & MNIST-C" style="width: 650px;"/></center>
  
#### The MNIST and MNIST-C dataset

In this assignment, you will use the [MNIST](http://yann.lecun.com/exdb/mnist) and [MNIST-C](https://github.com/google-research/mnist-c) (MNIST-Corrupted) datasets. The MNIST-C dataset is a corruption benchmark dataset for out-of-distribution evaluation, where handwritten digits from the MNIST dataset are corrupted with various types of noise. Both datasets contain 60,000 examples for training and 10,000 examples for testing.

* LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998), "Gradient-based learning applied to document recognition." _Proceedings of the IEEE_, **86** (11), 2278-2324.
* LeCun, Y., Cortes, C. & Burges, C.J. (2010), "MNIST handwritten digit database", ATT Labs [Online](http://yann.lecun.com/exdb/mnist), **2**, 2010.
* Mu, N. & Gilmer, J. (2019), "MNIST-C: A Robustness Benchmark for Computer Vision", arXiv preprint, abs/1906.02337.

Your goal is to build and train a Bayesian neural network on the MNIST dataset, and test the robustness of the model on the MNIST-C dataset.

#### Load and prepare the data
For this assignment, you will load the MNIST and MNIST-C datasets from the TensorFlow Datasets library:

In [None]:
# Run this cell to load the MNIST data and print the element_spec

mnist_train_data = tfds.load("mnist_corrupted", split="train", 
                             read_config=tfds.ReadConfig(try_autocache=False))
mnist_test_data = tfds.load("mnist_corrupted", split="test", 
                            read_config=tfds.ReadConfig(try_autocache=False))

mnist_train_data.element_spec

In [None]:
# View some samples

fig, axes = plt.subplots(1, 6, figsize=(10, 3))
for i, example in enumerate(mnist_train_data.shuffle(200).take(6)):
    axes[i].imshow(ops.convert_to_numpy(example['image']), cmap='gray_r')
    axes[i].set_axis_off()
    axes[i].set_title(f'Label: {ops.convert_to_numpy(example["label"])}', fontsize=8)

In [None]:
# Run this cell to load the MNIST-C data and print the element_spec

mnist_c_train_data = tfds.load("mnist_corrupted", split="train", builder_kwargs={"config": 'spatter'}, 
                               read_config=tfds.ReadConfig(try_autocache=False))
mnist_c_test_data = tfds.load("mnist_corrupted", split="test", builder_kwargs={"config": 'spatter'},
                              read_config=tfds.ReadConfig(try_autocache=False))

mnist_c_train_data.element_spec

In [None]:
# View some samples

fig, axes = plt.subplots(1, 6, figsize=(10, 3))
for i, example in enumerate(mnist_c_train_data.shuffle(200).take(6)):
    axes[i].imshow(ops.convert_to_numpy(example['image']), cmap='gray_r')
    axes[i].set_axis_off()
    axes[i].set_title(f'Label: {ops.convert_to_numpy(example["label"])}', fontsize=8)

This version of the MNIST-C dataset adds 'spatters' to the images from the MNIST dataset.

First, you should write a `process_dataset` function to preprocess the data ready for training. 

* The function takes the arguments `dataset`, `batch_size`, and `shuffle_buffer`
  * `dataset` is a `tf.data.Dataset` object as loaded above
  * `batch_size` is a positive integer
  * `shuffle_buffer` is a positive integer, or `None`
* The `dataset` should be processed as follows:
  * It should return tuples of Tensors `(image, label)`
  * The image values should be scaled to the range $[0, 1]$, with dtype `tf.float32`
  * The labels should be convert to one-hot vectors, with dtype `tf.float32`
* If `shuffle_buffer` is not `None`, `dataset` should be shuffled with buffer size equal to `shuffle_buffer`
* `dataset` should be batched using `batch_size`
* Your function should end with a call to `prefetch` (using the argument `tf.data.AUTOTUNE`) and return the processed Dataset

In [None]:
#### GRADED CELL ####

# Complete the following function.
# Make sure not to change the function name or arguments.

def process_dataset(dataset, batch_size, shuffle_buffer=None):
    """
    This function takes a tf.data.Dataset, shuffle_buffer and batch_size arguments. 
    It should preprocess and return the Dataset as specified above.
    """
    
    

In [None]:
# Use your function to process the Datasets

mnist_train_data = process_dataset(mnist_train_data, 128, shuffle_buffer=500)
mnist_test_data = process_dataset(mnist_test_data, 128)
mnist_c_train_data = process_dataset(mnist_c_train_data, 128, shuffle_buffer=500)
mnist_c_test_data = process_dataset(mnist_c_test_data, 128)

print(mnist_train_data.element_spec)

#### Build and train the Bayesian MLP model

We will now build and train a Bayesian MLP on the MNIST dataset. This model will learn distributions over the model weights, and will be able to quantify its own uncertainty on OOD (out-of-distribution) data.

First you should complete the following `DenseVariational` custom layer. The layer is similar to the one from the lecture notes. It also should use an independent standard Normal distribution $N(0, 1)$ for each weight and bias parameter in the prior, and diagonal Gaussians with learnable means and variances for the posterior. 

* The initializer takes `units` as a required argument, and `activation`, `kl_weight`, `num_kl_mc_samples` as optional arguments
* The initializer and `build` method are completed for you
* You should complete the `call` method. The outputs computation has been completed, but you should add the KL loss term using the `add_loss` method
  * The KL loss should be computed using the first form of the SGVB estimator. That is, the KL term should be approximated as
$$\frac{1}{K}\sum_{j=1}^K  \left[ \log q_\phi(\theta^{(j)}) - \log p(\theta^{(j)}) \right],$$
where $\theta^{(j)} = g_\phi(\epsilon^{(j)}),~\epsilon^{(j)}\sim p(\epsilon)$ and $g_\phi$ is the scale-and-shift transformation using the mean and standard deviation Variables for the kernel and mean parameters.
  * $K$ in the above expression is defined by `num_kl_mc_samples`

In [None]:
from keras.layers import Layer, Activation

class DenseVariational(Layer):

    def __init__(self, units, activation=None, kl_weight=None, num_kl_mc_samples=1, **kwargs):
        super().__init__(**kwargs)
        self.units = units
        self.activation = activation
        self.kl_weight = kl_weight
        self.num_kl_mc_samples = num_kl_mc_samples
        self.pi = ops.array(np.pi)

    def build(self, inputs_shape):
        in_units = inputs_shape[-1]
        self.kernel_mean = self.add_weight(
            name='kernel_mean',
            shape=(in_units, self.units),
            initializer='glorot_uniform'
        )
        self.kernel_logstd = self.add_weight(
            name='kernel_logstd',
            shape=(in_units, self.units),
            initializer='zeros'
        )
        self.bias_mean = self.add_weight(
            name='bias_mean',
            shape=(self.units,),
            initializer='zeros'
        )
        self.bias_logstd = self.add_weight(
            name='bias_logstd',
            shape=(self.units,),
            initializer='zeros'
        )
        self.activation_fn = Activation(self.activation)

    def call(self, inputs):
        """
        Add the KL loss using the add_loss method, using the kl_weight and num_kl_mc_samples attributes.
        """
        kernel = self.kernel_mean + (keras.random.normal(self.kernel_mean.shape) * ops.exp(self.kernel_logstd))
        bias = self.bias_mean + (keras.random.normal(self.bias_mean.shape) * ops.exp(self.bias_logstd))
        outputs = self.activation_fn(inputs @ kernel + bias)

        return outputs

You should now define the BNN model with the following `get_bayesian_model` function.

* The function takes the arguments `input_shape`, `hidden_units`, `kl_weight` and `num_kl_mc_samples`
  * `input_shape` is a tuple of integers
  * `hidden_units` is a list of integers, for the width of the Dense layers
  * `kl_weight` is a float to use to weight the KL-term in the objective
  * `num_kl_mc_samples` is an integer to define the number of MC samples to approximate the KL loss term
* The first layer of the model should use the `input_shape` argument
* The model should first flatten the input to a batch of 1-D Tensors
* The `hidden_units` argument is a list of integers (of any length), containing the number of units to use in subsequent `DenseVariational` layers
  * Each of these `DenseVariational` layers should use a ReLU activation
  * The KL-divergence term should be weighted using `kl_weight`
  * The KL loss should be evaluated using `num_kl_mc_samples`
* There should then be one more `DenseVariational` layer with 10 units that output the logits of a categorical distribution with 10 categories
  * This `DenseVariational` layer should also use `kl_weight` and `num_kl_mc_samples` as above
* The function should then return the model

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def get_bayesian_model(input_shape, hidden_units, kl_weight, num_kl_mc_samples):
    """
    This function should define the BNN model as described above. 
    Your function should return this model.
    """
    
    

In [None]:
# Use your function to define the model

model = get_bayesian_model(input_shape=(28, 28, 1), hidden_units=[200, 100], kl_weight=1/60000, num_kl_mc_samples=5)
model.summary()

In [None]:
# Compile the model with the negative log-likelihood loss

def nll(y_true, y_pred):
    return keras.losses.categorical_crossentropy(y_true, y_pred, from_logits=True)

model.compile(loss=nll, optimizer=RMSprop(learning_rate=1e-2), metrics=['accuracy'])

The Bayes by Backprop algorithm can take a while to converge. We will use a learning rate schedule that uses the initial learning rate for 50 epochs, and then multiplies the learning rate by 0.1 every 25 epochs. We can use a [`LearningRateScheduler`](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LearningRateScheduler) callback to do this.

In [None]:
# Define a learning rate schedule

def scheduler(epoch, lr):
    if epoch < 50:
        return lr
    elif (epoch % 25 == 0):
        return lr * 0.1
    else:
        return lr
    
learning_rate_scheduler = keras.callbacks.LearningRateScheduler(scheduler)

In [None]:
# Train the model on the MNIST data

history = model.fit(mnist_train_data, epochs=100, callbacks=[learning_rate_scheduler])

In [None]:
# Plot the training loss and accuracy

plt.figure(figsize=(10, 3))
plt.subplot(1, 2, 1)
plt.plot(history.history['loss'])
plt.ylim(0, 5)
plt.title("Loss vs epoch")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.subplot(1, 2, 2)
plt.plot(history.history['accuracy'])
plt.title("Accuracy vs epoch")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.show()

In [None]:
# Evaluate the model on the MNIST test data

mnist_results = model.evaluate(mnist_test_data, return_dict=True, verbose=False)
print(f"MNIST test loss: {mnist_results['loss']:.4f}, MNIST test accuracy: {mnist_results['accuracy']:.4f}")

In [None]:
# Evaluate the model on the MNIST-C test data

mnist_c_results = model.evaluate(mnist_c_test_data, return_dict=True, verbose=False)
print(f"MNIST-C test loss: {mnist_c_results['loss']:.4f}, MNIST-C test accuracy: {mnist_c_results['accuracy']:.4f}")

#### Predictive distribution

We can use our posterior approximation $q_\phi(\theta)$ to approximate the predictive distribution, by marginalising out the parameters of the model. We can estimate this by drawing $K$ samples from our variational distribution $q_\phi$, and computing the Monte Carlo estimate

$$
p(y^* \mid x^*, \mathcal{D}) \approx \frac{1}{K} \sum_{k=1}^K p(y^* \mid x^*,\theta_k),\quad \theta_k \sim q_\phi(\theta).
$$

You should now complete the following `predictive_distribution` function to approximate the predictive distribution for a given batch of inputs.

* The function takes the arguments `bayesian_model`, `inputs` and `num_samples`
  * `bayesian_model` will be the trained Bayesian neural network from above
  * `inputs` is a batch of images of shape `(batch_size, 28, 28, 1)`
  * `num_samples` is a integer, for the number of Monte Carlo samples ($K$ in the above equation)
* The function should compute the above approximation to the predictive distribution
  * It should use `num_samples` Monte Carlo samples in the approximation
* The function should then return a Tensor of shape `(batch_size, 10)` for the probabilities of the predictive distribution

_Hint: you can access the logits of the categorical distribution with the `logits` attribute, which you can then use to compute the probabilities with `tf.nn.softmax`. You should not sample from the categorical distribution._

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def predictive_distribution(bayesian_model, inputs, num_samples):
    """
    This function should compute the predictive distribution approximation as described
    above. Your function should return the Tensor of probabilities.
    """
    
    

In [None]:
# Test your function on a batch of inputs

for batch in mnist_train_data.take(1):
    images, _ = batch

y_pred = predictive_distribution(model, images, num_samples=10)

You should now complete the following `pd_loss_and_accuracy` function to compute the categorical cross-entropy loss and categorical accuracy on a dataset, using the predictive distribution approximation of the model.

* The function takes the arguments `bayesian_model`, `dataset`, `num_samples` and `predictive_distribution_fn`
  * `bayesian_model` will be the trained Bayesian neural network from above
  * `dataset` is a tf.data.Dataset object, as used above for training or testing
  * `num_samples` is a integer, for the number of Monte Carlo samples
  * `predictive_distribution_fn` is a function used to compute the predictive distribution approximation. It has the signature defined above in the `predictive_distribution` function
* The `pd_loss_and_accuracy` function should define two metric objects to compute the categorical cross-entropy loss and categorical accuracy
* The function should iterate over `dataset`, compute the predictive distribution for the inputs using `predictive_distribution_fn`, and update the metrics
* The function should then return a tuple of floats `(loss, accuracy)`

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def pd_loss_and_accuracy(bayesian_model, dataset, num_samples, 
                         predictive_distribution_fn=predictive_distribution):
    """
    This function should compute the categorical cross-entropy loss and categorical accuracy
    on a dataset as described above. Your function should return the loss and accuracy values.
    """
    
    

In [None]:
# Use your function to evaluate your model on the MNIST dataset using the predictive distribution

final_loss, final_accuracy = pd_loss_and_accuracy(model, mnist_test_data, num_samples=5, 
                                                  predictive_distribution_fn=predictive_distribution)
print(f"MNIST test loss: {final_loss:.4f}, MNIST test accuracy: {final_accuracy:.4f}")

In [None]:
# Use your function to evaluate your model on the MNIST-C dataset using the predictive distribution

final_loss, final_accuracy = pd_loss_and_accuracy(model, mnist_c_test_data, num_samples=5, 
                                                  predictive_distribution_fn=predictive_distribution)
print(f"MNIST-C test loss: {final_loss:.4f}, MNIST-C test accuracy: {final_accuracy:.4f}")

You should see an improvement in these scores, compared with the evaluation of your model above (using `model.evaluate`) which uses a single Monte Carlo sample.

#### Uncertainty quantification

One of the key advantages of using Bayesian neural networks is the ability to quantify the uncertainty in the predictions. Recall that the predictive entropy is calculated by computing the entropy of the predictive distribution, estimated using Monte Carlo samples:

$$
H(p(Y \mid x,\mathcal{D})) \approx H\left(\frac{1}{K} \sum_{k=1}^K p(y \mid x, \theta^{(k)}) \right),\qquad \theta^{(k)}\sim q_\phi(\theta),\label{predictive_entropy}\tag{1}
$$

where the entropy of a random variable $Y$ that can assume one of $n$ discrete states is given by

$$
H(Y) = \mathbb{E}_p[-\log p(y)] = -\sum_{i=1}^n p(y_i) \log p(y_i).\label{entropy}\tag{2}
$$

You should now complete the following `predictive_entropy` function, to compute the quantity given in equation \eqref{predictive_entropy}.

* The function takes the arguments `bayesian_model`, `inputs`, `num_samples` and `predictive_distribution_fn`
  * `bayesian_model` will be the trained Bayesian neural network from above
  * `inputs` is a batch of images of shape `(batch_size, 28, 28, 1)`
  * `num_samples` is a integer, for the number of Monte Carlo samples ($K$ in the above equation)
  * `predictive_distribution_fn` is a function used to compute the predictive distribution approximation. It has the signature defined above in the `predictive_distribution` function
* The `predictive_entropy` function should use the `predictive_distribution_fn` to compute the entropy of the predictive distribution
  * The predictive distribution should use `num_samples` Monte Carlo samples in the approximation
  * The entropy should be computed according to equation \eqref{entropy}
* The function should then return a Tensor of shape `(batch_size,)` for the predictive entropy values for each input in the batch

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def predictive_entropy(bayesian_model, inputs, num_samples, 
                       predictive_distribution_fn=predictive_distribution):
    """
    This function should compute the predictive entropy for a batch of inputs as described above.
    Your function should return the Tensor of predictive entropy values.
    """
    
    

In [None]:
# Test your function on a batch of inputs

for batch in mnist_train_data.take(1):
    images, _ = batch

pred_ent = predictive_entropy(model, images, num_samples=10, predictive_distribution_fn=predictive_distribution)

We have seen how the predictive entropy, conditional entropy and mutual information can be used to decompose the overall uncertainty into aleatoric and epistemic uncertainty.

$$
\underbrace{H(p(Y \mid x, \mathcal{D}))}_{\text{Predictive entropy}} = \underbrace{I(\theta; Y \mid \mathcal{D}, x)}_{\substack{\text{Mutual information/} \\ \text{Epistemic uncertainty}}} + \underbrace{\mathbb{E}_{q_\phi(\theta)} H(p(Y \mid x, \theta))}_{\substack{\text{Expected entropy/} \\ \text{Aleatoric uncertainty}}}.\label{uncertainty_decomposition}\tag{3}
$$

The aleatoric uncertainty is given by the expected entropy, which is calculated by computing an average entropy over output distributions:

$$
\mathbb{E}_{q_\phi(\theta)} H(p(Y \mid x, \theta)) \approx \frac{1}{K} \sum_{k=1}^K H\left(p(Y \mid x, \theta^{(k)}) \right),\qquad \theta^{(k)} \sim q_\phi(\theta)\label{expected_entropy}\tag{4}
$$

You should now complete the following `expected_entropy` function, to compute the quantity given in equation \eqref{expected_entropy}.

* The function takes the arguments `bayesian_model`, `inputs` and `num_samples`
  * `bayesian_model` will be the trained Bayesian neural network from above
  * `inputs` is a batch of images of shape `(batch_size, 28, 28, 1)`
  * `num_samples` is a integer, for the number of Monte Carlo samples ($K$ in the above equation)
* The `expected_entropy` function should use `num_samples` Monte Carlo samples to compute the quantity in equation \eqref{expected_entropy}
* The function should then return a Tensor of shape `(batch_size,)` for the expected entropy values for each input in the batch

In [None]:
#### GRADED CELL ####

# Complete the following function. 
# Make sure to not change the function name or arguments.

def expected_entropy(bayesian_model, inputs, num_samples):
    """
    This function should compute the predictive entropy for a batch of inputs as described above.
    Your function should return the Tensor of predictive entropy values.
    """
    
    

In [None]:
# Test your function on a batch of inputs

for batch in mnist_train_data.take(1):
    images, _ = batch

pred_ent = expected_entropy(model, images, num_samples=10)

The following cells analyse the model's uncertainty on a batch of inputs from the MNIST and the MNIST-C dataset.

The cell plots histograms of the predictive entropy, expected entropy and mutual information values on the batch, and displays the images with the highest uncertainty values for each of these quantities.

In [None]:
# Analyse model predictions on a batch of inputs from the MNIST dataset

for batch in mnist_train_data.take(1):
    images, labels = batch
    pe = predictive_entropy(model, images, num_samples=10,
                            predictive_distribution_fn=predictive_distribution)
    ee = expected_entropy(model, images, num_samples=10)
    mi = pe - ee
    pe, ee, mi = ops.convert_to_numpy(pe), ops.convert_to_numpy(ee), ops.convert_to_numpy(mi)
    images = ops.convert_to_numpy(images)
    valid_inx = np.logical_not(np.logical_or(np.isnan(pe), np.isnan(ee), np.isnan(mi)))
    pe, ee, mi, images = pe[valid_inx], ee[valid_inx], mi[valid_inx], images[valid_inx]

plt.figure(figsize=(14, 4))
plt.subplot(1, 3, 1)
plt.hist(pe)
plt.title("Predictive entropy")
plt.subplot(1, 3, 2)
plt.hist(ee)
plt.title("Expected entropy / Aleatoric uncertainty")
plt.subplot(1, 3, 3)
plt.hist(mi)
plt.title("Mutual information / Epistemic uncertainty")
plt.show()

highest_pe = np.argsort(pe)[-1]
highest_ee = np.argsort(ee)[-1]
highest_mi = np.argsort(mi)[-1]

plt.figure(figsize=(14, 6))
plt.subplot(1, 3, 1)
plt.imshow(images[highest_pe])
plt.gca().set_axis_off()
plt.title(f"Highest predictive entropy: {pe[highest_pe]:.4f}")
plt.subplot(1, 3, 2)
plt.imshow(images[highest_ee])
plt.gca().set_axis_off()
plt.title(f"Highest aleatoric uncertainty: {ee[highest_ee]:.4f}")
plt.subplot(1, 3, 3)
plt.imshow(images[highest_mi])
plt.gca().set_axis_off()
plt.title(f"Highest epistemic uncertainty: {mi[highest_mi]:.4f}")
plt.show()

In [None]:
# Analyse model predictions on a batch of inputs from the MNIST-C dataset

for batch in mnist_c_train_data.take(1):
    images, labels = batch
    pe = predictive_entropy(model, images, num_samples=10,
                            predictive_distribution_fn=predictive_distribution)
    ee = expected_entropy(model, images, num_samples=10)
    mi = pe - ee
    pe, ee, mi = ops.convert_to_numpy(pe), ops.convert_to_numpy(ee), ops.convert_to_numpy(mi)
    images = ops.convert_to_numpy(images)
    valid_inx = np.logical_not(np.logical_or(np.isnan(pe), np.isnan(ee), np.isnan(mi)))
    pe, ee, mi, images = pe[valid_inx], ee[valid_inx], mi[valid_inx], images[valid_inx]
    
plt.figure(figsize=(14, 4))
plt.subplot(1, 3, 1)
plt.hist(pe)
plt.title("Predictive entropy")
plt.subplot(1, 3, 2)
plt.hist(ee)
plt.title("Expected entropy / Aleatoric uncertainty")
plt.subplot(1, 3, 3)
plt.hist(mi)
plt.title("Mutual information / Epistemic uncertainty")
plt.show()

highest_pe = np.argsort(pe)[-1]
highest_ee = np.argsort(ee)[-1]
highest_mi = np.argsort(mi)[-1]

plt.figure(figsize=(14, 6))
plt.subplot(1, 3, 1)
plt.imshow(images[highest_pe])
plt.gca().set_axis_off()
plt.title(f"Highest predictive entropy: {pe[highest_pe]:.4f}")
plt.subplot(1, 3, 2)
plt.imshow(images[highest_ee])
plt.gca().set_axis_off()
plt.title(f"Highest aleatoric uncertainty: {ee[highest_ee]:.4f}")
plt.subplot(1, 3, 3)
plt.imshow(images[highest_mi])
plt.gca().set_axis_off()
plt.title(f"Highest epistemic uncertainty: {mi[highest_mi]:.4f}")
plt.show()

Congratulations on completing this week's assignment! In this assignment you have developed a Bayesian neural network classifier for the MNIST and MNIST-C datasets, and compared overall performance and uncertainty quantification on both of these datasets.