# Bayesian Deep Learning

## Introduction

Neural networks are typically trained to find a single parameter setting that produces a point estimate for a given input. Larger models, like those found in Deep Neural Networks can represent the same function with many valid parameter settings, with each being independently trained. If we consider many different models trained on the same data, we can treat the range of predictions/parameter settings as a distribution and represent uncertainty through defining a metric of variation. This method of training multiple models is called a deep ensemble, and is the motivation for this notebook.

Probabilistic Machine Learning: Advanced Topics, Kevin Murphy, Chapter 17.1

### The Posterior Predictive Distribution

$$
p(y|x, \mathcal{D}) = \int p(y|x, \theta) p(\theta | \mathcal{D}) \, d\theta
$$

The posterior predictive distribution is a concept in Bayesian statistics that combines information from both the observed data and the uncertainty in the model parameters. It provides a way to make new predictions for new or future data points, taking into account what we've learned from the data we've already observed.

$p(y|x, \mathcal{D})$: This is the conditional probability of *y* given *x* and D. It represents the probability of some outcome *y* given both the input *x* and the observed data D

$p(y|x,\theta)$: This is the likelihood function. It represents the probabillity of observing the outcome *y* given the input *x* and a specific value of the parameter $\theta$

$p(\theta|\mathcal{D})$: This is the posterior distribution of the parameter $\theta$ given the data *D*. It represents the updated probability distribution of $\theta$ after taking into account the observed data.

#### Posterior Distribution

*Note* $p(\theta|D) ∝ p(\theta)p(D|\theta)$

This expression represents **Bayes theorem**

$p(\theta|D)$: This is the postierior probability, which represents the probability of the parameter $\theta$ given the data *D*.

$∝$: Means "proportional to", which in this context indicates the expression on the left is proportional to the right with a constant proportionality that ensures the total probability sums up to 1

$p(\theta)$: This is the **prior** probability representing the initial belief or probability distribution of the parameter $\theta$ before observing any data

$p(D|\theta)$: This is the **likelihood**, which represents the probability of observing the data *D* given a specific value of the parameter $\theta$. It describes how well the model with parameter $\theta$ explains the observed data

Putting this all together, Bayes theorem states the updated probability of $p(\theta|D)$ is proportional to the product of the prior probability of the parameter $p(\theta)$ and the likelihood of observing the data given that parameter $p(D|\theta)$. Resulting in the following mathmatical expression.

$p(\theta|D) ∝ p(\theta) * p(D|\theta)$

$$
p(y|x, \mathcal{D}) = \int p(y|x, \theta) p(\theta | \mathcal{D}) \, d\theta
$$
Thus, this expression is stating that the conditional probability of *y* given *x* and *D* can be calculated by integrating the product of the likelihood $p(y|x, \theta)$ and the posterior distribution $p(\theta|D)$ over all possible values of $\theta$. This is what we term the posterior predictive distribution.

### Prior


Choosing a prior can be done explicitly, or implicitly. In the deep ensemble case, we can treat the architecture of the model itself as a prior. The number of parameters and hidden layers can be varied to create more distinct models, while different starting parameter values and random seeds introduce randomization between identical model runs.

## Running a Deep Ensemble in Push

In [5]:
import torch
import torchvision
from torch import nn, optim, autograd
from torch.nn import functional as F
from torchvision import transforms, datasets
import numpy as np
from sklearn.metrics import roc_auc_score
import scipy
from utils.LB_utils import * 
import utils.LB_utils_special as LB_utils_special
from utils.load_not_MNIST import notMNIST
import os
import time
import matplotlib.pyplot as plt
import laplace

s = 1
np.random.seed(s)
torch.manual_seed(s)
torch.cuda.manual_seed(s)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

In [6]:
#load imagenet
#imagenet_train_root = os.path.abspath('your imagenet path')
#imagenet_val_root = os.path.abspath('your imagenet path')

# Get the current working directory (directory of your notebook)
notebook_directory = os.path.dirname(os.path.abspath("01_Bayesian_Deep_Learning_Tutorial.ipynb"))

# Navigate to the parent folder (assuming "usr" and "home" are at the same level)
parent_directory = os.path.abspath(os.path.join(notebook_directory, "..","..","..","..","..","..",".."))
# Construct the path to the ImageNet directory
imagenet_directory = os.path.abspath(os.path.join(parent_directory, "/usr/data1/imagenet"))

# print(imagenet_directory)

imagenet_train_root = os.path.abspath(imagenet_directory + '/train')
imagenet_val_root = os.path.abspath(imagenet_directory + '/val')


normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])

transform_imagenet_train = transforms.Compose([
            transforms.RandomResizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ])

transform_imagenet_val =  transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize,
        ])

imagenet_train = datasets.ImageFolder(imagenet_train_root, transform=transform_imagenet_train)
indices_small = np.random.choice(np.arange(0, len(imagenet_train)), size=(20000,), replace=False)
imagenet_train_small = torch.utils.data.Subset(imagenet_train, torch.tensor(indices_small))
imagenet_val = datasets.ImageFolder(imagenet_val_root, transform=transform_imagenet_val)

train_loader = torch.utils.data.DataLoader(
        imagenet_train_small,
        batch_size=16,
        shuffle=True)

train_loader_16 = torch.utils.data.DataLoader(
        imagenet_train_small,
        batch_size=16,
        shuffle=True)

train_loader_64 = torch.utils.data.DataLoader(
        imagenet_train_small,
        batch_size=64,
        shuffle=True)

train_loader_128 = torch.utils.data.DataLoader(
        imagenet_train_small,
        batch_size=128,
        shuffle=True)

val_loader = torch.utils.data.DataLoader(
        imagenet_val,
        batch_size=64)

In [7]:
import matplotlib
import tueplots
from tueplots import bundles
plt.rcParams.update(tueplots.bundles.icml2022())

print(plt.rcParams['figure.figsize'])
figwidth = plt.rcParams['figure.figsize'][0]
figheight = plt.rcParams['figure.figsize'][1]
matplotlib.rcParams['font.family'] = "serif"
matplotlib.rcParams['font.serif'] = 'Times new Roman'
matplotlib.rcParams['text.usetex'] = True
print(figheight)
print(figwidth)

[3.25, 2.0086104634371584]
2.0086104634371584
3.25


In [8]:
import models
from models.densenet import densenet121
import torch
from torch.utils.data import DataLoader
import push.bayes.ensemble

#load densenet

# densenet = densenet121(pretrained=False).cuda()
# densenet.eval()

epochs = 1

two_particle_params = push.bayes.ensemble.train_deep_ensemble(
        train_loader_128,
        torch.nn.CrossEntropyLoss(),
        epochs,
        densenet121, False,
        num_devices=2,
        num_ensembles=2,
    )

100%|██████████| 1/1 [01:44<00:00, 104.80s/it]


AttributeError: 'ReceiveFuncAckPDMSG' object has no attribute 'result'