# Push Bayesian Deep Learning Tutorial

## Introduction

In this notebook, we introduce Bayesian Deep Learning (BDL) and demonstrate how to use Push to perform BDL by running a deep ensemble.

## Background: The Posterior Predictive Distribution

The goal of BDL methods is to compute the *posterior predictive distribution*
$$
p(y|x, \mathcal{D}) = \int p(y|x, \theta) p(\theta | \mathcal{D}) \, d\theta
$$
where $y$ is an output, $x$ is an input, $\theta$ are parameters, and $\mathcal{D} = (x_i, y_i)_{i}$ is a dataset. In general, this integral is intractable and must be approximated.

### Monte Carlo Approximation

We can approximate the posterior predictive distribution with a Monte Carlo approximation [1]: 
$$
p(y|x, \mathcal{D}) \approx \frac{1}{J} \sum_{j=1}^{J} p(y|x, \theta_j), \quad \theta_j \sim p(\theta | D)
$$
where each $\theta_j$ corresponds to a unique parameter setting.

### Deep Ensembles

 *Deep ensembles* train $J$ different initializations of the same neural network (NN). If we average $p(y | x, \theta_j)$ over $J$ parameter settings $\theta_j$, we are essentially performing a Monte Carlo estimate that approximates the posterior predictive distribution. Thus, deep ensembles form a simple BDL method.

## Deep Ensembles in Push

We will now introduce deep ensembles in Push. Push interoperates with PyTorch, and so all datasets and models can use PyTorch components. Push also has concurrent execution semantics.

### Dataset

For illustrative purposes, we create a random dataset with $N$ points of dimension $D$.

In [16]:
from bdl import RandDataset

D = 1
batch_size = 128
N = 1
dataset = RandDataset(batch_size, N, D)
dataset

<bdl.RandDataset at 0x7f1a383d8cd0>

```py
class RandDataset(Dataset):
    def __init__(self, batch_size, N, D):
        self.xs = torch.randn(batch_size*N, D)
        self.ys = torch.randn(batch_size*N, 1)

    def __len__(self):
        return len(self.xs)

    def __getitem__(self, idx):
        return self.xs[idx], self.ys[idx]
```

### Neural Network

We create a simple neural network (NN). This NN contains two fully-connected layers of dimension $D$ and uses a ReLU activation function. 

In [17]:
from bdl import MiniNN

```py
class MiniNN(nn.Module):
    def __init__(self, D):
        super(MiniNN, self).__init__()
        self.fc1 = nn.Linear(D, D)
        self.fc2 = nn.Linear(D, D)

    def forward(self, x):
        x = self.fc1(x)
        x = torch.nn.ReLU()(x)
        x = self.fc2(x)
        return x
```

### Running a Deep Ensemble


In [18]:
import torch
from torch.utils.data import DataLoader
import push.bayes.ensemble

dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

epochs = 10
num_ensembles = 4
push.bayes.ensemble.train_deep_ensemble(
    dataloader,
    torch.nn.MSELoss(),
    epochs,
    MiniNN, D,
    num_devices=1,
    num_ensembles=num_ensembles
)


  0%|          | 0/10 [00:00<?, ?it/s]

Average loss 0 tensor(1.6703)
Average loss 0 tensor(1.6702)
Average loss 0 tensor(1.6702)
Average loss 0 tensor(1.6702)
Average loss 0 tensor(1.6701)
Average loss 0 tensor(1.6701)
Average loss 0 tensor(1.6700)
Average loss 0 tensor(1.6700)
Average loss 0 tensor(1.6699)
Average loss 0 tensor(1.6699)


100%|██████████| 10/10 [00:00<00:00, 18.38it/s]


[[tensor([[-0.8776]]),
  tensor([-0.0283]),
  tensor([[-0.9058]]),
  tensor([-0.4190])],
 [tensor([[-0.1410]]), tensor([0.1787]), tensor([[0.9782]]), tensor([0.3471])],
 [tensor([[-0.7838]]), tensor([0.2805]), tensor([[0.0788]]), tensor([0.4403])],
 [tensor([[0.0092]]),
  tensor([-0.4970]),
  tensor([[-0.9888]]),
  tensor([0.5064])]]

References:

[1] A.G. Wilson, P. Izmailov. Bayesian Deep Learning and a Probabilistic Perspective of Generalization. Advances in Neural Information Processing Systems, 2020.