# Tutorial: Introduction to Bayesian Neural Networks

This tutorial introduces the reader to the concept of Bayesian neural networks. A symbiosis between the extensive research done on neural networks, but where at the same time hyperparameters and "prior" knowledge are encoded in the setup of the weights, and Bayesian analysis, the study of the relation between our preconceptions and their change when confronted with new data. In its simplest form one would take distributions from our favorite package and put them over the weights of the neural network.

<img src="presentation/imgs/BayesianNN.png" width="500" height="170" />

As the search for a posterior does then, like in the probabilistic programming tutorial, rely on sampling-heavy routines the embedding of such Bayesian neural network inside of a PPL, such as Turing allows Turing's advanced inference algorithms to sample from the probabilistic model and obtain a posterior. We are at the same time faced with the same dilemma as in the amortized inference example; we need to evaluate our desire for accuracy against the available compute ressources and the desire to reduce the influence of priors' influence on the training outcome.

### Main advantages of Bayesian Neural Networks:

- Ability to include and quantify uncertainties
- Improve robustness against adversarial examples

### Downside of Bayesian Neural Networks:

- Computational cost

For a brief walk through the mathematical formalism, please have a look at [this](https://davidstutz.de/a-short-introduction-to-bayesian-neural-networks/) blog entry. For a much deeper look into the topic feel free to have a look Radford M. Neal's PhD thesis [Bayesian Learning for Neural Networks](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.446.9306&rep=rep1&type=pdf).


## Outline:

**Section 1.** [A First Bayesian Neural Network](#first)

**Section 2.** [Generic Bayesian Neural Networks](#generic)

**Section 3.** [Exercise - Different types of Bayesian Neural Networks](#ex)



In [None]:
using Turing, Flux, Plots, Random

In [None]:
# Hide sampling progress
Turing.turnprogress(false);

# Use reverse_diff due to the number of parameters in the neural network
Turing.setadbackend(:reverse_diff);

# 1: A First Bayesian Neural Network <a name="first"></a>

Generate an artificial dataset with its points arranged in a box-like pattern

In [None]:
# Number of points
N = 80
M = round(Int, N / 4)
Random.seed!(1234)

# Generate artificial data
x1s = rand(M) * 4.5; x2s = rand(M) * 4.5;
xt1s = Array([[x1s[i] + 0.5; x2s[i] + 0.5] for i = 1:M])
x1s = rand(M) * 4.5; x2s = rand(M) * 4.5;
append!(xt1s, Array([[x1s[i] - 5; x2s[i] - 5] for i = 1:M]))

x1s = rand(M) * 4.5; x2s = rand(M) * 4.5;
xt0s = Array([[x1s[i] + 0.5; x2s[i] - 5] for i = 1:M])
x1s = rand(M) * 4.5; x2s = rand(M) * 4.5;
append!(xt0s, Array([[x1s[i] - 5; x2s[i] + 0.5] for i = 1:M]))

# Store all data for later use
xs = [xt1s; xt0s]
ts = [ones(2*M); zeros(2*M)];

Visualize the artificial dataset

In [None]:
# Plot data points
function plot_data()
    x1 = map(e -> e[1], xt1s)
    y1 = map(e -> e[2], xt1s)
    x2 = map(e -> e[1], xt0s)
    y2 = map(e -> e[2], xt0s)
    
    Plots.scatter(x1, y1, color="red", clim = (0, 1))
    Plots.scatter!(x2, y2, color="blue", clim = (0, 1))
end

plot_data()

### 1.1: Create a neural network with two hidden layers and one output layer

Define a helper function for the training of neural networks and subsequently construct the neural network in Flux

In [None]:
# Turn a vector into a set of weights and biases
function unpack(nn_params::AbstractVector)
    W1 = reshape(nn_params[1:6], 3, 2);
    b1 = reshape(nn_params[7:9], 3)
    
    W2 = reshape(nn_params[10:15], 2, 3);
    b2 = reshape(nn_params[16:17], 2)
    
    W0 = reshape(nn_params[18:19], 1, 2);
    b0 = reshape(nn_params[20:20], 1)
    return W1, b1, W2, b2, W0, b0
end

# Construct the neural network with Flux and predict its output
function nn_forward(xs, nn_params::AbstractVector)
    W1, b1, W2, b2, W0, b0 = unpack(nn_params)
    nn = Chain(Dense(W1, b1, tanh),
               Dense(W2, b2, tanh),
               Dense(W0, b0, σ))
    return nn(xs)
end;

### 1.2: Build the probabilistic model

Create the probabilistic model encapsulating the Bayesian neural network, where the prior comes from a multivariate normal distribution

In [None]:
# Create a regularization term and a Gaussian prior variance term
alpha = 0.09
sig = sqrt(1.0 / alpha)

# Specify the probabilistic model
@model bayes_nn(xs, ts) = begin
    # Create the weight and bias vector
    nn_params ~ MvNormal(zeros(20), sig .* ones(20))
    
    # Calculate predictions for the inputs given the weights and biases in theta
    preds = nn_forward(xs, nn_params)
    
    # Observe each prediction
    for i = 1:length(ts)
        ts[i] ~ Bernoulli(preds[i])
    end
end;

Perform inference using the Hamiltonian Monte-Carlo algorithm with 5000 epochs

In [None]:
# Perform inference on the Bayesian neural network
N = 5000
ch = sample(bayes_nn(hcat(xs...), ts), HMC(0.05, 4), N);

Retrieve the posterior values for the weights and biases from the sampled chain

In [None]:
# Extract weights and biases
theta = ch[:nn_params].value.data;

### 1.3: Maximum a posteriori (MAP) estimation

Find the set of weights, which provided the highest log posterior and subsequently plot the dataset

In [None]:
plot_data()

# Find index with highest log posterior in the chain
_, i = findmax(ch[:lp].value.data)

# Extract max row value
i = i.I[1]

# Plot posterior distribution
x_range = collect(range(-6, stop=6, length=25))
y_range = collect(range(-6, stop=6, length=25))
Z = [nn_forward([x, y], theta[i, :])[1] for x=x_range, y=y_range]
contour!(x_range, y_range, Z)

Drawn from the MCMC chain the predicted values are returned after the inference

In [None]:
# Return average predicted value across weights
function nn_predict(x, theta, num)
    mean([nn_forward(x, theta[i, :])[1] for i in 1:10:num])
end;

In [None]:
# Plot average prediction
plot_data()

n_end = 1500
x_range = collect(range(-6, stop=6, length=25))
y_range = collect(range(-6, stop=6, length=25))
Z = [nn_predict([x, y], theta, n_end)[1] for x=x_range, y=y_range]
contour!(x_range, y_range, Z)

Using the ability to animate the sampling process we visualize the change in posterior over time 

In [None]:
# Plot the evolution of the network's predictive power
n_end = 500

anim = @animate for i=1:n_end
    plot_data()
    Z = [nn_forward([x, y], theta[i, :])[1] for x=x_range, y=y_range]
    contour!(x_range, y_range, Z, title="Iteration $i", clim=(0, 1))
end every 5

gif(anim, "/tmp/jl_ozeq2f.gif", fps=15)

# 2: Generic Bayesian Neural Networks <a name="generic"></a>

We introduce a more general setup of the neural network, which will allow us to change subparts of the network later on in the exercises. The main constraint here is that we are still constrained to purely 'Dense' layers and have to refrain from using more advanced cells in this framework.

## 2.1 More generalized framework

We govern the shape of the entire network through the 'network_shape'. Data is subsequently prepared for training

In [None]:
# Specify the network architecture made up of 'dense' layers
network_shape = [
    (3, 2, :tanh),
    (2, 3, :tanh),
    (1, 2, :σ)
]

# Regularization, parameter variance & total number of parameters
alpha = 0.09
sig = sqrt(1.0 / alpha)
num_params = sum([i * o + i for (i, o, _) in network_shape])

# Generate a series of vectors given the network shape
function unpack(θ::AbstractVector, network_shape::AbstractVector)
    index = 1
    weights = []
    biases = []
    for layer in network_shape
        rows, cols, _ = layer
        size = rows * cols
        last_index_w = size + index - 1
        last_index_b = last_index_w + rows
        push!(weights, reshape(θ[index:last_index_w], rows, cols))
        push!(biases, reshape(θ[last_index_w + 1:last_index_b], rows))
        index = last_index_b + 1
    end
    return weights, biases
end

# Generate the neural network given a shape and return a prediction
function nn_forward(x, θ::AbstractVector, network_shape::AbstractVector)
    weights, biases = unpack(θ, network_shape)
    layers = []
    for i in eachindex(network_shape)
        push!(layers, Dense(weights[i],
                biases[i],
                eval(network_shape[i][3])))
    end
    nn = Chain(layers...)
    return nn(x)
end

# General Turing specification for a BNN
@model bayes_nn(xs, ts, network_shape, num_params) = begin
    θ ~ MvNormal(zeros(num_params), sig .* ones(num_params))
    preds = nn_forward(xs, θ, network_shape)
    for i = 1:length(ts)
        ts[i] ~ Bernoulli(preds[i])
    end
end

# Set the backend
Turing.setadbackend(:reverse_diff);

Use Hamiltonian Monte-Carlo to sample from the probabilistic neural network model and arrive at a valid posterior

In [None]:
# Perform inference
num_samples = 10000
ch2 = sample(bayes_nn(hcat(xs...), ts, network_shape, num_params), HMC(0.05, 4), num_samples);

Make predictions based on the network shape

In [None]:
# Make predictions based on network shape
function nn_predict(x, theta, num, network_shape)
    mean([nn_forward(x, theta[i, :], network_shape)[1] for i in 1:10:num])
end;

# Extract θ from the sampled chain
params2 = ch2[:θ].value.data;

Plot the results

In [None]:
# Plot the prediction
plot_data()

x_range = collect(range(-6, stop=6, length=25))
y_range = collect(range(-6, stop=6, length=25))
Z = [nn_predict([x, y], params2, length(ch2), network_shape)[1] for x=x_range, y=y_range]
contour!(x_range, y_range, Z)

# 3: Exercise - Different types of Bayesian Neural Networks <a name="ex"></a>

- Experiment with deep Bayesian neural networks
    - How does the inference engine scale throughout changes to the neural networ?
- Experiment with different kinds of noise by rewriting the probabilistic model using the above examples and Turing's [library](https://github.com/TuringLang/Turing.jl/blob/master/src/stdlib/distributions.jl).
- How many samples do I need across networks to arrive at an expressive posterior?
- Change the inference algorithm employed by Turing to a variational inference based one
- Transfer the linear regression and logistic regression frameworks from the previous tutorials into the Bayesian machine learning 