# Bayesian Neural Networks

## Useful literature
* [Financial forecasting with probabilistic programming and pyro (Medium)](https://alexrachnog.medium.com/financial-forecasting-with-probabilistic-programming-and-pyro-db68ab1a1dba)
    * Author has several tutorials for financial forecasting with neural nets
* [Bayesian time series forecasting (Medium)](https://medium.com/geekculture/bayesian-time-series-forecasting-c8e1928d34d4)
* (Hands-on Bayesian for deep learning (Youtube))[https://www.youtube.com/watch?v=T5TPaI5H4q8]
* https://www.coursera.org/learn/probabilistic-deep-learning-with-tensorflow2
* https://www.coursera.org/lecture/probabilistic-deep-learning-with-tensorflow2/probabilistic-layers-xBWQh
* [What uncertainties tell you in bayesian NNs (Towardsdatascience)](https://towardsdatascience.com/what-uncertainties-tell-you-in-bayesian-neural-networks-6fbd5f85648e)

### Relevant libraries
* Tensorflow Probability
* Tensorflow BNN
* Pytorch
* Pyro (builds on Pytorch)
* Edward
* PyMC3
* InferPy

## [1] Introduction to Bayesian Inference

#### Contents:
1. Motivation
2. Uncertainty quantification<br>
    2.1 Aleatoric and epistemic uncertainty

### [1.1] Motivation
* Quantify the uncertainty of a model's predictions for more sound decision making
* Better at avoiding overfitting which is a common problem with regular NNs
* Explainability
* Allows separation of aleatoric and epistemic uncertainty
* Much noise in financial time series makes regular NNs more prone to overfitting, e.g., finding spurious trends in the data


### [1.2] Uncertainty quantification
Model (epistemic) vs data uncertainty (aleatoric):
* Aleatoric: Inherent uncertainty in data, i.e., statistical uncertainty -- irreducable
* Epistemic: Knowledge uncertainty

Predictive uncertainty is the sum of the aleatoric and epistemic uncertainty, where the latter can be viewed as a distribution over the model parameters.

## [2] Bayesian Neural Nets

### Note about priors and posterior
<b>Priors</b> are the initial distribution of the parameters, e.g., the network weights: $P(w)$<br> 
<b>Posteriors</b> are the distribution of the parameters given the evidence/data; $P(w|D)$
___
General Bayesian:
$ P(H|D) = \frac{P(D|H)P(H)}{P(D)} = \frac{P(D,H)}{\int_{H}P(D,H^{'})dH^{'}} $ <br>
Notations:
* H: Hypothesis (network weights)
* D: Evidence/Data 

Predictions:
* Marginal probability distribution $P(y|x, D)$ (cond. probability of labels given data inputs <b>x</b> and training data <b>D</b>) quantifies model's uncertainty on prediction
* Using a Monte Carlo approach, the final prediction can be obtained by sampling and averaging the marginal prediction distribution

Problems:
* Need for prior belief for weights, i.e., $P(H)$

### [2.1] Stochastic Neural Nets
___
<b> Stochastic neural nets as ensemble models </b>

Introducing randomness into neural net can be done in two ways:
* Stochastic activation function
* Stochastic weights<br>

This achieves a network capable of simulating multiple different models $\theta$ with probability distribution $p(\theta)$. Thus, they can be considered as a special case of ensemble learning.
<br>
By comparing predictions from multiple different models it is possible to obtain a measure of uncertainty. If different models agree on the results, the uncertainty is low, and high if they disagree.
___
<b> Predictions </b>

The marginal distribution of a prediction, $p(y|x,D)$, quantifies the uncertainty on given predictions. The posterior distribution of the model parameters, $p(\theta|D)$, allows for computing the marginal prediction as $$p(y|x,D) = \int_{\theta}p(y|x,\theta^{'}p(\theta^{'}|D)d\theta^{'}$$
___

<b>Setting priors</b>
1. Normal prior with zero mean and diagonal covariance $\sigma I$<br>
    * Good default prior due to mathematical properties of normal distribution, but no theoretical foundation for its use.
___

## [3] Algorithms

The posterior probability $P(H|D)$ is often intractable due to the integral $\int_{H}P(D|H^{'})P(H^{'})dH^{'}$. Thus, approximations and sampling methods are used to learn the posterior, where two main methods are most prevalent: <i>Monte Carlo methods</i> and <i>Variational Inference</i>.
___

### [3.1] Monte Carlo
Key points:
* Generate i.i.d samples from distribution to estimate expected value/average
* Law of large numbers, given i.i.d. samples and bounded variance => converges to expected value 
* Central limit theorem: for sufficiently large samples, estimated average converges to normal distribution

Further methods
* Explain and demonstrate dropout within a Monte Carlo framework
___

<b>Method</b>

Monte Carlo methods tries to sample the posterior probability.

### [3.2] Monte Carlo Markov Chain (MCMC)

<b>Method and drawbacks</b>

MCMC methods tries to sample the exact posterior distribution by constructing a Markov chain. This is achieved by drawing random samples which depends solely on the previously drawn sample, thus aquiring a desired distribution.

There are some issues with these methods:
* There is often the need for an initial burn-in time before the chain converges to the desired distribution. 
* Autocorrelation between samples may be present, thus demanding a large number of samples to approximate independent sampling.
* The drawn samples has to be stored after training, which is expensive for most deep learning models.
___

<b>Metropolis-Hasting algortihm</b>

The algorithm starts with an initial guess $\theta$. Then, a second sample is drawn based on a proposed distribution $Q(\theta^{'}|\theta)$. Depending on the target distribution, the sample is either accepted or rejected. If it is rejected, a new sample is constructed. Otherwise, the algorithm continues with $\theta^{'}$ as the new benchmark for the desired number of steps.
The acceptance probability is computed as 
$$p=\min{\left(1, \frac{Q(\theta^{'}|\theta_{n})}{Q(\theta_{n}|\theta^{'})}\frac{f(\theta^{'})}{f(\theta_n)}\right)}$$
If $Q$ is chosen to be symmetric, the acceptance formula is easier to compute and the algorithm is only called the <i>Metropolis method</i>. Examples include normal and uniform distributions.
In the case of bounded domains, a non-symmetric distribution must be utilized.
___

<b>Hamiltonian Monte Carlo algorithm</b> 

This method builds upon the Metropolis method such that it tries to draw as few samples of $\theta_{'}$ as possible, and additionally attempts to avoid correlations between samples. 

### [3.3] Variational Inference
Key points:
* Approximate inference
* Uses Kullback-Leibler divergence to provide evidence lower bound; compute a lower bound of the likelihood
    * Converts into optimization problem
___

<b>Method and drawbacks</b>

Variational inference approximates the posterior by optimization. A variational distribution $q_{\phi}(H)$ paremeterized by $\phi$ is chosen. Then, the parameters is learned to get the distribution as close as possible to the exact posterior. The measure of closeness if often done by the Kullback-Leibler divergence. As the exact KL-divergence requires computing the posterior, the   evidence lower bound (ELBO) is used as the loss for the approximation. 

A popular method for optimization is <i>Stochastic Variational Inference</i> (SVI); the stochastic gradient descent algorithm applied to variational inference.

Typical distributions for $q_{\phi}(H)$ are constructed from the exponential family of distributions, e.g., multivariate normal, Gamma and Dirichlet.

One problem with this method concerns deep learning, where the stochasticity stops backpropagation from functioning within the internal nodes of the network.
___

<b>Bayes-by-backprop</b>

This method is a practical implementation of SVI, where the problems surrounding backpropagation is overcome by a reparametrization trick. The general idea is to use a random variable $\epsilon$ in the transformation $\theta=t(\epsilon,\phi)$ to obtain the parameters. This allows backpropagation to work as usual for the variational parameters $\phi$. 

As the objective function is a single sample of the ELBO, it will be noisy. As a countermeasure, the loss could be averaged over multiple epochs. 