In [None]:
---
title: "Approximate inference"
format:
    html:
        code-fold: true
jupyter: python3
fontsize: 1.2em
linestretch: 1.5
toc: true
notebook-view: true
---

### Overview

This lecture focuses on methods for approximate inference. This is required when either the likelihood (noise model) is non-Gaussian — in which case there is no closed-form solution to the posterior — or when there are just too many data points. Across both these scenarios, Markov chain Monte Carlo (MCMC) represents the gold standard for hyperparameter inference. However, as noted before, care must be taken when selecting a given MCMC step (Gibbs, Metropolis, Hamiltonian Monte Carlo), as it will have implications on how well the sample space is searched. 

Broadly, the learning strategy in Gaussian process models is to infer the hyper parameters by finding point estimates of the log marginal likelihood. The methods below aim to optimize a substitute to the marginal likelihood.

- Laplace’s Approximation (see RW page 41)
- Expectation Propagation
- Variational Inference


### Laplace's Approximation

This is a local Taylor expansion of the log posterior. From before, we note the posterior density is given by

$$
p \left( \mathbf{t} | \mathbf{X}, \sigma^2 \right) = \int p \left( \mathbf{t} | \mathbf{f}, \mathbf{X}, \sigma^2 \right) p \left( \mathbf{f} \right) d \mathbf{f}
$$

We now set $\zeta \left( \mathbf{f} \right) = log \left(  p \left( \mathbf{t} | \mathbf{f}, \mathbf{X}, \sigma^2 \right) p \left( \mathbf{f} \right)\right)$. The approximate posterior $q \left( \mathbf{f} \right)$ is obtained through a second-order Taylor series expansion of $\zeta \left( \mathbf{f} \right)$ around its maximum value $\hat{\mathbf{f}}$, given by

$$
\hat{\mathbf{f}} = \underset{\mathbf{f}}{argmax} \; \;  \zeta \left( \mathbf{f} \right).
$$

To see this, consider the following steps 

$$
\begin{aligned}
p \left( \mathbf{t} | \mathbf{X}, \sigma^2 \right) & \propto p \left( \mathbf{t} | \mathbf{f}, \mathbf{X}, \sigma^2 \right) p \left( \mathbf{f} \right) \\
& = exp \left( \zeta \left( \mathbf{f} \right) \right) \\
& \approx exp \left( \zeta \left( \hat{\mathbf{f}} \right) + \frac{1}{2} \left( \mathbf{f} - \hat{\mathbf{f}}\right)^{T} \nabla^2 \zeta \left( \mathbf{f} \right) |_{\mathbf{f} = \hat{\mathbf{f}}} \left( \mathbf{f} - \hat{\mathbf{f}} \right) \right) \\
& \propto \mathcal{N} \left( \mathbf{f} | \hat{\mathbf{f}}, \mathbf{A}^{-1} \right) \\
& = q \left( \mathbf{f} \right)
\end{aligned}
$$

Here $\mathbf{A} = - \nabla^2 \zeta \left( \mathbf{f} \right)|_{\mathbf{f} = \hat{\mathbf{f}}}$ is the Hessian matrix of the negative log posterior at $\hat{\mathbf{f}}$. Following the above, we can now approximate the log marginal likelihood as

$$
\begin{aligned}
log p \left( \mathbf{t} | \mathbf{X}, \sigma^2 \right) & = log \int exp \left( \zeta \left( \mathbf{f} \right) \right) d \mathbf{f} \\
& \approx log \int exp \left(\zeta \left( \hat{\mathbf{f}} \right) - \frac{1}{2}  \left( \mathbf{f} - \hat{\mathbf{f}} \right)^{T} \mathbf{A} \left( \mathbf{f} - \hat{\mathbf{f}} \right)\right) d \mathbf{f} 
\end{aligned}
$$

In [None]:
### Expectation propagation 

Reference: (Minka 2001)[https://arxiv.org/pdf/1301.2294.pdf]

In expectation propagation, we factorize the posterior as a product of a normalization term, the prior, and the likelihood. 

## Overview

Variational Bayes approximates the intractable posterior distribution (the distribution of latent variables given the data) with a simpler distribution from a chosen family. It does this by minimizing the Kullback-Leibler (KL) divergence between the true posterior and the variational approximation.

How does VB work with Gaussian Processes?

Here's how VB is applied to Gaussian processes:

Variational Family: We choose a family of simpler distributions to approximate the complex posterior distribution of the latent function in a GP. A common choice is a fully factorized Gaussian distribution with variational parameters (means and covariances) for each function value.

Tractable Lower Bound: We define a lower bound on the marginal log-likelihood (evidence) based on the KL divergence between the true posterior and the variational approximation. This lower bound represents how well the chosen variational family captures the true posterior.

Optimization: We optimize the variational parameters by maximizing this lower bound. This optimization typically involves iterative updates of the variational parameters until convergence.

Benefits of VB for GPs:

Scalability: VB allows for approximate inference in GPs with large datasets by avoiding the need for exact computations.
Intractability: It tackles scenarios where exact posterior inference becomes intractable due to the complexity of the model or the data.
Challenges of VB for GPs:

Accuracy: The accuracy of VB for GPs depends on the chosen variational family. Simpler families might not capture the full complexity of the true posterior, leading to potentially inaccurate approximations.
Optimization: Finding the optimal variational parameters can be challenging, and the optimization process might get stuck in local optima.
Further Exploration:

Here are some resources for a deeper dive:

Paper: "[1511.06499] The Variational Gaussian Process" ([arxiv.org]) This paper introduces the Variational Gaussian Process (VGP), a novel variational family for GPs with strong representational power.
Applications: Explore research papers that apply VB for GPs in specific tasks like time series forecasting or image classification. This can give you a sense of the practical implementation and benefits.
I hope this explanation provides a good starting point for understanding variational Bayes in the context of Gaussian processes! Feel free to ask if you have any further questions.