# What is a Gaussian Process?

In this notebook we're going to introduce Gaussian processes, describe what they are and how they can be used in machine learning. 

First of all, we'll consider the question, what is machine learning? By my definition Machine Learning is a combination of

$$\text{data} + \text{model} \rightarrow \text{prediction}$$

and the reason that machine learning has become a mainstay of artificial intelligence is the importance of predictions in artificial intelligence. 

So how do Gaussian processes come in? Well in practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:

**a prediction function** a function which is used to make the predictions. It includes our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.

**an objective function** a function which defines the cost of misprediction. Typically it includes knowledge about the world's generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).

In practice, we normally also have uncertainty associated with these functions. Uncertainty in the prediction function arises from 

1. scarcity of training data and 
2. mismatch between the set of prediction functions we choose and all possible prediction functions.

There are also challenges around specification of the objective function, but for we will save those for another day. For the moment, let us focus on the prediction function. 

## Neural Networks and Prediction Functions

Neural networks are adaptive non-linear function models. Originally, they were studied (by McColloch and Pitts) as simple models for neurons, but over the last decade they have become popular because of their ability to model data. A particular characteristic of neural network models is that they can be composed to form highly complex functions which encode many of our expectations of the real world. They allow us to encode our assumptions about how the world works.

We will return to composition later, but for the moment, let's focus on a one hidden layer neural network. We are interested in the prediction function, so we'll ignore the objective function (which is often called an error function) for the moment, and just describe the mathematical object of interest

$$
\mappingFunction(\inputVector) = \mappingMatrix^\top \activationVector(\mappingMatrixTwo, \inputVector)
$$

Where in this case $\mappingFunction(\cdot)$ is a scalar function with vector inputs, and $\activationVector(\cdot)$ is a vector function with vector inputs. The dimensionality of the vector function is known as the number of hidden units, or the number of neurons. The elements of this vector function are known as the *activation* function of the neural network and $\mappingMatrixTwo$ are the parameters of the activation functions.

In statistics activation functions are traditionally known as *basis functions*. And we would think of this as a *linear model*. It's doesn't make linear predictions, but it's linear because in statistics estimation focuses on the parameters, $\mappingMatrix$, not the parameters, $\mappingMatrixTwo$. The linear model terminology refers to the fact that the model is *linear in the parameters*, but it is *not* linear in the data unless the activation functions are chosen to be linear.

The first difference in the (early) neural network literature to the classical statistical literature is the decision to optimize these parameters, $\mappingMatrixTwo$, as well as the  parameters, $\mappingMatrix$ (which would normally be denoted in statistics by $\boldsymbol{\beta}$).

In this tutorial, we're going to go revisit that decision, and follow the path of Radford Neal who, inspired by work of David MacKay and others did his PhD thesis on Bayesian Neural Networks. If we take a Bayesian approach to parameter inference (note I am using inference here in the classical sense, not in the sense of prediction of test data, which seems to be a newer usage), then we don't wish to fit parameters at all, rather we wish to integrate them away and understand the family of functions that the model describes.

## Probabilistic Modelling

This Bayesian approach is designed to deal with uncertainty arising from fitting our prediction function to the data we have, a reduced data set.

The Bayesian approach can be derived from a broader understanding of what our objective is. If we accept that we can jointly represent all things that happen in the world with a probability distribution, then we can interogate that probability to make predictions. So, if we are interested in predictions, $\dataScalar_*$ at future points input locations of interest, $\inputVector_*$ given previously observed test observations, $\dataVector$ and their corresponding inputs, $\inputMatrix$, then we are really interogating the following probability density,
$$
p(\dataScalar_*|\dataVector, \inputMatrix, \inputVector_*),
$$
there is nothing controversial here, as long as you accept that you have a good joint model of the world around you that relates test data to training data, $p(\dataScalar_*, \dataVector, \inputMatrix, \inputVector_*)$ then this conditional distribution can be recovered through standard rules of probability ($\text{data} + \text{model} \rightarrow \text{prediction}$). 

We can construct this joint density through the use of the following decomposition:
$$
p(\dataScalar_*|\dataVector, \inputMatrix, \inputVector_*) = \int p(\dataScalar_*|\inputVector_*, \parameterVector) p(\parameterVector | \dataVector, \inputMatrix) \text{d} \parameterVector
$$
where, for convenience, we are assuming *all* the parameters of the model are now represented by $\parameterVector$ (which contains $\mappingMatrix$ and $\mappingMatrixTwo$) and $p(\parameterVector | \dataVector, \inputMatrix)$ is recognised as the posterior density of the parameters given data and $p(\dataScalar_*|\inputVector_*, \parameterVector)$ is the *likelihood* of an individual test data point given the parameters. he likelihood of the data is normally assumed to be independent across the parameters,
$$
p(\dataVector|\inputMatrix, \parameterVector) \prod_{i=1}^\numData p(\dataScalar_i|\inputVector_i, \parameterVector),$$
and if that is so, it is easy to extend our predictions across all future, potential, locations,
$$
p(\dataVector_*|\dataVector, \inputMatrix, \inputMatrix_*) = \int p(\dataVector_*|\inputMatrix_*, \parameterVector) p(\parameterVector | \dataVector, \inputMatrix) \text{d} \parameterVector.
$$
The likelihood is also where the *prediction function* is incorporated. For example in the regression case, we consider an objective based around the Gaussian density,
$$
p(\dataScalar_i | \mappingFunction(\inputVector_i)) = \frac{1}{\sqrt{2\pi \dataStd^2}} \exp\left(-\frac{\dataScalar_i - \mappingFunction_i)^2}{2\dataStd^2}\right)
$$

In short, that is the classical approach to probabilistic inference, and all approaches to Bayesian neural networks fall within this path. For a deep probabilistic model, we can simply take this one stage further and place a probability distribution over the input locations,
$$
p(\dataVector_*|\dataVector) = \int p(\dataVector_*|\inputMatrix_*, \parameterVector) p(\parameterVector | \dataVector, \inputMatrix) p(\inputMatrix) p(\inputMatrix_*) \text{d} \parameterVector \text{d} \inputMatrix \text{d}\inputMatrix_*
$$
and we have *unsupervised learning*, and families of deep probabilistic models. 

## Performing Inference

Of course, the devil is in the detail, and while everything is easy to write in terms of probability densities, as we move from $\text{data}$ and $\text{model}$ to $\text{prediction}$ there is that simple $\rightarrow$ sign, which is now burying a wealth of difficulties. Each integral sign above is a high dimensional integral which will typically need approximation. Approximations also come with computational demands. As we consider more complex classes of functions, the challenges around the integrals become harder and prediction of future test data given our model and the data becomes so involved as to be impractical or impossible. 

Statisticians realized these challenges early on, indeed, so early that they were actually physicists, both Laplace and Gauss worked on models such as this, in Gauss's case he made his career on prediction of the location of the lost planet (later reclassified as a asteroid, then dwarf planet), Ceres. Gauss and Laplace made use of maximum a posteriori estimates for simplifying their computations and Laplace developed Laplace's method (and invented the Gaussian density) to expand around that mode. But classical statistics needs better guarantees around model performance and interpretation, and as a result has focussed more on the *linear* model implied by 

$$
\mappingFunction(\inputVector) = \mappingMatrix^\top \activationVector(\parameterVector, \inputVector)
$$

by holding $\parameterVector$ fixed for any given analysis. In this case, a Gaussian prior is formulated over the parameters $\mappingMatrix$,
$$
\mappingMatrix \sim \gaussianSamp{\zerosVector}{\covarianceMatrix}.
$$

The Gaussian likelihood given above implies that the data observation is related to the function by noise corruption so we have,
$$
\dataScalar_i = \mappingFunction(\inputVector_i) + \noiseScalar_i,
$$
where 
$$
\noiseScalar_i \sim \gaussianSamp{0}{\dataStd^2}
$$
and while normally integrating over high dimensional parameter vectors is highly complex, here it is *trivial*. That is because of a property of the multivariate Gaussian.

## Multivariate Gaussian Properties

Gaussian processes are initially of interest because

1. linear Gaussian models are easier to deal with 
2. Even the parameters *within* the process can be handled, by considering a particular limit.

Let's first of all review the properties of the Gaussian process that make linear Gaussian models easier to deal with. We'll return to the, perhaps surprising, result on the parameters within the nonlinearity, $\parameterVector$, shortly.

To work with linear Gaussian models, to find the marginal likelihood all you need to know is the following rules. If
$$
\dataVector = \mappingMatrix \inputVector + \errorVector,
$$
where $\dataVector$, $\inputVector$ and $\errorVector$ and we assume that $\inputVector$ and $\errorVector$ are drawn from multivariate Gaussians,
\begin{align}
\inputVector & \sim \gaussianSamp{\meanVector}{\covarianceMatrix}\\
\errorVector & \sim \gaussianSamp{\zerosVector}{\covarianceTwoMatrix}
\end{align}
then we know that $\dataVector$ is also drawn from a multivariate Gaussian with,
$$
\dataVector \sim \gaussianSamp{\mappingMatrix\meanVector}{\mappingMatrix\covarianceMatrix\mappingMatrix^\top + \covarianceTwoMatrix.
$$
which in the case given above we can write all training data as 
$$
\dataVector = \basisMatrix\mappingVector + \errorVector.
$$

With apprioriately defined covariance, $\covarianceTwoMatrix$, this is actually the marginal likelihood for Factor Analysis, or Probabilistic Principal Component Analysis [@], because we integrated out the inputs (or *latent* variables they would be called in that case). 

However, we are focussing on what happens in models which are non-linear in the inputs, whereas the above would be *linear* in the inputs. To consider these, we introduce a matrix, called the design matrix. We set each activation function computed at each data point to be
$$
\activationScalar_{i,j} = \activationScalar(\mappingVectorTwo_{j}, \inputVector_{i})
$$
and define the matrix of activations (known as the *design matrix* in statistics) to be,
$$
\activationMatrix = 
\begin{bmatrix}
\activationScalar_{1, 1} & \activationScalar_{1, 2} & \dots & \activationScalar_{1, \numHidden} \\
\activationScalar_{1, 2} & \activationScalar_{1, 2} & \dots & \activationScalar_{1, \numData} \\
\vdots & \vdots & \ddots & \vdots \\
\activationScalar_{\numData, 1} & \activationScalar_{\numData, 2} & \dots & \activationScalar_{\numData, \numHidden}
\end{bmatrix}.
$$
By convention this matrix always has $\numData$ rows and $\numHidden$ columns, now if we define the vector of all noise corruptions, $\errorVector = \left[\errorScalar_1, \dots \errorScalar_\numData\right]^\top$. If we define the prior distribution over the vector $\mappingVector$ to be Gaussian,
$$
\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha\eye},
$$
then we can use rules of multivariate Gaussians to see that,
$$
\dataVector \sim \gaussianSamp{\zerosVector}{\alpha \activationMatrix \activationMatrix^\top + \dataStd^2 \eye}.
$$

In other words, our training data is distributed as a multivariate Gaussian, with zero mean and a covariance given by 
$$
\kernelMatrix = alpha \activationMatrix \activationMatrix^\top + \dataStd^2 \eye.
$$
This is an $\numData \times \numData$ size matrix. Its elements are in the form of a function. The maths shows that any element, index by $i$ and $j$, is a function *only* of inputs associated with data points $i$ and $j$, $\dataVector_i$, $\dataVector_j$. So in other words we have,
$$
\kernel_{i,j} = \kernel\left(\inputVector_i, \inputVector_j\right) = \alpha \basisVector\left(\mappingMatrixTwo, \inputVector_i)^\top \basisVector\left(\mappingMatrixTwo, \inputVector_j)
$$
so the elements of the covariance or *kernel* matrix are formed by inner products of the rows of the *design matrix*.  

This is the essence of a Gaussian process. Instead of making assumptions about our density over each data point, $\dataScalar_i$ as i.i.d. we make a joint Gaussian assumption over our data. The covariance matrix is now a function of both the parameters of the activation function, $\mappingMatrixTwo$, and the input variables, $\inputMatrix$. This comes about through integrating out the parameters of the model, $\mappingVector$. 

We can basically put anything inside the basis functions, and many people do. These can be deep kernels [@saul] or we can learn the parameters of a convolutional neural network inside there.

Viewing a neural network in this way is also what allows us to beform sensible *batch* normalizations [@cite].

However, we have a covariance function that is not just a function of our 

Impressively, it doesn't stop ther




functions alongsidethe parametProbabilistic inference allows us to consider what kinds of ac

In [None]:
$$