# What is a Gaussian Process?

In this notebook we're going to introduce Gaussian processes, describe what they are and how they can be used in machine learning. 

First of all, we'll consider the question, what is machine learning? By my definition Machine Learning is a combination of

$$\text{data} + \text{model} \rightarrow \text{prediction}$$

and the reason that machine learning has become a mainstay of artificial intelligence is the importance of predictions in artificial intelligence. 

So how do Gaussian processes come in? Well in practice we normally perform machine learning using two functions. To combine data with a model we typically make use of:

**a prediction function** a function which is used to make the predictions. It includes our assumptions about how the world works, e.g. smoothness, spatial similarities, temporal similarities.

**an objective function** a function which defines the cost of misprediction. Typically it includes knowledge about the world's generating processes (probabilistic objectives) or the costs we pay for mispredictions (empiricial risk minimization).

In practice, we normally also have uncertainty associated with these functions. Uncertainty in the prediction function arises from 

1. scarcity of training data and 
2. mismatch between the set of prediction functions we choose and all possible prediction functions.

There are also challenges around specification of the objective function, but for we will save those for another day. For the moment, let us focus on the prediction function. 

## Neural Networks and Prediction Functions

Neural networks are adaptive non-linear function models. Originally, they were studied (by McColloch and Pitts) as simple models for neurons, but over the last decade they have become popular because of their ability to model data. A particular characteristic of neural network models is that they can be composed to form highly complex functions which encode many of our expectations of the real world. They allow us to encode our assumptions about how the world works.

We will return to composition later, but for the moment, let's focus on a one hidden layer neural network. We are interested in the prediction function, so we'll ignore the objective function (which is often called an error function) for the moment, and just describe the mathematical object of interest

$$
\mappingFunction(\inputVector) = \weightMatrix^\top \activationFunctionVector(\parameterVector, \inputVector)
$$

Where in this case $\mappingFunction(\cdot)$ is a scalar function with vector inputs, and $\activationFunctionVector(\cdot)$ is a vector function with vector inputs. The dimensionality of the vector function is known as the number of hidden units, or the number of neurons. The elements of this vector function are known as the *activation* function of the neural network and $\parameterVector$ are the parameters of the activation functions.

In statistics activation functions are traditionally known as *basis functions*. And we would think of this as a *linear model*. It's doesn't make linear predictions, but it's linear because in statistics estimation focuses on the parameters, $\weightMatrix$, not the parameters, $\parameterVector$. The linear model terminology refers to the fact that the model is *linear in the parameters*, but it is *not* linear in the data unless the activation functions are chosen to be linear.

The first difference in the (early) neural network literature to the classical statistical literature is the decision to optimize these parameters, $\parameterVector$, as well as the  parameters, $\weightMatrix$ (which would normally be denoted in statistics by $\boldsymbol{\beta}$).

In this tutorial, we're going to go revisit that decision, and follow the path of Radford Neal who, inspired by work of David MacKay and others did his PhD thesis on Bayesian Neural Networks. If we take a Bayesian approach to parameter inference (note I am using inference here in the classical sense, not in the sense of prediction of test data, which seems to be a newer usage), then we don't wish to fit parameters at all, rather we wish to integrate them away and understand the family of functions that the model describes.

This Bayesian approach is designed to deal with uncertainty arising from fitting our prediction function to the data we have, a reduced data set.

The Bayesian approach can be derived from a broader understanding of what our objective is. If we accept that we can jointly represent all things that happen in the world with a probability distribution, then we can interogate that probability to make predictions. So, if we are interested in predictions, $\dataScalar_*$ at future points input locations of interest, $\inputVector_*$ given previously observed test observations, $\dataVector$ and their corresponding inputs, $\inputMatrix$, then we are really interogating the following probability density,
$$
p(\dataScalar_*|\dataVector, \inputMatrix, \inputVector_*),
$$
there is nothing controversial here, as long as you accept that you have a good joint model of the world around you that relates test data to training data, $p(\dataScalar_*, \dataVector, \inputMatrix, \inputVector_*)$ then this conditional distribution can be recovered through standard rules of probability ($\text{data} + \text{model} \rightarrow \text{prediction}$). 

We can construct this joint density through the use of the following decomposition:
$$
p(\dataScalar_*|\dataVector, \inputMatrix, \inputVector_*) = \int p(\dataScalar_*|\inputVector_*, \parameterVector) p(\parameterVector | \dataVector, \inputMatrix) \text{d} \parameterVector
$$
where, for convenience, we are assuming *all* the parameters of the model are now represented by $\parameterVector$ and $p(\parameterVector | \dataVector, \inputMatrix)$ is recognised as the posterior density of the parameters given data and $p(\dataScalar_*|\inputVector_*, \parameterVector)$ is the *likelihood* of an individual test data point given the parameters. he likelihood of the data is normally assumed to be independent across the parameters,
$$
p(\dataVector|\inputMatrix, \parameterVector) \prod_{i=1}^\numData p(\dataScalar_i|\inputVector_i, \parameterVector),$$
and if that is so, it is easy to extend our predictions across all future, potential, locations,
$$
p(\dataVector_*|\dataVector, \inputMatrix, \inputMatrix_*) = \int p(\dataVector_*|\inputMatrix_*, \parameterVector) p(\parameterVector | \dataVector, \inputMatrix) \text{d} \parameterVector.
$$
The likelihood is also where the *prediction function* is incorporated. For example in the regression case, we consider an objective based around the Gaussian density,
$$
p(\dataScalar_i | \mappingFunction(\inputVector_i)) = \frac{1}{\sqrt{2\pi \dataStd^2}} \exp\left(-\frac{\dataScalar_i - \mappingFunction_i)^2}{2\dataStd^2}\right)
$$

In short, that is the classical approach to probabilistic inference, and all approaches to Bayesian neural networks fall within this path. For a deep probabilistic model, we can simply take this one stage further and place a probability distribution over the input locations,
$$
p(\dataVector_*|\dataVector) = \int p(\dataVector_*|\inputMatrix_*, \parameterVector) p(\parameterVector | \dataVector, \inputMatrix) p(\inputMatrix) p(\inputMatrix_*) \text{d} \parameterVector \text{d} \inputMatrix \text{d}\inputMatrix_*
$$
and we have *unsupervised learning*, and families of deep probabilistic models. 

Of course, the devil is in the detail, and while everything is easy to write in terms of probability densities, as we move from $\text{data}$ and $\text{model}$ to $\text{prediction}$ there is that simple $\rightarrow$ sign, which is now burying a wealth of difficulties. Each integral sign above is a high dimensional integral which will typically need approximation. Approximations also come with computational demands. As we consider more complex classes of functions, the challenges around the integrals become harder and prediction of 


functions alongsidethe parametProbabilistic inference allows us to consider what kinds of ac

In [None]:
$$