# R250: Gaussian Processes Introduction

### [Neil D. Lawrence](http://inverseprobability.com), University of

Cambridge

### 2020-01-24

**Abstract**: In this talk we give an introduction to Gaussian processes
for students who are interested in working with GPs for the the R250
module.

$$
$$

::: {.cell .markdown}

<!-- Do not edit this file locally. -->
<!-- Do not edit this file locally. -->
<!---->
<!-- Do not edit this file locally. -->
<!-- Do not edit this file locally. -->
<!-- The last names to be defined. Should be defined entirely in terms of macros from above-->
<!--

-->

### Pierre-Simon Laplace

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_physics/includes/laplace-portrait.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_physics/includes/laplace-portrait.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//ml/Pierre-Simon_Laplace.png" style="width:30%">

Figure: <i>Pierre-Simon Laplace 1749-1827.</i>

In [None]:
import notutils as nu
nu.display_google_book(id='1YQPAAAAQAAJ', page='PR17-IA2')

Famously, Laplace considered the idea of a deterministic Universe, one
in which the model is *known*, or as the below translation refers to it,
“an intelligence which could comprehend all the forces by which nature
is animated”. He speculates on an “intelligence” that can submit this
vast data to analysis and propsoses that such an entity would be able to
predict the future.

> Given for one instant an intelligence which could comprehend all the
> forces by which nature is animated and the respective situation of the
> beings who compose it—an intelligence sufficiently vast to submit
> these data to analysis—it would embrace in the same formulate the
> movements of the greatest bodies of the universe and those of the
> lightest atom; for it, nothing would be uncertain and the future, as
> the past, would be present in its eyes.

This notion is known as *Laplace’s demon* or *Laplace’s superman*.

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//physics/laplacesDeterminismEnglish.png" style="width:60%">

Figure: <i>Laplace’s determinsim in English translation.</i>

## Laplace’s Gremlin

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_physics/includes/laplaces-determinism.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_physics/includes/laplaces-determinism.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

Unfortunately, most analyses of his ideas stop at that point, whereas
his real point is that such a notion is unreachable. Not so much
*superman* as *strawman*. Just three pages later in the “Philosophical
Essay on Probabilities” (Laplace, 1814), Laplace goes on to observe:

> The curve described by a simple molecule of air or vapor is regulated
> in a manner just as certain as the planetary orbits; the only
> difference between them is that which comes from our ignorance.
>
> Probability is relative, in part to this ignorance, in part to our
> knowledge.

In [None]:
import notutils as nu
nu.display_google_book(id='1YQPAAAAQAAJ', page='PR17-IA4')

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//physics/philosophicaless00lapliala.png" style="width:60%">

Figure: <i>To Laplace, determinism is a strawman. Ignorance of mechanism
and data leads to uncertainty which should be dealt with through
probability.</i>

In other words, we can never make use of the idealistic deterministic
Universe due to our ignorance about the world, Laplace’s suggestion, and
focus in this essay is that we turn to probability to deal with this
uncertainty. This is also our inspiration for using probability in
machine learning. This is the true message of Laplace’s essay, not
determinism, but the gremlin of uncertainty that emerges from our
ignorance.

The “forces by which nature is animated” is our *model*, the “situation
of beings that compose it” is our *data* and the “intelligence
sufficiently vast enough to submit these data to analysis” is our
compute. The fly in the ointment is our *ignorance* about these aspects.
And *probability* is the tool we use to incorporate this ignorance
leading to uncertainty or *doubt* in our predictions.

## Bayesian Inference by Rejection Sampling

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_gp/includes/gp-intro-very-short.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_gp/includes/gp-intro-very-short.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

One view of Bayesian inference is to assume we are given a mechanism for
generating samples, where we assume that mechanism is representing an
accurate view on the way we believe the world works.

This mechanism is known as our *prior* belief.

We combine our prior belief with our observations of the real world by
discarding all those prior samples that are inconsistent with our
observations. The *likelihood* defines mathematically what we mean by
inconsistent with the observations. The higher the noise level in the
likelihood, the looser the notion of consistent.

The samples that remain are samples from the *posterior*.

This approach to Bayesian inference is closely related to two sampling
techniques known as *rejection sampling* and *importance sampling*. It
is realized in practice in an approach known as *approximate Bayesian
computation* (ABC) or likelihood-free inference.

In practice, the algorithm is often too slow to be practical, because
most samples will be inconsistent with the observations and as a result
the mechanism must be operated many times to obtain a few posterior
samples.

However, in the Gaussian process case, when the likelihood also assumes
Gaussian noise, we can operate this mechanism mathematically, and obtain
the posterior density *analytically*. This is the benefit of Gaussian
processes.

First, we will load in two python functions for computing the covariance
function.

In [None]:
import mlai

In [None]:
%load -n mlai.Kernel

In [None]:
# %load -n mlai.Kernel
class Kernel():
    """Covariance function
    :param function: covariance function
    :type function: function
    :param name: name of covariance function
    :type name: string
    :param shortname: abbreviated name of covariance function
    :type shortname: string
    :param formula: latex formula of covariance function
    :type formula: string
    :param function: covariance function
    :type function: function
    :param \**kwargs:
        See below

    :Keyword Arguments:
        * """

    def __init__(self, function, name=None, shortname=None, formula=None, **kwargs):        
        self.function=function
        self.formula = formula
        self.name = name
        self.shortname = shortname
        self.parameters=kwargs
        
    def K(self, X, X2=None):
        """Compute the full covariance function given a kernel function for two data points."""
        if X2 is None:
            X2 = X
        K = np.zeros((X.shape[0], X2.shape[0]))
        for i in np.arange(X.shape[0]):
            for j in np.arange(X2.shape[0]):
                K[i, j] = self.function(X[i, :], X2[j, :], **self.parameters)

        return K

    def diag(self, X):
        """Compute the diagonal of the covariance function"""
        diagK = np.zeros((X.shape[0], 1))
        for i in range(X.shape[0]):            
            diagK[i] = self.function(X[i, :], X[i, :], **self.parameters)
        return diagK

    def _repr_html_(self):
        raise NotImplementedError

In [None]:
import mlai

In [None]:
%load -n mlai.eq_cov

In [None]:
# %load -n mlai.eq_cov
def eq_cov(x, x_prime, variance=1., lengthscale=1.):
    """Exponentiated quadratic covariance function."""
    diffx = x - x_prime
    return variance*np.exp(-0.5*np.dot(diffx, diffx)/lengthscale**2)

In [None]:
kernel = Kernel(function=eq_cov,
                     name='Exponentiated Quadratic',
                     shortname='eq',                     
                     lengthscale=0.25)

Next, we sample from a multivariate normal density (a multivariate
Gaussian), using the covariance function as the covariance matrix.

In [None]:
import numpy as np
np.random.seed(10)
import mlai.plot as plot

In [None]:
plot.rejection_samples(kernel=kernel, 
    diagrams='./gp')

In [None]:
import notutils as nu
from ipywidgets import IntSlider

In [None]:
nu.display_plots('gp_rejection_sample{sample:0>3}.png', 
                 directory='./gp', 
                 sample=IntSlider(1,1,5,1))

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/gp_rejection_sample003.png" style="width:100%">
<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/gp_rejection_sample004.png" style="width:100%">
<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/gp_rejection_sample005.png" style="width:100%">

Figure: <i>One view of Bayesian inference is we have a machine for
generating samples (the *prior*), and we discard all samples
inconsistent with our data, leaving the samples of interest (the
*posterior*). This is a rejection sampling view of Bayesian inference.
The Gaussian process allows us to do this analytically by multiplying
the *prior* by the *likelihood*.</i>

# What is Machine Learning?

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_ml/includes/what-is-ml.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_ml/includes/what-is-ml.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

What is machine learning? At its most basic level machine learning is a
combination of

$$\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}$$

where *data* is our observations. They can be actively or passively
acquired (meta-data). The *model* contains our assumptions, based on
previous experience. That experience can be other data, it can come from
transfer learning, or it can merely be our beliefs about the
regularities of the universe. In humans our models include our inductive
biases. The *prediction* is an action to be taken or a categorization or
a quality score. The reason that machine learning has become a mainstay
of artificial intelligence is the importance of predictions in
artificial intelligence. The data and the model are combined through
computation.

In practice we normally perform machine learning using two functions. To
combine data with a model we typically make use of:

**a prediction function** it is used to make the predictions. It
includes our beliefs about the regularities of the universe, our
assumptions about how the world works, e.g., smoothness, spatial
similarities, temporal similarities.

**an objective function** it defines the ‘cost’ of misprediction.
Typically, it includes knowledge about the world’s generating processes
(probabilistic objectives) or the costs we pay for mispredictions
(empirical risk minimization).

The combination of data and model through the prediction function and
the objective function leads to a *learning algorithm*. The class of
prediction functions and objective functions we can make use of is
restricted by the algorithms they lead to. If the prediction function or
the objective function are too complex, then it can be difficult to find
an appropriate learning algorithm. Much of the academic field of machine
learning is the quest for new learning algorithms that allow us to bring
different types of models and data together.

A useful reference for state of the art in machine learning is the UK
Royal Society Report, [Machine Learning: Power and Promise of Computers
that Learn by
Example](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf).

You can also check my post blog post on [What is Machine
Learning?](http://inverseprobability.com/2017/07/17/what-is-machine-learning).

In practice, we normally also have uncertainty associated with these
functions. Uncertainty in the prediction function arises from

1.  scarcity of training data and
2.  mismatch between the set of prediction functions we choose and all
    possible prediction functions.

There are also challenges around specification of the objective
function, but for we will save those for another day. For the moment,
let us focus on the prediction function.

## Neural Networks and Prediction Functions

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_ml/includes/neural-networks.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_ml/includes/neural-networks.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

Neural networks are adaptive non-linear function models. Originally,
they were studied (by McCulloch and Pitts (McCulloch and Pitts, 1943))
as simple models for neurons, but over the last decade they have become
popular because they are a flexible approach to modelling complex data.
A particular characteristic of neural network models is that they can be
composed to form highly complex functions which encode many of our
expectations of the real world. They allow us to encode our assumptions
about how the world works.

We will return to composition later, but for the moment, let’s focus on
a one hidden layer neural network. We are interested in the prediction
function, so we’ll ignore the objective function (which is often called
an error function) for the moment, and just describe the mathematical
object of interest

$$
f(\mathbf{ x}) = \mathbf{W}^\top \boldsymbol{ \phi}(\mathbf{V}, \mathbf{ x})
$$

Where in this case $f(\cdot)$ is a scalar function with vector inputs,
and $\boldsymbol{ \phi}(\cdot)$ is a vector function with vector inputs.
The dimensionality of the vector function is known as the number of
hidden units, or the number of neurons. The elements of this vector
function are known as the *activation* function of the neural network
and $\mathbf{V}$ are the parameters of the activation functions.

## Relations with Classical Statistics

In statistics activation functions are traditionally known as *basis
functions*. And we would think of this as a *linear model*. It’s doesn’t
make linear predictions, but it’s linear because in statistics
estimation focuses on the parameters, $\mathbf{W}$, not the parameters,
$\mathbf{V}$. The linear model terminology refers to the fact that the
model is *linear in the parameters*, but it is *not* linear in the data
unless the activation functions are chosen to be linear.

## Adaptive Basis Functions

The first difference in the (early) neural network literature to the
classical statistical literature is the decision to optimize these
parameters, $\mathbf{V}$, as well as the parameters, $\mathbf{W}$ (which
would normally be denoted in statistics by $\boldsymbol{\beta}$)[1].

[1] In classical statistics we often interpret these parameters,
$\beta$, whereas in machine learning we are normally more interested in
the result of the prediction, and less in the prediction. Although this
is changing with more need for accountability. In honour of this I
normally use $\boldsymbol{\beta}$ when I care about the value of these
parameters, and $\mathbf{ w}$ when I care more about the quality of the
prediction.

## Integrated Basis Functions

We’re going to go revisit that decision, and follow the path of Radford
Neal (Neal, 1994) who, inspired by work of David MacKay (MacKay, 1992)
and others did his PhD thesis on Bayesian Neural Networks. If we take a
Bayesian approach to parameter inference (note I am using inference here
in the classical sense, not in the sense of prediction of test data,
which seems to be a newer usage), then we don’t wish to fit parameters
at all, rather we wish to integrate them away and understand the family
of functions that the model describes.

## Probabilistic Modelling

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_ml/includes/probabilistic-modelling.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_ml/includes/probabilistic-modelling.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

This Bayesian approach is designed to deal with uncertainty arising from
fitting our prediction function to the data we have, a reduced data set.

The Bayesian approach can be derived from a broader understanding of
what our objective is. If we accept that we can jointly represent all
things that happen in the world with a probability distribution, then we
can interogate that probability to make predictions. So, if we are
interested in predictions, $y_*$ at future points input locations of
interest, $\mathbf{ x}_*$ given previously training data, $\mathbf{ y}$
and corresponding inputs, $\mathbf{X}$, then we are really interogating
the following probability density, $$
p(y_*|\mathbf{ y}, \mathbf{X}, \mathbf{ x}_*),
$$ there is nothing controversial here, as long as you accept that you
have a good joint model of the world around you that relates test data
to training data, $p(y_*, \mathbf{ y}, \mathbf{X}, \mathbf{ x}_*)$ then
this conditional distribution can be recovered through standard rules of
probability
($\text{data} + \text{model} \rightarrow \text{prediction}$).

We can construct this joint density through the use of the following
decomposition: $$
p(y_*|\mathbf{ y}, \mathbf{X}, \mathbf{ x}_*) = \int p(y_*|\mathbf{ x}_*, \mathbf{W}) p(\mathbf{W}| \mathbf{ y}, \mathbf{X}) \text{d} \mathbf{W}
$$

where, for convenience, we are assuming *all* the parameters of the
model are now represented by $\boldsymbol{ \theta}$ (which contains
$\mathbf{W}$ and $\mathbf{V}$) and
$p(\boldsymbol{ \theta}| \mathbf{ y}, \mathbf{X})$ is recognised as the
posterior density of the parameters given data and
$p(y_*|\mathbf{ x}_*, \boldsymbol{ \theta})$ is the *likelihood* of an
individual test data point given the parameters.

The likelihood of the data is normally assumed to be independent across
the parameters, $$
p(\mathbf{ y}|\mathbf{X}, \mathbf{W}) = \prod_{i=1}^np(y_i|\mathbf{ x}_i, \mathbf{W}),$$

and if that is so, it is easy to extend our predictions across all
future, potential, locations, $$
p(\mathbf{ y}_*|\mathbf{ y}, \mathbf{X}, \mathbf{X}_*) = \int p(\mathbf{ y}_*|\mathbf{X}_*, \boldsymbol{ \theta}) p(\boldsymbol{ \theta}| \mathbf{ y}, \mathbf{X}) \text{d} \boldsymbol{ \theta}.
$$

The likelihood is also where the *prediction function* is incorporated.
For example in the regression case, we consider an objective based
around the Gaussian density, $$
p(y_i | f(\mathbf{ x}_i)) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{\left(y_i - f(\mathbf{ x}_i)\right)^2}{2\sigma^2}\right)
$$

In short, that is the classical approach to probabilistic inference, and
all approaches to Bayesian neural networks fall within this path. For a
deep probabilistic model, we can simply take this one stage further and
place a probability distribution over the input locations, $$
p(\mathbf{ y}_*|\mathbf{ y}) = \int p(\mathbf{ y}_*|\mathbf{X}_*, \boldsymbol{ \theta}) p(\boldsymbol{ \theta}| \mathbf{ y}, \mathbf{X}) p(\mathbf{X}) p(\mathbf{X}_*) \text{d} \boldsymbol{ \theta}\text{d} \mathbf{X}\text{d}\mathbf{X}_*
$$ and we have *unsupervised learning* (from where we can get deep
generative models).

## Graphical Models

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_ml/includes/graphical-models.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_ml/includes/graphical-models.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

One way of representing a joint distribution is to consider conditional
dependencies between data. Conditional dependencies allow us to
factorize the distribution. For example, a Markov chain is a
factorization of a distribution into components that represent the
conditional relationships between points that are neighboring, often in
time or space. It can be decomposed in the following form.
$$p(\mathbf{ y}) = p(y_n| y_{n-1}) p(y_{n-1}|y_{n-2}) \dots p(y_{2} | y_{1})$$

In [None]:
import daft
from matplotlib import rc

rc("font", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)
rc("text", usetex=True)

In [None]:
pgm = daft.PGM(shape=[3, 1],
               origin=[0, 0], 
               grid_unit=5, 
               node_unit=1.9, 
               observed_style='shaded',
              line_width=3)


pgm.add_node(daft.Node("y_1", r"$y_1$", 0.5, 0.5, fixed=False))
pgm.add_node(daft.Node("y_2", r"$y_2$", 1.5, 0.5, fixed=False))
pgm.add_node(daft.Node("y_3", r"$y_3$", 2.5, 0.5, fixed=False))
pgm.add_edge("y_1", "y_2")
pgm.add_edge("y_2", "y_3")

pgm.render().figure.savefig("./ml/markov.svg", transparent=True)

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//ml/markov.svg" class="" width="50%" style="vertical-align:middle;">

Figure: <i>A Markov chain is a simple form of probabilistic graphical
model providing a particular decomposition of the joint density.</i>

By specifying conditional independencies we can reduce the
parameterization required for our data, instead of directly specifying
the parameters of the joint distribution, we can specify each set of
parameters of the conditonal independently. This can also give an
advantage in terms of interpretability. Understanding a conditional
independence structure gives a structured understanding of data. If
developed correctly, according to causal methodology, it can even inform
how we should intervene in the system to drive a desired result (Pearl,
1995).

However, a challenge arises when the data becomes more complex. Consider
the graphical model shown below, used to predict the perioperative risk
of *C Difficile* infection following colon surgery (Steele et al.,
2012).

<img class="negate" src="https://mlatcl.github.io/deepnn/./slides/diagrams//bayes-net-diagnosis.png" style="width:60%">

Figure: <i>A probabilistic directed graph used to predict the
perioperative risk of *C Difficile* infection following colon surgery.
When these models have good predictive performance they are often
difficult to interpret. This may be due to the limited representation
capability of the conditional densities in the model.</i>

To capture the complexity in the interelationship between the data, the
graph itself becomes more complex, and less interpretable.

## Performing Inference

As far as combining our data and our model to form our prediction, the
devil is in the detail. While everything is easy to write in terms of
probability densities, as we move from $\text{data}$ and $\text{model}$
to $\text{prediction}$ there is that simple
$\stackrel{\text{compute}}{\rightarrow}$ sign, which is now burying a
wealth of difficulties. Each integral sign above is a high dimensional
integral which will typically need approximation. Approximations also
come with computational demands. As we consider more complex classes of
functions, the challenges around the integrals become harder and
prediction of future test data given our model and the data becomes so
involved as to be impractical or impossible.

Statisticians realized these challenges early on, indeed, so early that
they were actually physicists, both Laplace and Gauss worked on models
such as this, in Gauss’s case he made his career on prediction of the
location of the lost planet (later reclassified as a asteroid, then
dwarf planet), Ceres. Gauss and Laplace made use of maximum a posteriori
estimates for simplifying their computations and Laplace developed
Laplace’s method (and invented the Gaussian density) to expand around
that mode. But classical statistics needs better guarantees around model
performance and interpretation, and as a result has focussed more on the
*linear* model implied by $$
  f(\mathbf{ x}) = \left.\mathbf{ w}^{(2)}\right.^\top \boldsymbol{ \phi}(\mathbf{W}_1, \mathbf{ x})
  $$

$$
  \mathbf{ w}^{(2)} \sim \mathcal{N}\left(\mathbf{0},\mathbf{C}\right).
  $$

The Gaussian likelihood given above implies that the data observation is
related to the function by noise corruption so we have, $$
  y_i = f(\mathbf{ x}_i) + \epsilon_i,
  $$ where $$
  \epsilon_i \sim \mathcal{N}\left(0,\sigma^2\right)
  $$

and while normally integrating over high dimensional parameter vectors
is highly complex, here it is *trivial*. That is because of a property
of the multivariate Gaussian.

Gaussian processes are initially of interest because

1.  linear Gaussian models are easier to deal with
2.  Even the parameters *within* the process can be handled, by
    considering a particular limit.

Let’s first of all review the properties of the multivariate Gaussian
distribution that make linear Gaussian models easier to deal with. We’ll
return to the, perhaps surprising, result on the parameters within the
nonlinearity, $\boldsymbol{ \theta}$, shortly.

To work with linear Gaussian models, to find the marginal likelihood all
you need to know is the following rules. If $$
\mathbf{ y}= \mathbf{W}\mathbf{ x}+ \boldsymbol{ \epsilon},
$$ where $\mathbf{ y}$, $\mathbf{ x}$ and $\boldsymbol{ \epsilon}$ are
vectors and we assume that $\mathbf{ x}$ and $\boldsymbol{ \epsilon}$
are drawn from multivariate Gaussians, $$
\begin{align}
\mathbf{ x}& \sim \mathcal{N}\left(\boldsymbol{ \mu},\mathbf{C}\right)\\
\boldsymbol{ \epsilon}& \sim \mathcal{N}\left(\mathbf{0},\boldsymbol{ \Sigma}\right)
\end{align}
$$ then we know that $\mathbf{ y}$ is also drawn from a multivariate
Gaussian with, $$
\mathbf{ y}\sim \mathcal{N}\left(\mathbf{W}\boldsymbol{ \mu},\mathbf{W}\mathbf{C}\mathbf{W}^\top + \boldsymbol{ \Sigma}\right).
$$

With appropriately defined covariance, $\boldsymbol{ \Sigma}$, this is
actually the marginal likelihood for Factor Analysis, or Probabilistic
Principal Component Analysis (Tipping and Bishop, 1999), because we
integrated out the inputs (or *latent* variables they would be called in
that case).

## Linear Model Overview

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_ml/includes/linear-model-overview.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_ml/includes/linear-model-overview.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

However, we are focussing on what happens in models which are non-linear
in the inputs, whereas the above would be *linear* in the inputs. To
consider these, we introduce a matrix, called the design matrix. We set
each activation function computed at each data point to be $$
\phi_{i,j} = \phi(\mathbf{ w}^{(1)}_{j}, \mathbf{ x}_{i})
$$ and define the matrix of activations (known as the *design matrix* in
statistics) to be, $$
\boldsymbol{ \Phi}= 
\begin{bmatrix}
\phi_{1, 1} & \phi_{1, 2} & \dots & \phi_{1, h} \\
\phi_{1, 2} & \phi_{1, 2} & \dots & \phi_{1, n} \\
\vdots & \vdots & \ddots & \vdots \\
\phi_{n, 1} & \phi_{n, 2} & \dots & \phi_{n, h}
\end{bmatrix}.
$$ By convention this matrix always has $n$ rows and $h$ columns, now if
we define the vector of all noise corruptions,
$\boldsymbol{ \epsilon}= \left[\epsilon_1, \dots \epsilon_n\right]^\top$.

If we define the prior distribution over the vector $\mathbf{ w}$ to be
Gaussian, $$
\mathbf{ w}\sim \mathcal{N}\left(\mathbf{0},\alpha\mathbf{I}\right),
$$ then we can use rules of multivariate Gaussians to see that, $$
\mathbf{ y}\sim \mathcal{N}\left(\mathbf{0},\alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top + \sigma^2 \mathbf{I}\right).
$$

In other words, our training data is distributed as a multivariate
Gaussian, with zero mean and a covariance given by $$
\mathbf{K}= \alpha \boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top + \sigma^2 \mathbf{I}.
$$

This is an $n\times n$ size matrix. Its elements are in the form of a
function. The maths shows that any element, index by $i$ and $j$, is a
function *only* of inputs associated with data points $i$ and $j$,
$\mathbf{ y}_i$, $\mathbf{ y}_j$.
$k_{i,j} = k\left(\mathbf{ x}_i, \mathbf{ x}_j\right)$

If we look at the portion of this function associated only with
$f(\cdot)$, i.e. we remove the noise, then we can write down the
covariance associated with our neural network, $$
k_f\left(\mathbf{ x}_i, \mathbf{ x}_j\right) = \alpha \boldsymbol{ \phi}\left(\mathbf{W}_1, \mathbf{ x}_i\right)^\top \boldsymbol{ \phi}\left(\mathbf{W}_1, \mathbf{ x}_j\right)
$$ so the elements of the covariance or *kernel* matrix are formed by
inner products of the rows of the *design matrix*.

## Gaussian Process

This is the essence of a Gaussian process. Instead of making assumptions
about our density over each data point, $y_i$ as i.i.d. we make a joint
Gaussian assumption over our data. The covariance matrix is now a
function of both the parameters of the activation function,
$\mathbf{V}$, and the input variables, $\mathbf{X}$. This comes about
through integrating out the parameters of the model, $\mathbf{ w}$.

## Basis Functions

We can basically put anything inside the basis functions, and many
people do. These can be deep kernels (Cho and Saul, 2009) or we can
learn the parameters of a convolutional neural network inside there.

Viewing a neural network in this way is also what allows us to beform
sensible *batch* normalizations (Ioffe and Szegedy, 2015).

## Non-degenerate Gaussian Processes

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_gp/includes/non-degenerate-gps.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_gp/includes/non-degenerate-gps.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The process described above is degenerate. The covariance function is of
rank at most $h$ and since the theoretical amount of data could always
increase $n\rightarrow \infty$, the covariance function is not full
rank. This means as we increase the amount of data to infinity, there
will come a point where we can’t normalize the process because the
multivariate Gaussian has the form, $$
\mathcal{N}\left(\mathbf{ f}|\mathbf{0},\mathbf{K}\right) = \frac{1}{\left(2\pi\right)^{\frac{n}{2}}\det{\mathbf{K}}^\frac{1}{2}} \exp\left(-\frac{\mathbf{ f}^\top\mathbf{K}\mathbf{ f}}{2}\right)
$$ and a non-degenerate kernel matrix leads to $\det{\mathbf{K}} = 0$
defeating the normalization (it’s equivalent to finding a projection in
the high dimensional Gaussian where the variance of the the resulting
univariate Gaussian is zero, i.e. there is a null space on the
covariance, or alternatively you can imagine there are one or more
directions where the Gaussian has become the delta function).

<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip0">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

Radford Neal

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://mlatcl.github.io/deepnn/./slides/diagrams//people/radford-neal.jpg" clip-path="url(#clip0)"/>

</svg>

In the machine learning field, it was Radford Neal (Neal, 1994) that
realized the potential of the next step. In his 1994 thesis, he was
considering Bayesian neural networks, of the type we described above,
and in considered what would happen if you took the number of hidden
nodes, or neurons, to infinity, i.e. $h\rightarrow \infty$.

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//neal-infinite-priors.png" style="width:80%">

Figure: <i>Page 37 of [Radford Neal’s 1994
thesis](http://www.cs.toronto.edu/~radford/ftp/thesis.pdf)</i>

In loose terms, what Radford considers is what happens to the elements
of the covariance function, $$
  \begin{align*}
  k_f\left(\mathbf{ x}_i, \mathbf{ x}_j\right) & = \alpha \boldsymbol{ \phi}\left(\mathbf{W}_1, \mathbf{ x}_i\right)^\top \boldsymbol{ \phi}\left(\mathbf{W}_1, \mathbf{ x}_j\right)\\
  & = \alpha \sum_k \phi\left(\mathbf{ w}^{(1)}_k, \mathbf{ x}_i\right) \phi\left(\mathbf{ w}^{(1)}_k, \mathbf{ x}_j\right)
  \end{align*}
  $$ if instead of considering a finite number you sample infinitely
many of these activation functions, sampling parameters from a prior
density, $p(\mathbf{ v})$, for each one, $$
k_f\left(\mathbf{ x}_i, \mathbf{ x}_j\right) = \alpha \int \phi\left(\mathbf{ w}^{(1)}, \mathbf{ x}_i\right) \phi\left(\mathbf{ w}^{(1)}, \mathbf{ x}_j\right) p(\mathbf{ w}^{(1)}) \text{d}\mathbf{ w}^{(1)}
$$ And that’s not *only* for Gaussian $p(\mathbf{ v})$. In fact this
result holds for a range of activations, and a range of prior densities
because of the *central limit theorem*.

To write it in the form of a probabilistic program, as long as the
distribution for $\phi_i$ implied by this short probabilistic program,
$$
  \begin{align*}
  \mathbf{ v}& \sim p(\cdot)\\
  \phi_i & = \phi\left(\mathbf{ v}, \mathbf{ x}_i\right), 
  \end{align*}
  $$ has finite variance, then the result of taking the number of hidden
units to infinity, with appropriate scaling, is also a Gaussian process.

## Further Reading

To understand this argument in more detail, I highly recommend reading
chapter 2 of Neal’s thesis (Neal, 1994), which remains easy to read and
clear today. Indeed, for readers interested in Bayesian neural networks,
both Raford Neal’s and David MacKay’s PhD thesis (MacKay, 1992) remain
essential reading. Both theses embody a clarity of thought, and an
ability to weave together threads from different fields that was the
business of machine learning in the 1990s. Radford and David were also
pioneers in making their software widely available and publishing
material on the web.

<!-- ### Two Dimensional Gaussian Distribution -->
<!-- include{_ml/includes/two-d-gaussian.md} -->

In [None]:
import numpy as np
np.random.seed(4949)

## Sampling a Function

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_gp/includes/gpdistfunc.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_gp/includes/gpdistfunc.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

We will consider a Gaussian distribution with a particular structure of
covariance matrix. We will generate *one* sample from a 25-dimensional
Gaussian density. $$
\mathbf{ f}=\left[f_{1},f_{2}\dots f_{25}\right].
$$ in the figure below we plot these data on the $y$-axis against their
*indices* on the $x$-axis.

In [None]:
import mlai

In [None]:
%load -n mlai.Kernel

In [None]:
import mlai

In [None]:
%load -n mlai.polynomial_cov

In [None]:
import mlai

In [None]:
%load -n mlai.exponentiated_quadratic

In [None]:
import mlai.plot as plot
from mlai import Kernel, exponentiated_quadratic

In [None]:
kernel=Kernel(function=exponentiated_quadratic, lengthscale=0.5)
plot.two_point_sample(kernel.K, diagrams='./gp')

In [None]:
import notutils as nu
from ipywidgets import IntSlider

In [None]:
import notutils as nu

In [None]:
nu.display_plots('two_point_sample{sample:0>3}.svg', './gp', sample=IntSlider(0, 0, 8, 1))

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/two_point_sample008.svg" class="" width="80%" style="vertical-align:middle;">

Figure: <i>A 25 dimensional correlated random variable (values ploted
against index)</i>

### Sampling a Function from a Gaussian

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_gp/includes/gaussian-predict-index-one-and-two.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_gp/includes/gaussian-predict-index-one-and-two.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

In [None]:
import notutils as nu
from ipywidgets import IntSlider

In [None]:
import notutils as nu

In [None]:
nu.display_plots('two_point_sample{sample:0>3}.svg', 
                            './gp', 
                            sample=IntSlider(0, 0, 8, 1))

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/two_point_sample001.svg" class="" width="80%" style="vertical-align:middle;">

Figure: <i>The joint Gaussian over $f_1$ and $f_2$ along with the
conditional distribution of $f_2$ given $f_1$</i>

### Joint Density of $f_1$ and $f_2$

In [None]:
import notutils as nu
from ipywidgets import IntSlider

In [None]:
import notutils as nu

In [None]:
nu.display_plots('two_point_sample{sample:0>3}.svg', 
                            './gp', 
                            sample=IntSlider(9, 9, 12, 1))

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/two_point_sample012.svg" class="" width="80%" style="vertical-align:middle;">

Figure: <i>The joint Gaussian over $f_1$ and $f_2$ along with the
conditional distribution of $f_2$ given $f_1$</i>

## Uluru

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/799px-Uluru_Panorama.jpg" style="width:">

Figure: <i>Uluru, the sacred rock in Australia. If we think of it as a
probability density, viewing it from this side gives us one *marginal*
from the density. Figuratively speaking, slicing through the rock would
give a conditional density.</i>

When viewing these contour plots, I sometimes find it helpful to think
of Uluru, the prominent rock formation in Australia. The rock rises
above the surface of the plane, just like a probability density rising
above the zero line. The rock is three dimensional, but when we view
Uluru from the classical position, we are looking at one side of it.
This is equivalent to viewing the marginal density.

The joint density can be viewed from above, using contours. The
conditional density is equivalent to *slicing* the rock. Uluru is a holy
rock, so this has to be an imaginary slice. Imagine we cut down a
vertical plane orthogonal to our view point (e.g. coming across our view
point). This would give a profile of the rock, which when renormalized,
would give us the conditional distribution, the value of conditioning
would be the location of the slice in the direction we are facing.

## Prediction with Correlated Gaussians

Of course in practice, rather than manipulating mountains physically,
the advantage of the Gaussian density is that we can perform these
manipulations mathematically.

Prediction of $f_2$ given $f_1$ requires the *conditional density*,
$p(f_2|f_1)$.Another remarkable property of the Gaussian density is that
this conditional distribution is *also* guaranteed to be a Gaussian
density. It has the form, $$
p(f_2|f_1) = \mathcal{N}\left(f_2|\frac{k_{1, 2}}{k_{1, 1}}f_1, k_{2, 2} - \frac{k_{1,2}^2}{k_{1,1}}\right)
$$where we have assumed that the covariance of the original joint
density was given by $$
\mathbf{K}= \begin{bmatrix} k_{1, 1} & k_{1, 2}\\ k_{2, 1} & k_{2, 2}.\end{bmatrix}
$$

Using these formulae we can determine the conditional density for any of
the elements of our vector $\mathbf{ f}$. For example, the variable
$f_8$ is less correlated with $f_1$ than $f_2$. If we consider this
variable we see the conditional density is more diffuse.

### Joint Density of $f_1$ and $f_8$

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_gp/includes/gaussian-predict-index-one-and-eight.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_gp/includes/gaussian-predict-index-one-and-eight.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

In [None]:
import notutils as nu
from ipywidgets import IntSlider

In [None]:
import notutils as nu

In [None]:
nu.display_plots('two_point_sample{sample:0>3}.svg', 
                            './gp', 
                            sample=IntSlider(13, 13, 17, 1))

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/two_point_sample013.svg" class="" width="80%" style="vertical-align:middle;">

Figure: <i>Sample from the joint Gaussian model, points indexed by 1 and
8 highlighted.</i>

### Prediction of $f_{8}$ from $f_{1}$

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/two_point_sample017.svg" class="" width="80%" style="vertical-align:middle;">

Figure: <i>The joint Gaussian over $f_1$ and $f_8$ along with the
conditional distribution of $f_8$ given $f_1$</i>

-   The single contour of the Gaussian density represents the
    <font color="blue">joint distribution, $p(f_1, f_8)$</font>

. . .

-   We observe a value for <font color="green">$f_1=-?$</font>

. . .

-   Conditional density: <font color="red">$p(f_8|f_1=?)$</font>.

-   Prediction of $\mathbf{ f}_*$ from $\mathbf{ f}$ requires
    multivariate *conditional density*.

-   Multivariate conditional density is *also* Gaussian. <large> $$
    p(\mathbf{ f}_*|\mathbf{ f}) = {\mathcal{N}\left(\mathbf{ f}_*|\mathbf{K}_{*,\mathbf{ f}}\mathbf{K}_{\mathbf{ f},\mathbf{ f}}^{-1}\mathbf{ f},\mathbf{K}_{*,*}-\mathbf{K}_{*,\mathbf{ f}} \mathbf{K}_{\mathbf{ f},\mathbf{ f}}^{-1}\mathbf{K}_{\mathbf{ f},*}\right)}
    $$ </large>

-   Here covariance of joint density is given by $$
    \mathbf{K}= \begin{bmatrix} \mathbf{K}_{\mathbf{ f}, \mathbf{ f}} & \mathbf{K}_{*, \mathbf{ f}}\\ \mathbf{K}_{\mathbf{ f}, *} & \mathbf{K}_{*, *}\end{bmatrix}
    $$

-   Prediction of $\mathbf{ f}_*$ from $\mathbf{ f}$ requires
    multivariate *conditional density*.

-   Multivariate conditional density is *also* Gaussian. <large> $$
    p(\mathbf{ f}_*|\mathbf{ f}) = {\mathcal{N}\left(\mathbf{ f}_*|\boldsymbol{ \mu},\boldsymbol{ \Sigma}\right)}
    $$ $$
    \boldsymbol{ \mu}= \mathbf{K}_{*,\mathbf{ f}}\mathbf{K}_{\mathbf{ f},\mathbf{ f}}^{-1}\mathbf{ f}
    $$ $$
    \boldsymbol{ \Sigma}= \mathbf{K}_{*,*}-\mathbf{K}_{*,\mathbf{ f}} \mathbf{K}_{\mathbf{ f},\mathbf{ f}}^{-1}\mathbf{K}_{\mathbf{ f},*}
    $$ </large>

-   Here covariance of joint density is given by $$
    \mathbf{K}= \begin{bmatrix} \mathbf{K}_{\mathbf{ f}, \mathbf{ f}} & \mathbf{K}_{*, \mathbf{ f}}\\ \mathbf{K}_{\mathbf{ f}, *} & \mathbf{K}_{*, *}\end{bmatrix}
    $$

-   Covariance function, $\mathbf{K}$

-   Determines properties of samples.

-   Function of $\mathbf{X}$,
    $$k_{i,j} = k(\mathbf{ x}_i, \mathbf{ x}_j)$$

-   Posterior mean
    $$f_D(\mathbf{ x}_*) = \mathbf{ k}(\mathbf{ x}_*, \mathbf{X}) \mathbf{K}^{-1}
    \mathbf{ y}$$

-   Posterior covariance
    $$\mathbf{C}_* = \mathbf{K}_{*,*} - \mathbf{K}_{*,\mathbf{ f}}
    \mathbf{K}^{-1} \mathbf{K}_{\mathbf{ f}, *}$$

-   Posterior mean

    $$f_D(\mathbf{ x}_*) = \mathbf{ k}(\mathbf{ x}_*, \mathbf{X}) \boldsymbol{\alpha}$$

-   Posterior covariance
    $$\mathbf{C}_* = \mathbf{K}_{*,*} - \mathbf{K}_{*,\mathbf{ f}}
    \mathbf{K}^{-1} \mathbf{K}_{\mathbf{ f}, *}$$

## Exponentiated Quadratic Covariance

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_kern/includes/eq-covariance.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_kern/includes/eq-covariance.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

In [None]:
import mlai

In [None]:
%load -n mlai.Kernel

In [None]:
import mlai

In [None]:
%load -n mlai.eq_cov

In [None]:
kernel = Kernel(function=eq_cov,
                     name='Exponentiated Quadratic',
                     shortname='eq',                     
                     formula='\kernelScalar(\inputVector, \inputVector^\prime) = \alpha \exp\left(-\frac{\ltwoNorm{\inputVector-\inputVector^\prime}^2}{2\lengthScale^2}\right)',
                     lengthscale=0.2)

In [None]:
import mlai.plot as plot

In [None]:
plot.covariance_func(kernel=kernel, diagrams='./kern/')

The exponentiated quadratic covariance, also known as the Gaussian
covariance or the RBF covariance and the squared exponential. Covariance
between two points is related to the negative exponential of the squared
distnace between those points. This covariance function can be derived
in a few different ways: as the infinite limit of a radial basis
function neural network, as diffusion in the heat equation, as a
Gaussian filter in *Fourier space* or as the composition as a series of
linear filters applied to a base function.

The covariance takes the following form, $$
k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha \exp\left(-\frac{\left\Vert \mathbf{ x}-\mathbf{ x}^\prime \right\Vert_2^2}{2\ell^2}\right)
$$ where $\ell$ is the *length scale* or *time scale* of the process and
$\alpha$ represents the overall process variance.

<center>

$$k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha \exp\left(-\frac{\left\Vert \mathbf{ x}-\mathbf{ x}^\prime \right\Vert_2^2}{2\ell^2}\right)$$

</center>
<table>
<tr>
<td width="45%">

<img src="../slides/diagrams/kern/eq_covariance.svg" class="" width="100%" style="vertical-align:middle;">

</td>
<td width="45%">

<img class="negate" src="../slides/diagrams/kern/eq_covariance.gif" style="width:100%">

</td>
</tr>
</table>

Figure: <i>The exponentiated quadratic covariance function.</i>

## Olympic Marathon Data

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_datasets/includes/olympic-marathon-data.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_datasets/includes/olympic-marathon-data.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

<table>
<tr>
<td width="70%">

-   Gold medal times for Olympic Marathon since 1896.
-   Marathons before 1924 didn’t have a standardized distance.
-   Present results using pace per km.
-   In 1904 Marathon was badly organized leading to very slow times.

</td>
<td width="30%">

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//Stephen_Kiprotich.jpg" style="width:100%">
<small>Image from Wikimedia Commons <http://bit.ly/16kMKHQ></small>

</td>
</tr>
</table>

The first thing we will do is load a standard data set for regression
modelling. The data consists of the pace of Olympic Gold Medal Marathon
winners for the Olympics from 1896 to present. Let’s load in the data
and plot.

In [None]:
%pip install pods

In [None]:
import numpy as np
import pods

In [None]:
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']

offset = y.mean()
scale = np.sqrt(y.var())
yhat = (y - offset)/scale

In [None]:
import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai

In [None]:

xlim = (1875,2030)
ylim = (2.5, 6.5)

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
ax.set_xlabel('year', fontsize=20)
ax.set_ylabel('pace min/km', fontsize=20)
ax.set_xlim(xlim)
ax.set_ylim(ylim)

mlai.write_figure(filename='olympic-marathon.svg', 
                  directory='./datasets')

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//datasets/olympic-marathon.svg" class="" width="80%" style="vertical-align:middle;">

Figure: <i>Olympic marathon pace times since 1896.</i>

Things to notice about the data include the outlier in 1904, in that
year the Olympics was in St Louis, USA. Organizational problems and
challenges with dust kicked up by the cars following the race meant that
participants got lost, and only very few participants completed. More
recent years see more consistently quick marathons.

## Alan Turing

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_ml/includes/alan-turing-marathon.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_ml/includes/alan-turing-marathon.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

<table>
<tr>
<td width="50%">

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//turing-times.gif" style="width:100%">

</td>
<td width="50%">

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//turing-run.jpg" style="width:50%">

</td>
</tr>
</table>

Figure: <i>Alan Turing, in 1946 he was only 11 minutes slower than the
winner of the 1948 games. Would he have won a hypothetical games held in
1946? Source:
<a href="http://www.turing.org.uk/scrapbook/run.html" target="_blank">Alan
Turing Internet Scrapbook</a>.</i>

If we had to summarise the objectives of machine learning in one word, a
very good candidate for that word would be *generalization*. What is
generalization? From a human perspective it might be summarised as the
ability to take lessons learned in one domain and apply them to another
domain. If we accept the definition given in the first session for
machine learning, $$
\text{data} + \text{model} \stackrel{\text{compute}}{\rightarrow} \text{prediction}
$$ then we see that without a model we can’t generalise: we only have
data. Data is fine for answering very specific questions, like “Who won
the Olympic Marathon in 2012?”, because we have that answer stored,
however, we are not given the answer to many other questions. For
example, Alan Turing was a formidable marathon runner, in 1946 he ran a
time 2 hours 46 minutes (just under four minutes per kilometer, faster
than I and most of the other [Endcliffe Park
Run](http://www.parkrun.org.uk/sheffieldhallam/) runners can do 5 km).
What is the probability he would have won an Olympics if one had been
held in 1946?

To answer this question we need to generalize, but before we formalize
the concept of generalization let’s introduce some formal representation
of what it means to generalize in machine learning.

## Gaussian Process Fit

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_gp/includes/olympic-marathon-gp.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_gp/includes/olympic-marathon-gp.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

Our first objective will be to perform a Gaussian process fit to the
data, we’ll do this using the [GPy
software](https://github.com/SheffieldML/GPy).

In [None]:
import GPy

In [None]:
m_full = GPy.models.GPRegression(x,yhat)
_ = m_full.optimize() # Optimize parameters of covariance function

The first command sets up the model, then `m_full.optimize()` optimizes
the parameters of the covariance function and the noise level of the
model. Once the fit is complete, we’ll try creating some test points,
and computing the output of the GP model in terms of the mean and
standard deviation of the posterior functions between 1870 and 2030. We
plot the mean function and the standard deviation at 200 locations. We
can obtain the predictions using `y_mean, y_var = m_full.predict(xt)`

In [None]:
xt = np.linspace(1870,2030,200)[:,np.newaxis]
yt_mean, yt_var = m_full.predict(xt)
yt_sd=np.sqrt(yt_var)

Now we plot the results using the helper function in `mlai.plot`.

In [None]:
import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai

In [None]:
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel="year", ylabel="pace min/km", fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig,
                  filename="olympic-marathon-gp.svg", 
                  directory = "./gp",
                  transparent=True, frameon=True)

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/olympic-marathon-gp.svg" class="" width="80%" style="vertical-align:middle;">

Figure: <i>Gaussian process fit to the Olympic Marathon data. The error
bars are too large, perhaps due to the outlier from 1904.</i>

## Fit Quality

In the fit we see that the error bars (coming mainly from the noise
variance) are quite large. This is likely due to the outlier point in
1904, ignoring that point we can see that a tighter fit is obtained. To
see this make a version of the model, `m_clean`, where that point is
removed.

In [None]:
x_clean=np.vstack((x[0:2, :], x[3:, :]))
y_clean=np.vstack((yhat[0:2, :], yhat[3:, :]))

m_clean = GPy.models.GPRegression(x_clean,y_clean)
_ = m_clean.optimize()

In [None]:
import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai

In [None]:
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m_clean, scale=scale, offset=offset, ax=ax, xlabel='year', ylabel='pace min/km', fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig,
                  filename='./gp/olympic-marathon-gp.svg', 
                  transparent=True, frameon=True)

In [None]:
def rotateObject(rotationMatrix, handle):
for i = 1:prod(size(handle))
    type = get(handle(i), 'type');
    if strcmp(type, 'text'):
        xy = get(handle(i), 'position');
        xy(1:2) = rotationMatrix*xy(1:2)';
        set(handle(i), 'position', xy);
    else:
        xd = get(handle(i), 'xdata');
        yd = get(handle(i), 'ydata');
        new = rotationMatrix*[xd(:)'; yd(:)'];
        set(handle(i), 'xdata', new(1, :));
        set(handle(i), 'ydata', new(2, :));

## Learning Covariance Parameters

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_gp/includes/gp-optimize.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_gp/includes/gp-optimize.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

Can we determine covariance parameters from the data?

$$
\mathcal{N}\left(\mathbf{ y}|\mathbf{0},\mathbf{K}\right)=\frac{1}{(2\pi)^\frac{n}{2}{\det{\mathbf{K}}^{\frac{1}{2}}}}{\exp\left(-\frac{\mathbf{ y}^{\top}\mathbf{K}^{-1}\mathbf{ y}}{2}\right)}
$$

$$
\begin{aligned}
    \mathcal{N}\left(\mathbf{ y}|\mathbf{0},\mathbf{K}\right)=\frac{1}{(2\pi)^\frac{n}{2}\color{blue}{\det{\mathbf{K}}^{\frac{1}{2}}}}\color{red}{\exp\left(-\frac{\mathbf{ y}^{\top}\mathbf{K}^{-1}\mathbf{ y}}{2}\right)}
\end{aligned}
$$

$$
\begin{aligned}
    \log \mathcal{N}\left(\mathbf{ y}|\mathbf{0},\mathbf{K}\right)=&\color{blue}{-\frac{1}{2}\log\det{\mathbf{K}}}\color{red}{-\frac{\mathbf{ y}^{\top}\mathbf{K}^{-1}\mathbf{ y}}{2}} \\ &-\frac{n}{2}\log2\pi
\end{aligned}
$$

$$
E(\boldsymbol{ \theta}) = \color{blue}{\frac{1}{2}\log\det{\mathbf{K}}} + \color{red}{\frac{\mathbf{ y}^{\top}\mathbf{K}^{-1}\mathbf{ y}}{2}}
$$

## Capacity Control through the Determinant

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_gp/includes/gp-optimize-capacity.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_gp/includes/gp-optimize-capacity.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The parameters are *inside* the covariance function (matrix).
$$k_{i, j} = k(\mathbf{ x}_i, \mathbf{ x}_j; \boldsymbol{ \theta})$$

$$\mathbf{K}= \mathbf{R}\boldsymbol{ \Lambda}^2 \mathbf{R}^\top$$

<table>
<tr>
<td width="50%">

<img class="negate" src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/gp-optimize-eigen.png" style="width:100%">

</td>
<td width="50%">

$\boldsymbol{ \Lambda}$ represents distance on axes. $\mathbf{R}$ gives
rotation.

</td>
</tr>
</table>

-   $\boldsymbol{ \Lambda}$ is *diagonal*,
    $\mathbf{R}^\top\mathbf{R}= \mathbf{I}$.
-   Useful representation since
    $\det{\mathbf{K}} = \det{\boldsymbol{ \Lambda}^2} = \det{\boldsymbol{ \Lambda}}^2$.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import mlai
import mlai.plot as plot

In [None]:
plot.covariance_capacity(rotate_angle=np.pi/4, lambda1 = 0.5, lambda2 = 0.3, diagrams = './gp/')

In [None]:
import notutils as nu
from ipywidgets import IntSlider

In [None]:
nu.display_plots('gp-optimise-determinant{sample:0>3}.svg', 
                                          directory='./gp', 
                              sample=IntSlider(0, 0, 9, 1))

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/gp-optimise-determinant009.svg" class="" width="80%" style="vertical-align:middle;">

Figure: <i>The determinant of the covariance is dependent only on the
eigenvalues. It represents the ‘footprint’ of the Gaussian.</i>

## Quadratic Data Fit

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_gp/includes/gp-optimize-data-fit.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_gp/includes/gp-optimize-data-fit.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

In [None]:
    clf
    includeText = [];
    counter = 0;
    plotWidth = 0.6*textWidth;
    lambda1 = 3;
    lambda2 = 1;
    t = linspace(-pi, pi, 200);
    R = [sqrt(2)/2 -sqrt(2)/2; sqrt(2)/2 sqrt(2)/2];
    xy = [lambda1*sin(t); lambda2*cos(t)];
    contourHand = line(xy(1, :), xy(2, :), 'color', blackColor);
    xy = [lambda1*sin(t); lambda2*cos(t)]*2;
    lim = [-1 1]*max([lambda1 lambda2])*2.2;
    set(gca, 'xlim', lim, 'ylim', lim)
    axis equal


    contourHand = [contourHand line(xy(1, :), xy(2, :), 'color', blackColor)];
    set(contourHand, 'linewidth', 2, 'color', redColor)
    arrowHand = arrow([0 lambda1], [0 0]);
    arrowHand = [arrowHand arrow([0 0], [0 lambda2])];
    set(arrowHand, 'linewidth', 3, 'color', blackColor);
    xlim = get(gca, 'xlim');
    xspan = xlim(2) - xlim(1);
    ylim = get(gca, 'ylim');
    yspan = ylim(2) - ylim(1);
    eigLabel = text(lambda1*0.5, -yspan*0.05, '$\eigenvalue_1$', 'horizontalalignment', 'center');
    eigLabel = [eigLabel text(-0.05*xspan, lambda2*0.5, '$\eigenvalue_2$', 'horizontalalignment', 'center')];
    xlabel('$\dataScalar_1$')
    ylabel('$\dataScalar_2$')
    
    box off
    xlim = get(gca, 'xlim');
    ylim = get(gca, 'ylim');
    line([xlim(1) xlim(1)], ylim, 'color', blackColor)
    line(xlim, [ylim(1) ylim(1)], 'color', blackColor)
    
    fileName = ['gpOptimiseQuadratic' num2str(counter)];
    printLatexPlot(fileName, directory, plotWidth);
    includeText = [includeText '\only<' num2str(counter) '>{\input{' directory fileName '.svg}}'];
    counter = counter + 1;

    y = [1.2 1.4];
    dataHand = line(y(1), y(2), 'marker', 'x', 'markersize', markerSize, 'linewidth', markerWidth, 'color', blackColor);
    
    fileName = ['gpOptimiseQuadratic' num2str(counter)];
    printLatexPlot(fileName, directory, plotWidth);
    includeText = [includeText '\only<' num2str(counter) '>{\input{' directory fileName '.svg}}'];
    counter = counter + 1;

    
    rotateObject(rotationMatrix, arrowHand);
    rotateObject(rotationMatrix, contourHand);
    rotateObject(rotationMatrix, eigLabel);
    
    fileName = ['gpOptimiseQuadratic' num2str(counter)];
    printLatexPlot(fileName, directory, plotWidth);
    includeText = [includeText '\only<' num2str(counter) '>{\input{' directory fileName '.svg}}'];
    counter = counter + 1;
    
    printLatexText(includeText, 'gpOptimiseQuadraticIncludeText.tex', directory)

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/gp-optimise-quadratic002.svg" class="" width="80%" style="vertical-align:middle;">

Figure: <i>The data fit term of the Gaussian process is a quadratic loss
centered around zero. This has eliptical contours, the principal axes of
which are given by the covariance matrix.</i>

## Data Fit Term

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_gp/includes/gp-optimize-data-fit-capacity.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_gp/includes/gp-optimize-data-fit-capacity.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os

In [None]:
import GPy
import mlai.plot as plot
import mlai
import gp_tutorial

In [None]:
np.random.seed(125)
diagrams = './gp'

black_color=[0., 0., 0.]
red_color=[1., 0., 0.]
blue_color=[0., 0., 1.]
magenta_color=[1., 0., 1.]
fontsize=18

In [None]:
y_lim = [-2.2, 2.2]
y_ticks = [-2, -1, 0, 1, 2]
x_lim = [-2, 2]
x_ticks = [-2, -1, 0, 1, 2]
err_y_lim = [-12, 20]

linewidth=3
markersize=15
markertype='.'

In [None]:
x = np.linspace(-1, 1, 6)[:, np.newaxis]
xtest = np.linspace(x_lim[0], x_lim[1], 200)[:, np.newaxis]

# True data
true_kern = GPy.kern.RBF(1) + GPy.kern.White(1)
true_kern.rbf.lengthscale = 1.0
true_kern.white.variance = 0.01
K = true_kern.K(x) 
y = np.random.multivariate_normal(np.zeros((6,)), K, 1).T

In [None]:

# Fitted model
kern = GPy.kern.RBF(1) + GPy.kern.White(1)
kern.rbf.lengthscale = 1.0
kern.white.variance = 0.01

lengthscales = np.asarray([0.01, 0.05, 0.1, 0.25, 0.5, 1, 2, 4, 8, 16, 100])

fig1, ax1 = plt.subplots(figsize=plot.one_figsize)    
fig2, ax2 = plt.subplots(figsize=plot.one_figsize)    
line = ax2.semilogx(np.NaN, np.NaN, 'x-', 
                    color=black_color)
ax.set_ylim(err_y_lim)
ax.set_xlim([0.025, 32])
ax.grid(True)
ax.set_xticks([0.01, 0.1, 1, 10, 100])
ax.set_xticklabels(['$10^{-2}$', '$10^{-1}$', '$10^0$', '$10^1$', '$10^2$'])


err = np.zeros_like(lengthscales)
err_log_det = np.zeros_like(lengthscales)
err_fit = np.zeros_like(lengthscales)

counter = 0
for i, ls in enumerate(lengthscales):
        kern.rbf.lengthscale=ls
        K = kern.K(x) 
        invK, L, Li, log_det_K = GPy.util.linalg.pdinv(K)
        err[i] = 0.5*(log_det_K + np.dot(np.dot(y.T,invK),y))
        err_log_det[i] = 0.5*log_det_K
        err_fit[i] = 0.5*np.dot(np.dot(y.T,invK), y)
        Kx = kern.K(x, xtest)
        ypred_mean = np.dot(np.dot(Kx.T, invK), y)
        ypred_var = kern.Kdiag(xtest) - np.sum((np.dot(Kx.T,invK))*Kx.T, 1)
        ypred_sd = np.sqrt(ypred_var)
        ax1.clear()
        _ = gp_tutorial.gpplot(xtest.flatten(),
                               ypred_mean.flatten(),
                               ypred_mean.flatten()-2*ypred_sd.flatten(),
                               ypred_mean.flatten()+2*ypred_sd.flatten(), 
                               ax=ax1)
        x_lim = ax1.get_xlim()
        ax1.set_ylabel('$f(x)$', fontsize=fontsize)
        ax1.set_xlabel('$x$', fontsize=fontsize)

        p = ax1.plot(x, y, markertype, color=black_color, markersize=markersize, linewidth=linewidth)
        ax1.set_ylim(y_lim)
        ax1.set_xlim(x_lim)                                    
        ax1.set_xticks(x_ticks)
        #ax.set(box=False)
           
        ax1.plot([x_lim[0], x_lim[0]], y_lim, color=black_color)
        ax1.plot(x_lim, [y_lim[0], y_lim[0]], color=black_color)

        file_name = 'gp-optimise{counter:0>3}.svg'.format(counter=counter)
        mlai.write_figure(os.path.join(diagrams, file_name),
                          figure=fig1,
                          transparent=True)
        counter += 1

        ax2.clear()
        t = ax2.semilogx(lengthscales[0:i+1], err[0:i+1], 'x-', 
                        color=magenta_color, 
                        markersize=markersize,
                        linewidth=linewidth)
        t2 = ax2.semilogx(lengthscales[0:i+1], err_log_det[0:i+1], 'x-', 
                         color=blue_color, 
                        markersize=markersize,
                        linewidth=linewidth)
        t3 = ax2.semilogx(lengthscales[0:i+1], err_fit[0:i+1], 'x-', 
                         color=red_color, 
                        markersize=markersize,
                        linewidth=linewidth)
        ax2.set_ylim(err_y_lim)
        ax2.set_xlim([0.025, 32])
        ax2.set_xticks([0.01, 0.1, 1, 10, 100])
        ax2.set_xticklabels(['$10^{-2}$', '$10^{-1}$', '$10^0$', '$10^1$', '$10^2$'])

        ax2.grid(True)

        ax2.set_ylabel('negative log likelihood', fontsize=fontsize)
        ax2.set_xlabel('length scale, $\ell$', fontsize=fontsize)
        file_name = 'gp-optimise{counter:0>3}.svg'.format(counter=counter)
        mlai.write_figure(os.path.join(diagrams, file_name),
                          figure=fig2,
                          transparent=True)
        counter += 1
        #ax.set_box(False)
        xlim = ax2.get_xlim()
        ax2.plot([xlim[0], xlim[0]], err_y_lim, color=black_color)
        ax2.plot(xlim, [err_y_lim[0], err_y_lim[0]], color=black_color)

<table>
<tr>
<td width="50%">

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/gp-optimise006.svg" class="" width="100%" style="vertical-align:middle;">

</td>
<td width="50%">

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/gp-optimise010.svg" class="" width="100%" style="vertical-align:middle;">

</td>
</tr>
</table>
<table>
<tr>
<td width="50%">

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/gp-optimise016.svg" class="" width="100%" style="vertical-align:middle;">

</td>
<td width="50%">

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/gp-optimise021.svg" class="" width="100%" style="vertical-align:middle;">

</td>
</tr>
</table>

Figure: <i>Variation in the data fit term, the capacity term and the
negative log likelihood for different lengthscales.</i>

## Gene Expression Example

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_gp/includes/della-gatta-gene-gp.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_gp/includes/della-gatta-gene-gp.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

We now consider an example in gene expression. Gene expression is the
measurement of mRNA levels expressed in cells. These mRNA levels show
which genes are ‘switched on’ and producing data. In the example we will
use a Gaussian process to determine whether a given gene is active, or
we are merely observing a noise response.

## Della Gatta Gene Data

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_datasets/includes/della-gatta-gene-data.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_datasets/includes/della-gatta-gene-data.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

-   Given given expression levels in the form of a time series from
    Della Gatta et al. (2008).

In [None]:
import numpy as np
import pods

In [None]:
data = pods.datasets.della_gatta_TRP63_gene_expression(data_set='della_gatta',gene_number=937)

x = data['X']
y = data['Y']

offset = y.mean()
scale = np.sqrt(y.var())

In [None]:
import matplotlib.pyplot as plt
import mlai.plot as plot
import mlai

In [None]:

xlim = (-20,260)
ylim = (5, 7.5)
yhat = (y-offset)/scale

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
ax.set_xlabel('time/min', fontsize=20)
ax.set_ylabel('expression', fontsize=20)
ax.set_xlim(xlim)
ax.set_ylim(ylim)

mlai.write_figure(figure=fig, 
                  filename='./datasets/della-gatta-gene.svg', 
                  transparent=True, 
                  frameon=True)

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//datasets/della-gatta-gene.svg" class="" width="80%" style="vertical-align:middle;">

Figure: <i>Gene expression levels over time for a gene from data
provided by Della Gatta et al. (2008). We would like to understand
whether there is signal in the data, or we are only observing noise.</i>

-   Want to detect if a gene is expressed or not, fit a GP to each gene
    Kalaitzis and Lawrence (2011).

<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip1">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

Freddie Kalaitzis

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://mlatcl.github.io/deepnn/./slides/diagrams//people/freddie-kalaitzis.jpg" clip-path="url(#clip1)"/>

</svg>

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//health/1471-2105-12-180_1.png" style="width:80%">

Figure: <i>The example is taken from the paper “A Simple Approach to
Ranking Differentially Expressed Gene Expression Time Courses through
Gaussian Process Regression.” Kalaitzis and Lawrence (2011).</i>

<center>

<http://www.biomedcentral.com/1471-2105/12/180>

</center>

Our first objective will be to perform a Gaussian process fit to the
data, we’ll do this using the [GPy
software](https://github.com/SheffieldML/GPy).

In [None]:
import GPy

In [None]:
m_full = GPy.models.GPRegression(x,yhat)
m_full.kern.lengthscale=50
_ = m_full.optimize() # Optimize parameters of covariance function

Initialize the length scale parameter (which here actually represents a
*time scale* of the covariance function) to a reasonable value. Default
would be 1, but here we set it to 50 minutes, given points are arriving
across zero to 250 minutes.

In [None]:
xt = np.linspace(-20,260,200)[:,np.newaxis]
yt_mean, yt_var = m_full.predict(xt)
yt_sd=np.sqrt(yt_var)

Now we plot the results using the helper function in `mlai.plot`.

In [None]:
import mlai.plot as plot

In [None]:
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='time/min', ylabel='expression', fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
ax.set_title('log likelihood: {ll:.3}'.format(ll=m_full.log_likelihood()), fontsize=20)
mlai.write_figure(figure=fig,
                  filename='./gp/della-gatta-gene-gp.svg', 
                  transparent=True, frameon=True)

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/della-gatta-gene-gp.svg" class="" width="80%" style="vertical-align:middle;">

Figure: <i>Result of the fit of the Gaussian process model with the time
scale parameter initialized to 50 minutes.</i>

Now we try a model initialized with a longer length scale.

In [None]:
m_full2 = GPy.models.GPRegression(x,yhat)
m_full2.kern.lengthscale=2000
_ = m_full2.optimize() # Optimize parameters of covariance function

In [None]:
import mlai.plot as plot

In [None]:
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m_full2, scale=scale, offset=offset, ax=ax, xlabel='time/min', ylabel='expression', fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
ax.set_title('log likelihood: {ll:.3}'.format(ll=m_full2.log_likelihood()), fontsize=20)
mlai.write_figure(figure=fig,
                  filename='./gp/della-gatta-gene-gp2.svg', 
                  transparent=True, frameon=True)

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/della-gatta-gene-gp2.svg" class="" width="80%" style="vertical-align:middle;">

Figure: <i>Result of the fit of the Gaussian process model with the time
scale parameter initialized to 2000 minutes.</i>

Now we try a model initialized with a lower noise.

In [None]:
m_full3 = GPy.models.GPRegression(x,yhat)
m_full3.kern.lengthscale=20
m_full3.likelihood.variance=0.001
_ = m_full3.optimize() # Optimize parameters of covariance function

In [None]:
import mlai.plot as plot

In [None]:
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m_full3, scale=scale, offset=offset, ax=ax, xlabel='time/min', ylabel='expression', fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
ax.set_title('log likelihood: {ll:.3}'.format(ll=m_full3.log_likelihood()), fontsize=20)
mlai.write_figure(figure=fig,
                  filename='./gp/della-gatta-gene-gp3.svg', 
                  transparent=True, frameon=True)

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/della-gatta-gene-gp3.svg" class="" width="80%" style="vertical-align:middle;">

Figure: <i>Result of the fit of the Gaussian process model with the
noise initialized low (standard deviation 0.1) and the time scale
parameter initialized to 20 minutes.</i>

In [None]:
import mlai.plot as plot

In [None]:
plot.multiple_optima(diagrams='./gp')

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/multiple-optima000.svg" class="" width="50%" style="vertical-align:middle;">

Figure: <i></i>

<!--

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//gp/multiple-optima001.svg" class="" width="" style="vertical-align:middle;">-->

## Example: Prediction of Malaria Incidence in Uganda

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_health/includes/malaria-gp.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_health/includes/malaria-gp.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip2">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

Martin Mubangizi

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://mlatcl.github.io/deepnn/./slides/diagrams//people/martin-mubangizi.png" clip-path="url(#clip2)"/>

</svg>
<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip3">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

Ricardo Andrade Pacecho

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://mlatcl.github.io/deepnn/./slides/diagrams//people/ricardo-andrade-pacheco.png" clip-path="url(#clip3)"/>

</svg>
<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip4">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

John Quinn

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://mlatcl.github.io/deepnn/./slides/diagrams//people/john-quinn.jpg" clip-path="url(#clip4)"/>

</svg>

As an example of using Gaussian process models within the full pipeline
from data to decsion, we’ll consider the prediction of Malaria incidence
in Uganda. For the purposes of this study malaria reports come in two
forms, HMIS reports from health centres and Sentinel data, which is
curated by the WHO. There are limited sentinel sites and many HMIS
sites.

The work is from Ricardo Andrade Pacheco’s PhD thesis, completed in
collaboration with John Quinn and Martin Mubangizi (Andrade-Pacheco et
al., 2014; Mubangizi et al., 2014). John and Martin were initally from
the AI-DEV group from the University of Makerere in Kampala and more
latterly they were based at UN Global Pulse in Kampala. You can see the
work summarized on the UN Global Pulse [disease outbreaks project site
here](https://diseaseoutbreaks.unglobalpulse.net/uganda/).

-   See [UN Global Pulse Disease Outbreaks
    Site](https://diseaseoutbreaks.unglobalpulse.net/uganda/)

Malaria data is spatial data. Uganda is split into districts, and health
reports can be found for each district. This suggests that models such
as conditional random fields could be used for spatial modelling, but
there are two complexities with this. First of all, occasionally
districts split into two. Secondly, sentinel sites are a specific
location within a district, such as Nagongera which is a sentinel site
based in the Tororo district.

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//health/uganda-districts-2006.png" style="width:50%">

Figure: <i>Ugandan districts. Data SRTM/NASA from
<https://dds.cr.usgs.gov/srtm/version2_1>.</i>

(Andrade-Pacheco et al., 2014; Mubangizi et al., 2014)

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//health/Kapchorwa_District_in_Uganda.svg" class="" width="50%" style="vertical-align:middle;">

Figure: <i>The Kapchorwa District, home district of Stephen
Kiprotich.</i>

Stephen Kiprotich, the 2012 gold medal winner from the London Olympics,
comes from Kapchorwa district, in eastern Uganda, near the border with
Kenya.

The common standard for collecting health data on the African continent
is from the Health management information systems (HMIS). However, this
data suffers from missing values (Gething et al., 2006) and diagnosis of
diseases like typhoid and malaria may be confounded.

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//health/Tororo_District_in_Uganda.svg" class="" width="50%" style="vertical-align:middle;">

Figure: <i>The Tororo district, where the sentinel site, Nagongera, is
located.</i>

[World Health Organization Sentinel Surveillance
systems](https://www.who.int/immunization/monitoring_surveillance/burden/vpd/surveillance_type/sentinel/en/)
are set up “when high-quality data are needed about a particular disease
that cannot be obtained through a passive system”. Several sentinel
sites give accurate assessment of malaria disease levels in Uganda,
including a site in Nagongera.

<img class="negate" src="https://mlatcl.github.io/deepnn/./slides/diagrams//health/sentinel_nagongera.png" style="width:100%">

Figure: <i>Sentinel and HMIS data along with rainfall and temperature
for the Nagongera sentinel station in the Tororo district.</i>

In collaboration with the AI Research Group at Makerere we chose to
investigate whether Gaussian process models could be used to assimilate
information from these two different sources of disease informaton.
Further, we were interested in whether local information on rainfall and
temperature could be used to improve malaria estimates.

The aim of the project was to use WHO Sentinel sites, alongside rainfall
and temperature, to improve predictions from HMIS data of levels of
malaria.

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//health/Mubende_District_in_Uganda.svg" class="" width="50%" style="vertical-align:middle;">

Figure: <i>The Mubende District.</i>

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//health/mubende.png" style="width:80%">

Figure: <i>Prediction of malaria incidence in Mubende.</i>

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//gpss/1157497_513423392066576_1845599035_n.jpg" style="width:80%">

Figure: <i>The project arose out of the Gaussian process summer school
held at Makerere in Kampala in 2013. The school led, in turn, to the
Data Science Africa initiative.</i>

## Early Warning Systems

<img src="https://mlatcl.github.io/deepnn/./slides/diagrams//health/Kabarole_District_in_Uganda.svg" class="" width="50%" style="vertical-align:middle;">

Figure: <i>The Kabarole district in Uganda.</i>

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//health/kabarole.gif" style="width:100%">

Figure: <i>Estimate of the current disease situation in the Kabarole
district over time. Estimate is constructed with a Gaussian process with
an additive covariance funciton.</i>

Health monitoring system for the Kabarole district. Here we have fitted
the reports with a Gaussian process with an additive covariance
function. It has two components, one is a long time scale component (in
red above) the other is a short time scale component (in blue).

Monitoring proceeds by considering two aspects of the curve. Is the blue
line (the short term report signal) above the red (which represents the
long term trend? If so we have higher than expected reports. If this is
the case *and* the gradient is still positive (i.e. reports are going
up) we encode this with a *red* color. If it is the case and the
gradient of the blue line is negative (i.e. reports are going down) we
encode this with an *amber* color. Conversely, if the blue line is below
the red *and* decreasing, we color *green*. On the other hand if it is
below red but increasing, we color *yellow*.

This gives us an early warning system for disease. Red is a bad
situation getting worse, amber is bad, but improving. Green is good and
getting better and yellow good but degrading.

Finally, there is a gray region which represents when the scale of the
effect is small.

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//health/monitor.gif" style="width:50%">

Figure: <i>The map of Ugandan districts with an overview of the Malaria
situation in each district.</i>

These colors can now be observed directly on a spatial map of the
districts to give an immediate impression of the current status of the
disease across the country.

## Additive Covariance

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_kern/includes/add-covariance.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_kern/includes/add-covariance.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

In [None]:
import mlai

In [None]:
%load -n mlai.Kernel

In [None]:
import mlai

In [None]:
%load -n mlai.linear_cov

In [None]:
import mlai

In [None]:
%load -n mlai.eq_cov

In [None]:
import mlai

In [None]:
%load -n mlai.add_cov

In [None]:
kernel = Kernel(function=add_cov,
                     name='Additive',
                     shortname='add',                     
                     formula='\kernelScalar_f(\inputVector, \inputVector^\prime) = \kernelScalar_g(\inputVector, \inputVector^\prime) + \kernelScalar_h(\inputVector, \inputVector^\prime)', 
                     kerns=[linear_cov, eq_cov], 
                     kern_args=[{'variance': 25}, {'lengthscale' : 0.2}])

In [None]:
import mlai.plot as plot

In [None]:
plot.covariance_func(kernel=kernel, diagrams='./kern/')

An additive covariance function is derived from considering the result
of summing two Gaussian processes together. If the first Gaussian
process is $g(\cdot)$, governed by covariance $k_g(\cdot, \cdot)$ and
the second process is $h(\cdot)$, governed by covariance
$k_h(\cdot, \cdot)$ then the combined process
$f(\cdot) = g(\cdot) + h(\cdot)$ is govererned by a covariance function,
$$
k_f(\mathbf{ x}, \mathbf{ x}^\prime) = k_g(\mathbf{ x}, \mathbf{ x}^\prime) + k_h(\mathbf{ x}, \mathbf{ x}^\prime)
$$

<center>

$$k_f(\mathbf{ x}, \mathbf{ x}^\prime) = k_g(\mathbf{ x}, \mathbf{ x}^\prime) + k_h(\mathbf{ x}, \mathbf{ x}^\prime)$$

</center>
<table>
<tr>
<td width="45%">

<img src="../slides/diagrams/kern/add_covariance.svg" class="" width="100%" style="vertical-align:middle;">

</td>
<td width="45%">

<img class="negate" src="../slides/diagrams/kern/add_covariance.gif" style="width:100%">

</td>
</tr>
</table>

Figure: <i>An additive covariance function formed by combining a linear
and an exponentiated quadratic covariance functions.</i>

## Analysis of US Birth Rates

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_gp/includes/bda-forecasting.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_gp/includes/bda-forecasting.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

<svg viewBox="0 0 200 200" style="width:15%">

<defs> <clipPath id="clip5">

<style>
circle {
  fill: black;
}
</style>

<circle cx="100" cy="100" r="100"/> </clipPath> </defs>

<title>

Aki Vehtari

</title>

<image preserveAspectRatio="xMinYMin slice" width="100%" xlink:href="https://mlatcl.github.io/deepnn/./slides/diagrams//people/aki-vehtari.jpg" clip-path="url(#clip5)"/>

</svg>

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//ml/bialik-fridaythe13th-1.png" style="width:70%">

Figure: <i>This is a retrospective analysis of US births by Aki Vehtari.
The challenges of forecasting. Even with seasonal and weekly effects
removed there are significant effects on holidays, weekends, etc.</i>

There’s a nice analysis of US birth rates by Gaussian processes with
additive covariances in Gelman et al. (2013). A combination of
covariance functions are used to take account of weekly and yearly
trends. The analysis is summarized on the cover of the book.

<table>
<tr>
<td width="50%">

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//ml/bda_cover_1.png" style="width:80%">

</td>
<td width="50%">

<img class="" src="https://mlatcl.github.io/deepnn/./slides/diagrams//ml/bda_cover.png" style="width:80%">

</td>
</tr>
</table>

Figure: <i>Two different editions of Bayesian Data Analysis (Gelman et
al., 2013).</i>

## Basis Function Covariance

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_kern/includes/basis-covariance.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_kern/includes/basis-covariance.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

The fixed basis function covariance just comes from the properties of a
multivariate Gaussian, if we decide $$
\mathbf{ f}=\boldsymbol{ \Phi}\mathbf{ w}
$$ and then we assume $$
\mathbf{ w}\sim \mathcal{N}\left(\mathbf{0},\alpha\mathbf{I}\right)
$$ then it follows from the properties of a multivariate Gaussian that
$$
\mathbf{ f}\sim \mathcal{N}\left(\mathbf{0},\alpha\boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top\right)
$$ meaning that the vector of observations from the function is jointly
distributed as a Gaussian process and the covariance matrix is
$\mathbf{K}= \alpha\boldsymbol{ \Phi}\boldsymbol{ \Phi}^\top$, each
element of the covariance matrix can then be found as the inner product
between two rows of the basis funciton matrix.

In [None]:
import mlai

In [None]:
%load -n mlai.basis_cov

In [None]:
import mlai

In [None]:
%load -n mlai.radial

In [None]:
import mlai.plot as plot
import mlai
import numpy as np

In [None]:

basis = mlai.Basis(function=radial, 
                   number=3,
                   data_limits=[-0.5, 0.5], 
                   width=0.125)
kernel = mlai.Kernel(function=basis_cov,
                     name='Basis',
                     shortname='basis',                  
                     formula='\kernel(\inputVector, \inputVector^\prime) = \basisVector(\inputVector)^\top \basisVector(\inputVector^\prime)',
                     basis=basis)
                     
plot.covariance_func(kernel, diagrams='./kern/')

<center>

$$k(\mathbf{ x}, \mathbf{ x}^\prime) = \boldsymbol{ \phi}(\mathbf{ x})^\top \boldsymbol{ \phi}(\mathbf{ x}^\prime)$$

</center>
<table>
<tr>
<td width="45%">

<img src="../slides/diagrams/kern/basis_covariance.svg" class="" width="100%" style="vertical-align:middle;">

</td>
<td width="45%">

<img class="negate" src="../slides/diagrams/kern/basis_covariance.gif" style="width:100%">

</td>
</tr>
</table>

Figure: <i>A covariance function based on a non-linear basis given by
$\boldsymbol{ \phi}(\mathbf{ x})$.</i>

## Brownian Covariance

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_kern/includes/brownian-covariance.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_kern/includes/brownian-covariance.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

In [None]:
import mlai

In [None]:
%load -n mlai.brownian_cov

In [None]:
import mlai.plot as plot
import mlai
import numpy as np

In [None]:
t=np.linspace(0, 2, 200)[:, np.newaxis]
kernel = mlai.Kernel(function=brownian_cov,
                     name='Brownian',
                     formula='\kernelScalar(t, t^\prime)=\alpha \min(t, t^\prime)',
                     shortname='brownian')
plot.covariance_func(kernel, t, diagrams='./kern/')

Brownian motion is also a Gaussian process. It follows a Gaussian random
walk, with diffusion occuring at each time point driven by a Gaussian
input. This implies it is both Markov and Gaussian. The covariance
function for Brownian motion has the form $$
k(t, t^\prime)=\alpha \min(t, t^\prime)
$$

<center>

$$k(t, t^\prime)=\alpha \min(t, t^\prime)$$

</center>
<table>
<tr>
<td width="45%">

<img src="../slides/diagrams/kern/brownian_covariance.svg" class="" width="100%" style="vertical-align:middle;">

</td>
<td width="45%">

<img class="negate" src="../slides/diagrams/kern/brownian_covariance.gif" style="width:100%">

</td>
</tr>
</table>

Figure: <i>Brownian motion covariance function.</i>

## MLP Covariance

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_kern/includes/mlp-covariance.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_kern/includes/mlp-covariance.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

In [None]:
import mlai

In [None]:
%load -n mlai.mlp_cov

In [None]:
import mlai.plot as plot
import mlai
import numpy as np

In [None]:
kernel = mlai.Kernel(function=mlp_cov,
                     name='Multilayer Perceptron',
                     shortname='mlp',                    
                     formula='\kernelScalar(\inputVector, \inputVector^\prime) = \alpha \arcsin\left(\frac{w \inputVector^\top \inputVector^\prime + b}{\sqrt{\left(w \inputVector^\top \inputVector + b + 1\right)\left(w \left.\inputVector^\prime\right.^\top \inputVector^\prime + b + 1\right)}}\right)',
                     w=5, b=0.5)
                     
plot.covariance_func(kernel, diagrams='./kern/')

The multi-layer perceptron (MLP) covariance, also known as the neural
network covariance or the arcsin covariance, is derived by considering
the infinite limit of a neural network.

<center>

$$k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha \arcsin\left(\frac{w \mathbf{ x}^\top \mathbf{ x}^\prime + b}{\sqrt{\left(w \mathbf{ x}^\top \mathbf{ x}+ b + 1\right)\left(w \left.\mathbf{ x}^\prime\right.^\top \mathbf{ x}^\prime + b + 1\right)}}\right)$$

</center>
<table>
<tr>
<td width="45%">

<img src="../slides/diagrams/kern/mlp_covariance.svg" class="" width="100%" style="vertical-align:middle;">

</td>
<td width="45%">

<img class="negate" src="../slides/diagrams/kern/mlp_covariance.gif" style="width:100%">

</td>
</tr>
</table>

Figure: <i>The multi-layer perceptron covariance function. This is
derived by considering the infinite limit of a neural network with
probit activation functions.</i>

## RELU Covariance

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_kern/includes/relu-covariance.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_kern/includes/relu-covariance.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

In [None]:
import mlai

In [None]:
%load -n mlai.relu_cov

In [None]:
import mlai.plot as plot
import mlai
import numpy as np

In [None]:
kernel = mlai.Kernel(function=relu_cov,
                     name='RELU',
                     shortname='relu',                   
                     formula='\kernelScalar(\inputVector, \inputVector^\prime) = \alpha \arcsin\left(\frac{w \inputVector^\top \inputVector^\prime + b}{\sqrt{\left(w \inputVector^\top \inputVector + b + 1\right)\left(w \left.\inputVector^\prime\right.^\top \inputVector^\prime + b + 1\right)}}\right)',
                     w=5, b=0.5)
                     
plot.covariance_func(kernel, diagrams='./kern/')

<center>

$$k(\mathbf{ x}, \mathbf{ x}^\prime) = 
\alpha \arcsin\left(\frac{w \mathbf{ x}^\top \mathbf{ x}^\prime + b}
{\sqrt{\left(w \mathbf{ x}^\top \mathbf{ x}+ b + 1\right)
\left(w \left.\mathbf{ x}^\prime\right.^\top \mathbf{ x}^\prime + b + 1\right)}}\right)$$

</center>
<table>
<tr>
<td width="45%">

<img src="../slides/diagrams/kern/relu_covariance.svg" class="" width="100%" style="vertical-align:middle;">

</td>
<td width="45%">

<img class="negate" src="../slides/diagrams/kern/relu_covariance.gif" style="width:100%">

</td>
</tr>
</table>

Figure: <i>Rectified linear unit covariance function.</i>

## Sinc Covariance

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_kern/includes/sinc-covariance.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_kern/includes/sinc-covariance.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

Another approach to developing covariance function exploits Bochner’s
theorem Bochner (1959). Bochner’s theorem tells us that any positve
filter in Fourier space implies has an associated Gaussian process with
a stationary covariance function. The covariance function is the
*inverse Fourier transform* of the filter applied in Fourier space.

For example, in signal processing, *band limitations* are commonly
applied as an assumption. For example, we may believe that no frequency
above $w=2$ exists in the signal. This is equivalent to a rectangle
function being applied as a the filter in Fourier space.

The inverse Fourier transform of the rectangle function is the
$\text{sinc}(\cdot)$ function. So the sinc is a valid covariance
function, and it represents *band limited* signals.

Note that other covariance functions we’ve introduced can also be
interpreted in this way. For example, the exponentiated quadratic
covariance function can be Fourier transformed to see what the implied
filter in Fourier space is. The Fourier transform of the exponentiated
quadratic is an exponentiated quadratic, so the standard EQ-covariance
implies a EQ filter in Fourier space.

In [None]:
import mlai

In [None]:
%load -n mlai.sinc_cov

In [None]:
import mlai.plot as plot
import mlai
import numpy as np

In [None]:
kernel = mlai.Kernel(function=sinc_cov,
                     name='Sinc',
                     shortname='sinc',                   
                     formula='\kernelScalar(\inputVector, \inputVector^\prime) = \alpha \text{sinc}\left(\pi w r\right)',
                     w=2)
                     
plot.covariance_func(kernel, diagrams='./kern/')

<center>

$$k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha \text{sinc}\left(\pi w r\right)$$

</center>
<table>
<tr>
<td width="45%">

<img src="../slides/diagrams/kern/sinc_covariance.svg" class="" width="100%" style="vertical-align:middle;">

</td>
<td width="45%">

<img class="negate" src="../slides/diagrams/kern/sinc_covariance.gif" style="width:100%">

</td>
</tr>
</table>

Figure: <i>Sinc covariance function.</i>

## Polynomial Covariance

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_kern/includes/poly-covariance.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_kern/includes/poly-covariance.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

In [None]:
import mlai

In [None]:
%load -n mlai.polynomial_cov

In [None]:
import mlai.plot as plot
import mlai
import numpy as np

In [None]:
kernel = mlai.Kernel(function=polynomial_cov,
                     name='Polynomial',
                     shortname='polynomial',                     
                     formula='\kernelScalar(\inputVector, \inputVector^\prime) = \alpha(w \inputVector^\top\inputVector^\prime + b)^d',
                     degree=5)
                     
plot.covariance_func(kernel, diagrams='./kern/')

<center>

$$k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha(w \mathbf{ x}^\top\mathbf{ x}^\prime + b)^d$$

</center>
<table>
<tr>
<td width="45%">

<img src="../slides/diagrams/kern/polynomial_covariance.svg" class="" width="100%" style="vertical-align:middle;">

</td>
<td width="45%">

<img class="negate" src="../slides/diagrams/kern/polynomial_covariance.gif" style="width:100%">

</td>
</tr>
</table>

Figure: <i>Polynomial covariance function.</i>

## Periodic Covariance

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_kern/includes/periodic-covariance.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_kern/includes/periodic-covariance.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

In [None]:
import mlai

In [None]:
%load -n mlai.periodic_cov

In [None]:
import mlai.plot as plot
import mlai
import numpy as np

In [None]:
kernel = mlai.Kernel(function=periodic_cov,
                     name='Periodic',
                     shortname='periodic',                   
                     formula='\kernelScalar(\inputVector, \inputVector^\prime) = \alpha\exp\left(\frac{-2\sin(\pi rw)^2}{\lengthScale^2}\right)',
                     lengthscale=1.0)
                     
plot.covariance_func(kernel, diagrams='./kern/')

<center>

$$k(\mathbf{ x}, \mathbf{ x}^\prime) = \alpha\exp\left(\frac{-2\sin(\pi rw)^2}{\ell^2}\right)$$

</center>
<table>
<tr>
<td width="45%">

<img src="../slides/diagrams/kern/periodic_covariance.svg" class="" width="100%" style="vertical-align:middle;">

</td>
<td width="45%">

<img class="negate" src="../slides/diagrams/kern/periodic_covariance.gif" style="width:100%">

</td>
</tr>
</table>

Figure: <i>Periodic covariance function.</i>

## Linear Model of Coregionalization Covariance

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_kern/includes/lmc-covariance.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_kern/includes/lmc-covariance.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

In [None]:
%load -s lmc_cov mlai.py

In [None]:
import mlai.plot as plot
import mlai
import numpy as np

In [None]:
K, anim=plot.animate_covariance_function(mlai.compute_kernel, 
                                         kernel=lmc_cov, subkernel=eq_cov,
                                         B = np.asarray([[1, 0.5],[0.5, 1.5]]))

In [None]:
from IPython.core.display import HTML

In [None]:
HTML(anim.to_jshtml())

In [None]:
plot.save_animation(anim, 
                    diagrams='./kern', 
                    filename='lmc_covariance.html')

<center>

$$k(i, j, \mathbf{ x}, \mathbf{ x}^\prime) = b_{i,j} k(\mathbf{ x}, \mathbf{ x}^\prime)$$

</center>
<table>
<tr>
<td width="45%">

<img src="../slides/diagrams/kern/lmc_covariance.svg" class="" width="100%" style="vertical-align:middle;">

</td>
<td width="45%">

<img class="negate" src="../slides/diagrams/kern/lmc_covariance.gif" style="width:100%">

</td>
</tr>
</table>

Figure: <i>Linear model of coregionalization covariance function.</i>

## Intrinsic Coregionalization Model Covariance

<span class="editsection-bracket"
style="">\[</span><span class="editsection"
style=""><a href="https://github.com/lawrennd/snippets/edit/main/_kern/includes/icm-covariance.md" target="_blank" onclick="ga('send', 'event', 'Edit Page', 'Edit', 'https://github.com/lawrennd/snippets/edit/main/_kern/includes/icm-covariance.md', 13);">edit</a></span><span class="editsection-bracket" style="">\]</span>

In [None]:
import mlai

In [None]:
%load -n mlai.icm_cov

In [None]:
import mlai.plot as plot
import mlai
import numpy as np

In [None]:
K, anim=plot.animate_covariance_function(mlai.compute_kernel, 
                                         kernel=icm_cov, subkernel=eq_cov,
                                         B = np.asarray([[1, 0.5],[0.5, 1.5]]))

In [None]:
from IPython.core.display import HTML

In [None]:
HTML(anim.to_jshtml())

In [None]:
plot.save_animation(anim, 
                    diagrams='./kern', 
                    filename='icm_covariance.html')

<center>

$$k(i, j, \mathbf{ x}, \mathbf{ x}^\prime) = b_{i,j} k(\mathbf{ x}, \mathbf{ x}^\prime)$$

</center>
<table>
<tr>
<td width="45%">

<img src="../slides/diagrams/kern/icm_covariance.svg" class="" width="100%" style="vertical-align:middle;">

</td>
<td width="45%">

<img class="negate" src="../slides/diagrams/kern/icm_covariance.gif" style="width:100%">

</td>
</tr>
</table>

Figure: <i>Intrinsic coregionalization model covariance function.</i>

## Extensions

We’ll cover extensions to Gaussian processes including approximate
inference in non Gaussian models, large data (Bui et al., 2017; Hensman
et al., n.d.), multiple output GPs (Álvarez et al., 2012), Bayesian
optimisation (Snoek et al., 2012) and Deep GPs (Damianou and Lawrence,
2013).

## Thanks!

For more information on these subjects and more you might want to check
the following resources.

-   twitter: [@lawrennd](https://twitter.com/lawrennd)
-   podcast: [The Talking Machines](http://thetalkingmachines.com)
-   newspaper: [Guardian Profile
    Page](http://www.theguardian.com/profile/neil-lawrence)
-   blog:
    [http://inverseprobability.com](http://inverseprobability.com/blog.html)

## References

Álvarez, M.A., Rosasco, L., Lawrence, N.D., 2012. Kernels for
vector-valued functions: A review. Foundations and Trends in Machine
Learning 4, 195–266. <https://doi.org/10.1561/2200000036>

Andrade-Pacheco, R., Mubangizi, M., Quinn, J., Lawrence, N.D., 2014.
Consistent mapping of government malaria records across a changing
territory delimitation. Malaria Journal 13.
<https://doi.org/10.1186/1475-2875-13-S1-P5>

Bochner, S., 1959. [Lectures on Fourier
integrals](http://books.google.co.uk/books?id=-vU02QewWK8C). Princeton
University Press.

Bui, T.D., Yan, J., Turner, R.E., 2017. [A unifying framework for
Gaussian process pseudo-point approximations using power expectation
propagation](http://jmlr.org/papers/v18/16-603.html). Journal of Machine
Learning Research 18, 1–72.

Cho, Y., Saul, L.K., 2009. [Kernel methods for deep
learning](http://papers.nips.cc/paper/3628-kernel-methods-for-deep-learning.pdf),
in: Bengio, Y., Schuurmans, D., Lafferty, J.D., Williams, C.K.I.,
Culotta, A. (Eds.), Advances in Neural Information Processing Systems
22. Curran Associates, Inc., pp. 342–350.

Damianou, A., Lawrence, N.D., 2013. Deep Gaussian processes. pp.
207–215.

Della Gatta, G., Bansal, M., Ambesi-Impiombato, A., Antonini, D.,
Missero, C., Bernardo, D. di, 2008. Direct targets of the TRP63
transcription factor revealed by a combination of gene expression
profiling and reverse engineering. Genome Research 18, 939–948.
<https://doi.org/10.1101/gr.073601.107>

Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin,
D.B., 2013. Bayesian data analysis, 3rd ed. Chapman; Hall.

Gething, P.W., Noor, A.M., Gikandi, P.W., Ogara, E.A.A., Hay, S.I.,
Nixon, M.S., Snow, R.W., Atkinson, P.M., 2006. Improving imperfect data
from health management information systems in Africa using space–time
geostatistics. PLoS Medicine 3.
<https://doi.org/10.1371/journal.pmed.0030271>

Hensman, J., Fusi, N., Lawrence, N.D., n.d. Gaussian processes for big
data.

Ioffe, S., Szegedy, C., 2015. [Batch normalization: Accelerating deep
network training by reducing internal covariate
shift](http://proceedings.mlr.press/v37/ioffe15.html), in: Bach, F.,
Blei, D. (Eds.), Proceedings of the 32nd International Conference on
Machine Learning, Proceedings of Machine Learning Research. PMLR, Lille,
France, pp. 448–456.

Kalaitzis, A.A., Lawrence, N.D., 2011. A simple approach to ranking
differentially expressed gene expression time courses through Gaussian
process regression. BMC Bioinformatics 12.
<https://doi.org/10.1186/1471-2105-12-180>

Laplace, P.S., 1814. Essai philosophique sur les probabilités, 2nd ed.
Courcier, Paris.

MacKay, D.J.C., 1992. Bayesian methods for adaptive models (PhD thesis).
California Institute of Technology.

McCulloch, W.S., Pitts, W., 1943. A logical calculus of the ideas
immanent in nervous activity. Bulletin of Mathematical Biophysics 5,
115–133. <https://doi.org/10.1007/BF02478259>

Mubangizi, M., Andrade-Pacheco, R., Smith, M.T., Quinn, J., Lawrence,
N.D., 2014. Malaria surveillance with multiple data sources using
Gaussian process models, in: 1st International Conference on the Use of
Mobile ICT in Africa.

Neal, R.M., 1994. Bayesian learning for neural networks (PhD thesis).
Dept. of Computer Science, University of Toronto.

Pearl, J., 1995. From Bayesian networks to causal networks, in:
Gammerman, A. (Ed.), Probabilistic Reasoning and Bayesian Belief
Networks. Alfred Waller, pp. 1–31.

Snoek, J., Larochelle, H., Adams, R.P., 2012. [Practical Bayesian
optimization of machine learning
algorithms](http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf),
in: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (Eds.),
Advances in Neural Information Processing Systems 25. Curran Associates,
Inc., pp. 2951–2959.

Steele, S., Bilchik, A., Eberhardt, J., Kalina, P., Nissan, A., Johnson,
E., Avital, I., Stojadinovic, A., 2012. Using machine-learned Bayesian
belief networks to predict perioperative risk of clostridium difficile
infection following colon surgery. Interact J Med Res 1, e6.
<https://doi.org/10.2196/ijmr.2131>

Tipping, M.E., Bishop, C.M., 1999. Probabilistic principal component
analysis. Journal of the Royal Statistical Society, B 6, 611–622.
<https://doi.org/doi:10.1111/1467-9868.00196>