# Gaussian Processes
### [Neil D. Lawrence](http://inverseprobability.com), Amazon Cambridge and University of Sheffield
### 2019-01-09

**Abstract**: Classical machine learning and statistical approaches to learning, such
as neural networks and linear regression, assume a parametric form for
functions. Gaussian process models are an alternative approach that
assumes a probabilistic prior over functions. This brings benefits, in
that uncertainty of function estimation is sustained throughout
inference, and some challenges: algorithms for fitting Gaussian
processes tend to be more complex than parametric models. In these
sessions I will introduce Gaussian processes and explain why sustaining
uncertainty is important. We’ll then look at some extensions of Gaussian
process models, in particular composition of Gaussian processes, or deep
Gaussian processes.

$$
\newcommand{\Amatrix}{\mathbf{A}}
\newcommand{\KL}[2]{\text{KL}\left( #1\,\|\,#2 \right)}
\newcommand{\Kaast}{\kernelMatrix_{\mathbf{ \ast}\mathbf{ \ast}}}
\newcommand{\Kastu}{\kernelMatrix_{\mathbf{ \ast} \inducingVector}}
\newcommand{\Kff}{\kernelMatrix_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\Kfu}{\kernelMatrix_{\mappingFunctionVector \inducingVector}}
\newcommand{\Kuast}{\kernelMatrix_{\inducingVector \bf\ast}}
\newcommand{\Kuf}{\kernelMatrix_{\inducingVector \mappingFunctionVector}}
\newcommand{\Kuu}{\kernelMatrix_{\inducingVector \inducingVector}}
\newcommand{\Kuui}{\Kuu^{-1}}
\newcommand{\Qaast}{\mathbf{Q}_{\bf \ast \ast}}
\newcommand{\Qastf}{\mathbf{Q}_{\ast \mappingFunction}}
\newcommand{\Qfast}{\mathbf{Q}_{\mappingFunctionVector \bf \ast}}
\newcommand{\Qff}{\mathbf{Q}_{\mappingFunctionVector \mappingFunctionVector}}
\newcommand{\aMatrix}{\mathbf{A}}
\newcommand{\aScalar}{a}
\newcommand{\aVector}{\mathbf{a}}
\newcommand{\acceleration}{a}
\newcommand{\bMatrix}{\mathbf{B}}
\newcommand{\bScalar}{b}
\newcommand{\bVector}{\mathbf{b}}
\newcommand{\basisFunc}{\phi}
\newcommand{\basisFuncVector}{\boldsymbol{ \basisFunc}}
\newcommand{\basisFunction}{\phi}
\newcommand{\basisLocation}{\mu}
\newcommand{\basisMatrix}{\boldsymbol{ \Phi}}
\newcommand{\basisScalar}{\basisFunction}
\newcommand{\basisVector}{\boldsymbol{ \basisFunction}}
\newcommand{\activationFunction}{\phi}
\newcommand{\activationMatrix}{\boldsymbol{ \Phi}}
\newcommand{\activationScalar}{\basisFunction}
\newcommand{\activationVector}{\boldsymbol{ \basisFunction}}
\newcommand{\bigO}{\mathcal{O}}
\newcommand{\binomProb}{\pi}
\newcommand{\cMatrix}{\mathbf{C}}
\newcommand{\cbasisMatrix}{\hat{\boldsymbol{ \Phi}}}
\newcommand{\cdataMatrix}{\hat{\dataMatrix}}
\newcommand{\cdataScalar}{\hat{\dataScalar}}
\newcommand{\cdataVector}{\hat{\dataVector}}
\newcommand{\centeredKernelMatrix}{\mathbf{ \MakeUppercase{\centeredKernelScalar}}}
\newcommand{\centeredKernelScalar}{b}
\newcommand{\centeredKernelVector}{\centeredKernelScalar}
\newcommand{\centeringMatrix}{\mathbf{H}}
\newcommand{\chiSquaredDist}[2]{\chi_{#1}^{2}\left(#2\right)}
\newcommand{\chiSquaredSamp}[1]{\chi_{#1}^{2}}
\newcommand{\conditionalCovariance}{\boldsymbol{ \Sigma}}
\newcommand{\coregionalizationMatrix}{\mathbf{B}}
\newcommand{\coregionalizationScalar}{b}
\newcommand{\coregionalizationVector}{\mathbf{ \coregionalizationScalar}}
\newcommand{\covDist}[2]{\text{cov}_{#2}\left(#1\right)}
\newcommand{\covSamp}[1]{\text{cov}\left(#1\right)}
\newcommand{\covarianceScalar}{c}
\newcommand{\covarianceVector}{\mathbf{ \covarianceScalar}}
\newcommand{\covarianceMatrix}{\mathbf{C}}
\newcommand{\covarianceMatrixTwo}{\boldsymbol{ \Sigma}}
\newcommand{\croupierScalar}{s}
\newcommand{\croupierVector}{\mathbf{ \croupierScalar}}
\newcommand{\croupierMatrix}{\mathbf{ \MakeUppercase{\croupierScalar}}}
\newcommand{\dataDim}{p}
\newcommand{\dataIndex}{i}
\newcommand{\dataIndexTwo}{j}
\newcommand{\dataMatrix}{\mathbf{Y}}
\newcommand{\dataScalar}{y}
\newcommand{\dataSet}{\mathcal{D}}
\newcommand{\dataStd}{\sigma}
\newcommand{\dataVector}{\mathbf{ \dataScalar}}
\newcommand{\decayRate}{d}
\newcommand{\degreeMatrix}{\mathbf{ \MakeUppercase{\degreeScalar}}}
\newcommand{\degreeScalar}{d}
\newcommand{\degreeVector}{\mathbf{ \degreeScalar}}
% Already defined by latex
%\newcommand{\det}[1]{\left|#1\right|}
\newcommand{\diag}[1]{\text{diag}\left(#1\right)}
\newcommand{\diagonalMatrix}{\mathbf{D}}
\newcommand{\diff}[2]{\frac{\text{d}#1}{\text{d}#2}}
\newcommand{\diffTwo}[2]{\frac{\text{d}^2#1}{\text{d}#2^2}}
\newcommand{\displacement}{x}
\newcommand{\displacementVector}{\textbf{\displacement}}
\newcommand{\distanceMatrix}{\mathbf{ \MakeUppercase{\distanceScalar}}}
\newcommand{\distanceScalar}{d}
\newcommand{\distanceVector}{\mathbf{ \distanceScalar}}
\newcommand{\eigenvaltwo}{\ell}
\newcommand{\eigenvaltwoMatrix}{\mathbf{L}}
\newcommand{\eigenvaltwoVector}{\mathbf{l}}
\newcommand{\eigenvalue}{\lambda}
\newcommand{\eigenvalueMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\eigenvalueVector}{\boldsymbol{ \lambda}}
\newcommand{\eigenvector}{\mathbf{ \eigenvectorScalar}}
\newcommand{\eigenvectorMatrix}{\mathbf{U}}
\newcommand{\eigenvectorScalar}{u}
\newcommand{\eigenvectwo}{\mathbf{v}}
\newcommand{\eigenvectwoMatrix}{\mathbf{V}}
\newcommand{\eigenvectwoScalar}{v}
\newcommand{\entropy}[1]{\mathcal{H}\left(#1\right)}
\newcommand{\errorFunction}{E}
\newcommand{\expDist}[2]{\left<#1\right>_{#2}}
\newcommand{\expSamp}[1]{\left<#1\right>}
\newcommand{\expectation}[1]{\left\langle #1 \right\rangle }
\newcommand{\expectationDist}[2]{\left\langle #1 \right\rangle _{#2}}
\newcommand{\expectedDistanceMatrix}{\mathcal{D}}
\newcommand{\eye}{\mathbf{I}}
\newcommand{\fantasyDim}{r}
\newcommand{\fantasyMatrix}{\mathbf{ \MakeUppercase{\fantasyScalar}}}
\newcommand{\fantasyScalar}{z}
\newcommand{\fantasyVector}{\mathbf{ \fantasyScalar}}
\newcommand{\featureStd}{\varsigma}
\newcommand{\gammaCdf}[3]{\mathcal{GAMMA CDF}\left(#1|#2,#3\right)}
\newcommand{\gammaDist}[3]{\mathcal{G}\left(#1|#2,#3\right)}
\newcommand{\gammaSamp}[2]{\mathcal{G}\left(#1,#2\right)}
\newcommand{\gaussianDist}[3]{\mathcal{N}\left(#1|#2,#3\right)}
\newcommand{\gaussianSamp}[2]{\mathcal{N}\left(#1,#2\right)}
\newcommand{\given}{|}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\heaviside}{H}
\newcommand{\hiddenMatrix}{\mathbf{ \MakeUppercase{\hiddenScalar}}}
\newcommand{\hiddenScalar}{h}
\newcommand{\hiddenVector}{\mathbf{ \hiddenScalar}}
\newcommand{\identityMatrix}{\eye}
\newcommand{\inducingInputScalar}{z}
\newcommand{\inducingInputVector}{\mathbf{ \inducingInputScalar}}
\newcommand{\inducingInputMatrix}{\mathbf{Z}}
\newcommand{\inducingScalar}{u}
\newcommand{\inducingVector}{\mathbf{ \inducingScalar}}
\newcommand{\inducingMatrix}{\mathbf{U}}
\newcommand{\inlineDiff}[2]{\text{d}#1/\text{d}#2}
\newcommand{\inputDim}{q}
\newcommand{\inputMatrix}{\mathbf{X}}
\newcommand{\inputScalar}{x}
\newcommand{\inputSpace}{\mathcal{X}}
\newcommand{\inputVals}{\inputVector}
\newcommand{\inputVector}{\mathbf{ \inputScalar}}
\newcommand{\iterNum}{k}
\newcommand{\kernel}{\kernelScalar}
\newcommand{\kernelMatrix}{\mathbf{K}}
\newcommand{\kernelScalar}{k}
\newcommand{\kernelVector}{\mathbf{ \kernelScalar}}
\newcommand{\kff}{\kernelScalar_{\mappingFunction \mappingFunction}}
\newcommand{\kfu}{\kernelVector_{\mappingFunction \inducingScalar}}
\newcommand{\kuf}{\kernelVector_{\inducingScalar \mappingFunction}}
\newcommand{\kuu}{\kernelVector_{\inducingScalar \inducingScalar}}
\newcommand{\lagrangeMultiplier}{\lambda}
\newcommand{\lagrangeMultiplierMatrix}{\boldsymbol{ \Lambda}}
\newcommand{\lagrangian}{L}
\newcommand{\laplacianFactor}{\mathbf{ \MakeUppercase{\laplacianFactorScalar}}}
\newcommand{\laplacianFactorScalar}{m}
\newcommand{\laplacianFactorVector}{\mathbf{ \laplacianFactorScalar}}
\newcommand{\laplacianMatrix}{\mathbf{L}}
\newcommand{\laplacianScalar}{\ell}
\newcommand{\laplacianVector}{\mathbf{ \ell}}
\newcommand{\latentDim}{q}
\newcommand{\latentDistanceMatrix}{\boldsymbol{ \Delta}}
\newcommand{\latentDistanceScalar}{\delta}
\newcommand{\latentDistanceVector}{\boldsymbol{ \delta}}
\newcommand{\latentForce}{f}
\newcommand{\latentFunction}{u}
\newcommand{\latentFunctionVector}{\mathbf{ \latentFunction}}
\newcommand{\latentFunctionMatrix}{\mathbf{ \MakeUppercase{\latentFunction}}}
\newcommand{\latentIndex}{j}
\newcommand{\latentScalar}{z}
\newcommand{\latentVector}{\mathbf{ \latentScalar}}
\newcommand{\latentMatrix}{\mathbf{Z}}
\newcommand{\learnRate}{\eta}
\newcommand{\lengthScale}{\ell}
\newcommand{\rbfWidth}{\ell}
\newcommand{\likelihoodBound}{\mathcal{L}}
\newcommand{\likelihoodFunction}{L}
\newcommand{\locationScalar}{\mu}
\newcommand{\locationVector}{\boldsymbol{ \locationScalar}}
\newcommand{\locationMatrix}{\mathbf{M}}
\newcommand{\variance}[1]{\text{var}\left( #1 \right)}
\newcommand{\mappingFunction}{f}
\newcommand{\mappingFunctionMatrix}{\mathbf{F}}
\newcommand{\mappingFunctionTwo}{g}
\newcommand{\mappingFunctionTwoMatrix}{\mathbf{G}}
\newcommand{\mappingFunctionTwoVector}{\mathbf{ \mappingFunctionTwo}}
\newcommand{\mappingFunctionVector}{\mathbf{ \mappingFunction}}
\newcommand{\scaleScalar}{s}
\newcommand{\mappingScalar}{w}
\newcommand{\mappingVector}{\mathbf{ \mappingScalar}}
\newcommand{\mappingMatrix}{\mathbf{W}}
\newcommand{\mappingScalarTwo}{v}
\newcommand{\mappingVectorTwo}{\mathbf{ \mappingScalarTwo}}
\newcommand{\mappingMatrixTwo}{\mathbf{V}}
\newcommand{\maxIters}{K}
\newcommand{\meanMatrix}{\mathbf{M}}
\newcommand{\meanScalar}{\mu}
\newcommand{\meanTwoMatrix}{\mathbf{M}}
\newcommand{\meanTwoScalar}{m}
\newcommand{\meanTwoVector}{\mathbf{ \meanTwoScalar}}
\newcommand{\meanVector}{\boldsymbol{ \meanScalar}}
\newcommand{\mrnaConcentration}{m}
\newcommand{\naturalFrequency}{\omega}
\newcommand{\neighborhood}[1]{\mathcal{N}\left( #1 \right)}
\newcommand{\neilurl}{http://inverseprobability.com/}
\newcommand{\noiseMatrix}{\boldsymbol{ E}}
\newcommand{\noiseScalar}{\epsilon}
\newcommand{\noiseVector}{\boldsymbol{ \epsilon}}
\newcommand{\norm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\normalizedLaplacianMatrix}{\hat{\mathbf{L}}}
\newcommand{\normalizedLaplacianScalar}{\hat{\ell}}
\newcommand{\normalizedLaplacianVector}{\hat{\mathbf{ \ell}}}
\newcommand{\numActive}{m}
\newcommand{\numBasisFunc}{m}
\newcommand{\numComponents}{m}
\newcommand{\numComps}{K}
\newcommand{\numData}{n}
\newcommand{\numFeatures}{K}
\newcommand{\numHidden}{h}
\newcommand{\numInducing}{m}
\newcommand{\numLayers}{\ell}
\newcommand{\numNeighbors}{K}
\newcommand{\numSequences}{s}
\newcommand{\numSuccess}{s}
\newcommand{\numTasks}{m}
\newcommand{\numTime}{T}
\newcommand{\numTrials}{S}
\newcommand{\outputIndex}{j}
\newcommand{\paramVector}{\boldsymbol{ \theta}}
\newcommand{\parameterMatrix}{\boldsymbol{ \Theta}}
\newcommand{\parameterScalar}{\theta}
\newcommand{\parameterVector}{\boldsymbol{ \parameterScalar}}
\newcommand{\partDiff}[2]{\frac{\partial#1}{\partial#2}}
\newcommand{\precisionScalar}{j}
\newcommand{\precisionVector}{\mathbf{ \precisionScalar}}
\newcommand{\precisionMatrix}{\mathbf{J}}
\newcommand{\pseudotargetScalar}{\widetilde{y}}
\newcommand{\pseudotargetVector}{\mathbf{ \pseudotargetScalar}}
\newcommand{\pseudotargetMatrix}{\mathbf{ \widetilde{Y}}}
\newcommand{\rank}[1]{\text{rank}\left(#1\right)}
\newcommand{\rayleighDist}[2]{\mathcal{R}\left(#1|#2\right)}
\newcommand{\rayleighSamp}[1]{\mathcal{R}\left(#1\right)}
\newcommand{\responsibility}{r}
\newcommand{\rotationScalar}{r}
\newcommand{\rotationVector}{\mathbf{ \rotationScalar}}
\newcommand{\rotationMatrix}{\mathbf{R}}
\newcommand{\sampleCovScalar}{s}
\newcommand{\sampleCovVector}{\mathbf{ \sampleCovScalar}}
\newcommand{\sampleCovMatrix}{\mathbf{s}}
\newcommand{\scalarProduct}[2]{\left\langle{#1},{#2}\right\rangle}
\newcommand{\sign}[1]{\text{sign}\left(#1\right)}
\newcommand{\sigmoid}[1]{\sigma\left(#1\right)}
\newcommand{\singularvalue}{\ell}
\newcommand{\singularvalueMatrix}{\mathbf{L}}
\newcommand{\singularvalueVector}{\mathbf{l}}
\newcommand{\sorth}{\mathbf{u}}
\newcommand{\spar}{\lambda}
\newcommand{\trace}[1]{\text{tr}\left(#1\right)}
\newcommand{\BasalRate}{B}
\newcommand{\DampingCoefficient}{C}
\newcommand{\DecayRate}{D}
\newcommand{\Displacement}{X}
\newcommand{\LatentForce}{F}
\newcommand{\Mass}{M}
\newcommand{\Sensitivity}{S}
\newcommand{\basalRate}{b}
\newcommand{\dampingCoefficient}{c}
\newcommand{\mass}{m}
\newcommand{\sensitivity}{s}
\newcommand{\springScalar}{\kappa}
\newcommand{\springVector}{\boldsymbol{ \kappa}}
\newcommand{\springMatrix}{\boldsymbol{ \mathcal{K}}}
\newcommand{\tfConcentration}{p}
\newcommand{\tfDecayRate}{\delta}
\newcommand{\tfMrnaConcentration}{f}
\newcommand{\tfVector}{\mathbf{ \tfConcentration}}
\newcommand{\velocity}{v}
\newcommand{\sufficientStatsScalar}{g}
\newcommand{\sufficientStatsVector}{\mathbf{ \sufficientStatsScalar}}
\newcommand{\sufficientStatsMatrix}{\mathbf{G}}
\newcommand{\switchScalar}{s}
\newcommand{\switchVector}{\mathbf{ \switchScalar}}
\newcommand{\switchMatrix}{\mathbf{S}}
\newcommand{\tr}[1]{\text{tr}\left(#1\right)}
\newcommand{\loneNorm}[1]{\left\Vert #1 \right\Vert_1}
\newcommand{\ltwoNorm}[1]{\left\Vert #1 \right\Vert_2}
\newcommand{\onenorm}[1]{\left\vert#1\right\vert_1}
\newcommand{\twonorm}[1]{\left\Vert #1 \right\Vert}
\newcommand{\vScalar}{v}
\newcommand{\vVector}{\mathbf{v}}
\newcommand{\vMatrix}{\mathbf{V}}
\newcommand{\varianceDist}[2]{\text{var}_{#2}\left( #1 \right)}
% Already defined by latex
%\newcommand{\vec}{#1:}
\newcommand{\vecb}[1]{\left(#1\right):}
\newcommand{\weightScalar}{w}
\newcommand{\weightVector}{\mathbf{ \weightScalar}}
\newcommand{\weightMatrix}{\mathbf{W}}
\newcommand{\weightedAdjacencyMatrix}{\mathbf{A}}
\newcommand{\weightedAdjacencyScalar}{a}
\newcommand{\weightedAdjacencyVector}{\mathbf{ \weightedAdjacencyScalar}}
\newcommand{\onesVector}{\mathbf{1}}
\newcommand{\zerosVector}{\mathbf{0}}
$$

<!-- Enables links to pages-->
<!--ipynb-->

<!-- ipynb and slides-->

## What is Machine Learning?

### What is Machine Learning?

. . .

$$ \text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$$

. . .

-   **data** : observations, could be actively or passively acquired
    (meta-data).

. . .

-   **model** : assumptions, based on previous experience (other data!
    transfer learning etc), or beliefs about the regularities of the
    universe. Inductive bias.

. . .

-   **prediction** : an action to be taken or a categorization or a
    quality score.

. . .

-   Royal Society Report: [Machine Learning: Power and Promise of
    Computers that Learn by
    Example](https://royalsociety.org/~/media/policy/projects/machine-learning/publications/machine-learning-report.pdf)

### What is Machine Learning?

$$\text{data} + \text{model} \xrightarrow{\text{compute}} \text{prediction}$$

> -   To combine data with a model need:
> -   **a prediction function** $\mappingFunction(\cdot)$ includes our
>     beliefs about the regularities of the universe
> -   **an objective function** $\errorFunction(\cdot)$ defines the cost
>     of misprediction.

\newslides{Artificial Intelligence}

-   Machine learning is a mainstay because of importance of prediction.

\newslides{Uncertainty}

-   Uncertainty in prediction arises from:
-   scarcity of training data and
-   mismatch between the set of prediction functions we choose and all
    possible prediction functions.
-   Also uncertainties in objective, leave those for another day.

### Neural Networks and Prediction Functions

-   adaptive non-linear function models inspired by simple neuron models
    [@McCulloch:neuron43]

-   have become popular because of their ability to model data.

-   can be composed to form highly complex functions

-   start by focussing on one hidden layer

### Prediction Function of One Hidden Layer

$$
\mappingFunction(\inputVector) = \left.\mappingVector^{(2)}\right.^\top \activationVector(\mappingMatrix_{1}, \inputVector)
$$

$\mappingFunction(\cdot)$ is a scalar function with vector inputs,

$\activationVector(\cdot)$ is a vector function with vector inputs.

-   dimensionality of the vector function is known as the number of
    hidden units, or the number of neurons.

-   elements of $\activationVector(\cdot)$ are the *activation* function
    of the neural network

-   elements of $\mappingMatrix_{1}$ are the parameters of the
    activation functions.

### Relations with Classical Statistics

-   In statistics activation functions are known as *basis functions*.

-   would think of this as a *linear model*: not linear predictions,
    linear in the parameters

-   $\mappingVector_{1}$ are *static* parameters.

### Adaptive Basis Functions

-   In machine learning we optimize $\mappingMatrix_{1}$ as well as
    $\mappingMatrix_{2}$ (which would normally be denoted in statistics
    by $\boldsymbol{\beta}$).

-   This tutorial: revisit that decision: follow the path of
    @Neal:bayesian94 and @MacKay:bayesian92.

-   Consider the probabilistic approach.

### Probabilistic Modelling

-   Probabilistically we want, $$
    p(\dataScalar_*|\dataVector, \inputMatrix, \inputVector_*),
    $$ $\dataScalar_*$ is a test output $\inputVector_*$ is a test input
    $\inputMatrix$ is a training input matrix $\dataVector$ is training
    outputs

### Joint Model of World

$$
p(\dataScalar_*|\dataVector, \inputMatrix, \inputVector_*) = \int p(\dataScalar_*|\inputVector_*, \mappingMatrix) p(\mappingMatrix | \dataVector, \inputMatrix) \text{d} \mappingMatrix
$$

. . .

$\mappingMatrix$ contains $\mappingMatrix_1$ and $\mappingMatrix_2$

$p(\mappingMatrix | \dataVector, \inputMatrix)$ is posterior density

### Likelihood

$p(\dataScalar|\inputVector, \mappingMatrix)$ is the *likelihood* of
data point

. . .

Normally assume independence: $$
p(\dataVector|\inputMatrix, \mappingMatrix) \prod_{i=1}^\numData p(\dataScalar_i|\inputVector_i, \mappingMatrix),$$

### Likelihood and Prediction Function

$$
p(\dataScalar_i | \mappingFunction(\inputVector_i)) = \frac{1}{\sqrt{2\pi \dataStd^2}} \exp\left(-\frac{\left(\dataScalar_i - \mappingFunction(\inputVector_i)\right)^2}{2\dataStd^2}\right)
$$

### Unsupervised Learning

-   Can also consider priors over latents $$
    p(\dataVector_*|\dataVector) = \int p(\dataVector_*|\inputMatrix_*, \mappingMatrix) p(\mappingMatrix | \dataVector, \inputMatrix) p(\inputMatrix) p(\inputMatrix_*) \text{d} \mappingMatrix \text{d} \inputMatrix \text{d}\inputMatrix_*
    $$

-   This gives *unsupervised learning*.

### Probabilistic Inference

-   Data: $\dataVector$

-   Model: $p(\dataVector, \dataVector^*)$

-   Prediction: $p(\dataVector^*| \dataVector)$

### Graphical Models

-   Represent joint distribution through *conditional dependencies*.
-   E.g. Markov chain

$$p(\dataVector) = p(\dataScalar_\numData | \dataScalar_{\numData-1}) p(\dataScalar_{\numData-1}|\dataScalar_{\numData-2}) \dots p(\dataScalar_{2} | \dataScalar_{1})$$

In [None]:
import daft
from matplotlib import rc

rc("font", **{'family':'sans-serif','sans-serif':['Helvetica']}, size=30)
rc("text", usetex=True)

In [None]:
pgm = daft.PGM(shape=[3, 1],
               origin=[0, 0], 
               grid_unit=5, 
               node_unit=1.9, 
               observed_style='shaded',
              line_width=3)


pgm.add_node(daft.Node("y_1", r"$y_1$", 0.5, 0.5, fixed=False))
pgm.add_node(daft.Node("y_2", r"$y_2$", 1.5, 0.5, fixed=False))
pgm.add_node(daft.Node("y_3", r"$y_3$", 2.5, 0.5, fixed=False))
pgm.add_edge("y_1", "y_2")
pgm.add_edge("y_2", "y_3")

pgm.render().figure.savefig("../slides/diagrams/ml/markov.svg", transparent=True)

<img src="../slides/diagrams/ml/markov.svg" class="" align="" style="">

### 

Predict Perioperative Risk of Clostridium Difficile Infection Following
Colon Surgery [@Steele:predictive12]

<div style="text-align:center">

<img class="negate" src="../slides/diagrams/bayes-net-diagnosis.png" width="50%" height="auto" align="center" style="background:none; border:none; box-shadow:none;">

</div>

### Performing Inference

-   Easy to write in probabilities

-   But underlying this is a wealth of computational challenges.

-   High dimensional integrals typically require approximation.

### Linear Models

-   In statistics, focussed more on *linear* model implied by $$
      \mappingFunction(\inputVector) = \left.\mappingVector^{(2)}\right.^\top \activationVector(\mappingMatrix_1, \inputVector)
      $$

-   Hold $\mappingMatrix_1$ fixed for given analysis.

-   Gaussian prior for $\mappingMatrix$, $$
      \mappingVector^{(2)} \sim \gaussianSamp{\zerosVector}{\covarianceMatrix}.
      $$ $$
      \dataScalar_i = \mappingFunction(\inputVector_i) + \noiseScalar_i,
      $$ where $$
      \noiseScalar_i \sim \gaussianSamp{0}{\dataStd^2}
      $$

### Linear Gaussian Models

-   Normally integrals are complex but for this Gaussian linear case
    they are trivial.

\newslides{Multivariate Gaussian Properties}

### Recall Univariate Gaussian Properties

. . .

1.  Sum of Gaussian variables is also Gaussian.

$$\dataScalar_i \sim \gaussianSamp{\meanScalar_i}{\dataStd_i^2}$$

. . .

$$\sum_{i=1}^{\numData} \dataScalar_i \sim \gaussianSamp{\sum_{i=1}^\numData \meanScalar_i}{\sum_{i=1}^\numData\dataStd_i^2}$$

### Recall Univariate Gaussian Properties

2.  Scaling a Gaussian leads to a Gaussian.

. . .

$$\dataScalar \sim \gaussianSamp{\meanScalar}{\dataStd^2}$$

. . .

$$\mappingScalar\dataScalar\sim \gaussianSamp{\mappingScalar\meanScalar}{\mappingScalar^2 \dataStd^2}$$

### Multivariate Consequence

[If]{style="text-align:left"}
$$\inputVector \sim \gaussianSamp{\meanVector}{\covarianceMatrix}$$

. . .

[And]{style="text-align:left"}
$$\dataVector= \mappingMatrix\inputVector$$

. . .

[Then]{style="text-align:left"}
$$\dataVector \sim \gaussianSamp{\mappingMatrix\meanVector}{\mappingMatrix\covarianceMatrix\mappingMatrix^\top}$$

\newslides{Linear Gaussian Models}

1.  linear Gaussian models are easier to deal with
2.  Even the parameters *within* the process can be handled, by
    considering a particular limit.

### Multivariate Gaussian Properties

-   If $$
    \dataVector = \mappingMatrix \inputVector + \noiseVector,
    $$

-   Assume $$\begin{align}
    \inputVector & \sim \gaussianSamp{\meanVector}{\covarianceMatrix}\\
    \noiseVector & \sim \gaussianSamp{\zerosVector}{\covarianceMatrixTwo}
    \end{align}$$

-   Then $$
    \dataVector \sim \gaussianSamp{\mappingMatrix\meanVector}{\mappingMatrix\covarianceMatrix\mappingMatrix^\top + \covarianceMatrixTwo}.
    $$ If $\covarianceMatrixTwo=\dataStd^2\eye$, this is Probabilistic
    Principal Component Analysis [@Tipping:probpca99], because we
    integrated out the inputs (or *latent* variables they would be
    called in that case).

### Non linear on Inputs

-   Set each activation function computed at each data point to be

$$
\activationScalar_{i,j} = \activationScalar(\mappingVector^{(1)}_{j}, \inputVector_{i})
$$ Define *design matrix* $$
\activationMatrix = 
\begin{bmatrix}
\activationScalar_{1, 1} & \activationScalar_{1, 2} & \dots & \activationScalar_{1, \numHidden} \\
\activationScalar_{1, 2} & \activationScalar_{1, 2} & \dots & \activationScalar_{1, \numData} \\
\vdots & \vdots & \ddots & \vdots \\
\activationScalar_{\numData, 1} & \activationScalar_{\numData, 2} & \dots & \activationScalar_{\numData, \numHidden}
\end{bmatrix}.
$$

### Matrix Representation of a Neural Network

$$\dataScalar\left(\inputVector\right) = \activationVector\left(\inputVector\right)^\top \mappingVector + \noiseScalar$$

. . .

$$\dataVector = \activationMatrix\mappingVector + \noiseVector$$

. . .

$$\noiseVector \sim \gaussianSamp{\zerosVector}{\dataStd^2\eye}$$

### Prior Density

-   Define

$$
\mappingVector \sim \gaussianSamp{\zerosVector}{\alpha\eye},
$$

-   Rules of multivariate Gaussians to see that,

$$
\dataVector \sim \gaussianSamp{\zerosVector}{\alpha \activationMatrix \activationMatrix^\top + \dataStd^2 \eye}.
$$

$$
\kernelMatrix = \alpha \activationMatrix \activationMatrix^\top + \dataStd^2 \eye.
$$

### Joint Gaussian Density

-   Elements are a function
    $\kernel_{i,j} = \kernel\left(\inputVector_i, \inputVector_j\right)$

$$
\kernelMatrix = \alpha \activationMatrix \activationMatrix^\top + \dataStd^2 \eye.
$$

### Covariance Function

$$
\kernel_\mappingFunction\left(\inputVector_i, \inputVector_j\right) = \alpha \activationVector\left(\mappingMatrix_1, \inputVector_i\right)^\top \activationVector\left(\mappingMatrix_1, \inputVector_j\right)
$$

-   formed by inner products of the rows of the *design matrix*.

### Gaussian Process

-   Instead of making assumptions about our density over each data
    point, $\dataScalar_i$ as i.i.d.

-   make a joint Gaussian assumption over our data.

-   covariance matrix is now a function of both the parameters of the
    activation function, $\mappingMatrix_1$, and the input variables,
    $\inputMatrix$.

-   Arises from integrating out $\mappingVector^{(2)}$.

### Basis Functions

-   Can be very complex, such as deep kernels, [@Cho:deep09] or could
    even put a convolutional neural network inside.
-   Viewing a neural network in this way is also what allows us to
    beform sensible *batch* normalizations [@Ioffe:batch15].

### Non-degenerate Gaussian Processes

-   This process is *degenerate*.

-   Covariance function is of rank at most $\numHidden$.

-   As $\numData \rightarrow \infty$, covariance matrix is not full
    rank.

-   Leading to $\det{\kernelMatrix} = 0$

### Infinite Networks

-   In ML Radford Neal [@Neal:bayesian94] asked "what would happen if
    you took $\numHidden \rightarrow \infty$?"

[<img class="" src="../slides/diagrams/neal-infinite-priors.png" width="80%" height="auto" align="" style="background:none; border:none; box-shadow:none;">](http://www.cs.toronto.edu/~radford/ftp/thesis.pdf)

*Page 37 of Radford Neal's 1994 thesis*

### Roughly Speaking

-   Instead of

$$
  \begin{align*}
  \kernel_\mappingFunction\left(\inputVector_i, \inputVector_j\right) & = \alpha \activationVector\left(\mappingMatrix_1, \inputVector_i\right)^\top \activationVector\left(\mappingMatrix_1, \inputVector_j\right)\\
  & = \alpha \sum_k \activationScalar\left(\mappingVector^{(1)}_k, \inputVector_i\right) \activationScalar\left(\mappingVector^{(1)}_k, \inputVector_j\right)
  \end{align*}
  $$

-   Sample infinitely many from a prior density,
    $p(\mappingVector^{(1)})$,

$$
\kernel_\mappingFunction\left(\inputVector_i, \inputVector_j\right) = \alpha \int \activationScalar\left(\mappingVector^{(1)}, \inputVector_i\right) \activationScalar\left(\mappingVector^{(1)}, \inputVector_j\right) p(\mappingVector^{(1)}) \text{d}\mappingVector^{(1)}
$$

-   Also applies for non-Gaussian $p(\mappingVector^{(1)})$ because of
    the *central limit theorem*.

### Simple Probabilistic Program

-   If $$
      \begin{align*} 
      \mappingVector^{(1)} & \sim p(\cdot)\\ \phi_i & = \activationScalar\left(\mappingVector^{(1)}, \inputVector_i\right), 
      \end{align*}
      $$ has finite variance.

-   Then taking number of hidden units to infinity, is also a Gaussian
    process.

### Further Reading

-   Chapter 2 of Neal's thesis [@Neal:bayesian94]

-   Rest of Neal's thesis. [@Neal:bayesian94]

-   David MacKay's PhD thesis [@MacKay:bayesian92]

In [None]:
import numpy as np
np.random.seed(10)
import teaching_plots as plot
from mlai import Kernel

In [None]:
kernel = Kernel(function=exponentiated_quadratic, lengthscale=0.25)
plot.rejection_samples(kernel.K,
                       lengthscale=0.25, 
                       diagrams='../slides/diagrams/gp')

In [None]:
import pods
from ipywidgets import IntSlider

In [None]:
pods.notebook.display_plots('gp_rejection_sample{sample:0>3}.svg', 
                            directory='../slides/diagrams/gp', 
                            sample=IntSlider(1,1,5,1))

###  {#section-1 data-transition="none"}

<img src="../slides/diagrams/gp/gp_rejection_sample001.svg" class="" align="" style="">

###  {#section-2 data-transition="none"}

<img src="../slides/diagrams/gp/gp_rejection_sample002.svg" class="" align="" style="">

###  {#section-3 data-transition="none"}

<img src="../slides/diagrams/gp/gp_rejection_sample003.svg" class="" align="" style="">

###  {#section-4 data-transition="none"}

<img src="../slides/diagrams/gp/gp_rejection_sample004.svg" class="" align="" style="">

###  {#section-5 data-transition="none"}

x
<img src="../slides/diagrams/gp/gp_rejection_sample005.svg" class="" align="" style="">

<!-- ### Two Dimensional Gaussian Distribution -->
<!-- include{_ml/includes/two-d-gaussian.md} -->
\newslides{Distributions over Functions}

In [None]:
import numpy as np
np.random.seed(4949)

### Sampling a Function

**Multi-variate Gaussians**

-   We will consider a Gaussian with a particular structure of
    covariance matrix.
-   Generate a single sample from this 25 dimensional Gaussian density,
    $$
    \mappingFunctionVector=\left[\mappingFunction_{1},\mappingFunction_{2}\dots \mappingFunction_{25}\right].
    $$
-   We will plot these points against their index.

In [None]:
import teaching_plots as plot
from mlai import Kernel, exponentiated_quadratic

In [None]:
kernel=Kernel(function=exponentiated_quadratic, lengthscale=0.5)
plot.two_point_sample(kernel.K, diagrams='../slides/diagrams/gp')

In [None]:
import pods
from ipywidgets import IntSlider

In [None]:
pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', '../slides/diagrams/gp', sample=IntSlider(0, 0, 8, 1))

### Gaussian Distribution Sample

\\startanimation{two\_point\_sample}{0}{8}
\newframe{<img src="../slides/diagrams/gp/two_point_sample000.svg" class="" align="" style="">}{two_point_sample}
\newframe{<img src="../slides/diagrams/gp/two_point_sample001.svg" class="" align="" style="">}{two_point_sample}
\newframe{<img src="../slides/diagrams/gp/two_point_sample002.svg" class="" align="" style="">}{two_point_sample}
\newframe{<img src="../slides/diagrams/gp/two_point_sample003.svg" class="" align="" style="">}{two_point_sample}
\newframe{<img src="../slides/diagrams/gp/two_point_sample004.svg" class="" align="" style="">}{two_point_sample}
\newframe{<img src="../slides/diagrams/gp/two_point_sample005.svg" class="" align="" style="">}{two_point_sample}
\newframe{<img src="../slides/diagrams/gp/two_point_sample006.svg" class="" align="" style="">}{two_point_sample}
\newframe{<img src="../slides/diagrams/gp/two_point_sample007.svg" class="" align="" style="">}{two_point_sample}
\newframe{<img src="../slides/diagrams/gp/two_point_sample008.svg" class="" align="" style="">}{two_point_sample}
\endanimation

In [None]:
import pods
from ipywidgets import IntSlider

In [None]:
pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', '../slides/diagrams/gp', sample=IntSlider(9, 9, 12, 1))

### Prediction of $\mappingFunction_{2}$ from $\mappingFunction_{1}$

\\startanimation{two\_point\_sample2}{9}{12}
\newframe{<img src="../slides/diagrams/gp/two_point_sample009.svg" class="" align="" style="">}{two_point_sample2}
\newframe{<img src="../slides/diagrams/gp/two_point_sample010.svg" class="" align="" style="">}{two_point_sample2}
\newframe{<img src="../slides/diagrams/gp/two_point_sample011.svg" class="" align="" style="">}{two_point_sample2}
\newframe{<img src="../slides/diagrams/gp/two_point_sample012.svg" class="" align="" style="">}{two_point_sample2}
\endanimation

### Uluru

<div style="text-align:center">

<img class="" src="../slides/diagrams/gp/799px-Uluru_Panorama.jpg" width="" height="auto" align="center" style="background:none; border:none; box-shadow:none;">

</div>

### Prediction with Correlated Gaussians

-   Prediction of $\mappingFunction_2$ from $\mappingFunction_1$
    requires *conditional density*.
-   Conditional density is *also* Gaussian. $$
    p(\mappingFunction_2|\mappingFunction_1) = \gaussianDist{\mappingFunction_2}{\frac{\kernelScalar_{1, 2}}{\kernelScalar_{1, 1}}\mappingFunction_1}{ \kernelScalar_{2, 2} - \frac{\kernelScalar_{1,2}^2}{\kernelScalar_{1,1}}}
    $$ where covariance of joint density is given by $$
    \kernelMatrix = \begin{bmatrix} \kernelScalar_{1, 1} & \kernelScalar_{1, 2}\\ \kernelScalar_{2, 1} & \kernelScalar_{2, 2}.\end{bmatrix}
    $$

In [None]:
import pods
from ipywidgets import IntSlider

In [None]:
pods.notebook.display_plots('two_point_sample{sample:0>3}.svg', '../slides/diagrams/gp', sample=IntSlider(13, 13, 17, 1))

### Prediction of $\mappingFunction_{8}$ from $\mappingFunction_{1}$

\\startanimation{two\_point\_sample3}{13}{17}
\newframe{<img src="../slides/diagrams/gp/two_point_sample013.svg" class="" align="" style="">}{two_point_sample3}
\newframe{<img src="../slides/diagrams/gp/two_point_sample014.svg" class="" align="" style="">}{two_point_sample3}
\newframe{<img src="../slides/diagrams/gp/two_point_sample015.svg" class="" align="" style="">}{two_point_sample3}
\newframe{<img src="../slides/diagrams/gp/two_point_sample016.svg" class="" align="" style="">}{two_point_sample3}
\newframe{<img src="../slides/diagrams/gp/two_point_sample017.svg" class="" align="" style="">}{two_point_sample3}
\endanimation

\newslides{Key Object}

-   Covariance function, $\kernelMatrix$

-   Determines properties of samples.

-   Function of $\inputMatrix$,
    $$\kernelScalar_{i,j} = \kernelScalar(\inputVector_i, \inputVector_j)$$

\newslides{Linear Algebra}

-   Posterior mean

In [None]:
$$\mappingFunction_D(\inputVector_*) = \kernelVector(\inputVector_*, \inputMatrix) \kernelMatrix^{-1}
\mathbf{y}$$

-   Posterior covariance
    $$\mathbf{C}_* = \kernelMatrix_{*,*} - \kernelMatrix_{*,\mappingFunctionVector}
    \kernelMatrix^{-1} \kernelMatrix_{\mappingFunctionVector, *}$$

\newslides{Linear Algebra}

-   Posterior mean

In [None]:
$$\mappingFunction_D(\inputVector_*) = \kernelVector(\inputVector_*, \inputMatrix) \boldsymbol{\alpha}$$

-   Posterior covariance
    $$\covarianceMatrix_* = \kernelMatrix_{*,*} - \kernelMatrix_{*,\mappingFunctionVector}
    \kernelMatrix^{-1} \kernelMatrix_{\mappingFunctionVector, *}$$

\newslides{}

<img src="../slides/diagrams/gp_prior_samples_data.svg" class="" align="" style="">

\newslides{}

<img src="../slides/diagrams/gp_rejection_samples.svg" class="" align="" style="">

\newslides{}

<img src="../slides/diagrams/gp_prediction.svg" class="" align="" style="">

\loadplotcode{eq_cov}{mlai}

In [None]:
import teaching_plots as plot
import mlai
import numpy as np

In [None]:
import teaching_plots as plot
import mlai
import numpy as np

In [None]:
kernel = mlai.Kernel(function=eq_cov,
                     name='Exponentiated Quadratic',
                     shortname='eq',                     
                     formula='\formula',
                     lengthscale=0.2)
plot.covariance_func(kernel, diagrams='../slides/diagrams/kern/')

### Exponentiated Quadratic Covariance

<center>
$$\formula$$
</center>
<br>
<table>
<tr>
<td width="45%">
<img src="../slides/diagrams/kern/eq_covariance.svg" class="" align="" >
<td width="45%">
<img class="negate" src="../slides/diagrams/kern/eq_covariance.gif" width="100%" height="auto" align="center" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
</table>

In [None]:
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']

offset = y.mean()
scale = np.sqrt(y.var())

xlim = (1875,2030)
ylim = (2.5, 6.5)
yhat = (y-offset)/scale

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
ax.set_xlabel('year', fontsize=20)
ax.set_ylabel('pace min/km', fontsize=20)
ax.set_xlim(xlim)
ax.set_ylim(ylim)

mlai.write_figure(figure=fig, filename='../slides/diagrams/datasets/olympic-marathon.svg', transparent=True, frameon=True)

### Olympic Marathon Data

<table>
<tr>
<td width="70%">
-   Gold medal times for Olympic Marathon since 1896.

-   Marathons before 1924 didn’t have a standardised distance.

-   Present results using pace per km.

-   In 1904 Marathon was badly organised leading to very slow times.

</td>
<td width="30%">
![image](../slides/diagrams/Stephen_Kiprotich.jpg) <small>Image from
Wikimedia Commons <http://bit.ly/16kMKHQ></small>
</td>
</tr>
</table>
### Olympic Marathon Data

<img src="../slides/diagrams/datasets/olympic-marathon.svg" class="" align="" style="">

In [None]:
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='year', ylabel='pace min/km', fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig,
                  filename='../slides/diagrams/gp/olympic-marathon-gp.svg', 
                  transparent=True, frameon=True)

### Olympic Marathon Data GP

<img src="../slides/diagrams/gp/olympic-marathon-gp.svg" class="" align="" style="">

### 

<table>
<tr>
<td width="40%">
<img class="" src="../slides/diagrams/turing-run.jpg" width="40%" height="auto" align="" style="background:none; border:none; box-shadow:none;">
</td>
<td width="50%">
<img class="" src="../slides/diagrams/turing-times.gif" width="50%" height="auto" align="" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
</table>
<center>
<i>Alan Turing, in 1946 he was only 11 minutes slower than the winner of
the 1948 games. Would he have won a hypothetical games held in 1946?
Source: [Alan Turing Internet
Scrapbook](http://www.turing.org.uk/scrapbook/run.html)</i>
</center>
### Basis Function Covariance

In [None]:
import teaching_plots as plot
import mlai
import numpy as np

In [None]:

basis = mlai.Basis(function=radial, 
                   number=3,
                   data_limits=[-0.5, 0.5], 
                   width=0.125)
kernel = mlai.Kernel(function=basis_cov,
                     name='Basis',
                     shortname='basis',                  
                     formula='\kernel(\inputVector, \inputVector^\prime) = \basisVector(\inputVector)^\top \basisVector(\inputVector^\prime)',
                     basis=basis)
                     
plot.covariance_func(kernel, diagrams='../slides/diagrams/kern/')

<center>
$$\kernel(\inputVector, \inputVector^\prime) = \basisVector(\inputVector)^\top \basisVector(\inputVector^\prime)$$
</center>
<br>
<table>
<tr>
<td width="45%">
<img src="../slides/diagrams/kern/basis_covariance.svg" class="" align="" >
<td width="45%">
<img class="negate" src="../slides/diagrams/kern/basis_covariance.gif" width="100%" height="auto" align="center" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
</table>
{
<center>
<i>A covariance function based on a non-linear basis given by
$\basisVector(\inputVector)$.</i>
</center>
### Brownian Covariance

In [None]:
import teaching_plots as plot
import mlai
import numpy as np

In [None]:
t=np.linspace(0, 2, 200)[:, np.newaxis]
kernel = mlai.Kernel(function=brownian_cov,
                     name='Brownian',
                     formula='\kernelScalar(t, t^\prime)=\alpha \min(t, t^\prime)',
                     shortname='brownian')
plot.covariance_func(kernel, t, diagrams='../slides/diagrams/kern/')

$$
\kernelScalar(t, t^\prime) = \alpha \min(t, t^\prime)
$$

<!--<table><tr><td width="50%">
<img src="../slides/diagrams/kern/brownian_covariance.svg" class="" align="" style="">
</td><td width="50%">
<iframe src="../slides/diagrams/kern/brownian_covariance.html" width="512" height="384" allowtransparency="true" frameborder="0">
</iframe>
</td></tr></table>
-->
<table>
<tr>
<td width="45%">
<img src="../slides/diagrams/kern/brownian_covariance.svg" class="" align="" >
</td>
<td width="45%">
<img class="negate" src="../slides/diagrams/kern/brownian_covariance.gif" width="100%" height="auto" align="center" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
</table>
### MLP Covariance

In [None]:
import teaching_plots as plot
import mlai
import numpy as np

In [None]:
kernel = mlai.Kernel(function=mlp_cov,
                     name='Multilayer Perceptron',
                     shortname='mlp',                    
                     formula='\kernelScalar(\inputVector, \inputVector^\prime) = \alpha \arcsin\left(\frac{w \inputVector^\top \inputVector^\prime + b}{\sqrt{\left(w \inputVector^\top \inputVector + b + 1\right)\left(w \left.\inputVector^\prime\right.^\top \inputVector^\prime + b + 1\right)}}\right)',
                     w=5, b=0.5)
                     
plot.covariance_func(kernel, diagrams='../slides/diagrams/kern/')

<center>
$$\kernelScalar(\inputVector, \inputVector^\prime) = \alpha \arcsin\left(\frac{w \inputVector^\top \inputVector^\prime + b}{\sqrt{\left(w \inputVector^\top \inputVector + b + 1\right)\left(w \left.\inputVector^\prime\right.^\top \inputVector^\prime + b + 1\right)}}\right)$$
</center>
<br>
<table>
<tr>
<td width="45%">
<img src="../slides/diagrams/kern/mlp_covariance.svg" class="" align="" >
<td width="45%">
<img class="negate" src="../slides/diagrams/kern/mlp_covariance.gif" width="100%" height="auto" align="center" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
</table>
### 

<img class="" src="../slides/diagrams/Planck_CMB.png" width="70%" height="auto" align="" style="background:none; border:none; box-shadow:none;">

### 

<div style="fontsize:120px;vertical-align:middle;">

<img src="../slides/diagrams/earth_PNG37.png" width="20%" style="display:inline-block;background:none;vertical-align:middle;border:none;box-shadow:none;">$=f\Bigg($
<img src="../slides/diagrams/Planck_CMB.png"  width="50%" style="display:inline-block;background:none;vertical-align:middle;border:none;box-shadow:none;">$\Bigg)$

</div>

### Deep Gaussian Processes

### Approximations

<div style="text-align:center">

<img class="" src="../slides/diagrams/sparse-gps-1.png" width="90%" height="auto" align="center" style="background:none; border:none; box-shadow:none;">

</div>

<center>
<i>Image credit: Kai Arulkumaran</i>
</center>
### Approximations

<div style="text-align:center">

<img class="" src="../slides/diagrams/sparse-gps-2.png" width="90%" height="auto" align="center" style="background:none; border:none; box-shadow:none;">

</div>

<center>
<i>Image credit: Kai Arulkumaran</i>
</center>
### Approximations

<div style="text-align:center">

<img class="" src="../slides/diagrams/sparse-gps-3.png" width="45%" height="auto" align="center" style="background:none; border:none; box-shadow:none;">

</div>

<center>
<i>Image credit: Kai Arulkumaran</i>
</center>
### Approximations

<div style="text-align:center">

<img class="" src="../slides/diagrams/sparse-gps-4.png" width="45%" height="auto" align="center" style="background:none; border:none; box-shadow:none;">

</div>

<center>
<i>Image credit: Kai Arulkumaran</i>
</center>

In [None]:
fig, ax = plt.subplots(figsize=plot.wide_figsize)
plot.model_output(m_full, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2)
xlim = ax.get_xlim()
ylim = ax.get_ylim()
mlai.write_figure(figure=fig,
                  filename='../slides/diagrams/gp/sparse-demo-full-gp.svg', 
                  transparent=True, frameon=True)

### Full Gaussian Process Fit

<img src="../slides/diagrams/gp/sparse-demo-full-gp.svg" class="" align="" style="">

In [None]:
fig, ax = plt.subplots(figsize=plot.wide_figsize)
plot.model_output(m, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2, xlim=xlim, ylim=ylim)
mlai.write_figure(figure=fig,
                  filename='../slides/diagrams/gp/sparse-demo-constrained-inducing-6-unlearned-gp.svg', 
                  transparent=True, frameon=True)

### Inducing Variable Fit

<img src="../slides/diagrams/gp/sparse-demo-constrained-inducing-6-unlearned-gp.svg" class="" align="" style="">

In [None]:
fig, ax = plt.subplots(figsize=plot.wide_figsize)
plot.model_output(m, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2, xlim=xlim, ylim=ylim)
mlai.write_figure(figure=fig,
                  filename='../slides/diagrams/gp/sparse-demo-full-gp.svg', 
                  transparent=True, frameon=True)

### Inducing Variable Param Optimize

<img src="../slides/diagrams/gp/sparse-demo-constrained-inducing-6-learned-gp.svg" class="" align="" style="">

In [None]:
fig, ax = plt.subplots(figsize=plot.wide_figsize)
plot.model_output(m, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2,xlim=xlim, ylim=ylim)
mlai.write_figure(figure=fig,
                  filename='../slides/diagrams/gp/sparse-demo-unconstrained-inducing-6-gp.svg', 
                  transparent=True, frameon=True)

### Inducing Variable Full Optimize

<img src="../slides/diagrams/gp/sparse-demo-unconstrained-inducing-6-gp.svg" class="" align="" style="">

In [None]:
fig, ax = plt.subplots(figsize=plot.wide_figsize)
plot.model_output(m, ax=ax, xlabel='$x$', ylabel='$y$', fontsize=20, portion=0.2, xlim=xlim, ylim=ylim)
mlai.write_figure(figure=fig,
                  filename='../slides/diagrams/gp/sparse-demo-sparse-inducing-8-gp.svg', 
                  transparent=True, frameon=True)

### Eight Optimized Inducing Variables

<img src="../slides/diagrams/gp/sparse-demo-sparse-inducing-8-gp.svg" class="" align="" style="">

### Full Gaussian Process Fit

<img src="../slides/diagrams/gp/sparse-demo-full-gp.svg" class="" align="" style="">

### Modern Review

-   *A Unifying Framework for Gaussian Process Pseudo-Point
    Approximations using Power Expectation Propagation*
    @Thang:unifying17

-   *Deep Gaussian Processes and Variational Propagation of Uncertainty*
    @Damianou:thesis2015

In [None]:
plot.deep_nn(diagrams='../slides/diagrams/deepgp/')

### Deep Neural Network

<img src="../slides/diagrams/deepgp/deep-nn1.svg" class="" align="" style="">

### Deep Neural Network

<img src="../slides/diagrams/deepgp/deep-nn2.svg" class="" align="" style="">

### Mathematically

$$
\begin{align}
    \hiddenVector_{1} &= \basisFunction\left(\mappingMatrix_1 \inputVector\right)\\
    \hiddenVector_{2} &=  \basisFunction\left(\mappingMatrix_2\hiddenVector_{1}\right)\\
    \hiddenVector_{3} &= \basisFunction\left(\mappingMatrix_3 \hiddenVector_{2}\right)\\
    \dataVector &= \mappingVector_4 ^\top\hiddenVector_{3}
\end{align}
$$

### Overfitting

-   Potential problem: if number of nodes in two adjacent layers is big,
    corresponding $\mappingMatrix$ is also very big and there is the
    potential to overfit.

-   Proposed solution: “dropout”.

-   Alternative solution: parameterize $\mappingMatrix$ with its SVD. $$
      \mappingMatrix = \eigenvectorMatrix\eigenvalueMatrix\eigenvectwoMatrix^\top
      $$ or $$
      \mappingMatrix = \eigenvectorMatrix\eigenvectwoMatrix^\top
      $$ where if $\mappingMatrix \in \Re^{k_1\times k_2}$ then
    $\eigenvectorMatrix\in \Re^{k_1\times q}$ and
    $\eigenvectwoMatrix \in \Re^{k_2\times q}$, i.e. we have a low rank
    matrix factorization for the weights.

In [None]:
plot.low_rank_approximation(diagrams='../slides/diagrams')

### Low Rank Approximation

<img src="../slides/diagrams/wisuvt.svg" class="" align="" style="">
<center>
<i>Pictorial representation of the low rank form of the matrix
$\mappingMatrix$</i>
</center>

In [None]:
plot.deep_nn_bottleneck(diagrams='../slides/diagrams/deepgp')

### Deep Neural Network

<img src="../slides/diagrams/deepgp/deep-nn-bottleneck1.svg" class="" align="" style="">

### Deep Neural Network

<img src="../slides/diagrams/deepgp/deep-nn-bottleneck2.svg" class="" align="" style="">

### Mathematically

The network can now be written mathematically as $$
\begin{align}
  \latentVector_{1} &= \eigenvectwoMatrix^\top_1 \inputVector\\
  \hiddenVector_{1} &= \basisFunction\left(\eigenvectorMatrix_1 \latentVector_{1}\right)\\
  \latentVector_{2} &= \eigenvectwoMatrix^\top_2 \hiddenVector_{1}\\
  \hiddenVector_{2} &= \basisFunction\left(\eigenvectorMatrix_2 \latentVector_{2}\right)\\
  \latentVector_{3} &= \eigenvectwoMatrix^\top_3 \hiddenVector_{2}\\
  \hiddenVector_{3} &= \basisFunction\left(\eigenvectorMatrix_3 \latentVector_{3}\right)\\
  \dataVector &= \mappingVector_4^\top\hiddenVector_{3}.
\end{align}
$$

### A Cascade of Neural Networks

$$
\begin{align}
  \latentVector_{1} &= \eigenvectwoMatrix^\top_1 \inputVector\\
  \latentVector_{2} &= \eigenvectwoMatrix^\top_2 \basisFunction\left(\eigenvectorMatrix_1 \latentVector_{1}\right)\\
  \latentVector_{3} &= \eigenvectwoMatrix^\top_3 \basisFunction\left(\eigenvectorMatrix_2 \latentVector_{2}\right)\\
  \dataVector &= \mappingVector_4 ^\top \latentVector_{3}
\end{align}
$$

### Cascade of Gaussian Processes

-   Replace each neural network with a Gaussian process $$
    \begin{align}
      \latentVector_{1} &= \mappingFunctionVector_1\left(\inputVector\right)\\
      \latentVector_{2} &= \mappingFunctionVector_2\left(\latentVector_{1}\right)\\
      \latentVector_{3} &= \mappingFunctionVector_3\left(\latentVector_{2}\right)\\
      \dataVector &= \mappingFunctionVector_4\left(\latentVector_{3}\right)
    \end{align}
    $$

-   Equivalent to prior over parameters, take width of each layer to
    infinity.

### Mathematically

-   Composite *multivariate* function

$$
  \mathbf{g}(\inputVector)=\mappingFunctionVector_5(\mappingFunctionVector_4(\mappingFunctionVector_3(\mappingFunctionVector_2(\mappingFunctionVector_1(\inputVector))))).
  $$

In [None]:
pgm = plot.horizontal_chain(depth=5)
pgm.render().figure.savefig("../slides/diagrams/deepgp/deep-markov.svg", transparent=True)

### Equivalent to Markov Chain

-   Composite *multivariate* function $$
      p(\dataVector|\inputVector)= p(\dataVector|\mappingFunctionVector_5)p(\mappingFunctionVector_5|\mappingFunctionVector_4)p(\mappingFunctionVector_4|\mappingFunctionVector_3)p(\mappingFunctionVector_3|\mappingFunctionVector_2)p(\mappingFunctionVector_2|\mappingFunctionVector_1)p(\mappingFunctionVector_1|\inputVector)
      $$

<img src="../slides/diagrams/deepgp/deep-markov.svg" class="" align="" style="">

In [None]:
pgm = plot.vertical_chain(depth=5)
pgm.render().figure.savefig("../slides/diagrams/deepgp/deep-markov-vertical.svg", transparent=True)

### 

<img src="../slides/diagrams/deepgp/deep-markov-vertical.svg" class="" align="" style="">

### Why Deep?

-   Gaussian processes give priors over functions.

-   Elegant properties:
-   e.g. *Derivatives* of process are also Gaussian distributed (if they
    exist).

-   For particular covariance functions they are ‘universal
    approximators’, i.e. all functions can have support under the prior.

-   Gaussian derivatives might ring alarm bells.

-   E.g. a priori they don’t believe in function ‘jumps’.

### Stochastic Process Composition

-   From a process perspective: *process composition*.

-   A (new?) way of constructing more complex *processes* based on
    simpler components.

### 

<img src="../slides/diagrams/deepgp/deep-markov-vertical.svg" class="" align="" style="">

In [None]:
pgm = plot.vertical_chain(depth=5, shape=[2, 7])
pgm.add_node(daft.Node('y_2', r'$\mathbf{y}_2$', 1.5, 3.5, observed=True))
pgm.add_edge('f_2', 'y_2')
pgm.render().figure.savefig("../slides/diagrams/deepgp/deep-markov-vertical-side.svg", transparent=True)

### 

<img src="../slides/diagrams/deepgp/deep-markov-vertical-side.svg" class="" align="" style="">

In [None]:
plot.non_linear_difficulty_plot_3(diagrams='../../slides/diagrams/dimred/')

### Difficulty for Probabilistic Approaches {#difficulty-for-probabilistic-approaches data-transition="None"}

-   Propagate a probability distribution through a non-linear mapping.

-   Normalisation of distribution becomes intractable.

<img src="../slides/diagrams/dimred/nonlinear-mapping-3d-plot.svg" class="" align="center" style="">

In [None]:
plot.non_linear_difficulty_plot_2(diagrams='../../slides/diagrams/dimred/')

### Difficulty for Probabilistic Approaches {#difficulty-for-probabilistic-approaches-1 data-transition="None"}

-   Propagate a probability distribution through a non-linear mapping.

-   Normalisation of distribution becomes intractable.

<img src="../slides/diagrams/dimred/nonlinear-mapping-2d-plot.svg" class="" align="center" style="">

In [None]:
plot.non_linear_difficulty_plot_1(diagrams='../../slides/diagrams/dimred')

### Difficulty for Probabilistic Approaches {#difficulty-for-probabilistic-approaches-2 data-transition="None"}

-   Propagate a probability distribution through a non-linear mapping.

-   Normalisation of distribution becomes intractable.

<img src="../slides/diagrams/dimred/gaussian-through-nonlinear.svg" class="" align="center" style="">

### Deep Gaussian Processes

-   Deep architectures allow abstraction of features
    [@Bengio:deep09; @Hinton:fast06; @Salakhutdinov:quantitative08]

-   We use variational approach to stack GP models.

In [None]:
plot.stack_gp_sample(kernel=GPy.kern.Linear,
                     diagrams="../../slides/diagrams/deepgp")

In [None]:
pods.notebook.display_plots('stack-gp-sample-Linear-{sample:0>1}.svg', 
                            directory='../../slides/diagrams/deepgp', sample=(0,4))

### Stacked PCA {#stacked-pca data-transition="None"}

<img src="../slides/diagrams/stack-pca-sample-0.svg" class="" align="" style="">

### Stacked PCA {#stacked-pca-1 data-transition="None"}

<img src="../slides/diagrams/stack-pca-sample-1.svg" class="" align="" style="">

### Stacked PCA {#stacked-pca-2 data-transition="None"}

<img src="../slides/diagrams/stack-pca-sample-2.svg" class="" align="" style="">

### Stacked PCA {#stacked-pca-3 data-transition="None"}

<img src="../slides/diagrams/stack-pca-sample-3.svg" class="" align="" style="">

### Stacked PCA {#stacked-pca-4 data-transition="None"}

<img src="../slides/diagrams/stack-pca-sample-4.svg" class="" align="" style="">

In [None]:
plot.stack_gp_sample(kernel=GPy.kern.RBF,
                     diagrams="../../slides/diagrams/deepgp")

In [None]:
pods.notebook.display_plots('stack-gp-sample-RBF-{sample:0>1}.svg', 
                            directory='../../slides/diagrams/deepgp', sample=(0,4))

### Stacked GP {#stacked-gp data-transition="None"}

<img src="../slides/diagrams/stack-gp-sample-0.svg" class="" align="" style="">

### Stacked GP {#stacked-gp-1 data-transition="None"}

<img src="../slides/diagrams/stack-gp-sample-1.svg" class="" align="" style="">

### Stacked GP {#stacked-gp-2 data-transition="None"}

<img src="../slides/diagrams/stack-gp-sample-2.svg" class="" align="" style="">

### Stacked GP {#stacked-gp-3 data-transition="None"}

<img src="../slides/diagrams/stack-gp-sample-3.svg" class="" align="" style="">

### Stacked GP {#stacked-gp-4 data-transition="None"}

<img src="../slides/diagrams/stack-gp-sample-4.svg" class="" align="" style="">

### Analysis of Deep GPs

-   *Avoiding pathologies in very deep networks* @Duvenaud:pathologies14
    show that the derivative distribution of the process becomes more
    *heavy tailed* as number of layers increase.

-   *How Deep Are Deep Gaussian Processes?* @Dunlop:deep2017 perform a
    theoretical analysis possible through conditional Gaussian Markov
    property.

###

In [None]:
from IPython.lib.display import YouTubeVideo
YouTubeVideo('XhIvygQYFFQ')

In [None]:
data = pods.datasets.olympic_marathon_men()
x = data['X']
y = data['Y']

offset = y.mean()
scale = np.sqrt(y.var())

xlim = (1875,2030)
ylim = (2.5, 6.5)
yhat = (y-offset)/scale

fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
ax.set_xlabel('year', fontsize=20)
ax.set_ylabel('pace min/km', fontsize=20)
ax.set_xlim(xlim)
ax.set_ylim(ylim)

mlai.write_figure(figure=fig, filename='../slides/diagrams/datasets/olympic-marathon.svg', transparent=True, frameon=True)

### Olympic Marathon Data

<table>
<tr>
<td width="70%">
-   Gold medal times for Olympic Marathon since 1896.

-   Marathons before 1924 didn’t have a standardised distance.

-   Present results using pace per km.

-   In 1904 Marathon was badly organised leading to very slow times.

</td>
<td width="30%">
![image](../slides/diagrams/Stephen_Kiprotich.jpg) <small>Image from
Wikimedia Commons <http://bit.ly/16kMKHQ></small>
</td>
</tr>
</table>
### Olympic Marathon Data

<img src="../slides/diagrams/datasets/olympic-marathon.svg" class="" align="" style="">

In [None]:
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='year', ylabel='pace min/km', fontsize=20, portion=0.2)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig,
                  filename='../slides/diagrams/gp/olympic-marathon-gp.svg', 
                  transparent=True, frameon=True)

### Olympic Marathon Data GP

<img src="../slides/diagrams/gp/olympic-marathon-gp.svg" class="" align="" style="">

### 

<table>
<tr>
<td width="40%">
<img class="" src="../slides/diagrams/turing-run.jpg" width="40%" height="auto" align="" style="background:none; border:none; box-shadow:none;">
</td>
<td width="50%">
<img class="" src="../slides/diagrams/turing-times.gif" width="50%" height="auto" align="" style="background:none; border:none; box-shadow:none;">
</td>
</tr>
</table>
<center>
<i>Alan Turing, in 1946 he was only 11 minutes slower than the winner of
the 1948 games. Would he have won a hypothetical games held in 1946?
Source: [Alan Turing Internet
Scrapbook](http://www.turing.org.uk/scrapbook/run.html)</i>
</center>
### Deep GP Fit

-   Can a Deep Gaussian process help?

-   Deep GP is one GP feeding into another.

### Olympic Marathon Data Deep GP

<img src="../slides/diagrams/deepgp/olympic-marathon-deep-gp.svg" class="" align="" style="">

In [None]:
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
plot.model_sample(m, scale=scale, offset=offset, samps=10, ax=ax, 
                  xlabel='year', ylabel='pace min/km', portion = 0.225)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/olympic-marathon-deep-gp-samples.svg', 
                  transparent=True, frameon=True)

### Olympic Marathon Data Deep GP {#olympic-marathon-data-deep-gp-1 data-transition="None"}

<img src="../slides/diagrams/deepgp/olympic-marathon-deep-gp-samples.svg" class="" align="" style="">

In [None]:
m.visualize(scale=scale, offset=offset, xlabel='year',
            ylabel='pace min/km',xlim=xlim, ylim=ylim,
            dataset='olympic-marathon',
            diagrams='../slides/diagrams/deepgp')

In [None]:
import pods

In [None]:
pods.notebook.display_plots('olympic-marathon-deep-gp-layer-{sample:0>1}.svg', 
                            '../slides/diagrams/deepgp', sample=(0,1))

### Olympic Marathon Data Latent 1 {#olympic-marathon-data-latent-1 data-transition="None"}

<img src="../slides/diagrams/deepgp/olympic-marathon-deep-gp-layer-0.svg" class="" align="" style="">

### Olympic Marathon Data Latent 2 {#olympic-marathon-data-latent-2 data-transition="None"}

<img src="../slides/diagrams/deepgp/olympic-marathon-deep-gp-layer-1.svg" class="" align="" style="">

In [None]:
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
m.visualize_pinball(ax=ax, scale=scale, offset=offset, points=30, portion=0.1,
                    xlabel='year', ylabel='pace km/min', vertical=True)
mlai.write_figure(figure=fig, filename='../slides/diagrams/deepgp/olympic-marathon-deep-gp-pinball.svg', 
                  transparent=True, frameon=True)

### Olympic Marathon Pinball Plot

<img src="../slides/diagrams/deepgp/olympic-marathon-deep-gp-pinball.svg" class="" align="" style="">

In [None]:
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
_ = ax.set_xlabel('$x$', fontsize=20)
_ = ax.set_ylabel('$y$', fontsize=20)
xlim = (-2, 2)
ylim = (-0.6, 1.6)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/datasets/step-function.svg', 
            transparent=True, frameon=True)

### Step Function Data {#step-function-data data-transition="None"}

<img src="../slides/diagrams/datasets/step-function.svg" class="" align="" style="">

In [None]:
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m_full, scale=scale, offset=offset, ax=ax, fontsize=20, portion=0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)

mlai.write_figure(figure=fig,filename='../../slides/diagrams/gp/step-function-gp.svg', 
            transparent=True, frameon=True)

### Step Function Data GP {#step-function-data-gp data-transition="None"}

<img src="../slides/diagrams/gp/step-function-gp.svg" class="" align="" style="">

In [None]:
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m, scale=scale, offset=offset, ax=ax, fontsize=20, portion=0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(filename='../../slides/diagrams/deepgp/step-function-deep-gp.svg', 
            transparent=True, frameon=True)

### Step Function Data Deep GP {#step-function-data-deep-gp data-transition="None"}

<img src="../slides/diagrams/deepgp/step-function-deep-gp.svg" class="" align="" style="">

In [None]:
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_sample(m, scale=scale, offset=offset, samps=10, ax=ax, portion = 0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/deepgp/step-function-deep-gp-samples.svg', 
                  transparent=True, frameon=True)

### Step Function Data Deep GP {#step-function-data-deep-gp-1 data-transition="None"}

<img src="../slides/diagrams/deepgp/step-function-deep-gp-samples.svg" class="" align="" style="">

In [None]:
m.visualize(offset=offset, scale=scale, xlim=xlim, ylim=ylim,
            dataset='step-function',
            diagrams='../../slides/diagrams/deepgp')

### Step Function Data Latent 1 {#step-function-data-latent-1 data-transition="None"}

<img src="../slides/diagrams/deepgp/step-function-deep-gp-layer-0.svg" class="" align="" style="">

### Step Function Data Latent 2 {#step-function-data-latent-2 data-transition="None"}

<img src="../slides/diagrams/deepgp/step-function-deep-gp-layer-1.svg" class="" align="" style="">

### Step Function Data Latent 3 {#step-function-data-latent-3 data-transition="None"}

<img src="../slides/diagrams/deepgp/step-function-deep-gp-layer-2.svg" class="" align="" style="">

### Step Function Data Latent 4 {#step-function-data-latent-4 data-transition="None"}

<img src="../slides/diagrams/deepgp/step-function-deep-gp-layer-3.svg" class="" align="" style="">

In [None]:
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
m.visualize_pinball(offset=offset, ax=ax, scale=scale, xlim=xlim, ylim=ylim, portion=0.1, points=50)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/deepgp/step-function-deep-gp-pinball.svg', 
                  transparent=True, frameon=True, ax=ax)

### Step Function Pinball Plot {#step-function-pinball-plot data-transition="None"}

<img src="../slides/diagrams/deepgp/step-function-deep-gp-pinball.svg" class="" align="" style="">

In [None]:
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x, y, 'r.',markersize=10)
_ = ax.set_xlabel('time', fontsize=20)
_ = ax.set_ylabel('acceleration', fontsize=20)
ax.set_xlim(xlim)
ax.set_ylim(ylim)
mlai.write_figure(filename='../../slides/diagrams/datasets/motorcycle-helmet.svg', 
            transparent=True, frameon=True)

### Motorcycle Helmet Data {#motorcycle-helmet-data data-transition="None"}

<img src="../slides/diagrams/datasets/motorcycle-helmet.svg" class="" align="" style="">

In [None]:
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m_full, scale=scale, offset=offset, ax=ax, xlabel='time', ylabel='acceleration/$g$', fontsize=20, portion=0.5)
xlim=(-20,80)
ylim=(-180,120)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(figure=fig,filename='../../slides/diagrams/gp/motorcycle-helmet-gp.svg', 
            transparent=True, frameon=True)

### Motorcycle Helmet Data GP {#motorcycle-helmet-data-gp data-transition="None"}

<img src="../slides/diagrams/gp/motorcycle-helmet-gp.svg" class="" align="" style="">

In [None]:
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m, scale=scale, offset=offset, ax=ax, xlabel='time', ylabel='acceleration/$g$', fontsize=20, portion=0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(filename='../../slides/diagrams/deepgp/motorcycle-helmet-deep-gp.svg', 
            transparent=True, frameon=True)

### Motorcycle Helmet Data Deep GP {#motorcycle-helmet-data-deep-gp data-transition="None"}

<img src="../slides/diagrams/deepgp/motorcycle-helmet-deep-gp.svg" class="" align="" style="">

In [None]:
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_sample(m, scale=scale, offset=offset, samps=10, ax=ax, xlabel='time', ylabel='acceleration/$g$', portion = 0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)

mlai.write_figure(figure=fig, filename='../../slides/diagrams/deepgp/motorcycle-helmet-deep-gp-samples.svg', 
                  transparent=True, frameon=True)

### Motorcycle Helmet Data Deep GP {#motorcycle-helmet-data-deep-gp-1 data-transition="None"}

<img src="../slides/diagrams/deepgp/motorcycle-helmet-deep-gp-samples.svg" class="" align="" style="">

In [None]:
m.visualize(xlim=xlim, ylim=ylim, scale=scale,offset=offset, 
            xlabel="time", ylabel="acceleration/$g$", portion=0.5,
            dataset='motorcycle-helmet',
            diagrams='../../slides/diagrams/deepgp')

### Motorcycle Helmet Data Latent 1 {#motorcycle-helmet-data-latent-1 data-transition="None"}

<img src="../slides/diagrams/deepgp/motorcycle-helmet-deep-gp-layer-0.svg" class="" align="" style="">

### Motorcycle Helmet Data Latent 2 {#motorcycle-helmet-data-latent-2 data-transition="None"}

<img src="../slides/diagrams/deepgp/motorcycle-helmet-deep-gp-layer-1.svg" class="" align="" style="">

In [None]:
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
m.visualize_pinball(ax=ax, xlabel='time', ylabel='acceleration/g', 
                    points=50, scale=scale, offset=offset, portion=0.1)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/deepgp/motorcycle-helmet-deep-gp-pinball.svg', 
                  transparent=True, frameon=True)

### Motorcycle Helmet Pinball Plot {#motorcycle-helmet-pinball-plot data-transition="None"}

<img src="../slides/diagrams/deepgp/motorcycle-helmet-deep-gp-pinball.svg" class="" align="" style="">

In [None]:
fig, ax = plt.subplots(figsize=plot.big_figsize)
plt.plot(data['X'][:, 1], data['X'][:, 2], 'r.', markersize=5)
ax.set_xlabel('x position', fontsize=20)
ax.set_ylabel('y position', fontsize=20)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/datasets/robot-wireless-ground-truth.svg', transparent=True, frameon=True)

### Robot Wireless Ground Truth {#robot-wireless-ground-truth data-transition="None"}

<img src="../slides/diagrams/datasets/robot-wireless-ground-truth.svg" class="" align="" style="">

In [None]:
output_dim=1
xlim = (-0.3, 1.3)
fig, ax = plt.subplots(figsize=plot.big_wide_figsize)
_ = ax.plot(x.flatten(), y[:, output_dim], 
            'r.', markersize=5)

ax.set_xlabel('time', fontsize=20)
ax.set_ylabel('signal strength', fontsize=20)
xlim = (-0.2, 1.2)
ylim = (-0.6, 2.0)
ax.set_xlim(xlim)
ax.set_ylim(ylim)

mlai.write_figure(figure=fig, filename='../../slides/diagrams/datasets/robot-wireless-dim-' + str(output_dim) + '.svg', 
            transparent=True, frameon=True)

### Robot WiFi Data {#robot-wifi-data data-transition="None"}

<img src="../slides/diagrams/datasets/robot-wireless-dim-1.svg" class="" align="" style="">

In [None]:
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m_full, output_dim=output_dim, scale=scale, offset=offset, ax=ax, 
                  xlabel='time', ylabel='signal strength', fontsize=20, portion=0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(filename='../../slides/diagrams/gp/robot-wireless-gp-dim-' + str(output_dim)+ '.svg', 
            transparent=True, frameon=True)

### Robot WiFi Data GP {#robot-wifi-data-gp data-transition="None"}

<img src="../slides/diagrams/gp/robot-wireless-gp-dim-1.svg" class="" align="" style="">

In [None]:
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_output(m, output_dim=output_dim, scale=scale, offset=offset, ax=ax, 
                  xlabel='time', ylabel='signal strength', fontsize=20, portion=0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/deepgp/robot-wireless-deep-gp-dim-' + str(output_dim)+ '.svg', 
                  transparent=True, frameon=True)

### Robot WiFi Data Deep GP {#robot-wifi-data-deep-gp data-transition="None"}

<img src="../slides/diagrams/deepgp/robot-wireless-deep-gp-dim-1.svg" class="" align="" style="">

In [None]:
fig, ax=plt.subplots(figsize=plot.big_wide_figsize)
plot_model_sample(m, output_dim=output_dim, scale=scale, offset=offset, samps=10, ax=ax,
                  xlabel='time', ylabel='signal strength', fontsize=20, portion=0.5)
ax.set_ylim(ylim)
ax.set_xlim(xlim)
mlai.write_figure(figure=fig, filename='../../slides/diagrams/deepgp/robot-wireless-deep-gp-samples-dim-' + str(output_dim)+ '.svg', 
                  transparent=True, frameon=True)

### Robot WiFi Data Deep GP {#robot-wifi-data-deep-gp-1 data-transition="None"}

<img src="../slides/diagrams/deepgp/robot-wireless-deep-gp-samples-dim-1.svg" class="" align="" style="">

### Robot WiFi Data Latent Space {#robot-wifi-data-latent-space data-transition="None"}

<img src="../slides/diagrams/deepgp/robot-wireless-ground-truth.svg" class="" align="" style="">

In [None]:
fig, ax = plt.subplots(figsize=plot.big_figsize)
ax.plot(m.layers[-2].latent_space.mean[:, 0], 
        m.layers[-2].latent_space.mean[:, 1], 
        'r.-', markersize=5)

ax.set_xlabel('latent dimension 1', fontsize=20)
ax.set_ylabel('latent dimension 2', fontsize=20)

mlai.write_figure(figure=fig, filename='../../slides/diagrams/deepgp/robot-wireless-latent-space.svg', 
            transparent=True, frameon=True)

### Robot WiFi Data Latent Space {#robot-wifi-data-latent-space-1 data-transition="None"}

<img src="../slides/diagrams/deepgp/robot-wireless-latent-space.svg" class="" align="" style="">

### Motion Capture {#motion-capture data-transition="none"}

-   ‘High five’ data.

-   Model learns structure between two interacting subjects.

### Shared LVM {#shared-lvm data-transition="none"}

<img src="../slides/diagrams/shared.svg" class="" align="" style="">

###  {#section-14 data-transition="none"}

<img class="negate" src="../slides/diagrams/deep-gp-high-five2.png" width="100%" height="auto" align="" style="background:none; border:none; box-shadow:none;">

<small>[Thanks to: Zhenwen Dai and Neil D.
Lawrence]{style="text-align:right"}</small>

In [None]:
rc("font", **{'family':'sans-serif','sans-serif':['Helvetica'],'size':20})
fig, ax = plt.subplots(figsize=plot.big_figsize)
for d in digits:
    ax.plot(m.layer_1.X.mean[labels==d,0],m.layer_1.X.mean[labels==d,1],'.',label=str(d))
_ = plt.legend()
mlai.write_figure(figure=fig, filename="../../slides/diagrams/deepgp/usps-digits-latent.svg", transparent=True)

###  {#section-15 data-transition="none"}

<img src="../slides/diagrams/usps-digits-latent.svg" class="" align="" style="">

In [None]:
fig, ax = plt.subplots(figsize=plot.big_figsize)
for i in range(5):
    for j in range(i):
        dims=[i, j]
        ax.cla()
        for d in digits:
            ax.plot(m.obslayer.X.mean[labels==d,dims[0]],
                 m.obslayer.X.mean[labels==d,dims[1]],
                 '.', label=str(d))
        plt.legend()
        plt.xlabel('dimension ' + str(dims[0]))
        plt.ylabel('dimension ' + str(dims[1]))
        mlai.write_figure(figure=fig, filename="../../slides/diagrams/deepgp/usps-digits-hidden-" + str(dims[0]) + '-' + str(dims[1]) + '.svg', transparent=True)

###  {#section-16 data-transition="none"}

<img src="../slides/diagrams/usps-digits-hidden-1-0.svg" class="" align="" style="">

###  {#section-17 data-transition="none"}

<img src="../slides/diagrams/usps-digits-hidden-2-0.svg" class="" align="" style="">

###  {#section-18 data-transition="none"}

<img src="../slides/diagrams/usps-digits-hidden-3-0.svg" class="" align="" style="">

###  {#section-19 data-transition="none"}

<img src="../slides/diagrams/usps-digits-hidden-4-0.svg" class="" align="" style="">

In [None]:
yt = m.predict(x)
fig, axs = plt.subplots(rows,cols,figsize=(10,6))
for i in range(rows):
    for j in range(cols):
        #v = np.random.normal(loc=yt[0][i*cols+j, :], scale=np.sqrt(yt[1][i*cols+j, :]))
        v = yt[0][i*cols+j, :]
        axs[i,j].imshow(v.reshape(28,28), 
                        cmap='gray', interpolation='none',
                        aspect='equal')
        axs[i,j].set_axis_off()
mlai.write_figure(figure=fig, filename="../../slides/diagrams/deepgp/digit-samples-deep-gp.svg", transparent=True)

###  {#section-20 data-transition="none"}

<img src="../slides/diagrams/digit-samples-deep-gp.svg" class="" align="" style="">

### Deep Health

<div style="text-align:center">

<img src="../slides/diagrams/deep-health.svg" class="" align="" style="">

</div>

### From NIPS 2017

-   *Gaussian process based nonlinear latent structure discovery in
    multivariate spike train data* @Anqi:gpspike2017
-   *Doubly Stochastic Variational Inference for Deep Gaussian
    Processes* @Salimbeni:doubly2017
-   *Deep Multi-task Gaussian Processes for Survival Analysis with
    Competing Risks* @Alaa:deep2017
-   *Counterfactual Gaussian Processes for Reliable Decision-making and
    What-if Reasoning* @Schulam:counterfactual17

### Some Other Works

-   *Deep Survival Analysis* @Ranganath-survival16
-   *Recurrent Gaussian Processes* @Mattos:recurrent15
-   *Gaussian Process Based Approaches for Survival Analysis*
    @Saul:thesis2016

### Uncertainty Quantification

-   Deep nets are powerful approach to images, speech, language.
-   Proposal: Deep GPs may also be a great approach, but better to
    deploy according to natural strengths.

### Uncertainty Quantification

-   Probabilistic numerics, surrogate modelling, emulation, and UQ.
-   Not a fan of AI as a term.
-   But we are faced with increasing amounts of *algorithmic decision
    making*.

### ML and Decision Making

-   When trading off decisions: compute or acquire data?
-   There is a critical need for uncertainty.

### Uncertainty Quantification

> Uncertainty quantification (UQ) is the science of quantitative
> characterization and reduction of uncertainties in both computational
> and real world applications. It tries to determine how likely certain
> outcomes are if some aspects of the system are not exactly known.

-   Interaction between physical and virtual worlds of major interest.

### Contrast

-   Simulation in *reinforcement learning*.
-   Known as *data augmentation*.
-   Newer, similar in spirit, but typically ignores uncertainty.

### Example: Formula One Racing

-   Designing an F1 Car requires CFD, Wind Tunnel, Track Testing etc.

-   How to combine them?

### Mountain Car Simulator

<div style="text-align:center">

<img class="negate" src="../slides/diagrams/uq/mountaincar.png" width="" height="auto" align="center" style="background:none; border:none; box-shadow:none;">

</div>

### Car Dynamics

$$\inputVector_{t+1} = \mappingFunction(\inputVector_{t},\textbf{u}_{t})$$

where $\textbf{u}_t$ is the action force, $\inputVector_t = (p_t, v_t)$
is the vehicle state

### Policy

-   Assume policy is linear with parameters $\boldsymbol{\theta}$

$$\pi(\inputVector,\theta)= \theta_0 + \theta_p p + \theta_vv.$$

### Emulate the Mountain Car

-   Goal is find $\theta$ such that

$$\theta^* = arg \max_{\theta} R_T(\theta).$$

-   Reward is computed as 100 for target, minus squared sum of actions

In [None]:
HTML(anim.to_jshtml())

In [None]:
mc.save_frames(frames, 
                  diagrams='../slides/diagrams/uq', 
                  filename='mountain_car_random.html')

### Random Linear Controller

<iframe src="../slides/diagrams/uq/mountain_car_random.html" width="1024" height="768" allowtransparency="true" frameborder="0">
</iframe>

In [None]:
HTML(anim.to_jshtml())

In [None]:
mc.save_frames(frames, 
                  diagrams='../slides/diagrams/uq', 
                  filename='mountain_car_simulated.html')

### Best Controller after 50 Iterations of Bayesian Optimization

<iframe src="../slides/diagrams/uq/mountain_car_simulated.html" width="1024" height="768" allowtransparency="true" frameborder="0">
</iframe>
### Data Efficient Emulation

-   For standard Bayesian Optimization ignored *dynamics* of the car.

-   For more data efficiency, first *emulate* the dynamics.

-   Then do Bayesian optimization of the *emulator*.

-   Use a Gaussian process to model $$\Delta v_{t+1} = v_{t+1} - v_{t}$$
    and $$\Delta x_{t+1} = p_{t+1} - p_{t}$$

-   Two processes, one with mean $v_{t}$ one with mean $p_{t}$

### Emulator Training

-   Used 500 randomly selected points to train emulators.

-   Can make proces smore efficient through *experimental design*.

In [None]:
control = mc.plot_control(velocity_model)
interact(control.plot_slices, control=(-1, 1, 0.05))

In [None]:
mc.emu_sim_comparison(env, controller_gains, [position_model, velocity_model], 
                      max_steps=500, diagrams='../slides/diagrams/uq')

### Comparison of Emulation and Simulation

<img src="../slides/diagrams/uq/emu_sim_comparison.svg" class="" align="" style="">

In [None]:
HTML(anim.to_jshtml())

In [None]:
mc.save_frames(frames, 
                  diagrams='../slides/diagrams/uq', 
                  filename='mountain_car_emulated.html')

### Data Efficiency

-   Our emulator used only 500 calls to the simulator.

-   Optimizing the simulator directly required 37,500 calls to the
    simulator.

### Best Controller using Emulator of Dynamics

<iframe src="../slides/diagrams/uq/mountain_car_emulated.html" width="1024" height="768" allowtransparency="true" frameborder="0">
</iframe>
500 calls to the simulator vs 37,500 calls to the simulator

$$\mappingFunction_i\left(\inputVector\right) = \rho\mappingFunction_{i-1}\left(\inputVector\right) + \delta_i\left(\inputVector \right)$$

### Multi-Fidelity Emulation

$$\mappingFunction_i\left(\inputVector\right) = \mappingFunctionTwo_{i}\left(\mappingFunction_{i-1}\left(\inputVector\right)\right) + \delta_i\left(\inputVector \right),$$

In [None]:
HTML(anim.to_jshtml())

In [None]:
mc.save_frames(frames, 
                  diagrams='../slides/diagrams/uq', 
                  filename='mountain_car_multi_fidelity.html')

### Best Controller with Multi-Fidelity Emulator

<iframe src="../slides/diagrams/uq/mountain_car_multi_fidelity.html" width="1024" height="768" allowtransparency="true" frameborder="0">
</iframe>
250 observations of high fidelity simulator and 250 of the low fidelity
simulator

### Acknowledgments

Stefanos Eleftheriadis, John Bronskill, Hugh Salimbeni, Rich Turner,
Zhenwen Dai, Javier Gonzalez, Andreas Damianou, Mark Pullin.

### Ongoing Code

-   Powerful framework but
-   Software isn't there yet.
-   Our focus: Gaussian Processes driven by MXNet
-   Composition of GPs, Neural Networks, Other Models

### Thanks!

-   twitter: @lawrennd
-   blog:
    [http://inverseprobability.com](http://inverseprobability.com/blog.html)

### References

{.allowframebreaks}