# Lecture 10 - Gaussian Process Regression

$
% DEFINITIONS
\newcommand{\bff}{\mathbf{f}}
\newcommand{\bm}{\mathbf{m}}
\newcommand{\bk}{\mathbf{k}}
\newcommand{\bx}{\mathbf{x}}
\newcommand{\by}{\mathbf{y}}
\newcommand{\bz}{\mathbf{z}}
\newcommand{\bA}{\mathbf{A}}
\newcommand{\bB}{\mathbf{B}}
\newcommand{\bC}{\mathbf{C}}
\newcommand{\bD}{\mathbf{D}}
\newcommand{\bI}{\mathbf{I}}
\newcommand{\bK}{\mathbf{K}}
\newcommand{\bL}{\mathbf{L}}
\newcommand{\bM}{\mathbf{M}}
\newcommand{\bX}{\mathbf{X}}
\newcommand{\bY}{\mathbf{Y}}
\newcommand{\bTheta}{\mathbf{\Theta}}
\newcommand{\calX}{\mathcal{X}}
\newcommand{\bLambda}{\boldsymbol{\Lambda}}
\newcommand{\bSigma}{\boldsymbol{\Sigma}}
\newcommand{\bmu}{\boldsymbol{\mu}}
\newcommand{\calN}{\mathcal{N}}
\newcommand{\calD}{\mathcal{D}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\C}{\mathbb{C}}
\newcommand{\Rd}{\R^d}
\newcommand{\Rdd}{\R^{d\times d}}
\newcommand{\bzero}{\mathbf{0}}
\newcommand{\GP}{\mbox{GP}}
% END OF DEFINITIONS
$ 





## Objectives

+ Quick recap of Gaussian processes (GP). 
+ Condition a GP on observed measurements 
+ Training a GP by maximization of the likelihood

## Reading

+ Please read [this](http://www.kyb.mpg.de/fileadmin/user_upload/files/publications/pdfs/pdf2903.pdf) OR watch [this video lecture](http://videolectures.net/mlss03_rasmussen_gp/?q=MLSS).



Recall, that in the previous lecture we discussed the following :

+ What is prior knowledge?

+ What is a Gaussian process (GP) ?

+ What are the properties of the mean and covariance functions of a Gaussian process and what kind of priors can we encode into a GP through the mean and the covariance kernel? 

+ How do we sample from a GP ?

In this lecture, we shall talk about how we develop response functions to approximate a generic black box computer code (say $f(\cdot)$) in a manner that makes it compatible with our prior beliefs about the model. We do so, by using Bayes' rule and the Gaussian process regression method. Remember that our goal is to be able to propagate uncertainty in the inputs. 

We saw in the previous lecture that one's prior knowledge about the response can be modeled in terms of a generic GP. Let that prior state of knowledge be represented as follows: 
\begin{equation}
f(\cdot) | m(\cdot), k(\cdot, \cdot) \sim \GP\left(f(\cdot) | m(\cdot), k(\cdot, \cdot) \right),
\end{equation}

where the terms have their usual meaning i.e., $f(\cdot)$ is a generic response surface, $m(\cdot)$ is the prior mean function and $k(\cdot, \cdot)$ is the covariance kernel parameterized by a set of hyperparameters $\bTheta$. Specifically, in the case of the squared exponential covariance kernel, $\bTheta = \{s, l_1, l_2, \cdots, l_3\}$.


Now, assume that we make $n$ _measurements_ or _simulations_ at input locations $\bx_1, \bx_2, \cdots, \bx_n$ such that $\bx_i \in \R^{d}$. The corresponding observed outputs are $y_1, y_2, \cdots, y_n$, such that $y_i \in \R$. We write $\bX = \{\bx_1, \bx_2, \cdots, \bx_n\}$ and $\bY=\{y_1, y_2, \cdots, y_n\}$. Abusing mathematical notation slightly, we use the symbol $\calD$ to denote $\bX$ and $\bY$ collectively. We refer to $\calD$ as the _observed data_. How does the observed data $\calD$ affect our state of knowledge about the response surface? 

The answer lies in a straightforward application of Bayes' rule and Kolmogorov's theorem on the existence of random fields. 

Our new state of knowledge about the response function, conditioned upon the observed data, is another GP which can be expressed as follows: 

$$
f(\cdot)|\calD \sim GP(f(\cdot)| m^{*} (\cdot;\calD) k^{*}(\cdot, \cdot; \calD))
$$

with mean function: 
$$
m^{*}(\bx) = m(\bx) + \bk(\bx, \bX) \bK(\bX, \bX)^{-1}(\bY-m(\bX))
$$

and covariance function:
$$
k^{*}(\bx, \bx') = k(\bx, \bx') - \bk(\bx, \bX)\bK(\bX, \bX)^{-1}\bk(\bX, \bx')
$$


and posterior distribution of the hyperparameters given by : 
$$
p(\bTheta|\calD) = \frac{p(\calD|\bTheta)p(\bTheta)}{p(\calD)}
$$


where, 
$
p(\calD|\bTheta) = p(\bY|\bX, \bTheta) = \calN(\bY|m(\bX; \bTheta), k(\bX, \bX; \bTheta))
$
is called the _likelihood_ of the observed data $\calD$ and 
$
p(\calD) = \int p(\calD|\bTheta) p(\bTheta) d\bTheta
$
is called the _evidence_.


