In [1]:
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'width': 1920,
        'height': 1080,
        'scroll': True,
})
%config InlineBackend.figure_format = 'retina'
from IPython.lib.display import YouTubeVideo

# Week 09, ASTR 596: Fundamentals of Data Science


## Gaussian Processes contd.

### Gautham Narayan 
##### <gsn@illinois.edu>

## Let's step back and go over GPs a bit more slowly.

#### First, conceptually why even bother with GPs

<img src="GP_as_priors_on_functions.png">


## You can’t get anywhere without making some assumptions

### You did this in the first half of the semester as well - you assumed a model (likelihood, priors...) 

# Where we were interpreting priors as probability distributions from which the observations could concievably be drawn, here we are treating priors as probability distributions on functions that could concievably be consistent with the observations

#### GPs are a nice way of expressing this ‘prior on functions’ idea

Under certain assumptions GPs can answer the following questions:
- Here’s where the function will **most likely be** (expected function)
- Here are some **examples** of what it might look like (sampling from the posterior distribution)
- Here is a prediction of what you’ll see if you evaluate your function at x’, **with confidence**

### Recap:

The first half of the semester, you got famililar with observations drawn from a distribution e.g.

$y1$, drawn from a Gaussian distribution with mean $\mu$ and variance $\sigma^2$:

\begin{align}
p(y_1 | \mu, \sigma) = \frac{1}{\sqrt{2 \pi} \sigma} \exp \left[ - \frac{(y_1-\mu)^2}{2 \sigma^2} \right] 
\end{align}

i.e.

### $$y_1 \sim \mathcal{N}(\mu,\sigma^2)$$

<img src="1DGauss.png">

If pair of variables $y_1$ and $y_2$, drawn from a *bivariate Gaussian distribution*. The *joint probability density* for $y_1$ and $y_2$ is:

### $$
\left[ \begin{array}{l} y_1 \\ y_2 \end{array} \right] \sim \mathcal{N} \left(
\left[ \begin{array}{l} \mu_1 \\ \mu_2 \end{array}  \right] , 
\left[ \begin{array}{ll} 
\sigma_1^2 & C \\
C & \sigma_2^2 
\end{array}  \right] 
\right),
$$

where 

### $$C = {\rm cov}(y_1,y_2)$$ 

is the *covariance* between $y_1$ and $y_2$.

<img src="2DGauss.png">

In the first half of the semester, we dealt with independent variables i.e.

### $$P(y_1 \cap y_2) = P(y_1) \cdot P(y_2) $$

and consequently


\begin{align}
P(y_2|y_1) = \frac{P(y_1 \cap y_2)}{P(y_1)} = P(y_2)
\end{align}

If two variables are independent, then $C = 0$ (remember converse isn't true). 

The observations are *uncorrelated* so measuring $y_1$ doesn't teach us anything about $y_2$.

(If in addition $\mu_1 = \mu_2$ and $\sigma_1 = \sigma_2$ the variables are i.i.d)

# With time-series, $C \ne 0$ 

If we know the value of $y_1$, the probability density for $y_2$ collapses to the the *conditional distribution* of $y_2$ given $y_1$:

### $$
p(y_2 \mid y_1) = \mathcal{N} \left( \mu_2 + C (y_1-\mu_1)/\sigma_1^2, \sigma_2^2-C^2\sigma_1^2 \right).
$$

### Consider $N$ variables drawn from a multivariate Gaussian distribution:

### $$
\boldsymbol{y} \sim \mathcal{N} (\boldsymbol{\mu},\boldsymbol{\Sigma})
$$

where 

### $$\boldsymbol{y} = (y_1,y_2,\ldots,y_N)^T$$

### $$\boldsymbol{\mu} = (\mu_1,\mu_2,\ldots,\mu_N)^T$$ 


is the *mean vector*, and $\boldsymbol{\Sigma}$ is an $N \times N$ positive semi-definite *covariance matrix*, with elements 

### $$\Sigma_{ij}={\rm cov}(y_i,y_j)$$



### And then the likelihood generalizes from 1D:

\begin{align}
p(y_1 | \mu, \sigma) = \frac{1}{\sqrt{2 \pi} \sigma} \exp \left[ - \frac{(y_1-\mu)^2}{2 \sigma^2} \right] 
\end{align}

### to ND:


\begin{align}
p(\boldsymbol{y} | \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{\sqrt{2 \pi^N |\Sigma|} } \exp \left[ -\frac{1}{2} (\boldsymbol{y} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{y} - \boldsymbol{\mu}) \right] 
\end{align}

This works because:

<img src="gaussians_all_the_way_down.jpg">

# A Gaussian process is an extension of this concept to infinite $N$.

# This gives rise to a probability distribution over functions, rather than finite $N$ samples. 

<img src="gp.png">

# Informally - infinitely long vector ~ function

Again, for finite number of $y$ drawn from a multivariate normal distribution:

### $$
\boldsymbol{y} \sim \mathcal{N} (\boldsymbol{\mu},\boldsymbol{\Sigma})
$$

This clearly doesn't make sense for infinite $N$, but the essential feature remains the same:
### A Gaussian process is completely specified by its *mean function* and *covariance function*.  

## Incorporating observational error is similar to what you did in the past as well:

### $$ y \sim f(t) + \epsilon$$ 

### with deviations from the truth related to the observational uncertainties

### $$ \epsilon \sim \mathcal{N}(0, \sigma_y^2) $$


### except now, $f(t)$ is a function not of some parameters, but rather of functions thenselves:

###  $$ f(t) \sim \mathcal{GP}(m(t), k(t,t'))$$

### where I'm switching from $\mu$ to $m(t)$ and $\Sigma$ to $k(t, t')$ just to make explicit that these are not vectors.

I'm using $k$ because this function that describes the covariance between time $t$ and $t'$ is called a **kernel** function. 

# Probabilistic Graphical Model for a GP


### Recall:

A **probabilistic graphical model** (PGM) is a very useful way of visualizing a generative model.
* They sketch out the procedure for how one would generate mock data in practice.
* They illustrate the interdependence of model parameters, and the dependence of data on parameters.
* _They also (therefore) represent a conditional factorization of the PDF for all the data and model parameters._

Ingredients of a PGM:
* **Nodes** represent PDFs for parameters
* **Edges** represent conditional relationships
* **Plates** represent repeated model components whose contents are **conditionally independent**

Types of nodes:
* **Circles** represent a PDF. This parameter is a *stochastic* function of the parameters feeding into it.
* **Points** represent a delta-function PDF. This parameter is a *deterministic* function of the parameters feeding into it.
* **Double circles** (or shading) indicate measured data. They are stochastic in the context of generating mock data, but fixed in the context of parameter inference.

# If we were dealing with i.i.d. data

### $$ y \sim f(t) + \epsilon$$ 

### with deviations from the truth related to the observational uncertainties

### $$ \epsilon \sim \mathcal{N}(0, \sigma_y^2) $$


<img src="pgm_conditionally_independent.png">

# With time-series the data are not conditionally independent

i.e. you don't have a nice plate:

<img src="gp_pgm.png">

From [Rasmussen & Williams (aka the GP bible)](http://www.gaussianprocess.org/gpml/chapters/RW.pdf)

### We don't actually observe the function  $f$

### As you've seen, there isn't one single function, but infinitely many for a specific choice of $m$, $k$

### We marginalize over them to find the posterior mean - $f$ is behaving like a parameter 

### The paramters of $m$ and $k$, which actually specify $f$ are called "hyper parameters"

### The interesting bit here is the covariance function/kernel, $k$ (we can always recenter the data to have mean = 0)

# Thankfully, in the real world, we only have a finite number of observations

Previously, we saw:

### $$\Sigma_{ij}={\rm cov}(y_i,y_j)$$

We don't have a parameteric model for $y$ anymore, but that's OK, we can write down a parametric model for the covariance itself, i.e.:

### $$
\mathrm{cov}(y(t),y(t'))=k(t,t') $$


That's helpful to do, because with finite observations:

### $$
\mathrm{cov}(y_i,y_j)=k(t_i,t_j)
$$

# So we don't have parametrized model, but do have parametrized covariance - what can we do with this thing?

## The prior

Now consider a finite set of observations: inputs $\boldsymbol{t}$, with corresponding outputs $\boldsymbol{y}$.

The *joint distribution* of $\boldsymbol{y}$ given $\boldsymbol{t}$, $m$ and $k$ is

### $$
\mathrm{p}(\boldsymbol{y} \mid \boldsymbol{t},m,k) = \mathcal{N}( \boldsymbol{m},K),
$$

where again, $\boldsymbol{m}=m(\boldsymbol{t})$ is the *mean vector* 

and $K$ is the *covariance matrix*, with elements $K_{ij} = k(t_i,t_j)$.

## Test and training sets

Suppose we have an (observed) *training set* $(\boldsymbol{t},\boldsymbol{y})$. 

We are interested in some other *test set* of inputs $\boldsymbol{t}_*$.

The joint distribution over the training and test sets is
### $$
\mathrm{p} \left( \left[ \begin{array}{l} \boldsymbol{y} \\ \boldsymbol{y}_* \end{array} \right] \right) 
= \mathcal{N} \left( \left[ \begin{array}{l} \boldsymbol{m} \\ \boldsymbol{m}_* \end{array} \right], 
\left[ \begin{array}{ll} K & K_* \\ K_*^T & K_{**} \end{array} \right] \right),
$$

where $\boldsymbol{m}_* = m(\boldsymbol{x}_*)$, $K_{**,ij} = k(t_{*,i},t_{*,j})$ and $K_{*,ij} = k(t_i,t_{*,j})$. 

This is not really any different from when we just had two observations:

### $$
\left[ \begin{array}{l} y_1 \\ y_2 \end{array} \right] \sim \mathcal{N} \left(
\left[ \begin{array}{l} \mu_1 \\ \mu_2 \end{array}  \right] , 
\left[ \begin{array}{ll} 
\sigma_1^2 & C \\
C & \sigma_2^2 
\end{array}  \right] 
\right),
$$

# For notational brevity I'm going to set the mean to 0.

## The conditional distribution

The *conditional distribution* for the test set given the training set is:

### $$ 
\mathrm{p} ( \boldsymbol{y}_* \mid \boldsymbol{y},k) = \mathcal{N} ( 
K_*^T K^{-1} \boldsymbol{y}, K_{**} - K_*^T K^{-1} K_* ).
$$

This is also just a straight forward generalization from what we had with just two points:

### $$
p(y_2 \mid y_1) = \mathcal{N} \left( \mu_2 + C (y_1-\mu_1)/\sigma_1^2, \sigma_2^2-C^2\sigma_1^2 \right).
$$

This is called the **predictive distribution**, because it can be use to predict future (or past) observations. 

More generally, it can be used for *interpolating* the observations to any desired set of inputs. 

This is one of the most widespread applications of GPs in some fields (e.g. kriging in geology, economic forecasting, ...)

## Real observations always contain a component of *white noise*

We need to account for this, but don't necessarily want to include in the predictions. 


If the white noise variance $\sigma^2$ is constant, we can write 

### $$
\mathrm{cov}(y_i,y_j)=k(t_i,t_j)+\delta_{ij} \sigma^2,
$$


and the conditional distribution becomes


### $$ 
\mathrm{p} ( \boldsymbol{y}_* \mid \boldsymbol{y},k) = \mathcal{N} ( 
K_*^T (K + \sigma^2 \mathbb{I})^{-1} \boldsymbol{y}, K_{**} - K_*^T (K + \sigma^2 \mathbb{I})^{-1} K_* ).
$$


We assumed constant white noise, but it's trivial to allow for different $\sigma$ for each data point.

You could also add some intrinsic dispersion as you often have to do. 

In real life, we may need to learn $\sigma_{\text{int}}$ from the data, alongside the other contribution to the covariance matrix.

## Single-point prediction

Let us look more closely at the predictive distribution for a single test point $t_*$.

It is a Gaussian with mean:
### $$
\overline{y}_* = \boldsymbol{k}_*^T (K + \sigma^2 \mathbb{I})^{-1} \boldsymbol{y}
$$

and variance
### $$
\mathbb{V}[y_*] = k(t_*,t_*) - \boldsymbol{k}_*^T (K + \sigma^2 \mathbb{I})^{-1} \boldsymbol{k}_*,
$$
where $\boldsymbol{k}_*$ is the vector of covariances between the test point and the training points.

Notice the mean is a linear combination of the observations: the GP is a *linear predictor*.

It is also a linear combination of covariance functions, each centred on a training point:

### $$
\overline{y}_* = \sum_{i=1}^N \alpha_i k(x_i,x_*),
$$
where $\alpha_i = (K + \sigma^2 \mathbb{I})^{-1} y_i$

# So how do you choose the kernel $k$?

Common choices: http://www.cs.toronto.edu/~duvenaud/cookbook/index.html

Lets fiddle:

# In-class exercise:

[Click this](https://distill.pub/2019/visual-exploration-gaussian-processes/)

## The likelihood

The *likelihood* of the data under the GP model is simply:

### $$
\mathrm{p}(\boldsymbol{y} \,|\, \boldsymbol{t}) = \mathcal{N}(\boldsymbol{y} \, | \, \boldsymbol{0},K + \sigma^2 \mathbb{I}).
$$

This is a measure of how well the model explains, or predicts, the training set.

i.e. **The observed $\boldsymbol{y}$ are noisy realisations of a latent (unobserved) Gaussian process $\boldsymbol{f}$.**

We are marginalizing over the function values $\boldsymbol{f}$:
### $$
\mathrm{p}(\boldsymbol{y} \,|\, \boldsymbol{t}) = \int \mathrm{p}(\boldsymbol{y} \,|\, \boldsymbol{f},\boldsymbol{t}) \, \mathrm{p}(\boldsymbol{f} \,|\, \boldsymbol{t}) \, \mathrm{d}\boldsymbol{f},
$$

where 


### $$
\mathrm{p}(\boldsymbol{f} \,|\, \boldsymbol{t}) = \mathcal{N}(\boldsymbol{f} \, | \, \boldsymbol{0},K)
$$


is the *prior*, and 


### $$
\mathrm{p}(\boldsymbol{y} \,|\, \boldsymbol{f},\boldsymbol{t}) = \mathcal{N}(\boldsymbol{y} \, | \, \boldsymbol{0},\sigma^2 \mathbb{I})
$$
is the *likelihood*.

# You "condition" the hyperparameters on some observed data

i.e. evaluate the conditional (or predictive) distribution for a given covariance matrix (i.e. covariance function and hyper-parameters), and training set.

## *Training* the GP...

...means maximising the *likelihood* of the model with respect to the hyperparameters.

# In-class Exercise: 

[Click this](http://chifeng.scripts.mit.edu/stuff/gp-demo/)

#### Spitzer exoplanet transits and eclipses (Evans et al. 2015)

<img src="Evans_Spitzer.png" width="800">

#### GPs to deal with correlated noise in fitting spectra (Narayan et al., 2019)

<img src="GP_spectra.jpg" width="800">

## Example: Mauna Kea CO$_2$ dataset

(From Rasmussen & Williams textbook)

<img height="700" src="RW_mauna_kea.png">

#### GPz photometric redshifts (Almosallam, Jarvis & Roberts 2016)

<img src="Almosallam_GPz.png" width="600">