# Regression: Advanced Topics

K. Leighly 2017

This lecture was drawn from the following sources:
 - Ivezic chapter 8
 - Bishop Chapter 1, 3 & 6
 - ["Gaussian Processes For Machine Learning", C. E. Rasmussen & C. K. I. Williams](http://www.gaussianprocess.org/gpml/chapters/)


## Introduction and Motivation

Last time we started our discussion of regression from a frequentist approach.  We found that sometimes the resulting model parameters had to be managed / controlled using regularization.  

This time we will continue with that with a complementary approach: Bayesian Linear Regression, which will not require regularization, because control on the model parameters can be attained through the use of a prior (sort of).

We will also address an advanced topic, kernel methods.  In linear regression, the specific function is chosen, and the model parameters are fit.  Using kernel methods, we can also optimize over sets of functions.   We will look at a frequentist method (kernel ridge regression) and a Bayesian method (Gaussian process regression).

Finally, we will set ourselves up for classification by talking about decision theory.

### Outine
- Bayesian model fitting and regression
- Kernel methods and the kernel trick
- Gaussian Processes and Gaussian Process Regression
- Kernel Ridge Regression
- Decision theory


## Curve Fitting Revisited

We have already discussed least-squares fitting.  Let's quickly review that from a bit of a different point of view in order to set ourselves up for Bayesian linear regression.

Consider our training data comprised of $N$ input variables $\mathbb{X} = (x_1,\ldots,x_N)^T$ and their target values $\mathbb{T} = (t_1,\ldots,t_N)^T$. Now, however, let's assume that given the value of $x$, the value of $t$ has a Gaussian distribution with mean equal to $y(x,\mathbf{w})$, where $\mathbf{w}$ are fit parameters from linear regression. 

Then the probability of observing $t$, given $x$, $\mathbf{w}$, and $\beta$ is

$$p(t|x,\mathbf{w},\beta) = \mathcal{N}(t|y(x,\mathbf{w}),\beta^{-1}),$$

where $\beta$ is a precision parameter defined to be the inverse variance of the noise distribution. 

The above probability holds for a single data point; we will generalize to the whole data set by writing the likelihood function:

$$p(\mathbb{T} | \mathbb{X},\mathbf{w},\beta) = \prod_{n=1}^N \mathcal{N} (t_n | y(x_n,\mathbf{w}),\beta^{-1}).$$

We will again maximize the logarithm of the likelihood function, which is:

$$\ln p(\mathbb{T} | \mathbb{X},\mathbf{w},\beta) = -\frac{\beta}{2} \sum_{n=1}^N \{ y(x_n,\mathbf{w}) - t_n\}^2 +\frac{N}{2} \ln \beta -\frac{N}{2}\ln (2\pi).$$

The maximum likelihood coefficients, $\mathbf{ML}$, are determined by maximizing the above function with respect to $\mathbf{w}$. Noting that the last two terms don't depend on $\mathbf{w}$, and that multiplying by a constant (with respect to the differentiation) $\beta$ won't change the locations of the maximum, this looks just like the sum-of-squares error function.

Likewise, the maximum likelihood estimate of the precision parameter can be obtained by minimizing with respect to $\beta$

$$\frac{1}{\beta_{ML}} = \frac{1}{N} \sum_{n=1}^N \{y(x_n,\mathbf{w}_{ML}) - t_n\}^2.$$

So this is seen to be inversely proportional to the average squared difference between the target values $t_n$, and the maximum likelihood estimate of the the target values $y(x_n,\mathbf{w}_{ML})$.

Now, instead of having a function, like we had above, we have a generative model that can predict a new value of $t$ given a new input value of $x$, i.e.,

$$p(t|x,\mathbf{x}_{ML},\beta_{ML}) = \mathcal{N} (t | y(x,\mathbf{w}_{ML}),\beta_{ML}^{-1}).$$

_Next, let's introduce a prior distribution over the polynomial coefficients $\mathbf{w}$_. Consider, for example, a Gaussian distribution of the form

$$p(\mathbf{w} | \alpha) = \mathcal{N}(\mathbf{w}|\mathbf{0},\alpha^{-1} \mathbf{I}) = \left ( \frac{\alpha}{2\pi} \right )^{(M+1)/2} \exp \left \{ -\frac{\alpha}{2} \mathbf{w}^T \mathbf{w} \right \}$$

where $\alpha$ is the precision of the distribution for the polynomial coefficients, and $M+1$ is the total number of elements in the vector $\mathbf{w}$ for a $M^{th}$ order polynomial. Variables such as $\alpha$, which control the distribution of model parameters, are called _hyperparameters_.



Bayes' theorem then gives us:

$$p(\mathbf{w}|\mathbb{X},\mathbb{T},\alpha,\beta) \propto
p(\mathbb{T} | \mathbb{X}, \mathbf{w},\beta) p(\mathbf{w} | \alpha).$$

Now we determine $\mathbf{w}$ by finding the maximum of the posterior distribution. Such a technique is called maximum posterior, or $MAP$. Taking the negative logarithm of above, substituting in where necessary, we find that the maximum of the posterior is given by the minimum of

$$\frac{\beta}{2} \sum_{n=1}^N \{y(x_n,\mathbf{w}-t_n \}^2 + \frac{\alpha}{2} \mathbf{w}^T \mathbf{w}.$$

So this shows that maximizing the posterior is equivalent to minimizing the regularized sum-of-squares error function, with regularization parameter $\lambda = \alpha/\beta$, i.e., a function of the uncertainty in the parameters due to noise in the data, and the hyperparameter that expresses constraints in the fitting coefficients.



## Bayesian Curve Fitting

One of the reasons why we do curve fitting (and linear regression) is to be able to _predict_ values of the dependent variable $X$ given values of $X$ not included in the training set.  The frequentist (maximum likelihood) point of view gives us a single point estimate of $y$ given any new value of $X$.  But it does not give you directly the uncertainties on that value of $y$.  You may be able to back that information out using bootstrap or other methods, but it is not available to you directly.

As we know, Bayesian techniques provide posterior probability distributions, rather than point estimates.

So let's look at Bayesian curve fitting, where we're not only going to get the new predicted $t$, but also uncertainties on that value. First we will  assume that the hyperparameters $\alpha$ and $\beta$ are known and fixed; that requirement will be relaxed later. We will then move on to Bayesian Regression.

It is important to note in the following that Bishop takes advantage of the fixedness of these parameters and supresses them in the notation, e.g., for $p(\mathbf{w} | \mathbb{X}, \mathbb{T},\alpha, \beta)$ he uses $p(\mathbf{w} | \mathbb{X}, \mathbb{T})$.

We start by writing:

$$p(t|x,\mathbb{X},\mathbb{T}) = \int p(t|x,\mathbf{w}) p(\mathbf{w} | \mathbb{X}, \mathbb{T})\, d\mathbf{w}$$

I.e., marginalizing over the possible cofficients $\mathbf{w}$. 

Here, $x$ is the independent variable, and the training data comprised of $N$ input variables $\mathbb{X} = (x_1,\ldots,x_N)^T$ and their target values $\mathbb{T} = (t_1,\ldots,t_N)^T$.  So we are asking: what is the probability distribution of $t$ given the training set and the particular values of $x$.

$p(\mathbf{w} | \mathbb{X} , \mathbb{T})$ is the posterior distribution over parameters, and can be found by normalizing 

$$p(\mathbf{w}|\mathbb{X},\mathbb{T},\alpha,\beta) \propto
p(\mathbb{T} | \mathbb{X}, \mathbf{w},\beta) p(\mathbf{w} | \alpha).$$

It will be shown later that this posterior distribution is a Gaussian, and can be evaluated analytically. 

The likelihood is also a Gaussian, and the product of two Gaussians is also a Gaussian, so this integral can be done analytically, resulting in:

$$p(t|x,\mathbb{X}, \mathbb{T}) = \mathcal{N}(t|m(x),s^2(x))$$

where the mean and variance are given by

$$m(x) = \beta \boldsymbol{\phi}(x)^T \mathbf{S} \sum_{n=1}^N \boldsymbol{\phi}(x_n) t_n$$

$$s^2(x) = \beta^{-1} + \boldsymbol{\phi}(x)^T \mathbf{S} \boldsymbol{\phi}(x), $$

where

$$\mathbf{S}^{-1} = \alpha \mathbf{I} + \beta \sum_{n=1}^N \boldsymbol{\phi} (x_n) \boldsymbol{\phi}(x_n)^T$$

and $\mathbf{I}$ is the unit matrix, while $\boldsymbol{\phi}(x)$ is defined to be a vector with $\phi_i(x) = x^i$ for $i=0,\ldots,M$.

Ok, so what is this?

- We've already generated a model using our training set $\mathbb{X}$, $\mathbb{T}$, and that model is characterized by $\mathbf{w}$.
- But those values of $\mathbf{w}$ are themselves probabilistic, and actually don't matter because what we are after is the new predicted value $t$ for the new value $x$. So we marginalize over possible values of $\mathbf{w}$ by the integral over $d \mathbf{w}$.
- At the same time, keep in mind that we aren't actually getting a new value of $t$, but rather a probability distribution for values of $t$, given input values of $x$ and the training set. That distribution will be a gaussian with parameters that depend on the training set (through $\boldsymbol{\phi}$ and $t_n$), as well as the noise in the data (through $\beta$), and the prior constraints on $\mathbf{w}$, through $\alpha$.

## Bayesian Linear Regression

In the frequentist approach, we allowed the regression parameters to take on any value.  That led to problems of variance, because the regression parameters could become uncontrollably large, a problem that we could then address by regularization.  

Another option would to be to place a prior on the model parameters.  The prior will control them and not let them become extremely large.  

We are going to assume a Gaussian distribution for the probability, producing a quadradic function of $\mathbf{w}$ for the log likelihood function.

A _conjugate prior_ has the property that the posterior distribution has the same functional form as the prior probability distribution. So, if our likelihood probability distribution is a Gaussian, then the prior distribution should be a Gaussian so that the posterior distribution is a Gaussian. So use a Gaussian for the prior distribution of the model parameters $\mathbf{w}$

$$p(\mathbf{w}) = \mathcal{N}(\mathbf{w} | \mathbf{m}_0,\mathbf{S}_0)$$

with mean $\mathbf{m}_0$ and covariance $\mathbf{S}_0$.

Next, we compute the posterior distribution, which is the product of the likelihood and the prior. The product of two Gaussians is a Gaussian, so this can be done by first completing the square in the exponential.  For those of us who don't remember, completing the square is a technique by which a quadratic polynomial $ax^2+bx+c$ can be rewritten in the form $a(x-h)^2+k)$.   Then we will find the normalization coefficient. 

But Bishop facilitates this computation by deriving, in his Chapter 2, some useful results for marginal and conditional Gaussians, as follows:

Consider a marginal Gaussian distribution for $\mathbf{x}$ and a conditional distribution for $\mathbf{y}$ given $\mathbf{x}$, in the form:

$$p(\mathbf{x}) = \mathcal{N} (\mathbf{x} | \boldsymbol{\mu}, \Lambda^{-1})$$
$$p(\mathbf{y} | \mathbf{x}) = \mathcal{N} (\mathbf{y} |
\mathbf{A}\mathbf{x}+\mathbf{b}, \mathbf{L}^{-1}),$$

where $\boldsymbol{\Lambda} \equiv \boldsymbol{\Sigma}^{-1}$ is the _precision matrix_, for the covariance matrix $\boldsymbol{\Sigma}$. 

$\mathbf{A}$, $\mathbf{b}$, and $\boldsymbol{\mu}$ are parameters controlling the means, and $\mathbf{L}$ is another precision matrix.



Then, the marginal distribution of $\mathbf{y}$ and the conditional distribution of $\mathbf{x}$ given $\mathbf{y}$ is are given by:

$$p(\mathbf{y}) = \mathcal{N}(\mathbf{y} | \mathbf{A} \boldsymbol{\mu}
+ \mathbf{b}, \mathbf{L}^{-1} + \mathbf{A}\boldsymbol{\Lambda}^{-1}\mathbf{A}^T)$$

$$p(\mathbf{x} | \mathbf{y}) = \mathcal{N} (\mathbf{x}|\boldsymbol{\Sigma}\{ \mathbf{A}^T \mathbf{L}(\mathbf{y} - \mathbf{b}) + \boldsymbol{\Lambda} \boldsymbol{\mu} \},
\boldsymbol{\Sigma}),$$

where

$$\boldsymbol{\Sigma} = (\boldsymbol{\Lambda} + \mathbf{A}^T \mathbf{L} \mathbf{A})^{-1}.$$

Using these already derived results, we can write down the general posterior distribution

$$p(\mathbf{w} | \mathfrak{t}) = \mathcal{N} (\mathbf{w} |
\mathbf{m}_N, \mathbf{S}_N)$$

where

$$\mathbf{m}_N = \mathbf{S}_N (\mathbf{S}_0^{-1} \mathbf{m}_0 + \beta
\boldsymbol{\Phi}^T \mathfrak{t})$$
$$\mathbf{S}^{-1}_N = \mathbf{S}_0^{-1} + \beta \boldsymbol{\Phi}^T  \boldsymbol{\Phi}
\mathbf{T}.$$

Here $\boldsymbol{\Phi}$ is the design matrix that we talked about last time in our discussion of the linear regression, and $\beta$ is the precision matrix, i.e., the inverse of the covariance matrix.


Let's examine this solution.

- because the posterior is Gaussian, its mean and mode are equal. So the maximum posterior weight vector is $\mathbf{w}_{MAP} = \mathbf{m}_N$.
- if we consider the infinitely broad prior $\mathbf{S}_0 = \alpha^{-1} \mathbf{I}$ with $\alpha \rightarrow 0$, $\mathbf{m}_N$ becomes the maximum likelihood value $\mathbf{w}_{ML}=(\boldsymbol{\Phi}^T \boldsymbol{\Phi})^{-1}
\boldsymbol{\Phi}^T \mathfrak{t}$ which we derived last time.  That is, the values of $\mathbf{w}$ are uncontrolled.
- if $N=0$, then the posterior reverts to the prior, as it should.
- And one can do sequential updating as discussed last time.


As we did above with Bayesian curve fitting, we can derive a useful result for a simplified Gaussian prior of the form

$$p(\mathbf{w} | \alpha) = \mathcal{N}(\mathbf{w} | 0,\alpha^{-1} \mathbf{I}).$$

Note the presence of the hyperparameter $\alpha$. This produces, similar to before, Gaussian posteror distribution above with

$$\mathbf{m}_N = \beta \mathbf{S}_N \boldsymbol{\Phi} \mathfrak{t}$$
$$\mathbf{S}_N^{-1} = \alpha \mathbf{I} + \beta \boldsymbol{\Phi}^T \boldsymbol{\Phi}.$$

We go through the same routine as last time, and find that maximization of the posterior distribution with respect to $\mathbf{w}$ is equivalent to minimizing the sum-of-squares error function above, including the quadratic regularization function, with $\lambda = \alpha/\beta$, i.e.,

$$\frac{1}{2} \sum_{n=1}^N \{t_n - \mathbf{w}^T
\boldsymbol{\phi}(\mathbf{x}_n) \}^2+ \frac{\alpha}{\beta} \mathbf{w}^T
\mathbf{w}.$$

So the Bayesian solution has an identical form as the frequentist, but the regularization parameter $\lambda$ is replaced by the ratio $\alpha/\beta$.


To promote understanding of this result, Bishop presents a lengthy sequential Bayesian learning example in his Figure 3.7. A somewhat similar plot is shown in Ivezic Figure 8.1, below.  First is shown the four points to be fit:

![Ivezic Figure 8.1 top](http://www.astroml.org/_images/fig_linreg_inline_1.png)

Then is shown the sequential determination of the best fitting parameters as subsequent values are added.

![Ivezic Figure 8.1 bottom](http://www.astroml.org/_images/fig_linreg_inline_2.png)




### Predictive Distribution

As noted above, one may not be interested in the values of $\mathbf{w}$ itself, but rather in making predictions of $t$ for new values of $\mathbf{x}$. The predictive distribution is defined by:

$$p(t|\mathfrak{t},\alpha,\beta) = \int p(t|\mathbf{w},\beta) p(\mathbf{w} | \mathfrak{t},\alpha,\beta)\, d\mathbf{w},$$

where $\mathfrak{t}$ is the vector of target values from the training set. Note that the $\mathbf{x}$ has been left out of the right side to simplify notation.

We have already found that:

$$p(t|\mathbf{x},\mathbf{w},\beta) = \mathcal{N}(t | y(\mathbf{x},\mathbf{w}),\beta^{-1}).$$

and

$$p(\mathbf{w} | \mathfrak{t}) = \mathcal{N} (\mathbf{w} | \mathbf{m}_N, \mathbf{S}_N).$$

Making use of the Gaussian identities above, we find

$$p(t|\mathbf{x},\mathfrak{t},\alpha,\beta) = \mathcal{N}(t|\mathbf{m}_N^T \boldsymbol{\phi}(\mathbf{x}),\sigma_N^2(\mathbf{x})) $$

where the variance $\sigma_N^2(\mathbf{x})$ of the predictive distribution is given by:

$$\sigma_N^2(\mathbf{x}) = \frac{1}{\beta} + \boldsymbol{\phi}(\mathbf{x})^T \mathbf{S}_N \boldsymbol{\phi}(\mathbf{x}).$$

The first term is the noise on the data, while the second term is the uncertainty associated with the parameters $\mathbf{w}$.



### Example

Bayesian regression is implemented as Bayesian Ridge Regression in scikit-learn.  Let's see how it performs on the same problem that we looked at with ridge regression last time.


In [None]:
import time
import numpy as np
import matplotlib.pyplot as plt
%pylab inline


# Set up an x-vector
x=np.sort(np.random.uniform(0,1,11))

# Set up an x-vector with the same range, but sampled more finely
x2=np.linspace(0,1,101)

# y is a sine wave
y=np.sin(2*np.pi*x)

#give y some uncertainty, i.e., apply a Gaussian scatter around 
#the theoretically determined values.
ynew=y+np.random.normal(0,0.15,11)

#assign some errors.  These errors don't have any physical interpretation, 
#but are reasonably appropriate, as they are scattered around the 
#sigma of the uncertainty gaussian (=0.15)
err=np.random.uniform(0.1,0.2,11)

In [None]:
#plot ideal y, and data.
plt.plot(x,y)
plt.errorbar(x, ynew, yerr=err, fmt='o')

In [None]:
from sklearn.linear_model import BayesianRidge, LinearRegression

def gaussian_basis(x, mu, sigma):
    return np.exp(-0.5 * ((x - mu) / sigma) ** 2)

centers = np.linspace(0, 1.0, 11)
widths = 0.05
X = gaussian_basis(x[:, None], centers, widths)




In [None]:
## Apply a Gaussian Basis Regression with no regularization.  
## Might have to run it a couple of times to get something truely pathelogical.

clf = LinearRegression(fit_intercept=True)
clf.fit(X, y)
w = clf.coef_
fit = clf.predict(gaussian_basis(x[:, None], centers, widths))
fit2 = clf.predict(gaussian_basis(x2[:,None],centers,widths))



In [None]:
plt.plot(x,y)
plt.errorbar(x, ynew, yerr=err, fmt='o')
plt.plot(x,fit)
plt.plot(x2,fit2)
print fit.shape

In [None]:
clf_Bayes = BayesianRidge(compute_score=True)
clf_Bayes.fit(X, y)

fit_bayes = clf_Bayes.predict(gaussian_basis(x[:, None], centers, widths))
fit2_bayes = clf_Bayes.predict(gaussian_basis(x2[:,None],centers,widths))



In [None]:
plt.plot(x,y)
plt.errorbar(x, ynew, yerr=err, fmt='o')
plt.plot(x,fit_bayes)
plt.plot(x2,fit2_bayes)
print fit_bayes.shape

In [None]:
print clf_Bayes.coef_
print clf.coef_

## Kernel Methods

Next we will discuss gaussian process regression, which is a kernel method.  

Kernel methods represent a dramatic generalization of our approaches so far.  So far, we have specified the function, and then have solved for the best-fitting parameters of that function.  We may modify the function (e.g., use a third order polynomial rather than a second-order one), but generally speaking, the flexbility of the function is fixed.

The generalization that we will use here is that instead of considering a single function, we will consider a class of functions. A _Gaussian Process_ is a generalization of the Gaussian probability distribution.  A probability distribution describes variables, the process describes the properties of functions.  

How can this be done, since there must be an infinite number of functions?  As you will see, the problem becomes much simpler when the functions have certain properties.

### Gaussian Process Regression

Rasmussion & Williams (2006) explain Gaussian process regression from two viewpoints, and we will follow their development in order to develop some intuition about the process.

#### Weight-space View

This discussion starts with Bayesian regression as discussed above, and generalizes it to functions.

We start with a training set $\mathcal{D}$ of $n$ observations, $\mathcal{D} = \{(\mathbf{x_i},y_i) | i=1,\dots,n\}$, where $\mathbf{x}$ is an input vector (independent variable) of dimension $\mathcal{D}$ and $y$ is the scalar output, also called the _target_ (dependent variable).  Recall that the inputs can be expressed as a $D \times n$ design matrix $X$, with the targets collected in $\mathbf{y}$ so that $\mathcal{D} = (X,\mathbf{y})$ (Note that this is the transpose of the design matrix we used before).  As before, we are interested in representing the relationship betwen the inputs and outputs, i.e., the conditional distribution of the targets given the inputs.

To motivate the properties of the Gaussian process, we start with the standard linear regression model with Gaussian noise 

$$f(\mathbf{x})= \mathbf{X}^T \mathbf{w}$$
$$y=f(\mathbf{x})+\epsilon$$

where $\mathbf{w}$ are the weights of the linear model, $f$ is the function value, and $y$ is the target value, while $\epsilon$ is noise specified by a Gaussian distribution with zero mean and variance $\sigma_n^2$.  

The probability density of the observations given the parameters is:

$$p(\mathbf{y}|X,\mathbf{w})=\prod_{i=1}^n p(y_i|\mathbf{x},\mathbf{w})$$

$$=\prod_{i=1}^N\frac{1}{\sqrt{2\pi}\sigma_n} \exp\left( - \frac{(y_i-\mathbf{x}_i^T\mathbf{w})^2}{2\sigma_n^2}\right)$$

$$= \frac{1}{(2\pi\sigma_n^2)^{n/2}} \exp\left(-\frac{1}{2\sigma^2_n} \lvert \mathbf{y}-X^T \mathbf{w}\rvert^2\right)$$

$$=\mathcal{N}(X^T\mathbf{w},\sigma_n^2 I),$$

where $\lvert z \rvert$ denotes the Euclidean length of vector $\mathbf{z}$.  

We need to specify a prior over the parameters $\mathbf{w}$, so 
$$\mathbf{w}\sim \mathcal{N}(\mathbf{0},\Sigma_p)$$

where $\Sigma_p$ is the covariance matrix for the prior.

We have the likelihood and we have the prior, so we can write the equation for the posterior proability distribution 

$$p(\mathbf{w} |X,\mathbf{y}) \propto \exp \left( -\frac{1}{2\sigma_n^2} (\mathbf{y}-X^T\mathbf{w})^T(\mathbf{y}-X^T \mathbf{w})\right) \exp \left( -\frac{1}{2}\mathbf{w}^T\Sigma_p^{-1}\mathbf{w}\right) $$

$$\propto \exp \left ( -\frac{1}{2}(\mathbf{w}-\mathbf{\bar{w}})^T  (\frac{1}{\sigma_n^2} X X^T + \Sigma_p^{-1}) (\mathbf{w}- \mathbf{\bar{w}})  \right),$$

where

$$\mathbf{\bar{w}} = \sigma_n^{-2} (\sigma_n^{-2} XX^T+\Sigma_p^{-1})^{-1}X\mathbf{y}.$$

This was accomplished again by "completing the square".

So, the form of the posterior distribution is a Gaussian with mean $\mathbf{\bar{w}}$ and covariance matrix $A^{-1}$ 

$$p(\mathbf{w}|X,\mathbf{y}) \sim \mathcal{N}(\mathbf{\bar{w}}, A^{-1})$$

where
$$\mathbf{\bar{w}}= \frac{1}{\sigma_n^2} A^{-1} X \mathbf{y}$$
and 
$$ A=\sigma_n^{-2} XX^T+\Sigma_p^{-1}.$$

Compare this with notation above:

$$\mathbf{m}_N = \beta \mathbf{S}_N \boldsymbol{\Phi} \mathfrak{t}$$
$$\mathbf{S}_N^{-1} = \alpha \mathbf{I} + \beta \boldsymbol{\Phi}^T \boldsymbol{\Phi}.$$



As before, we want to make predictions for a test case.  We'll average over all possible parameter values, weighted by their posterior probability.  

So, for test set values $\mathbf{x}_*$, the predictive distribution for $f_*$ is 

$$p(f_*|\mathbf{x}_*,X,\mathbf{y})=\int p(f_*|\mathbf{x}_*,\mathbf{w})p(\mathbf{w}|X,\mathbf{y})d\mathbf{w}$$
$$=\mathcal{N}\left (\frac{1}{\sigma_n^2} \mathbf{x}_*^T A^{-1} X \mathbf{y}, \mathbf{x}_*^T A^{-1} \mathbf{x}_*\right ).$$

(Note that $^*$ denotes a test value.)

#### Projection of the Inputs into Feature Space

Now we have the Bayesian linear model as before, in terms of Rasmussen and Williams (2006) parameters.  The issue with this model that we want to address is the fact that it has "limited expressiveness".  

The simple idea proposed is that the input values could be projected into some high-dimensional space ("feature space") using a set of basis functions, and then the linear regression could be performed in this space rather than directly on the inputs themselves.

For example, a scalar input $x$ could be projected into the space of powers of $x$: $\phi(x) = (1,x,x^2,x^3,\ldots)^T$ for polynomial regression.  

As an example, see below.  The idea is that in the input space, a plane cannot separate the points, but in the feature space, it can.

![wikipedia](https://upload.wikimedia.org/wikipedia/commons/c/cc/Kernel_trick_idea.svg)



Introduce the function $\phi(\mathbf{x})$ which maps $\mathbf{x}$ into an $N$ dimensional feature space.  

Let the matrix $\Phi(X)$ be the aggregation of columns $\phi(\mathbf{x})$.

Now the model is 
$$f(\mathbf{x}) = \phi (\mathbf{x})^T \mathbf{w},$$

where the vector of parameters has length $N$.  

This is the same as for the standard linear model, except now $\Phi(X)$ is substituted in for the design matrix $X$.  So the predictive distribution becomes:

$$f_*|\mathbf{x}_*,X,\mathbf{y} \sim =\mathcal{N}\left (\frac{1}{\sigma_n^2} \phi(\mathbf{x}_*)^T A^{-1} \Phi \mathbf{y}, \phi(\mathbf{x}_*)^T A^{-1} \phi(\mathbf{x}_*) \right)$$

where $\Phi=\Phi(X)$ and $A=\sigma_n^{-2} \Phi \Phi^T+\Sigma_p^{-1}$.



Making predictions using this equation requires inverting the $A$ matrix of size $N\times N$ which may not be convenient if the size of the feature space is large.  So the equation can be rewritten as follows:

$$f_*|\mathbf{x}_*,X,\mathbf{y} \sim =\mathcal{N}\left (\phi_*^T \Sigma_p \Phi(K+\sigma_n^2 I)^{-1} \mathbf{y}, \phi_*^T \Sigma_p \phi_* - \phi_*^T \Sigma_p \Phi(K+\sigma_n^2 I)^{-1} \Phi^T \Sigma_p \phi_* \right ),$$

where $\phi_*=\phi(\mathbf{x_*})$ and $K = \Phi^T \Sigma_p\Phi$.  The derivation of this form is outlined in Rasmussen & Williams (2006).



Now we get to the good part.

_Notice that in the above equation, the feature space always enters in the form_ $\Phi^T \Sigma_p \Phi$, $\phi_*^T \Sigma_p \Phi$, or $\phi_*^T \Sigma_p \phi_*$.  

Thus the entries are of the form $\phi(\mathbf{x})^T \Sigma_p \phi(\mathbf{x}^\prime)$ where $\mathbf{x}$ and $\mathbf{x}^\prime$ are _either in the traning set or the test set_.  

We define $k(\mathbf{x},\mathbf{x}^\prime) = \phi(\mathbf{x})^T \Sigma_p \phi(\mathbf{x}^\prime)$ where $k(\cdot,\cdot)$ is called the _covariance function_ or _kernel_.

Note that $\phi{\mathbf{x}}^T \Sigma_p \phi(\mathbf{x}^\prime)$ is an inner product (with respect to $\Sigma_p$).  $\Sigma_p$ is a covariance matrix, so it will be positive definite.  So we can define $\sigma_p^{1/2}$ so that $(\Sigma_p^{1/2})^2 = \Sigma_p$.  

Then defining $\psi(\mathbf{x}) = \Sigma_p^{1/2} \phi(\mathbf{x})$, then $k(\mathbf{x},\mathbf{x}^\prime) = \psi(\mathbf{x})\centerdot \psi(\mathbf{x}^\prime)$.

#### The Kernel Trick

If an algorithm can be defined in terms of inner products in input space, then it can be lifted into feature spec by replacing occurrences of those inner products by $k(\mathbf{x},\mathbf{x}^\prime)$.  

The kernel trick is also known as the _kernel substitution_.  The idea is that if the kernel appears as an inner product, then you can substitute in an inner product of a different kernel, as above, where we started with $\phi(\mathbf{x})$ and ended up with $\psi(\mathbf{x})$.  

This technique is valuable in situations where it is more convenient to compute the kernel than the feature vectors themselves.   So the kernel is of primary interest, and the feature space has secondary importance.  It can also arise naturally; the form of the error function will suggest an appropriate different kernel function.  In this case it is called a _dual representation_.  See Bishop 6.1 for an example.

Another interesting point is that the training sample doesn't go away.  In regular linear regression, we used the training set to obtain estimates of $\mathbf{w}$.  Then the training set was thrown away, and the derived $\mathbf{w}$ values were used on the test set.  Here, the training set persists, through  $k(\mathbf{x},\mathbf{x}^\prime) = \psi(\mathbf{x})\centerdot \psi(\mathbf{x}^\prime)$.



### Function-space View

As a complementary development, let's consider inference directly in function space.  

Define:  A _Gaussian process_ is a collection of random variables, any finite number of which have a joint Gaussian distribution.

A Gaussian process is completely specified by its mean and covariance functions, as you might expect.  Define mean function $m(\mathbf{x})$ and covariance function $k(\mathbf{x},\mathbf{x}^\prime)$ of a real process $f(\mathbf{x})$ as 
$$m(\mathbf{x}) = \mathbb{E} [f(\mathbf{x})],$$
$$k(\mathbf{x},\mathbf{x}^\prime) = \mathbb{E}[(f(\mathbf{x}-m(\mathbf{x}))(f(\mathbf{x}^\prime)-m(\mathbf{x}^\prime))].$$

The Gaussian process is then written as 

$$f(\mathbf{x}) \sim \mathcal{GP} (m(\mathbf{x}),k(\mathbf{x},\mathbf{x}^\prime)).$$

Generally, the mean will be taken to be zero, but that does not need to be the case.




A simple example of a Gaussian process is obtained from the linear regression model above, with $f(\mathbf{x}) = \phi(\mathbf{x})^T\mathbf{w}$ with prior $\mathbf{w} \sim \mathcal{N}(\mathbf{0},\Sigma_p)$.  

In this case, the mean and covariance are:

$$\mathbb{E}[f(\mathbf{x})]=\phi(\mathbf{x})^T \mathbb{E}[\mathbf{w}] = 0,$$

$$\mathbb{E}[f(\mathbf{x})f(\mathbf{x}^\prime] = \phi(\mathbf{x})^T \mathbb{E}[\mathbf{w}\mathbf{w}^T]\phi(\mathbf{x}^\prime) = \phi(\mathbf{x}^T \Sigma_p \phi(\mathbf{x}^\prime).$$

So $f(\mathbf{x})$ and $f(\mathbf{x}^\prime)$ are jointly Gaussian with zero mean and covariance given by $\phi(\mathbf{x})^T \Sigma_p \phi(\mathbf{x}^\prime)$.  

Let's consider a particular example of a covariance function: the **Squared Exponential (SE)**.  This covarience function is also called Gaussian, or the **Radial Basis Function (RBF)** (in Bishop and in scikit-learn).  

$$\mathsf{cov}(f(\mathbf{x}_p),f(\mathbf{x}_q))) = k(\mathbf{x}_p,\mathbf{x}_q) = \exp\left (-\frac{1}{2}|\mathbf{x}_p - \mathbf{x}_q|^2 \right).$$

This is a rather strange function.  The covariance between outputs is a function of the inputs.  It can be seen that the covariance will be close to unity if the corresponding inputs are close together.  

This is an important point for understanding and engineering kernels.  You may think about "what do I want the relationship between two $y$ values to be, given the known relationship between two $x$ values."

It turns out that the squared exponential covariance function corresponds to a Bayesian linear regression model with an infinite number of basis functions. 

The specification of the covariance function implies a _distribution over functions_.  So we can draw samples from the distribution evaluated at any number of points $xplot$, i.e., not points used to constrain the problem, but points for testing the result.  Let's do that below.

In [None]:
# define a squared exponential covariance function
def squared_exponential(x1, x2, h):
    return np.exp(-0.5 * (x1 - x2) ** 2 / h ** 2)

In [None]:
# draw samples from the unconstrained covariance
#np.random.seed(1)
x = np.linspace(0, 10, 100)
#h = 1.0
h = 0.5

mu = np.zeros(len(x))
C = squared_exponential(x, x[:, None], h)
draws = np.random.multivariate_normal(mu, C, 3)

plt.plot(x, draws.T, '-k')


In [None]:
pylab.rcParams['figure.figsize'] = (15, 6)

plt.subplot(1,2,1)

plt.contourf(x,x,C)

plt.subplot(1,2,2)

plt.plot(C[50,:])

Notes:
- $x$ is a 100 element vector spanning 0 to 10.
- $C$ is a $100 \times 100$ element vector created from $x$.  In cross section (either $x$ or $y$ direction) it looks like a Gaussian.
- "The multivariate normal, multinormal or Gaussian distribution is a generalization of the one-dimensional normal distribution to higher dimensions. Such a distribution is specified by its mean and covariance matrix. These parameters are analogous to the mean (average or “center”) and variance (standard deviation, or “width,” squared) of the one-dimensional normal distribution." - wikipedia

Also the functions do seem to have a characteristic scale length, given in the above example by $h$.  It can be thought of as the distance you have to move in input space before the function value changes significantly.

Next, let's incorporate the training data.  _This will constrain the function choices_.  For the first example, the training data will have no uncertainty.  The joint distribution of the training outputs $f$ and the test outputs $\mathbf{f}_*$ according to the prior is:

$$\begin{bmatrix}
\mathbf{f} \\
\mathbf{f}_* 
\end{bmatrix} \sim \mathcal{N} \left(\mathbf{0},
\begin{bmatrix} 
K(X,X) &  K(X,X_*) \\
K(X_*,X) & K(X_*,X_*) 
\end{bmatrix}
\right ).$$



So the chosen functions will have to go through the points in the training set.  One way to think about it is to consider that you are generating a large number of functions, as above, from the prior, then rejecting the ones that disagree with the observation.  That would not be a very efficient method, so instead, the joint Gaussian prior distribution is conditioned on the observations.

$$\mathbf{f}|X_*, ,X,\mathbf{f} \sim \mathcal{N}[(K(X_*,X) K(X,X)^{-1} \mathbf{f}, K(X_*,X_*)-K(X_*,X)K(X,X)^{-1} K(X,X_*)).$$

So, function values $\mathbf{f}_*$, corresponding to test inputs $\mathbf{X}_*$ can be sampled from the joint posterior distribution by evaluating the mean and covraiance matrix from the equation above, and then generating samples.

In [None]:
from sklearn.gaussian_process import GaussianProcess


# Constrain the mean and covariance with two points
x1 = np.array([2.5, 7])
y1 = np.cos(x1)
gp1 = GaussianProcess(corr='squared_exponential', theta0=0.5,
                      random_state=0)
gp1.fit(x1[:, None], y1)
f1, MSE1 = gp1.predict(x[:, None], eval_MSE=True)
f1_err = np.sqrt(MSE1)




In [None]:
plt.plot(x, f1, '-', color='gray')
plt.fill_between(x, f1 - 2 * f1_err, f1 + 2 * f1_err, color='gray', alpha=0.3)
plt.plot(x1, y1, '.k', ms=6)

Let's look at the Gaussian Process model class GaussianProcess.

(Keep in mind, however, this object is being replaced by GaussianProcessRegressor.  That object has a different API (Application Program Interface) than GaussianProcess.)

The inputs:
- The default is "squared exponential", and the other built-in models are: 'absolute_exponential', 'generalized_exponential', 'cubic', 'linear'.
- theta0 is and array with shape (n_features,), in this case is a scalar.  It is the parameters in the autocorrelation model.  Default is an isotropic autocorrelation model with theta0=0.1.
- random_state=0 - This is specifying properties of the random number generator.  Specifically, if it is set equal to zero, if fixes the seed of the random number generator so that exactly the same plot is obtained.  It defaults to the numpy random number generator.

The attributes:
- fit: The Gaussian Process model fitting method.
- predict: This function evaluates the Gaussian Process model at x.  The function returns the predictions $y$, and $MSE$, an array with shape (n_eval,), or (n_eval,n_targets) with the mean squared error at x.




### Predictions using Noisy Observations

Here, we assume that we do not have $f(\mathbf{x})$, but rather $y=f(\mathbf{x})+\epsilon$.  Then the prior becomes:

$$cov(y_p,y_q) = k(\mathbf{x}_p, \mathbf{x}_q) + \sigma_n^2\delta_{pq}$$
or
$$cov(\mathbf{y})=K(X,X)+\sigma_n^2 I$$

where $\delta_{pq}$ is the Kronecker delta.  So the joint distribution of the observed target values and the function values at the test locations under the prior is:

$$\begin{bmatrix}
\mathbf{y} \\
\mathbf{f}_* 
\end{bmatrix} \sim \mathcal{N} \left(\mathbf{0},
\begin{bmatrix} 
K(X,X)+\sigma_n^2I &  K(X,X_*) \\
K(X_*,X) & K(X_*,X_*) 
\end{bmatrix}
\right ).$$

The conditional distribution then becomes:

$$\mathbf{f}_* | X, \mathbf{y}, X_* \sim \mathcal{N}(\bar{\mathbf{f}}_*,cov(\mathbf{f}_*))$$

where

$$\bar{\mathbf{f}_*} \equiv \mathbb{E}[\mathbf{f}_* | X, \mathbf{y}, X_*] = K(X_*,X)[K(X,X)+\sigma_n^2 I]^{-1}\mathbf{y},$$
$$cov(\mathbf{f}_*)= K(X_*,X_*)-K(X_*,X)[K(X,X)+\sigma_n^2 I]^{-1} K(X,X_*).$$



To make these equations less unwieldy, introduce some shorthand notation:
- $K=K(X,X)$
- $K_*=K(X,X_*)$
- $\mathbf{k}(\mathbf{x}_*)=\mathbf{k}_*$ (for a single test point $\mathbf{x}_*$)

Then
$$\bar{f}_*=\mathbf{k}_*^T (K+\sigma_n^2 I)^{-1} \mathbf{y},$$
$$\mathbb{V}[f_*]=k(\mathbf{x}_*,\mathbf{x}_*) - \mathbf{k}_*^T (K+\sigma_n^2 I)^{-1}\mathbf{k}_*.$$

Examining these two equations, we notice a few things. First, the mean predition $\bar{f}_*$ is a linear combination of observations $\mathbf{y}$.  It is sometimes referred to as a linear predictor.  This equation says that $\bar{f}_*$ is a linear combination of $n$ kernel functions, each one centered on a training point.

$$\bar{f}(\mathbf{x}_*) = \sum_{i=1}^{n} \alpha_i k(\mathbf{x}_i,\mathbf{x}_*)$$ 

where $\boldsymbol{\alpha}=(K+\sigma_n^2I)^{-1}\mathbf{y}$.  

So despite the fact that the GP can be represented in terms of a possibly infinite number of basis functions, the estimate at a single point is a finite sum.  

Include some noise in the data above.  We will use homoscedastic errors for this example.


In [None]:
#------------------------------------------------------------
# Constrain the mean and covariance with two noisy points
#  scikit-learn gaussian process uses nomenclature from the geophysics
#  community, where a "nugget" can be specified.  The diagonal of the
#  assumed covariance matrix is multiplied by the nugget.  This is
#  how the error on inputs is incorporated into the calculation
dy2 = 0.2
gp2 = GaussianProcess(corr='squared_exponential', theta0=0.5,
                      nugget=(dy2 / y1) ** 2, random_state=0)
gp2.fit(x1[:, None], y1)
f2, MSE2 = gp2.predict(x[:, None], eval_MSE=True)
f2_err = np.sqrt(MSE2)

print gp2.theta_

In [None]:
plt.plot(x, f2, '-', color='gray')
plt.fill_between(x, f2 - 2 * f2_err, f2 + 2 * f2_err, color='gray', alpha=0.3)
plt.errorbar(x1, y1, dy2, fmt='.k', ms=6)


In [None]:
# Constrain the mean and covariance with many noisy points
x3 = np.linspace(0, 10, 20)
y3 = np.cos(x3)
dy3 = 0.2
y3 = np.random.normal(y3, dy3)
gp3 = GaussianProcess(corr='squared_exponential', theta0=0.5,
                      thetaL=0.01, thetaU=10.0,
                      nugget=(dy3 / y3) ** 2,
                      random_state=0)
gp3.fit(x3[:, None], y3)
f3, MSE3 = gp3.predict(x[:, None], eval_MSE=True)
f3_err = np.sqrt(MSE3)

# we have fit for the `h` parameter: print the result here:
print "best-fit theta =", gp3.theta_


In [None]:
plt.plot(x, f3, '-', color='gray')
plt.fill_between(x, f3 - 2 * f3_err, f3 + 2 * f3_err, color='gray', alpha=0.3)
plt.errorbar(x3, y3, dy3, fmt='.k', ms=6)


Note in the above that when there is enough data, the value of the scale factor is solved for.  This may be useful information for other purposes.

### Varying the Hyperparameters

Typically the covariance functions will have some free parameters (_hyperparameters_).  For example, the squared-exponential covariance function in one dimension has the following form:

$$k_y(x_p,x_q) = \sigma_f^2 \exp \left (-\frac{1}{2l^2}(x_p-x_q)^2\right )+\sigma_n^2\delta_{pq}.$$

In this equation, the length scale, the signal variance $\sigma_n^2$ and the noise variance $\sigma_n^2$ can be varied.  

Generally speaking, these will be determined from the training data.  But in order to understand how to interpret them, let's see what they do.  We'll generate some data with one set of hyperparameters, and fit it with some others.  We'll generate with $(l,\sigma_f,\sigma_n)=(1,1,0.1)$.

In [None]:
# define a squared exponential covariance function
def squared_exponential(x1, x2, h):
    return np.exp(-0.5 * (x1 - x2) ** 2 / h ** 2)

# Set up an x-vector
x=np.sort(np.random.uniform(0,10,11))
xplot = np.linspace(0, 10, 100)

#referring to the definition above, l=h, and sigma_f=1 already
h = 1.0

#dy is sigma_n
dy = 0.1

muplot=np.zeros(len(xplot))
Cplot=squared_exponential(xplot, xplot[:, None], h)
drawsplot = np.random.multivariate_normal(muplot, Cplot)

#now interpolate some y values at the random x values.

y=np.interp(x,xplot,drawsplot.T)

draws = np.random.multivariate_normal(mu, C)
plt.errorbar(x, y, dy, fmt='.k', ms=6)
plt.plot(xplot, drawsplot.T, '-k')



Now model with different values of $l$.  A problem is that this parameter in GaussianProcess is theta0, and it is not $l$ exactly, and in fact has an inverse relationship with $l$.  

First we'll try a large value of theta0, which corresponds to a small value of $l$, and therefore a short length scale.

In [None]:
gp4 = GaussianProcess(corr='squared_exponential', theta0=50.0,
                      nugget=(dy / y) ** 2, random_state=0)
gp4.fit(x[:, None], y)
f4, MSE4 = gp4.predict(xplot[:, None], eval_MSE=True)
f4_err = np.sqrt(MSE4)

print gp4.theta_

In [None]:
plt.plot(xplot, f4, '-', color='gray')
plt.fill_between(xplot, f4 - 2 * f4_err, f4 + 2 * f4_err, color='gray', alpha=0.3)
plt.errorbar(x, y, dy, fmt='.k', ms=6)


The function is seen to be too wiggly.  

Next try a small value of theta0.



In [None]:
gp5 = GaussianProcess(corr='squared_exponential', theta0=0.10,
                      nugget=(dy / y) ** 2, random_state=0)
gp5.fit(x[:, None], y)
f5, MSE5 = gp5.predict(xplot[:, None], eval_MSE=True)
f5_err = np.sqrt(MSE5)

print gp5.theta_

In [None]:
plt.plot(xplot, f5, '-', color='gray')
plt.fill_between(xplot, f5 - 2 * f5_err, f5 + 2 * f5_err, color='gray', alpha=0.3)
plt.errorbar(x, y, dy, fmt='.k', ms=6)


The function is seen to be too smooth.

Next, let us have it fit for theta0 and see what it gets.

In [None]:
gp6 = GaussianProcess(corr='squared_exponential', theta0=1.0,
                      thetaL=0.01, thetaU=10.0,
                      nugget=(dy / y) ** 2,
                      random_state=0)
gp6.fit(x[:, None], y)
f6, MSE6 = gp6.predict(xplot[:, None], eval_MSE=True)
f6_err = np.sqrt(MSE6)

# we have fit for the `h` parameter: print the result here:
print "best-fit theta =", gp6.theta_


In [None]:
plt.plot(xplot, f6, '-', color='gray')
plt.fill_between(xplot, f6 - 2 * f6_err, f6 + 2 * f6_err, color='gray', alpha=0.3)
plt.errorbar(x, y, dy, fmt='.k', ms=6)


This looks pretty good for the sparseness of the data.  Based on the best fit, I am guessing that $l$ differs from theta0 by a factor of $2\pi$?  More experimentation would confirm / deny this.

At any rate, the value of theta0 may be a useful parameter to describe the relative complexity of data perhaps?

## Kernels

The bottom line seems to be that engineering kernels may be where the strength and flexibility lies in this method.  But for many cases, the exponential-squared kernel (also known as radial basis function) discussed above may be useful.

A necessary and sufficient condition for a funtion $k(x,x^\prime)$ to be a valid kernel is that the Gram matrix $\mathbf{K}$, whose elements are given by $k(x_n,x_m)$ should be _positive semidefinite_ for all choices of the set $\{x_n\}$.  

What is positive semidefinite?  A matrix $\mathbf{A}$ is positive semidefinite if $\mathbf{w}^T \mathbf{A} \mathbf{w} \ge 0$ for all choices of $\mathbf{w}$.  This is equivalent to all eigenvalues $\lambda_i$ of $\mathbf{A}$ being greater than or equal to zero.  So it looks like you can just diagonalize your matrix to find out if it is acceptable.

One method for constructing new kernels is to build them out of simpler kernels.  Given valid kernels $k_1(x,x^\prime)$ and $k_2(x,x^\prime)$, then the following new kernels will also be valid.

$$k(x,x^\prime)=c k_1(x,x^\prime)$$
$$k(x,x^\prime)=f(x) k_1(x,x^\prime) f(x^\prime)$$
$$k(x,x^\prime)=q(k_1(x,x^\prime))$$
$$k(x,x^\prime)=\exp(k_1(x,x^\prime))$$
$$k(x,x^\prime)=k_1(x,x^\prime)+k_2(x,x^\prime)$$
$$k(x,x^\prime)=k_1(x,x^\prime)k_2(x,x^\prime)$$
$$k(x,x^\prime)=k_3(\phi(x),\phi(x^\prime))$$
$$k(x,x^\prime)=x^T \mathbf{A} x^\prime$$
$$k(x,x^\prime)=k_a(x_a,x_a^\prime)+k_b(x_b,x_b^\prime)$$
$$k(x,x^\prime)=k_a(x_a,x_a^\prime)k(x_b,x_b^\prime)$$

where 
- $c>0$ is a constant, 
- $f(\cdot)$ is any function
- $q(\cdot)$ is a polynomial with nonnegative coefficients 
- $\phi(x)$ is a function from $x$ to $\mathcal{R}^M$
- $k_3(\cdot,\cdot)$ is a valid kernel in $\mathcal{R}^M$
- $\mathbf{A}$ is a symmetric positive semidefinite matrix
- $x_a$ and $x_b$ are variables (not necessarily disjoint) with $x=(x_a,x_b)$
- $k_a$ and $k_b$ are valid kernel functions over their respective spaces.

So there is a lot of flexibility.

###  Specific examples:

**Radial-basis function (RBF) kernel**

This is the default kernel in the scikit-learn GaussianProcessRegression.  The kernel is given by:

$$k(x_i,x_j)= \exp\left (-\frac{1}{2} d(x_i/l,x_j/l )^2 \right)$$

where $d$ is going to be the Euclidean distance.  This should be the same kernel as written above as:

$$\mathsf{cov}(f(\mathbf{x}_p),f(\mathbf{x}_q))) = k(\mathbf{x}_p,\mathbf{x}_q) = \exp\left (-\frac{1}{2}|\mathbf{x}_p - \mathbf{x}_q|^2 \right).$$

An example is shown below:

![scikit-learn RBF illustration](http://scikit-learn.org/stable/_images/sphx_glr_plot_gpr_prior_posterior_000.png)


    


**The Matern kernel**

This is a generalization of the RBF, which has an additional parameter $\nu$ which controls the smoothness of the resulting function, as well as the length scale $l>0$.  The kernel is given by

$$k(x_i,x_j)=\sigma^2 \frac{1}{\Gamma(\nu)2^{\nu-1}} \left ( \gamma \sqrt{2\nu} d(x_i/l,x_j/l)\right ) ^\nu K_{\nu} \left ( \gamma \sqrt{2\nu} d(x_i/l,x_j/l)\right ).$$

The scikit-learn webpage does not deign to inform us what $\gamma$ is.

- If $\nu \rightarrow \infty$, then the Matern kernel converges to the RBF kernel.  
- when $\nu=1/2$, the Matern kernel becomes identical to the absolute exponential kernel (i.e., l1 versus l2 for the squared exponential discussed above).
- $\nu=3/2$ and $\nu=5/2$ are also popular.  Basically, $\nu$ controls the smoothness of the function, as seen in the example below.

![scikit-learn matern kernel](http://scikit-learn.org/stable/_images/sphx_glr_plot_gpr_prior_posterior_004.png)


**Rational Quadratic Kernel**

The RationalQuadratic kernel can be seen as a scale mixture of RBF kernels with different characteristic length scales.  It is parameterized by a scale parameter $l>0$ and a scale mixture parameter $\alpha > 0$.  In the scikit-learn, this kernel only available for scalar $l$ at the moment.

$$k(x_i,x_j)=\left ( 1+ \frac{d(x_i,x_j)^2}{2\alpha l^2} \right )^{-\alpha}$$

The example of the rational quadratic is seen below.

![scikit learn example of Rational Quadratic](http://scikit-learn.org/stable/_images/sphx_glr_plot_gpr_prior_posterior_001.png)

**Exp-Sine-Squared kernel**

The ExpSineSquared kernel allows modeling of periodic functions.  It is parameterized by length scale $l>0$ and periodicity parameter $p>0$.  In skikit-learn, only the scalar $l$ is supported.  The kernel is given by:

$$k(x_i,x_j)=\exp(-2 (\sin (\pi/p * d(x_i,x_j))/l^2).$$

An example of a ExpSineSquared kernel is seen below:

![scikit-learn example of ExpSineSquared kernel](http://scikit-learn.org/stable/_images/sphx_glr_plot_gpr_prior_posterior_002.png)

** Dot-Product Kernel**

The DotProduct Kernel is non-stationary. That is, unlike the kernels above, it does not depend on $d(x_i,x_j)$.  

The dot product kernel is invariant to a rotation of the coordinates about the origin, but not translations.  It is parameterized by $\sigma_0^2$.  For $\sigma_0^2=0$, the kernel is called the homogeneous linear kernel.  Otherwise it is inhomogeneous.  The kernel is given by:
$$k(x_i,x_j)=\sigma_0^2+x_i\centerdot x_j.$$

The DotProduct kernel is commonly combined with exponentiation.  An example with exponent 2 is show in the figure below.

![scikit-learn dotproduct kernel](http://scikit-learn.org/stable/_images/sphx_glr_plot_gpr_prior_posterior_003.png)

scikit-learn has a few other kernels including:

- [ConstantKernel](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.ConstantKernel.html#sklearn.gaussian_process.kernels.ConstantKernel)
- [WhiteKernel](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.ConstantKernel.html#sklearn.gaussian_process.kernels.ConstantKernel)

And kernel operators including:
- [Sum](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.Sum.html#sklearn.gaussian_process.kernels.Sum)
- [Product](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.Product.html#sklearn.gaussian_process.kernels.Product)
- [Exponentiation](http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.kernels.Exponentiation.html#sklearn.gaussian_process.kernels.Exponentiation)

The Gaussian Process Kernel API is given [here](http://scikit-learn.org/stable/modules/gaussian_process.html#kernels-for-gaussian-processes)


### Learning Hyperparameters

Many of the kernels discussed above have hyperparameters such as the length scale.

These can be fixed, and Gaussian Process Regression can be used simply for prediction, as discussed above.

But another possibility is to _learn the hyperparameters_, i.e., find the best-fitting value, e.g., using gradient methods.   Then, it might be interesting to use the hyperparameters as the data.  For example, if I have a bunch of data that I fit, deriving the hyperparameter for each one.  It is possible that the hyperparameter may correlate interestingly with other properties. 

## Kernel Ridge Regression

Gaussian Process Regression is a Bayesian method. There exists an analogous frequentist method that uses the same idea (and the same kernel trick).  It is called Kernel Ridge Regression.


Kernel ridge regression (KRR) combines Ridge Regression (linear least squares with l2-norm regularization) with the kernel trick. It thus learns a linear function in the space induced by the respective kernel and the data. For non-linear kernels, this corresponds to a non-linear function in the original space.



The form of the model learned by KernelRidge is identical to support vector regression (SVR). However, different loss functions are used: KRR uses squared error loss while support vector regression uses $\epsilon$-insensitive loss, both combined with l2 regularization. In contrast to SVR, fitting KernelRidge can be done in closed-form and is typically faster for medium-sized datasets. On the other hand, the learned model is non-sparse and thus slower than SVR, which learns a sparse model for \epsilon > 0, at prediction-time.

Let's start with the solution to regularized linear regression:

$$\mathbf{w} = \left (\lambda \mathbf{I} + \sum_i \mathbf{x}_i \mathbf{x}_i^T \right )^{-1} \left ( \sum_j y_j \mathbf{x}_j \right ).$$

Now we replace all the data with their feature vector: $\mathbf{x}_i \rightarrow \Phi_i = \Phi(\mathbf{x}_i)$.

Algebra plus the kernel trick leads to:

$$\mathbf{w}=(\lambda \mathbf{I}_d +\Phi \Phi^T)^{-1} \mathbf{y} = \Phi (\Phi^T\Phi +\lambda \mathbf{I}_n)^{-1}\mathbf{y}$$

which can be rewritten as 

$$\mathbf{w}=\sum_i \alpha_i \Phi(\mathbf{x_i})$$
with 
$$\boldsymbol{\alpha}=(\Phi^T \Phi +\lambda \mathbf{I}_n)^{-1}\mathbf{y}.$$

This equation, however, still retains the feature vectors, which we don't want to deal with; we want everything in terms of the kernel.  To do this, determine the value at a test point by projecting the test point $\mathbf{x}$ onto the solution: 

$$y=\mathbf{w}^T \Phi(\mathbf{x})$$
$$= \mathbf{y}(\Phi^T \Phi+\lambda \mathbf{I}_n)^{-1} \Phi^T\Phi(x)$$
$$=\mathbf{y} (K+\lambda \mathbf{I}_n)^{-1} \boldsymbol{\kappa}(\mathbf{x})$$

where $K(bx_i,bx_j)= \Phi(x_i)^T \Phi(x_j)$ and $\boldsymbol{\kappa}(\mathbf{x}) = K(\mathbf{x}_i,\mathbf{x}).$  That is, the feature vectors have been eliminated and the result is in terms of the kernel.

The following example illustrates the similarities and differences.

In [None]:
from sklearn.kernel_ridge import KernelRidge
from sklearn.model_selection import GridSearchCV
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, ExpSineSquared



In [None]:
rng = np.random.RandomState(0)

# Generate sample data
X = 15 * rng.rand(100, 1)
y = np.sin(X).ravel()
y += 3 * (0.5 - rng.rand(X.shape[0]))  # add noise



In [None]:
plt.plot(X,y,'o')

The below (param_grid) is a dictionary specifying a range of alpha values (for the regularization) and ExpSineSquared input parameters length scale and periodicity.


In [None]:
# Fit KernelRidge with parameter selection based on 5-fold cross validation
param_grid = {"alpha": [1e0, 1e-1, 1e-2, 1e-3],
              "kernel": [ExpSineSquared(l, p)
                         for l in np.logspace(-2, 2, 10)
                         for p in np.logspace(0, 2, 10)]}

print type(param_grid)

print param_grid['alpha']
print len(param_grid['kernel'])
print (param_grid['kernel'][90])

**GridSearchCV** is an optimizer for hyperparameters.  Read the details [here](http://scikit-learn.org/stable/modules/grid_search.html#grid-search).  The idea is that you can search the hyper-parameter space for the best cross validation score.  So, while you can do this by hand, apparently GridSearchCV is alo there to do it for you. (!!)

To run this, you need:
- an estimator (regressor or classifier such as sklearn.svm.SVC()), in this case it will be KernelRidge
- a parameter space, in this cases set up in param_grid
- a method for searching or sampling candidates, in this case GridSearchCV, which performs an exhaustive or brute force grid search.
- a cross-validation scheme, here specified as $cv=5$, i.e., a 5-fold cross validation splitting stragety
- a score function; not specified in this case

Running GridSearchCV sets up kr, which then is used to fit the data $(X,y)$.  You can also specify kr.best_estimator_ to see the best fit parameters.  The length_scale is around 5 and the periodicity is around 10.



In [None]:
import time


kr = GridSearchCV(KernelRidge(), cv=5, param_grid=param_grid)

stime = time.time()
kr.fit(X, y)
print("Time for KRR fitting: %.3f" % (time.time() - stime))

print type(param_grid)

In [None]:
kr.best_estimator_

Next we run the GaussianProcessRegressor. For this, we need to specify a kernel. The kernel chosen uses ExpSineSquared again, but adds WhiteKernel and specifies the noise level at 0.1. The length scale may well have been eyeballed from the graph, or the one from the GridSearchCV might have been used.

In [None]:
gp_kernel = ExpSineSquared(1.0, 5.0, periodicity_bounds=(1e-2, 1e1)) \
    + WhiteKernel(1e-1)
gpr = GaussianProcessRegressor(kernel=gp_kernel)
stime = time.time()
gpr.fit(X, y)
print("Time for GPR fitting: %.3f" % (time.time() - stime))



In [None]:
# Predict using kernel ridge
X_plot = np.linspace(0, 20, 10000)[:, None]
stime = time.time()
y_kr = kr.predict(X_plot)
print("Time for KRR prediction: %.3f" % (time.time() - stime))



In [None]:
# Predict using gaussian process regressor
stime = time.time()
y_gpr = gpr.predict(X_plot, return_std=False)
print("Time for GPR prediction: %.3f" % (time.time() - stime))

stime = time.time()
y_gpr, y_std = gpr.predict(X_plot, return_std=True)
print("Time for GPR prediction with standard-deviation: %.3f"
      % (time.time() - stime))



In [None]:
# Plot results
plt.figure(figsize=(10, 5))
lw = 2
plt.scatter(X, y, c='k', label='data')
plt.plot(X_plot, np.sin(X_plot), color='navy', lw=lw, label='True')
plt.plot(X_plot, y_kr, color='turquoise', lw=lw,
         label='KRR (%s)' % kr.best_params_)
plt.plot(X_plot, y_gpr, color='darkorange', lw=lw,
         label='GPR (%s)' % gpr.kernel_)
plt.fill_between(X_plot[:, 0], y_gpr - y_std, y_gpr + y_std, color='darkorange',
                 alpha=0.2)
plt.xlabel('data')
plt.ylabel('target')
plt.xlim(0, 20)
plt.ylim(-4, 4)
plt.title('GPR versus Kernel Ridge')
plt.legend(loc="best",  scatterpoints=1, prop={'size': 8})
plt.show()

Another example of Gaussian process regression can be seen [here](http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_co2.html).

## Decision Theory

The goal of this discussion of decision theory is to develop a framework so that it can be combined with probability theory to be able to make optimal decisions in cases where there is uncertainty.  This will lead into Classification, which we will tackle next time.

For example, suppose there is an input vector $\mathbf{x}$ and a corresponding vector $\mathbf{t}$ of target variables, and the goal is to be able to predict $\mathbf{t}$ given a new value of $\mathbf{x}$.

- For regression problems, $\mathbf{t}$ will comprise continuous variables, whereas for classification, $\mathbf{t}$ will represent class labels.
- The joint probability distribution $p(\mathbf{x},\mathbf{t})$ provides a complete summary of the uncertainty associated with these variables.
- Determination of $p(\mathbf{x},\mathbf{t})$ from a set of training data is an example of inference. This is generally the question we are trying to address in these kinds of problems.
- In practice, we may want to make a specific prediction of a value of $\mathbf{t}$, or take a specific action based on our understanding of the values $\mathbf{t}$ is likely to take.


The classic example for this kind of problem is the medical diagnosis problem. Consider that a certain test has been done on a patient, and we wish to determine whether or not that patient has cancer, because then we will make the decision on whether to treat or not. 

In this case, $\mathbf{x}$ is the results of the test (which can of course have multiple parts), and the output variable is either $\mathit{C}_1$, or, say $t=0$, where the patient has the cancer, or $\mathit{C}_2$, corresponding to $t=1$, where the patient doesn't have the cancer.

Determination of $p(\mathbf{x},C_k)$ will give the most complete probabilistic description of the problem. But the real utility of this exercise is to determine whether the patient should be treated or not. We would like this decision step to be optimal, and it is the subject of decision theory to tell us how to make optimal decisions given the appropriate probabilities.



Note that this discussion can be related to Bayes' theorem as follows. We are interested in $p(C_k,\mathbf{x})$ - the probability that the patient has the cancer given the results of the test. So we have:

$$p(C_k|\mathbf{x}) = \frac{p(\mathbf{x}|C_k)
p(C_k)}{p(\mathbf{x})}.$$

We can interpret these quantities as follows. 
- $p(C_k)$ is the prior probability for the class $\mathit{C}_k$. In the example, $p(C_1)$ would be the probability that the person has the cancer before the test is done. 
- $p(C_1|\mathbf{x})$ is the probability that the person has the cancer in the light of the test results. 
- $p(\mathbf{x}|C_1)$ is the probability that the cancer would produce the observed test results.  

Our aim in this experiment would be to avoid assigning the person to the wrong class, so we would choose the class that has the highest posterior probability.

More rigoriously, we divide the input space into regions $\mathit{R}_k$ called decision regions, one for each class, so that all points in $\mathit{R}_k$ are assigned to class $\mathit{C}_k$. Boundaries between these decision regions are called _decision boundaries_. 

A mistake is made if $\mathbf{x}$ belonging to $\mathit{C}_1$ is assigned to $\mathit{C}_2$, or vice versa. The probability of that can be written: 

$$p(\text{mistake}) = p(\mathbf{x} \in \mathit{R}_1,\mathit{C}_2)+p(\mathbf{x} \in
\mathit{R}_2,\mathit{C}_1)$$
$$ = \int_{\mathit{R}_1} p(\mathbf{x},\mathit{C}_2) d\mathbf{x} +\int_{\mathit{R}_2} p(\mathbf{x},\mathit{C}_1) d\mathbf{x} .$$

Clearly, we would like to make our choice such that $p(\text{mistake})$ is minimized, and choose to assign $\mathbf{x}$ to the class that produces the smaller integral in the equation above. 

The converse says that we should assign $\mathbf{x}$ to the class for which $p(\mathbf{x},C_k)$ is the largest. From the product rule, $p(\mathbf{x},C_k)= p(\mathit{C}_k|\mathbf{x}) p(\mathbf{x})$. Thus, the probability of making a mistake is minimized if $\mathbf{x}$ is assigned to the class where the posterior probability $p(\mathit{C}_k | \mathbf{x})$ is the largest.

This can be generalized as

$$p(\text{correct}) = \sum_{k=1}^K p(\mathbf{x} \in \mathcal{R}_k,\mathcal{C}_k) = \sum_{k=1}^K \int_{\mathit{R}_k} p(\mathbf{x},\mathit{C}_k) d\mathbf{x},$$

which will be maximized if each $\mathbf{x}$ is assigned to the class for which $p(\mathbf{x},\mathit{C}_k)$ is the largest. Again, by the reasoning above, this is equivalent to saying that $\mathbf{x}$ should be assigned to the class having the largest posterior probability $p(\mathit{C}_k,\mathbf{x})$.

### Minimizing the Loss Function

The problem with the development above is that it implicitly assumes that assigning $\mathbf{x}$ to $\mathit{C}_1$ when it belongs to $\mathit{C}_2$ is just as bad as assigning it to $\mathbf{C}_2$ when it belongs to $\mathit{C}_1$. In the case of cancer, these would correspond to determining the person does have cancer when, in fact, he or she doesn't (false positive), and determining the person does not have cancer when he or she does (false negative). Both outcomes are unpleasant for the patient, but clearly the second is worse.

So we define a loss (or cost) function that penalizes the two outcomes differently. An example is seen in Figure 1.25 reproduced from Bishop.


| | Cancer | Normal |
|------|-------|--------|
|**Cancer** | 0 | 1000|
|**Normal** | 1 |  0  |




Caption: An example of a loss matrix with elements $L_{kj}$ for the cancer problem. The rows correspond to the true class, and the columns correspond to the assignments of class made by the decision criterion. Thus, predicting a person is normal when he or she actually has cancer is penalized heavily.

The loss function combines with the probability to yield the expected loss:

$$\mathbb{E}[L] = \sum_k \sum_j \int_{\mathit{R}_j} L_{kj}\,
p(\mathbf{x},\mathit{C}_k) d\mathbf{x}.$$

We assign the $\mathbf{x}$ to $\mathit{R}_j$. We want to choose the region in order to minimize the expected loss, which implies for each $\mathbf{x}$ we should minimize $\sum_k L_{kj} p(\mathbf{x},\mathit{C}_k)$. We use the product rule to eliminate $p(\mathbf{x})$, which implies that we will assign $\mathbf{x}$ to class $\mathit{C}_j$ so that the following quantity is minimized:

$$\sum_k L_{kj} p(\mathit{C}_k | \mathbf{x}).$$

(Bishop goes on to discuss the case where the boundaries between regions are not so well defined by developing the reject option, i.e., a region between $\mathit{R}_1$ and $\mathit{R}_2$, where, for example, additional criteria are required to distinguish. Please see
Bishop for more details.)



### Approaches to Inference and Decision Problems

Bishop notes that there are three different ways to approach this kind of supervised learning problem. I review them principally because it may not be obvious from the outset that these options are possible, and also because the discussion introduces some useful nomenclature.

**Method 1:**

This method solves the problem in the way implicit to the discussion above: First, the inference problem is solved for each class $\mathit{C}_k$ individually, to yield the class-conditional densities $p(\mathbf{x}| \mathit{C}_k)$ for each class separately. The prior class probabilities $p(\mathit{C}_k)$ are also determined separately as well. Then combine using Bayes' Theorm in the form

$$p(\mathit{C}_k | \mathbf{x}) = \frac{p(\mathbf{x}|\mathit{C}_k)
p(\mathit{C}_k)}{p(\mathbf{x})}$$

to find the posterior class probabilities $p(\mathit{C}_k |\mathbf{x})$. Then, having found the posterior probabilities, decision theory is used to determine class membership for every new input $\mathbf{x}$. 

The name "generative model" is applied when the distribution of inputs are known, as well as outputs (from the posterior), because when both of these are known, synthetic data can be generated by sampling from them.

- This method may be difficult and time consuming as it requires estimation of the posterior probabilities for each class. 
- On the other hand, it might offer a more direct approach, for example, the priors of each class might be simply the fraction of the training set in each class. 
- However, having the probabilities might come in handy, as they allow detection of new data points that have low probability under the model - e.g., outliers.

**Method 2:**

In this method, the inference problem is solved all at once, instead of for each class separately, and then decision theory is used to assign each new $\mathbf{x}$ to one of the classes. Approaches that model the posterior probabilities directly are called _discriminative_ models.

This method would be useful if the goal is only to make classification decisions. Indeed, the densities for each class might have lots of structure that doesn't matter for the bottom line.

**Method 3:**

This method does it all in one step. Find a function $f(\mathbf{x})$, called the discriminant function, that maps each input $\mathbf{x}$ directly to a class label. This method might be not just an option but actually required if, e.g., the input and the output both have just two values, and a probability distribution might be difficult to define.

This method is the most direct; however, there is no access to the posterior probability distributions, which may be useful. See Bishop for discussion.

### Loss Functions for Regression

The above discusses the classification problem; what do we do for the regression problem, where a smooth functional result is desired? For this case, the decision state is choosing a specific estimate $y(\mathbf{x})$ of the value $t$ for each input $\mathbf{x}$. Imagine that the loss incurred is $L(t,y(\mathbf{x}))$. The expected loss is then:

$$\mathbb{E}[L] = \int\int L(t,y(\mathbf{x}))\, p(\mathbf{x},t)\, d\mathbf{x}\, dt.$$

A sensible choice for a regression problem is the squared loss, i.e., $L(t,y(\mathbf{x})) = \{y(\mathbf{x}) -t\}^2$, so that the expected loss is

$$\mathbb{E}[L] = \int\int \{y(\mathbf{x}) - t\}^2 p(\mathbf{x},t) \, d\mathbf{x}\, d\mathbf{t}.$$

Let's choose $y(\mathbf{x})$ to minimize $\mathbb{E}[L]$. Do this by taking the derivative and setting equal to zero, i.e.,

$$\frac{\partial \mathbb{E}[L]}{\partial y(\mathbf{x})} = 2 \int \{ y(\mathbf{x}) - t \} p(\mathbf{x},t) \, dt = 0.$$
$$y(\mathbf{x}) = \frac{\int t p(\mathbf{x},t) \,dt}{p(\mathbf{x})} = \int t p(t|\mathbf{x}) dt = \mathbb{E}_t[t|\mathbf{x}],$$

which is the conditional average of $t$ conditioned on $\mathbf{x}$ and is known as the regression function. 

Bishop includes a different derivation that yields a complementary form for $\mathbb{E}[L]$:

$$\mathbb{E}[L] = \int \{y(\mathbf{x}) - \mathbb{E}[t|\mathbf{x}]\}^2 p(\mathbf{x})\, d{x} +\int \text{var}[t|\mathbf{x}] p(\mathbf{x})\, d\mathbf{x}.$$

In this formulation, the function to be determined $y(\mathbf{x})$ occurs only in the first term, which is minimized when $y(\mathbf{x})$ is equal to $\mathbb{E}[t|\mathbf{x}]$, in which case that term will vanish. The second term is the variance of the distribution of $t$, averaged over $\mathbf{x}$, which represents the intrinsic variability of the target data and can be considered to be noise. Independent of $y(\mathbf{x})$, it represents the minimum value of the loss function.

As with the classification problem, there are three possible approaches for the regression problem, which are, in order of decreasing complexity:

- First solve the inference problem of determining the joint density $p(\mathbf{x},t)$. Then normalize to find the conditional density $p(t|\mathbf{x})$, and finally calculate the conditional mean using the equation above for $y(\mathbf{x})$.
- First solve the inference problem of determining the conditional density $p(t|\mathbf{x})$, then compute $y(\mathbf{x})$.
- Find a regression function $y(\mathbf{x})$ directly from the training data.

Finally, Bishop notes that the squared loss is not the only possible form. A generalization of the squared loss is the Minkowski loss:

$$\mathbb{E}[L_q]=\int\int |y(\mathbf{x}) - t|^q p(\mathbf{x},t)\, d\mathbf{x}\, dt$$

which is the mean squared loss for $q=2$. 