# Gaussian Process Regression in Python

<a href="https://en.wikipedia.org/wiki/Gaussian_process" target="_blank">Gaussian process</a> models are one of the less well known machine learning algorithms as compared to more popular ones such as tree based models or perceptron based models. This is unfortunate as Gaussian process models are relatively straight forward to understand while being able to model relatively complex systems. In this post we will explore how to perform Gaussian process regression using Python.

## Gaussian Processes

Gaussian process models are built on the assumption that observed data points are drawn from a Gaussan distribution. Observed data is assumed to have the form: 

_y_<sub>n</sub> = _f_(__x__<sub>n</sub>) + _e_<sub>n</sub>, 

where _f_(__x__<sub>n</sub>) is the true function which we want to observe, __x__<sub>n</sub> is the set of input features, and _e_<sub>n</sub> is Gaussian noise. The conditional probability of observing an array of observations __y__ due to some _f_(__x__) is simply the distribution: 

p(__y__|_f_(__x__)) = _N_(__y__|_f_(__x__), __σ__), 

where __σ__ is the standard deviation of the Gaussian noise.

In order to use Gaussian processes to make predictions on new observations of __y__, we need to determine the marginal probability of observing p(__y__). This probability can be obtained by marginalizing the conditional distribution p(__y__|_f_(__x__)) over the input features __x__:

p(__y__) = ∫p(__y__|_f_(__x__))p(_f_(__x__))d<b>x</b>.

The marginal distribution of observing the true function p(_f_(__x__)) is defined to be a Gaussian distribution with zero mean and covariance kernel matrix __K__ of size _N_ × _N_:

p(_f_(__x__)) = _N_(_f_(__x__)|__0__, __K__), 

although this definition can be extended to involve some arbitrary mean __m__. The covariance kernel matrix __K__ is composed of distances (or similarities) between two observed data points, and assumes that similar data points should give rise to similar observed values. Various kernels can be used, such as the quadratic exponential kernel (also known as the radial basis function kernel) and the periodic kernel. The type of kernel used depends on prior assumptions made on both the feature and observation space.

## Gaussian Process Kernels

In its simplest form, the quadratic exponential kernel matrix elements can be calculated from a set of input features __x__: 

__K__(__x__<sub>n</sub>, __x__<sub>m</sub>) = exp(-||__x__<sub>n</sub> - __x__<sub>m</sub>||<sup>2</sup>/2<i>L</i><sup>2</sup>), 

where _L_ is a kernel hyper parameter which we will set to 1 for convenience’s sake.

Putting everything together, for _N_ observed points in __y__ corresponding to input features __x__, the probability of observing a new point __y’__ corresponding to a new set of input features __x’__ is given by the expression: 

p(__y’__) = ∫p(__y’__|_f_(__x’__))p(_f_(__x’__))d<b>x’</b> = ∫<i>N</i>(__y__|<i>f</i>(__x’__), __σ__)<i>N</i>(<i>f</i>(__x’__)|<b>0</b>, <b>K</b>)d<b>x’</b> = <i>N</i>(<i>f</i>(__x’__)|<b>0</b>, __K’__). 

Therefore the new covariance kernel matrix which has a size of _N_ + 1 × _N_ + 1:

__K’__(__x__<sub>n</sub>, __x__<sub>m</sub>) = exp(-||__x__<sub>n</sub> - __x__<sub>m</sub>||<sup>2</sup>/2<i>L</i><sup>2</sup>)

is essentially the original matrix __K__ which includes an additional row and column for the distances of the new point. The last row (or column, given that covariance matrices are symmetric) of __K’__ is composed of a vector of length _N_ containing the distances between the new input feature and all current input features: 

__k__(__x__<sub>n</sub>, __x’__) = exp(-||__x__<sub>n</sub> - __x’__||<sup>2</sup>/2<i>L</i><sup>2</sup>), 

and a single element at index (_N_ + 1, _N_ + 1) containing the covariance of __x’__ with itself: 

_k_ = exp(-||__x’__ - __x’__||<sup>2</sup>/2<i>L</i><sup>2</sup>) = 1.

## Gaussian Process Predictions

From conditional probability theory, using the kernel matrices from above, the predicted expected value of observing __y’__ is : 

μ(__x’__) = __k__<sup>T</sup>__K__<sup>-1</sup>__y'__, 

while the corresponding covariance is given by: 

σ<sup>2</sup> = _k_ - __k__<sup>T</sup>__K__<sup>-1</sup>__k__.

All of this can be written in the following Python code. First, we create a function which calculates the covariance kernel matrix for some set of input features __x__: 

__K__(__x__<sub>n</sub>, __x__<sub>m</sub>) = exp(-||__x__<sub>n</sub> - __x__<sub>m</sub>||<sup>2</sup>/2<i>L</i><sup>2</sup>).

In [2]:
import numpy as np

def RBF_kernel(x, y, l = 1.0):
    K = np.exp(-np.linalg.norm(x - y)**2 / (2 * l**2))
    return K

def make_RBF_kernel(X, l = 1.0):
    K = np.zeros([len(X), len(X)])
    for i in range(len(X)):
        for j in range(len(X)):
            K[i, j] = RBF_kernel(X[i], X[j], l)
    return K

The prediction result μ(__x’__) = __k__<sup>T</sup>__K__<sup>-1</sup>__y'__ is calculated using the following function.

In [3]:
def gaussian_process_predict_mean(X_data, y_data, X_pred):
    rbf_kernel = make_RBF_kernel(np.vstack([X_data, X_pred]))
    K = rbf_kernel[:len(X_data), :len(X_data)]
    k = rbf_kernel[:len(X), -1]
    y_pred = np.dot(np.dot(k, np.linalg.inv(K)), y_data)
    return y_pred

Now all we need to do is to create some test data to try these functions out:

In [4]:
# Training data x and y:
x = np.array([0, 1, 2])
y = np.pi * x + np.exp(1)
X = x.reshape(-1, 1)

# New input feature to predict:
X_new = np.array([0.5])

# Calculate and print the predicted value of new y:
mean_pred = gaussian_process_predict_mean(X, y, X_new)
print("mean predict :{}".format(mean_pred))

mean predict :3.9328785636239916


As a sanity check, we can use `GaussianProcessRegressor` from `sklearn` to check our results!

In [5]:
from sklearn.gaussian_process import GaussianProcessRegressor
gpr = GaussianProcessRegressor()
gpr.fit(X, y)
print("sklearn pred: {}".format(gpr.predict(X_new.reshape(-1, 1))))

sklearn pred: [3.93287856]


This ends our demonstration of how Gaussian process models work.

***

## References
1. <a href="https://www.springer.com/gp/book/9780387310732" target="_blank">Pattern Recognition and Machine Learning</a>, C. M. Bishop, pp 303-313.
2. https://scikit-learn.org/stable/modules/gaussian_process.html
3. https://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html