GaussianProcessRegression

A small intorduction to Gaussian Process Regression (GPR) and its implementation in Python.

GPR is a non-parametric Bayesian regression method, which is very flexible and can be used for any type of data (strings, graphs, etc.). Typically the posterior is calculated using Bayes’ rule, but other optimization methods such as Gradient Descent are also possible.

The main difference (and benefit) of GPR is that it gives the estimated normal distribution for a data point, instead of a single value. This is useful in knowing how confident a prediction is, for example to better gauge the size of an investment.

The main drawback of GPR is its large computational cost, and possibly hard-to-make kernel functions. Calculation of the posterior distribution using Bayes’ rule requires inverting the training data’s kernel/similarity/covariance matrix, which is an O(n^3) operation.

Demos

Prior distribution

Here is a normal distribution. GPR works by assuming that the data points are a sample from a normal distribution, in this case, any sample from the normal distribution corresponds to all the observed data points.

Posterior distribution

Here is the posterior distribution, which is the distribution of the data points given the prior distribution and the observed data points. The posterior distribution is much more constrained than the prior distribution and likely gives good guesses for the actual data.

2D: Random samples taken from a prior distribution

2D: Random samples taken from a posterior distribution

It can be seen, that the posterior distribution is much more constrained than the prior distribution, and follows a sine wave which is the true function.

3D: Training data

3D: Random samples taken from a prior distribution

3D: Random samples taken from a posterior distribution

Fit on artificial data

This data was generated by creating a random kernel function, and then sampling from it. To cheat, the same random kernel was used to calculate the prior and posterior guaranteeing a good fit.

Electricity production forecasting

To see how GPR works on real data (daily electricity production 1985-2018), I used a dataset from Kaggle: https://www.kaggle.com/datasets/kandij/electric-production. Here, the task is to forecast electricity production.

Electricity production data

Training data (66%)

Electricity production forecast

We see a good potential fit, with the estimated confidence intervals.

Electricity production forecast vs test

I also calculated the mean absolute error percent, which came out to 3.5%.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
Electric_Production.csv		Electric_Production.csv
GaussianProcess.py		GaussianProcess.py
Kernels.py		Kernels.py
LICENSE		LICENSE
README.md		README.md
artificial_data_fitter.py		artificial_data_fitter.py
electr_prod_train.png		electr_prod_train.png
electr_prod_train_pred.png		electr_prod_train_pred.png
electr_prod_train_pred_test.png		electr_prod_train_pred_test.png
electricity_prod_forecasting.py		electricity_prod_forecasting.py
example_2d_posterior_fit.png		example_2d_posterior_fit.png
example_2d_prior_fit.png		example_2d_prior_fit.png
example_3d_posterior_fit.png		example_3d_posterior_fit.png
example_3d_prior_fit.png		example_3d_prior_fit.png
example_3d_train_data.png		example_3d_train_data.png
posterior_distribution_2d.png		posterior_distribution_2d.png
prior_distribution_2d.png		prior_distribution_2d.png
same_kernel_demo.png		same_kernel_demo.png
usage_demo.py		usage_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GaussianProcessRegression

Demos

Electricity production forecasting

About

Releases

Packages

Languages

License

ilmari99/GaussianProcessRegression

Folders and files

Latest commit

History

Repository files navigation

GaussianProcessRegression

Demos

Electricity production forecasting

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages