## Gaussian Process Regression

The purpose of this notebook is to introduce **Gaussian Process Regression**. The notebook will introduce in the order of statistical explanation, implementations, and hyperparameter tuning. Many contents of this notebook will based on the prerequisite concepts explained in the *fundamental.ipynb* file.
- **Mathematic Derivation & Definition, and Properties**
    1. Definition of Gaussian Process (GP)
    2. Definition of Gaussian Process Regression (GPR)
    3. Pros/Cons of GPR
    4. Origination of GPR
    5. Derivation of GPR
    6. Simple Illustration of GPR
- **Gaussian Process Regression Visualization**
    1. GP From Scratch Code
    2. GP interactive visualization
    3. The role of hyperparameters
    4. GP Using Libraries
- **Gaussian Process Hyperparameter Tuning**



### **Mathematic Derivation & Definition, and Properties**

#### 1.Definition of Gaussian Process(GP)
A **Gaussian Process (GP)** is a stochastic process that defines a distribution over **functions** rather than just over points. Formally, a GP is a collection of **random variables** where any finite subset follows a **multivariate Gaussian distribution**. A gaussian process is represented as:
$$f\sim \mathcal{GP}(\mu(x),k(x,x'))$$
- $f$ is the function that follows the GP (which is also our target when using GP to make predictions)
- A gaussian process, like gaussian distribution, are defined by only two terms: **mean function** $\mu(x)$ and **kernel (covariance) function** $k(x,x')$.
- The mean function $\mu(x)$ represents the **Expected Value** of the function $f$ given any input $x$.
- The **kernel (covariance) function** $k(x,x')$ captures how function values at different inputs $x$ and $x'$ are related.- GP assumes that function values near each other are correlated in a way dictated by the covariance function (key for generating smooth prediction).

The terminologies seem intimidating, but a gaussian process can be easily understood by the plot below:
<div style="text-align: center"> <img src="https://www.lancaster.ac.uk/stor-i-student-sites/thomas-newman/wp-content/uploads/sites/37/2022/05/Gaussian-process-with-noise.svg" alt="Drawing" width="500"/> </div>

The plot shown here involves training datas ($\times$), some functions, and a shade area. In a gaussian distribution of a random variable (explained in the *fundamental* notebook), we model the distribution of datapoints. Here in GP, we model **the distribution of functions that go through (or nearby, controlled by a hyperparameter explained later) training points**. The shaded area represents the distribution of the **function**, and the functions shown above are some subsets of the distribution. The **bold** function shown in the plot represents the value of **mean function $\mu(x)$**, the most possible function value. The shaded area is dictated by the **kernel(covariance) function**.

#### 2.Definition of Gaussian Process Regression(GPR)
Note that gaussian process is **NOT** a machine learning model for prediction. It only describes **Distribution of a function**. Gaussian process can be used to peform multiple tasks, including regression and classification. Here, we focus on **Gaussian Process Regression**, a supervised machine learning model that is cabable of predicting function regardless of its form using gaussian process while also providing uncertainty of the prediction. 

Recall that a regression is a model the relationship between input variables (features $X$) and a **continuous** output variable (Label $Y$):
$$Y=f(X)+\epsilon,\,\epsilon\sim\mathcal{N}(0,\sigma^2)$$

In the case where the function has a fixed form (linear regression), we use true bayesian prediction to obtain a predictive distribution of label $\hat{Y}$:
$$P(\hat{Y}|X,Y,\hat{X})=\int_{\theta} P(\hat{Y}|\theta,X,Y)P(\theta|X,Y)d\theta$$
The predictive distribution involves a prior $P(\theta)$ over parameter $\theta$, which we assumed to be gaussian in the previous example:
$$P(\theta)\sim \mathcal{N}(0,\tau^2I)$$

In GPR, the goal is to predict the function $f$ directly, regardless of its form. Thus, assuming a prior over parameter $\theta$ is **unrealistic**, since the form of $\theta$ is unknown. Instead, we put a **GP prior** over the function which we try to predict. (you can think of that before we have an assumption of $\theta$, now we have an assumption of $f$. Since the distribution of a function is a gaussian process, we use notation $\mathcal{GP}$ instead of $P$.)
$$f\sim \mathcal{GP}(\mu,k)$$


#### 2. Pros/Cons of GP
##### Advantages of GP
1. Gaussia Process is very powerful in predicting functions in **any form**. Unlike other models that stick with one particular function form (like linear regression for linear model), GP learns the function from the data.
2. Gaussian Process uses **Bayesian Approach**, incorporating prior knowledge and allow for principled Bayesian inference, making them adaptable to different domains (for reference MAP/True bayesian prediction in fundamental notebook).
3. GP is a probabilistic model, whose output not only provides the most probable value (mean function), but also **uncertainty** of the value.
4. GP does **not require a large dataset**, which is very friendly to the situation where data collection is time-consuming and expensive.
5. The choice of **kernel function** allows GPs to capture various data patterns, such as periodicity, smoothness, or sudden changes.
##### Disadvantages of GP
1. GP is computationally **expensive and complex**. This will be explained in later parts.
2. GP does not work ideally with **large datasets ($n>10^6$)**, primarily because its complexity.
3. GP does not work ideally with **high dimensional settings**.
4. Choosing kernel functions could be tricky, and it dictates the overall performance of GP.
5. Hyperparameter tuning is challenging.

#### 3. Origination of GPR
The originaltion of GP as a powerful model to predict any arbitraty shape function $f$  with uncertainty comes from several aspects.
##### Bayesian Approach
As introduced and explained in the *fundamental* notebook, **bayesian approach** is very powerful at making predictions of model parameters (MLE and MAP). In addition, a **true bayesian prediction** provides uncertainty on top of the prediction itself, making it very useful in many usecase where uncertainty is necessary to obtain. Therefore, gaussian process natually emerges as a model using the bayesian approach.
##### Universality of Gaussian Distribution
Bayesian Approach reliRes on previous assumptions/knowledge of the distribution (**Prior**) to infer the posterior. This means choosing a **reliable** prior is critical. As shown in the *fundamental* notebook, a **gaussian prior** is the popular choice, due to the universality of gaussian process, and most importantly the **central limit theorem(CLT)**.

In addition, **operations** among gaussian distributions (addition, multiplication, integration, etc) result in **another gaussian** in most cases. 

These two reasons make choosing a gaussian prior reasonable. Therefore, a gaussian process also places a gaussian prior in the model (why it named "gaussian" process).
##### Closed Form of Predictive Distribution
In true bayesian prediction, the **Predictive Distribution** is calculated by integrating over all parameter $\theta$:
$$P(\hat{Y}|X,Y,\hat{X})=\int_{\theta} P(\hat{Y}|\theta,X,Y)P(\theta|X,Y)d\theta$$
The indefinite integral is **computationally complex**, and rarely has a **closed form**. In the case of gaussian distribution, a **closed form is available**. This is the key why a GP can leverage true bayesian prediction as its approach to peform modeling.

#### 4.Derivation of GP