# Gaussian Processes - Regression

### - Solving linear & non-linear regressions
### - Choosing prior over functions
### - Modeling without having explicit coefficients(non-parametric models):


### $$Y_i = \sum_{i=0}^{\infty} W_i \phi(x_i)$$ 
### *with infinite coefficients      
### *Removing coefficients and simplify modeling via GPs



## Advantages :
### - Good for small-medium number of data unlike deep learning
### - Includes active learning which means using what is computed previously to obtain new data points
### - When there are different time series 
### (Obtaining every unknown data points with unknown $\Delta$t's automatically using GPs)
### - In Bayesian optimization: 

 #####  ![bb.png](attachment:bb.png)    

### - We can model this function using GPs to obtain min or max(Specially good for surrogate models) 







## two-dimensional Gaussian distribution

### $$\begin{bmatrix} f_1 \\ f_2 \end{bmatrix} \sim  \mathscr{N}\bigg(\begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix},\begin{bmatrix} \sigma_1^2 & \sigma_{12} \\ \sigma_{21} & \sigma_2^2\end{bmatrix}\bigg) $$
### Bivariate gaussian With 2 marginals $f_1 , f_2$  ![img.png](attachment:img.png)
                

## Having first observation of $f_2$


### - Sampling a point from $f_2$
### - Obtaining updated prediction of $f_1$ as $p(f1 |f2)$using guassian properties theorem 2  : $X \sim N 
(\mu,\Sigma)$ 









$$X=\begin{pmatrix} X_a \\ X_b \end{pmatrix},\mu=\begin{pmatrix} \mu_a \\ \mu_b \end{pmatrix},\Sigma= \begin{pmatrix} \Sigma_{aa} & \Sigma_{ab} \\ \Sigma_{ba} & \Sigma_{bb} \end{pmatrix}, p(X_a |X_b)=N\bigg(X_a;\mu_{a|b}, \Sigma_{a|b}\bigg)$$

### $$\mu_{a|b} = \mu_a + \Sigma_{ab}\Sigma_{bb}^-1$$

### $$\Sigma_{a|b} = \Sigma_{aa} - \Sigma_{ab}\Sigma_{bb}^-1\Sigma_{ba} $$ ![Untitlfed.png](attachment:Untitlfed.png)


### so new µ and σ of  $f_1$ :
 ### $$f1 |f2 \sim \big(\mu_1 +\frac{\sigma_{12}}{\sigma_{2}^2}(f_2-\mu_2),\sigma_{1}^2-\frac{\sigma_{12}\sigma_{21}}{\sigma_{2}^2}\big)$$ 

## Plotting 𝑓1  , 𝑓2   on X axis

##### ![ob.png](attachment:ob.png)

## Generalizing to 6D distribution
### $$f\sim \mathscr{N}(\mu,\Sigma)$$


### ![Untitled2.png](attachment:Untitled2.png)

### $$f=[f_1f_2f_3f_4f_5f_6]^T$$

## Observing 1 data point 
### When a discrete value is obtained ($f_4$) :
## $$f_1f_2f_3f_5f_6|f_4$$


#### ![Untitledas.png](attachment:Untitledas.png)

### *The observed point will affect more on closer points 

$$
\Sigma = 
 \begin{bmatrix}
  \sigma_{1}^2&              & \cdots &    &   & \sigma_{61} \\
  \sigma_{21} & \sigma_{2}^2 &        &     &    &  \sigma_{62} \\
  \sigma_{31} &        & \sigma_{3}^2 &     &   &\sigma_{63} \\
  \sigma_{41} &        &        & \sigma_{4}^2&   &\sigma_{64}  \\
  \sigma_{51} &        &        &      & \sigma_{5}^2&\sigma_{65} \\
  \sigma_{61} &        &    \cdots    &      &             &\sigma_{6}^2
 \end{bmatrix} $$

## Generalizing to continuous data
### How to generalize to continuous input data ? $\implies$   The Gaussian Processes 
### *It is a process in which we sample points and these points represent the functions which we would like to estimate. 

### - For the case of a finite set of input values  $\{X\in 1,2,...,n\}$    we use the multivariate Gaussian model as $f(x)$

### In a continuous stochastic process  $  \{X_t;t \in T\}$  it is a Gaussian process if and only if for every finite set of indices t1,...,tk :
### $$X_{t_1,...,t_k}= (X_{t_1},...,X_{t_k})$$

### is a multivariate Gaussian random variable.

## Covariance Function

### *Instead of presenting Gaussian distribution with   µ   and   σ   we present it with covariance function(which can take any values) and it’s called kernel. $k( X, X')$

#### $$\begin{bmatrix} f(X_1) \\ f(X_2) \\ \vdots \\ f(X_n) \end{bmatrix} \sim  \mathscr{N}\Bigg(\begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix},\begin{bmatrix} k( X_1, X_1) &k( X_1, X_2) & \cdots & k( X_1, X_n) \\ k( X_2, X_1) &    k( X_2, X_2)& \cdots &k( X_2, X_n) \\ \vdots & \vdots & \ddots & \vdots \\ k( X_n, X_1)& k( X_n, X_2) & \cdots & k( X_n, X_n) \end{bmatrix}\Bigg) $$




#### $$K(X,X') = \bigg(1 + \frac{|X-X'|}{2\alpha l}\bigg)^{\alpha}$$ 
## ![k.png](attachment:k.png)


## First observation in continuous data 


#### ![Untitledasas.png](attachment:Untitledasas.png)



## Computing $f(x_*)$

#### Given dat $ X=\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n\end{bmatrix} , f(x)\begin{bmatrix} f(X_1) \\ f(X_2) \\ \vdots \\ f(X_n) \end{bmatrix}$  And $x_*$  we want  $f(x_*) | f(X)$

#### $K(x_*,x_*)= k(x_*,x_*)$ and $K(X,x_*) =\begin{bmatrix} k(x_1,x_*) \\ \vdots \\  k(x_N,x_*) \end{bmatrix}= K(x_*,X)^T $ and $ K(X,X) = \begin{bmatrix} k(x_1,x_1) & \cdots & k(x_1,x_N) \\ \vdots& & \vdots \\  k(x_N,x_1) & \cdots & k(x_N,x_N)\end{bmatrix}$

### So we have   
### $$\begin{bmatrix} f(X) \\ f(x_*) \end{bmatrix} \sim  \mathscr{N}\bigg(\begin{bmatrix}  0 \\  0 \end{bmatrix},\begin{bmatrix} K(X,X) & K(X,x_*) \\ K(x_*,X) & K(x_*,x_*)\end{bmatrix}\bigg)$$ 

 
###  And most importantly<font color='red'>  $$f(x_*)|f(X)\sim \mathscr{N}\bigg(K(x_*,X)K(X,X)^{-1} f(X),K(x_*,x_*)-K(x_*,X)K(X,X)^{-1} K(X,x_*)\bigg) $$ </font> 

## The Gaussian process as a regression model (Data without noise)

### ![Untitledasaa.png](attachment:Untitledasaa.png) 


## Observation with noise

### We have  $f(x) + ε, $  with  $  ε ∼ N(0,\sigma_{n}^2 )$

### Incorporating noise : Adding  $\sigma_{n}^2 I$ to $K(X,X)$ :


### $$ K(X,X) = \begin{bmatrix} k(x_1,x_1) & \cdots & k(x_1,x_N) \\ \vdots& & \vdots \\  k(x_N,x_1) & \cdots & k(x_N,x_N)\end{bmatrix}+\begin{bmatrix} \sigma_{n}^2 & \cdots & 0 \\ \vdots& \ddots & \vdots \\  0 & \cdots & \sigma_{n}^2\end{bmatrix}$$ 
### so we have 
### $$f(x_*)|y\sim \mathscr{N}\bigg(K(x_*,X)(K(X,X)+\sigma_{n}^2I)^{-1} y,K(x_*,x_*)-K(x_*,X)(K(X,X)+\sigma_{n}^2I)^{-1} K(X,x_*)\bigg) $$![Untsasitled.png](attachment:Untsasitled.png)

## Samples from the Gaussian process
### ![Untitlessssd.png](attachment:Untitlessssd.png)
### *The Gaussian process can be understood as a distribution over functions




### *Visualized GP kernels  http://smlbook.org/GP/ 





## Transform F(x) From parametric to non-parametric(removing weights)



### - Bayesian linear regression is limited expressiveness
### - The goal is to deal with large data which could be slow modeling via normal GP

### Let’s start with a standard linear model : $y=X^T W +ε , ε ∼ N(0,\sigma_{n}^2 )$ where ($ X^T W = f(x)$) 



### -Posterior and predictive distribution :

### $$p(W |y) =  \mathscr{N}\bigg(W;(I+\frac{1}{\sigma_{n}^2} X^T X)^{-1}(\frac{1}{\sigma_{n}^2} X^T y),(I + \frac{1}{\sigma_{n}^2} X^T X)^{-1}\bigg)$$

### $$ p(f(x_*) |y) = \mathscr{N} \bigg(f(x_*);x_{*}^T(I+\frac{1}{\sigma_{n}^2} X^T X)^{-1}(\frac{1}{\sigma_{n}^2} X^T y),x_{*}^T(I + \frac{1}{\sigma_{n}^2} X^T X)^{-1} x_*\bigg)$$
#### ![Untitledsd.png](attachment:Untitledsd.png)

## Function-space View
### - Considering inference directly in function space
### Using a Gaussian process (GP) to describe a distribution over functions

### A gaussian process is completely specified by its mean function and covariance function. we define mean function $m(x)$ and the covariance function $k(x,x')$ of a real process $f(x)$ as:

### $$m(x)= E[f(x)]$$
### $$k(x,x') = E[(f(x)- m(x))(f(x')-m(x'))]$$
### and the gaussian process as :
### $$f(x)\sim \mathscr{GP}(m(x),k(x,x'))$$

## A linear function to covariance function

### consider the class of linear functions:

### $f(x)= ax + b$  where $a\sim \mathscr{N}(0,\alpha)$ and $b\sim \mathscr{N}(0,\beta)$

### we can compute the mean function :

### $$\mu(x)= E[f(x)]= \iint f(x)p(a)p(b)dadb=\int axp(a)da + \int bp(b)db=0$$
### and covariance function:

### $$k(x,x') = E[(f(x)- 0)(f(x')-0)] = \iint (ax+b)(ax'+b)p(a)p(b)dadb$$
### $$= \int a^2xx'p(a)da + \int b^2p(b)db+(x+x')\int abp(a)p(b)dadb= \alpha xx' + \beta$$

## A nonlinear function to covariance function
### consider the class of linear functions(sums of squared exponentials):
 
### $f(x)=\lim\limits_{n \to \infty}\frac{1}{n} \sum_{i=0}^{n} \gamma_i \exp(-(x-\frac{i}{n})^2), $    where $\gamma_i \sim \mathscr{N}(0,1)$
###            $= \int_{-\infty}^{\infty} \gamma (u)exp(-(x-u)^2)du,   $ where $  \gamma(u) \sim \mathscr{N}(0,1)$

### the mean function is:
### $$\mu(x)= E[f(x)]=\int_{-\infty}^{\infty} exp(-(x-u)^2)\int_{-\infty}^{\infty} \gamma p(\gamma) d\gamma du=0 $$

### and covariance function :
### $$ E[(f(x)f(x')]=\int exp(-(x-u)^2-(x'-u)^2)du$$
### $$\int exp(-2(u-\frac{x+x'}{2})^2+\frac{(x+x')^2}{2}-x^2-x'^2)du \propto exp(-\frac{(x-x')^2}{2})  $$

###  *Thus the squared exponential covariance function is equivalent to regression using infinitely many gaussian shaped basis functions placed everywhere(not just at training point)

## Non-linear input transformation





### Projecting  the inputs into higher dimensional space using a set of basis functions and then apply the linear model in this space instead of directly on the inputs, so we have :

### for example : $\phi(x) = (1,x,x^2,x^3,...)^T$

### $\phi(x)$ : A function set which maps a D-dimensional input vector x into an N dimensional feature space
### $\Phi(x)$ : Aggregation of columns (x) for all cases in the training set. Now the model is :

### $$f(x) = \phi(x)^T w$$ so;

### $$p(f(x_*) |y)= \mathscr{N} \bigg(f(x_*);\phi(x_{*})^T(I+\frac{1}{\sigma_{n}^2} \Phi^T \Phi)^{-1}(\frac{1}{\sigma_{n}^2} \Phi^T y),\Phi(x_{*})^T(I + \frac{1}{\sigma_{n}^2} \Phi^T \Phi)^{-1} \phi(x_*)\bigg)$$  



## Kernel trick
### Lifting any algorithm which is defined in terms of  input space into feature space by replacing occurrences of those inner products by $k(x,x′ )$;




### For any matrix A, $(I + A^TA)^{-1}A= A(I +AA^T)^{-1}$ Hence:
### $$\phi(x_*)^T(\sigma_{n}^2I+\Phi^T \Phi)^{-1}\Phi^Ty= \phi(x_*)^T \Phi^T(\sigma_{n}^2I +\Phi\Phi^T )^{-1}y$$

### The matrix inversion lemma $(I-UV)^{-1} = I-U(I+VU)^{-1}$ gives:

### $$\phi(x_*)^T\bigg(I+\frac{1}{\sigma_{n}^2} \Phi^T \Phi \bigg)^{-1} \phi(x_*)= \phi(x_*)^T \phi(x_*)-\phi(x_*)^T \Phi^T(\sigma_{n}^2I+\Phi \Phi^T)^{-1} \Phi\phi(x_*) $$

### Let $k(x,x')= \phi(x)^T\phi(x')$,we refer to $k(0,0)$ as a kernel :

### $$K(x_*,x_*)=k(x_*,x_*)=\phi(x_*)^T\phi(x_*)$$

 ### $$f(x_*)|y\sim \mathscr{N}\bigg(f(x_*);K(x_*,X)(K(X,X)+\sigma_{n}^2I)^{-1} y,K(x_*,x_*)-K(x_*,X)(K(X,X)+\sigma_{n}^2I)^{-1} K(X,x_*)\bigg) $$
 
 ###  *Do not compute (or even choose!) the nonlinear transformations $\phi(x)$. Work directly with $K(x, x′ )$ instead.
 
 ### *Only requirement on $K(x,x′): K(X,X)$ has to be positive semidefinite for all possible values on X.
 
 ### *One possible choice out of many kernels:

### $$ K(x,x′)= \bigg(1+ \frac{|x-x'|^2}{2\alpha l}\bigg)^{-\alpha}$$


## Choosing kernel and its hyperparameters

### In fact it depends on ML engineer to choose the best kernel based on the model and data

### -The kernel $(x , x′ )$ encodes assumptions on how much correlation there is between $f(x)$ and $f(x′)$
### -The kernel tells how the model should generalize the training data.
### - For a kernel valid for Gaussian processes, the matrix must be positive semidefinite for all possible X.

###  You can invent completely new kernels, as long as they fulfill following criteria:

### $$k_\times(x,x')= k_1(x,x')k_2(x,x')$$
### $$k_+(x,x')= k_1(x,x')+k_2(x,x')$$

## Common kernel functions 

### Squared exponential/RBF

### $$k(x,x')= \sigma^2exp(-\frac{1}{2l^2}(x-x')^2)$$    
### ![11.png](attachment:11.png)

### Rational quadratic
### $$k(x,x')= \sigma^2(1+\frac{|x-x'|^2}{2\alpha l^2})^{-\alpha}$$
### ![Usssd.png](attachment:Usssd.png)
### Matern 1
### $$k(x,x')= \sigma^2exp(-\frac{1}{l^2}|x-x'|)$$
### ![2s2.png](attachment:2s2.png)
### Periodic kernel
### $$k(x,x')= \sigma^2exp(-\frac{2}{l^2}\sin^2(\pi\frac{|x-x'|}{p}))$$
### ![Untisatled.png](attachment:Untisatled.png)

## Combining different kernels

### - Long-term smoth trend (squared exponential)

### $$k_1(x,x')= \theta_1^2exp(-(x-x')^2/\theta_2^2)$$

### - Seasonal trend(quasi-periodic smooth)

### $$k_2(x,x')= \theta_3^2exp(-2\sin^2(\pi(x-x'))/\theta_5^2)\times exp(-\frac{1}{2}(x-x')^2/\theta_4^2)$$

### - Short and medium term anomaly(rational quadratic)

### $$ k_3(x,x')= \theta_6^2(-\frac{(x-x')^2}{2\theta_8\theta_7^2})^{-\theta_8}$$

### -Noise(independent gaussian)

### $$k_4(x,x')= \theta_9^2 exp(-\frac{(x-x')^2}{2\theta_{10}^2})+ \theta_{11}^2\delta_{xx'} $$

### $$k(x,x')= k_1(x,x')+ k_2(x,x')+ k_3(x,x')+ k_4(x,x')$$

## Graphical model for Gaussian process

### - It explains why All pairs of latent variables are connected
### - Predictions $y*$ depends only on corresponding single latent $f*$
### - A triplet  $x*$, $y*$, $f*$  doesn’t effect the distribution due to the marginalization property of GP       


### ![Untitasdled.png](attachment:Untitasdled.png)

## Finding hyperparameters by maximizing the marginal likelihood

### The marginal likelihood/evidence $p_\xi(D)$ says how probable the data is with hyperparameter $\xi$
### with a similar argument as for the maximum likelihood idea, we can select  $\xi$ as :
### $$ \hat{\xi}= arg min p_\xi(D) $$
### The kernel function depends on $\xi$ hence we write $k_\xi(x,x'), K_\xi(X,X)$, etc

### The gaussian process model says :
### $$p(f(X))= \mathscr{N}(f(x);0, K_\xi(x,x'))$$
### and since $y=f(X)+ε, ε \sim \mathscr{N}(0,\sigma_n^2 I)$
### $$p(y)=\mathscr{N}(y;0, K_\xi(x,x')+\sigma_n^2 I) $$

### $$\implies \log{p(y)}= -\frac{1}{2}y^T(K_\xi(X,X)+\sigma_n^2 I)^{-1}y -\frac{1}{2}\log det(K_\xi(X,X)+\sigma_n^2 I )-\frac{N}{2} \log 2\pi$$

### - Here we have chosen $\xi = {\sigma_n^2,l}$ by maximizing marginal likelihood.
### (In this example ,$k(x,x')= exp(-\frac{1}{2l^2}(x-x')^2)$)

### ![Untitlsasaed.png](attachment:Untitlsasaed.png)