# Kernel Ridge Regression  
- Objective function  
$$\text{min}_{w \in R^{n}} \frac{\lambda}{N} \|w\|^2 + \frac{1}{N}\sum_{i = 1}^N (w^T \psi(x_i) - y_i)^2 $$  
using representer theorem:  
If $J(w)$ has a minimizer, then it has the form $w^* = \sum_{i = 1}^n \alpha_i\psi(x_i)$  
then the objective function becomes  
$$\text{min}_{\alpha \in R} \frac{\lambda}{N} \sum_{i,j} \alpha_i\alpha_j K_{ij} + \frac{1}{N} \sum_{i = 1}^N (\sum_{j = 1}^N \alpha_j K_{ij} - y_i)^2   $$  
write in matrix form  
$$\text{min}_{\alpha \in R} \frac{\lambda}{N} \alpha^T K \alpha + \frac{1}{N} ((K\alpha)^T(K\alpha) - 2y^TK\alpha + y^Ty) $$  
We can solve this by setting its gradient to 0  

$$\partial J(\alpha) = \frac{2 \lambda}{N}(K^T \alpha) + \frac{2}{N}(K^T K \alpha - K^T y)$$  
set this result to 0  
$$\alpha = (\lambda I + K)^{-1}y$$  
Notice: since K is psd and add a positive term $\lambda I$, which means the inverse of $\lambda I + K$ exists  
time complexity to calculate the inverse matrix is $O(N^3)$
- Compare this with linear ridge regression  
linear ridge regression:  
$$\mathbf{w}=\left(\lambda \mathbf{I}+\mathbf{X}^{T} \mathbf{X}\right)^{-1} \mathbf{X}^{T} \mathbf{y}$$  
   - More restricted  
   - $O(d^3 + d^2 N)$ training  
   - $O(d)$ prediction

Kernel ridge regression:  
$$\boldsymbol{\alpha}=(\lambda \mathrm{I}+\mathrm{K})^{-1} \mathbf{y}$$  
   - More **flexible** with $K$  
   - $O(N^3)$ training  
   - $O(N)$ prediction

# kernel ridge regression for classification  
- least-squares SVM(LSSVM)  
<div align="center"><img src = "./LSSVM.jpg" width = '500' height = '100' align = center /></div>  
The boundaries are similar, but more support vectors in the right graphs, which means $\alpha$ is more dense and the prediction becomes slower.    

- **Tube Regression**  
<div align="center"><img src = "./tube regression.jpg" width = '500' height = '100' align = center /></div>   

We consider to set a neutral zone, and we treat the point inside such zone as errorless, and if the points lie outside, we consider its distance to the zone.  
  - The error measure($\epsilon$-insensitive error)  
    $$l(y,\hat{y}) = \text{max}(0, |y - \hat{y}| - \epsilon)$$  
- **Tube error vs Squared error**  
<div align="center"><img src = "./Tube and Square.jpg" width = '500' height = '100' align = center /></div>   

As observed from graphs, we learn that when $|y - \hat{y}|$ is small, the two graphs are similar, while when $|y - \hat{y}|$ is larger, tube error is less influenced by outliers.  

- L2 regularization with tube regression  
$$\min \frac{1}{2}w^Tw + C \sum_{i = 1}^N \max (0, |y_i - w^Tx_i - b| - \epsilon)$$  
try to solve it using the similar way of SVM  


Transforming it to constrained optimization problem  
$$\begin{array}{l}
\frac{1}{2} \mathbf{w}^{T} \mathbf{w}+C \sum_{n=1}^{N}\left(\xi_{i}^{\vee}+\xi_{i}^{\wedge}\right) \\
-\epsilon-\xi_{i}^{\vee} \leq y_{i}-\mathbf{w}^{T} \mathbf{x}_{i}-b \leq \epsilon+\xi_{i}^{\wedge} \\
\xi_{i}^{\vee} \geq 0, \xi_{i}^{\wedge} \geq 0
\end{array}$$  
Just think what we have done in SVM, here we should contrive a method to eliminate the $|\cdot|$, therefore use $\xi_{i}^{\wedge}$ to illustrate the upper error measurement and $\xi_{i}^{\vee}$ the lower error measurement   
and this is a formal **Support Vector Regression(SVR) primal problem**  

- Parameters  
C: A trade-off of regularization and tube violation  
$\epsilon$：how much we can bear for the error  


- SVR Dual  
we first write it in Lagrangian form:  
$$L(w,b,\xi^{\wedge}, \xi^{\vee}, \alpha_{\wedge}, \alpha_{\vee}, \lambda^{\wedge}, \lambda^{\vee}) = \frac{1}{2} w^Tw + C\sum_{i =1}^N (\xi_{i}^{\wedge}+ \xi_{i}^{\vee}) + \sum_{i =1}^N \alpha_{i}^{\wedge} (y_i -w^T x_i -b -\epsilon - \xi_{i}^{\wedge}) + \sum_{i = 1}^N \alpha_{i}^{\vee} (-y_i + w^Tx_i +b -\epsilon -\xi_{i}^{\vee}) + \sum_{i = 1}^N \lambda_i^{\wedge} (-\xi^{\wedge}_{i}) + \lambda_i^{\vee} (-\xi^{\vee}_{i})$$  

Then we can write the primal and dual form  
$$\begin{aligned}
p^{*} &=\inf _{w, \xi^{\wedge}, \xi_{\vee}, b} \sup _{\alpha^{\wedge}, \alpha^{\vee}, \lambda \succeq 0} L(w,b,\xi^{\wedge}, \xi^{\vee}, \alpha_{\wedge}, \alpha_{\vee}, \lambda) \\
& \geqslant \sup _{\alpha^{\wedge}, \alpha^{\vee}, \lambda \succeq 0} \inf _{w, \xi^{\wedge}, \xi_{\vee}, b} L(w,b,\xi^{\wedge}, \xi^{\vee}, \alpha_{\wedge}, \alpha_{\vee}, \lambda)=d^{*}
\end{aligned}$$  

we let  
$$
g(w, \xi^{\wedge}, \xi_{\vee}, b) = \inf _{w, \xi^{\wedge}, \xi_{\vee}, b} L(w,b,\xi^{\wedge}, \xi^{\vee}, \alpha_{\wedge}, \alpha_{\vee}, \lambda)
$$
calculating the deravatives:  
$$\begin{array}{l}
\partial_{w} L=0 \quad \Longleftrightarrow \quad w-\sum_{i=1}^{n} (\alpha_{i}^{\wedge} - \alpha_{i}^{\vee}) x_{i}=0 \\
\partial_{b} L=0 \quad \Longleftrightarrow \quad-\sum_{i=1}^{n} (\alpha_{i}^{\wedge} - \alpha_i^{\vee}) =0 \quad \\
\partial_{\xi_{i}^{\wedge}} L=0 \quad \Longleftrightarrow \quad C - \alpha_{i}^{\wedge} -\lambda_{i}^{\wedge}=0  \\
\partial_{\xi_{i}^{\vee}} L=0 \quad \Longleftrightarrow \quad C - \alpha_{i}^{\vee} -\lambda_{i}^{\vee}=0
\end{array}$$  
we can find the solution is quite similar as that of SVM  
we also have slack complementary for the **optimal solution**  
$$\begin{array}{l}
\alpha_{i}^{\wedge}\left(\epsilon+\xi_{i}^{\wedge}-y_{i}+\mathbf{w}^{T} \mathbf{x}_{i}+b\right)=0 \\
\alpha_{i}^{\vee}\left(\epsilon+\xi_{i}^{\vee}+y_{i}-\mathbf{w}^{T} \mathbf{x}_{i}-b\right)=0 \\  
\lambda_{i}^{\wedge}\xi_{i}^{\wedge} = 0 \\  
\lambda_{i}^{\vee}\xi_{i}^{\vee} = 0
\end{array}$$  
for all $i \in [1,N]$  

We then have:  
$$\begin{array}{l}
g(w, \xi^{\wedge}, \xi_{\vee}, b) = -\frac{1}{2} \sum_{i, j=1}^{N}\left(\alpha_{i}^{\wedge}-\alpha_{i}^{\vee}\right)\left(\alpha_{j}^{\wedge}-\alpha_{j}^{\vee}\right) \underbrace{\Phi\left(\mathbf{x}_{i}\right)^{T} \Phi\left(\mathbf{x}_{j}\right)}_{k\left(\mathbf{x}_{i}, \mathbf{x}_{j}\right)} \\
-\epsilon \sum_{i=1}^{N}\left(\alpha_{i}^{\wedge}+\alpha_{i}^{\vee}\right)+\sum_{i=1}^{N} y_{i}\left(\alpha_{i}^{\wedge}-\alpha_{i}^{\vee}\right)
\end{array}
$$  
Now the problem becomes:  
$$\begin{array}{ll}
\sup _{\alpha^{\wedge}, \alpha^{\vee}, \lambda} g(w, \xi^{\wedge}, \xi_{\vee}, b) \\
\text { s.t. }  \sum_{i=1}^{N}\left(\alpha_{i}^{\wedge}-\alpha_{i}^{\vee}\right)=0 \quad \alpha_{i}^{\vee}, \alpha_{i}^{\wedge} \in[0, C] \quad \forall i \in[1, N]
\end{array}$$  

Therefore, if the result is inside the tube $\xi^{\wedge}, \xi^{\vee}$ are 0, which indicates that $\alpha^{\wedge}, \alpha^{\vee}$ are 0 from complementary slackness, which means $w$ is determinded by those sample outside the tube   

- Using kernel trick, the prediction function becomes:  
$$f(x) = \sum_{i=1}^{n} (\alpha_{i}^{\wedge} - \alpha_{i}^{\vee}) \Phi(x_{i})^T \Phi(x) + b$$

## Some ideas of Gaussian Kernel  
$$k_{\sigma}\left(\mathbf{x}, \mathbf{x}^{\prime}\right)=\exp \left(-\frac{\left\|\mathbf{x}-\mathbf{x}^{\prime}\right\|^{2}}{2 \sigma^{2}}\right)=\exp \left(-\frac{\mathbf{x}^{T} \mathbf{x}-2 \mathbf{x}^{T} \mathbf{x}^{\prime}+\mathbf{x}^{\prime T} \mathbf{x}^{\prime}}{2 \sigma^{2}}\right)$$  
- Depends on a width parameter $\sigma$  
- The smaller the width, the more prediction on a point only depends on its nearest neighbours   
-  Example of *Universal* kernel: they can uniformly approximate any arbitrary continuous target function (pb of number of training examples and choice of $\sigma$)

## Kernels on structured data  
- Kernels are generalization of dot products to arbitrary domains   
- It is possible to design kernels over structured objects like sequences, trees or graphs   
- The idea is designing a pairwise function measuring the similarity of two objects   
- This measure has to sastisfy the p.d. conditions to be a valid kernel

### E.g. string kernel: 3-gram spectrum kernel  
<div align="center"><img src = "./3-gram.jpg" width = '500' height = '100' align = center /></div>  

## Convolution Kernels  
- decomposition kernels deﬁning a kernel as the convolution of its parts:   
$$\left(k_{1} \star \cdots \star k_{D}\right)\left(x, x^{\prime}\right)=\sum_{\left(x_{1}, \ldots, x_{D}\right) \in R(x)\left(x_{1}^{\prime}, \ldots, x_{D}^{\prime}\right) \in R\left(x^{\prime}\right)} \prod_{d=1}^{D} k_{d}\left(x_{d}, x_{d}^{\prime}\right)$$  
- where the sums run over all possible decompositions of $x$ and $x'$.
 
### Set Kernels  
- Let $R(x)$ be the set membership relationship (written as $\in$)  
- Let $k_{member}(\xi,\xi_{0})$ be a kernel deﬁned over set elements 
- The set kernel is deﬁned as:
$$k_{s e t}\left(X, X^{\prime}\right)=\sum_{\xi \in X} \sum_{\xi^{\prime} \in X^{\prime}} k_{m e m b e r}\left(\xi, \xi^{\prime}\right)$$  

## Kernel normalization  
-  Kernel values can often be inﬂuenced by the dimension of objects   
- E.g. a longer string has more substrings → higher kernel value  
- This effect can be reduced *normalizing* the kernel

### Cosine normalization
Cosine normalization computes the cosine of the dot product in feature space:
$$\hat{k}\left(x, x^{\prime}\right)=\frac{k\left(x, x^{\prime}\right)}{\sqrt{k(x, x) k\left(x^{\prime}, x^{\prime}\right)}}$$