# Support Vector Regression

如何結合 L2-Regularized Linear Model 與 Kernel, 獲得 Analytic Solution for kernel ridge regression?

for any L2-regularized linear model

$$
\min_w \frac{\lambda}{N} w^T w + \frac{1}{N} \sum_{n=1}^N err \big( y_n, w^T z_n \big)
$$

optimal $ w_{*} = \sum_{n=1}^N \beta_n z_n $

for Regression with Squared Error:

$$
err \big( y_n, w^T z_n \big) = \big( y - w^T z \big)^2
$$

## Kernel Ridge Regression Problem

$
\min_w \frac{\lambda}{N} w^T w + \frac{1}{N} \sum_{n=1}^N \big( y_n - w^T z_n \big)^2
$

optimal $ w_{*} = \sum_{n=1}^N \beta_n z_n $

將 w 替換成 $ \beta $, 就可以將問題變成 $ \beta $ 的最佳化。 

$$
\min_{\beta} \frac{\lambda}{N} \sum_{n=1}^N \sum_{m=1}^N \beta_n \beta_m K(x_n, x_m) 
+ \frac{1}{N} \sum_{n=1}^N \Big( y_n - \sum_{m=1}^N \beta_m  K(x_n,x_m) \Big)^2 \\
= \frac{\lambda}{N} \beta^T K \beta + \frac{1}{N} \big( \beta^T K ^T K \beta - 2 \beta^T K^T y + y^T y \big)
$$

### Solving Kernel Ridge Regression

$
E_{aug}(\beta) = \frac{\lambda}{N} \beta^T K \beta + \frac{1}{N} \big( \beta^T K ^T K \beta - 2 \beta^T K^T y + y^T y \big)
$

$
\nabla E_{aug}(\beta) = \frac{2}{N} \big( \lambda K^T I \beta + K^T K \beta - k^T y \big) \\
= \frac{2}{N} K^T \big( ( \lambda I + K ) \beta - y \big)
$

want $ \nabla E_{aug} (\beta) = 0 $ : one anayttic solution

$$
\beta = \big( \lambda I + K \big)^{-1} y
$$

- $ (...)^{-1} $ always exists for $ \lambda \gt 0 $, because K is positive semi-definite. (Mercer's condition)
- time complexity: $ O(N^3) $ with simple dense matrix inversion

Can now do non-linear regression 'easily'.

![img](imgs/c206-linear-kernel-ridge-reg.jpg)

### Soft-Margin SVM versus Least-Squares SVM

least-squares SVM (LSSVM) = kernel ridge regression for classification

LSSVM 與 Soft-Margin Gauusian SVM 找出的邊界類似，但是 LSSVM 的 Support Vector 多很多。  

- dense $ \beta $ : LSSVM, kernel LogReg
- sparse $ \alpha $ : standard SVM

因此 LSSVM 做預測時候較費時

可否找到 sparse $ \beta $ like standard SVM ?

![img](imgs/c206-tube.png)

忽略掉在 tube (紫色區域) 中的錯誤，設定紫色區域的高度是 $ 2 \epsilon $，於是 error measure 變成:

$$
err(y, s) = \max \big( 0, | s - y | -\epsilon \big)
$$

這中衡量錯誤的方式 error measure 通常叫做 $ \epsilon $-insensitive error with $ \epsilon \gt 0 $

TODO: L2-regularized tube regression to get sparse $ \beta $

### L2-Regularized Tube Regression

$$
\min_w \ \ \frac{\lambda}{N} w^T w + \frac{1}{N} \sum_{n=1}^N \max \big( 0, |w^T z_n - y| - \epsilon \big)
$$

希望用 類似 standard SVM > QP > Dual 方式去找出 sparse solution. 於是微調上式加入 b, 改變係數...

$$
\min_{b,w} \ \ \frac{1}{2} w^T w + C \sum_{n=1}^N \max \big( 0, |w^T z_n + b - y_n| - \epsilon \big)
$$

將 error measure 轉變成 $ \xi_n $

$$
\begin{align}
\min_{b,w,\xi} \ \ & \frac{1}{2} w^T w + C \sum_{n=1}^N \xi_n \\
s.t. \ \ & \big| w^T z_n + b - y_n \big| \le \epsilon + \xi_n \\
& \xi_n \ge 0
\end{align}
$$

making constraints linear, 去掉絕對值，將 $ \xi $ 拆成了兩個:  
$ \xi^{\wedge} $ : 在 tube 上面的違反量  
$ \xi^{\vee} $ : 在 tube 下面的違反量

$$
\begin{align}
\min_{b,w,\xi} \ \ & \frac{1}{2} w^T w + C \sum_{n=1}^N \big( \xi_n^{\vee} + \xi_n^{\wedge} \big) \\
s.t. \ \ & - \epsilon - \xi_n^{\vee} \le y_n - w^T z_n - b \le \epsilon + \xi_n^{\wedge} \\
& \xi_n^{\wedge} \ge 0 \\
& \xi_n^{\vee} \ge 0
\end{align}
$$

#### Support Vector Regression (SVR) primal:

- minimize regularizer: $ w^T w $
- upper tube violations $ \xi_n^{\wedge} $
- lower tube violations $ \xi_n^{\vee} $


### Quadratic Programming for SVR

上式就是 SVR 的 QP prime 形式, 參數和複雜度如下:

- 參數 C : trade-off between Regularization and Tube Violation
- 參數 $ \epsilon $ : vertical tube width, tube 的寬度
- QP of $ \tilde{d} + 1 + 2N $ variables, 2N + 2N constraints.

接下來將 SVR primal 推導成 Dual, 即可移除掉對 $ \tilde{d} $ 的依賴。

### Lagrange Multipliers $ \alpha^{\vee}, \alpha^{\wedge} $


$$
\begin{align}
\text{objective function} \ \ & \frac{1}{2} w^T w + C \sum_{n=1}^N \big( \xi_n^{\vee} + \xi_n^{\wedge} \big) \\
\text{lagrange multiplier} \alpha_n^{\wedge} \ \ & & y_n - w^T z_n - b & \le \epsilon + \xi_n^{\wedge} \\
\text{lagrange multiplier} \alpha_n^{\vee} \ \ & - \epsilon - \xi_n^{\vee} \le & y_n - w^T z_n - b & &  \\
& \xi_n^{\wedge} \ge 0 & & \\
& \xi_n^{\vee} \ge 0   & &
\end{align}
$$

### Some of the KKT conditions

To get w:

$
\frac{\partial \mathcal{L}}{\partial w_i} = 0 \to \\
w = \sum_{n=1}^N \underbrace{\big( \alpha_n^{\wedge} - \alpha_n^{\vee} \big)}_{\beta_n} z_n
$

To get b:

$
\frac{\partial \mathcal{L}}{\partial b} = 0 \to \\
\sum_{n=1}^N \big( \alpha_n^{\wedge} - \alpha_n^{\vee} \big) = 0
$

Complementary Slackness:

$
\alpha_n^{\wedge} \big( \epsilon + \xi_n^{\wedge} - y_n + w^T z_n + b \big) = 0 \\
\alpha_n^{\vee}   \big( \epsilon + \xi_n^{\vee}   + y_n - w^T z_n - b \big) = 0
$

## SVM Dual and SVR Dual

![img](imgs/c206-svr-dual.png)

### Sparsity of SVR solution

$
\frac{\partial \mathcal{L}}{\partial w_i} = 0 \to \\
w = \sum_{n=1}^N \underbrace{\big( \alpha_n^{\wedge} - \alpha_n^{\vee} \big)}_{\beta_n} z_n
$

Complementary Slackness:

$
\alpha_n^{\wedge} \big( \epsilon + \xi_n^{\wedge} - y_n + w^T z_n + b \big) = 0 \\
\alpha_n^{\vee}   \big( \epsilon + \xi_n^{\vee}   + y_n - w^T z_n - b \big) = 0
$

Sparsity : 什麼時候 $ \beta $ 會是 0 ?

如果是在 tube 內的點， $ \big| w^T z_n + b - y_n \big| \lt \epsilon $  
$ \to \xi_n^{\wedge} = 0, \ \  \xi_n^{\vee} = 0 $  
$ \to ( \epsilon + \xi_n^{\wedge} - y_n + w^T z_n + b ) \ne 0, \ \ ( \epsilon + \xi_n^{\vee} + y_n - w^T z_n - b ) \ne 0 $
$ \to \alpha_n^{\wedge} = 0, \ \  \alpha_n^{\vee} = 0 $  
$ \to \beta = 0 $

SVs $ ( \beta \ne 0 ) $ : on or outside tube.

SVR: allows sparse $ \beta $

### Map of Linear Models

![img](imgs/c206-map-linear-models.png)

### Map of Linear / Kernel Models

![img](imgs/c206-map-lineark-models.png)

- 第一排: PLA/Pocket, linear SVR 實務上少用到，因為 worse performance.
- 第三排: kernel ridge reg., kernel logistic reg. 實務上少用到，因為 dense $ \beta $. 通常用第四排的對應方案。