Skip to content

Latest commit

 

History

History
57 lines (33 loc) · 3.97 KB

article.md

File metadata and controls

57 lines (33 loc) · 3.97 KB

Primal-dual regularized LS-SVM

Least Squares Support Vector Machines (LS-SVM), Kernel Ridge Regression (KRR), Radial Basis Function Networks (RBFN), Gaussian Processes (GP) and Kriging are regression models that are based on the same idea of decomposing the target $y$ in the Radial Basis Function (RBF) basis generated by all observations $x_i \in \mathbb{R}^d$, $i=1,\ldots,N$. More specifically,

$$y(x) = K(x) \cdot \alpha + b$$

where $K(x)$ is defined as

$$K(x) := \begin{bmatrix}\exp(-\gamma|x_1 - x|^2) & \cdots & \exp(-\gamma|x_N - x|^2)\end{bmatrix}\mbox{.}$$

In other words, the value $y(x)$ is determined as a linear combination of kernel functions (in this case Gaussian kernels) centered at the observations $x_i$.

Given observed targets $y_i$ corresponding to the observations $x_i$, the dual weights $\alpha$ and bias $b$ can be learned from the least squares problem

$$\arg \min_{\alpha,,b} |K \cdot \alpha + b - y|^2$$

where $K$ is called the kernel matrix and is defined as

$$K := \begin{bmatrix} \exp(-\gamma|x_i-x_j|^2)\end{bmatrix}_{i,j}\mbox{.}$$

However, learning $\alpha$ this way leads to low generalization performance because $y(x)$ interpolates the $y_i$. In the case of a classification task, we want the solution to strike a good balance between the following three properties:

  1. Low bias. This goal is captured by the least squares objective above.
  2. Maximum margin. Among all classifiers, we are more interested in classifiers whose decision boundary have a maximal margin or distance to the observed classes. Maximizing the margin is equivalent to adding a primal regularization term $|w|^2 \equiv \alpha^T\cdot K \cdot \alpha$.
  3. Minimal complexity. Among all classifiers, we are more interested in classifiers whose decision boundary is as smooth as possible. This can be captured elegantly in a dual regularization term of the form $\alpha^T \cdot \begin{bmatrix} \sqrt{k(x_i,x_j)} \left( 2 - \gamma|x_i - x_j|^2\right) \end{bmatrix}_{i,j} \cdot \alpha := \alpha^T \cdot L \cdot \alpha$.

Combining these into a single objective leads to the least squares problem

$$\arg \min_{\alpha,,b} \frac{1-\rho}{2} |K \cdot \alpha + b - y|^2 + \frac{\rho}{2}\alpha^T\cdot ((1-\delta) K + \delta L)\cdot \alpha$$

where the hyperparemeter $\rho \in (0,1)$ controls the amount of regularization and the hyperparameter $\delta \in (0,1)$ controls the balance between primal and dual regularization.

We can rewrite the least squares problem as

$$ \arg \min_{\alpha,,b} \frac{1}{2} \left|P \begin{bmatrix}K & 1_{n\times 1} \ R & 0_{n\times 1} \end{bmatrix} \cdot \begin{bmatrix}\alpha \ b\end{bmatrix} - P \begin{bmatrix}y\ 0\end{bmatrix}\right|^2 $$

where $R$ is the matrix square root of the regularization term $(1-\delta)K + \delta L$ and P is the diagonal matrix $\mathrm{diag}(\sqrt{1-\rho}\mathbb{I}_n,\sqrt{\rho}\mathbb{I}_n)$.


Removing the hyperparameter $\rho$ for a moment, the normal equations associated with this least squares problem are given by

$$ \begin{bmatrix}K^T K + \left((1-\delta)K + \delta L\right) & K^T\cdot 1_{n\times 1} \ 1_{1\times n}\cdot K & 1_{1\times n}\cdot 1_{n \times 1}\end{bmatrix}\cdot\begin{bmatrix}\alpha \ b\end{bmatrix} = \begin{bmatrix}K^T \ 1_{1\times n}\end{bmatrix} \cdot y $$

which is similar, but not the same as the LS-SVM system (which only combines properties (1) and (2))

$$ \begin{bmatrix}K + \lambda \mathbb{I} & 1_{n\times 1}\ 1_{1\times n} & 0\end{bmatrix}\cdot \begin{bmatrix}\alpha \ b\end{bmatrix} = \begin{bmatrix}y \ 0\end{bmatrix} $$

Can we go from our system to a more simple system simlar to LS-SVM? Why is the last equation so different?

Marc VB: Als je de eerste blokvergelijking van het eerste stelsel langs links vermenigvuldigt met K^T^{-1}, krijg je de eerste blokvergelijking van het tweede stelsel. Als je daarna alle vergelijkingen van de eerste blokrij optelt en daarna de laatste (blok) vergelijking ervan aftrekt, krijg je de tweede (blok) vergelijking van het tweede stelsel (als mu verschilt van 0).