# Kernel Methods

main idea: data represented as pairwise inner products rather than as points in euclidean space

**def** kernel function

$K(x_i, x_j) = \phi(x_i)^\top \phi(x_j)$

where $\phi$ is some feature space

we do not need to know the space, as long as it exists (certain criteria need to be met)

**def** kernel method

learning algorithm that uses only the pairwise evaluations of a kernel function rather than the data directly

**e.g.** linear kernel

$K_1(x_i, x_j) = x_i^\top x_j$

$\phi(x) = x$

**e.g.** quadratic kernel

$K_2(x_i, x_j) = (x_i^\top x_j + 1)^2$

to show it is a valid kernel, we need to find $\phi$ s.t. $\phi(x) \in \mathbb{R}^q$ for some $q \in \mathbb{N}$

$K_2(x_i, x_j) = (\sum_k x_{ik} x_{jk} + 1) (\sum_l x_{il} x_{jl} + 1)$  
$= \sum_k \sum_l x_{ik} x_{il} x_{jk} x_{jl} + \sum_k x_{ik} x_{jk} + \sum_l x_{il} x_{jl} + 1$

then we can see that $\phi(x_i)$ consists of $[x_{ik} x_{il}, \cdots, x_{ik}, \cdots, 1]$

**e.g.** polynomial kernel

$K_p(x_i, x_j) = (x_i^\top x_j + 1)^p$

**e.g.** polynomial kernel of infinite dimension

$K(x_i, x_j) = e^{x_i^\top x_j}$

can use taylor series to show that this is a kernel

$K(x_i, x_j) = \sum_k \frac{1}{k!} (x_i^\top x_j)^k$

So $\phi : \mathbb{R}^d \to \mathbb{R}^\infty$

**theorem**

let $K(\cdot, \cdot)$ be a kernel function with feature representation $\phi(\cdot)$  
let $c \in \mathbb{R}+$

then $c K(\cdot, \cdot)$ is also a kernel function and has representation function $\sqrt{c} \phi(\cdot)$

**e.g.** RBF/gaussian/square exponential/heat kernel

$K(x_i, x_j) = e^{-\frac{|x_i - x_j|^2}{s^2}}$

to show that it is a kernel, we can expand:

$K(x_i, x_j) = e^{-|x_i|^2 / s^2} e^{-|x_j|^2 / s^2} e^{2 x_i^\top x_j / s^2}$

the last term is the polynomial kernel function, and then this is multiplied by terms that depend only on $x_i$ and $x_j$, so the representation function can be written as

$\phi(x) = e^{-|x|^2 / s^2} \phi_p(x)$  
where $\phi_p$ is the polynomial kernel representation function

**def.** online logistic regression

for logistic regression, $\partial_w \ell = \Phi^\top (t - y)$  
or for just one observation, $(t_i - y_i) \phi(x_i)$

regular gradient descent has update step $w^{(k+1)} = w^{(k)} + \Phi^\top (t - y^{(k)})$

we can update one observation at a time and weigh the observations:

$w_i^{(k+1)} = w_i^{(k)} + \gamma_i (t_i - y_i^{(k)}) \phi(x_i)$

**def** perceptron

redefine $t_i \in \{-1, +1\}$

initialize $w = 0$

for $i \in 1 .. n$:

1. predict $\hat{t}_i = sign(w^\top \phi(x_i))$
2. if $\hat{t}_i \neq t$, then update $w \leftarrow w + t_i \phi(x_i)$

**def** dual perceptron

let $w = \sum_i \alpha_i t_i \phi(x_i)$  
where $\alpha_i$ is the number of times updates were made on observation $i$

then the algorithm becomes:

initialize $\alpha_i = 0$ $\forall i \leq n$  
initialize $w = 0$

for $i \in 1 ..n$:

1. $\hat{t}_i = sign((\sum_k \alpha_k t_k \phi(x_k))^\top \phi(x_i))$  
$= sign \bigg(\sum_k \alpha_k t_k K(x_k, x_i) \bigg)$
2. if $\hat{t}_i \neq t$, then update $\alpha_i \leftarrow \alpha_i + 1$

**def** nearest neighbors classifier

given a training set in $\mathbb{R}^d$, for a new point $z \in \mathbb{R}^d$, classify according to nearest point to $z$ in the training set

note that $|x_i - z|^2 = |x_i|^2 + |z|^2 - 2 x_i^\top z$
$= K(x_i, x_i) + K(z, z) - 2 K(x_i^\top z)$  
for the linear kernel

so we can redefine this as a kernel method

**e.g.** regularized linear regression

$\arg\min_w \frac{1}{2} \sum_i (w^\top \phi(x_i) - t_i)^2 + \frac{1}{2} \lambda |w|^2$

we can solve this by taking the gradient w.r.t. $w$ and setting to 0, and we get an estimate for $w$:

$0 = \sum_i (w^\top \phi(x_i) - t_i) \phi(x_i) + \lambda w$  
$\implies w = -\sum_i \lambda^{-1} (w^\top \phi(x_i) - t_i) \phi(x_i)$
$= \Phi^\top a$  
where $a = - \lambda^{-1} (\Phi w - t)$

then we get $a = -\frac{1}{\lambda} (\Phi \Phi^\top a - t)$  
$\implies a = (\lambda I + \Phi \Phi^\top)^{-1} t$

then $w = \Phi^\top (\lambda I + \Phi \Phi^\top)^{-1} t$

and $\hat{t} = \phi(x)^\top w = \phi(x) \Phi^\top (\lambda I + \Phi \Phi^\top)^{-1} t$  
$= k(x)^\top (\lambda I + K)^{-1} t$  
where $k(x)$ is a vector with entries $K(x, x_i)$ and $x_i$ are observations from the training set

so we can just use the kernel values to compute

**theorem** Mercer's theorem

$K(\cdot, \cdot)$ is a kernel function iff for a sample $X_1, ..., X_n$, the kernel matrix $K \in \mathbb{R}^{n \times n}$ where $K_{ij} = K(X_i, X_j)$, $K$ is symmetric and positive semidefinite

*proof*

right to left: 

we know that $K(x_i, x_j) = \phi(x_i)^\top \phi(x_j)$  
let $c$ be a nonzero vector in $\mathbb{R}^n$  
$c^\top K c = \sum_i \sum_j c_i c_j K(x_i, x_j)$  
$= \sum_i \sum_j c_i c_j \phi(x_i)^\top \phi(x_j)$  
$= (\sum_i c_i \phi(x_i)^\top) (\sum_j c_j \phi(x_j))$  
$= |\sum_i c_i \phi(x_i)|^2 \geq 0$

*left to right*

TBD

**theorem** representer theorem

given a optimization problem $\arg\min_w \sum_i l(w^\top \phi(x_i), y_i) + \lambda R(|w|)$ where $R$ is monotonic

the solution has the form $\hat{w} = \sum_k \gamma_k \phi(x_k)$

## Ordinal regression

labels are still called $\{1, 2, ..., k\}$, but we think of them as levels that can be compared

parameters $\alpha, \phi_1, ..., \phi_{k-1}$  
$-\infty = \phi_0 < \phi_1 < \cdots < \phi_{k-1} < \phi_k = \infty$

$P(t_i = j) = \sigma(\alpha (\phi_j - a_i)) - \sigma(\alpha(\phi_{j-1} - a_i))$  
if we denote $y_{ij} = \sigma(\alpha(\phi_j - a_i))$, then 
$P(t_i = j) = y_{ij} - y_{i, j-1}$

$a_i$ is a shifting parameter and $\alpha$ is a scaling parameter

### Likelihood estimation

$L = \prod_i^n \prod_j^k (y_{ij} - y_{i, j-1})^{t_{ij}}$

$\ell = \sum_i \sum_j t_{ij} (\log(y_{ij} - y_{i, j-1}))$

$\nabla_w \ell = -\sum_i \sum_j t_{ij} \frac{y_{ij} (1 - y_{ij})\alpha \phi(x_i) - y_{i, j-1} (1 - y_{i, j-1}) \alpha \phi(x_i)}{y_{ij} - y_{i, j-1}}$  
$= \sum_i \sum_j t_{ij} \phi(x_i) \alpha (y_{ij} + y_{i, j-1} - 1)$  
$= \Phi^\top d$  
where $d_i = \sum_j \alpha t_{ij} (y_{ij} + y_{i, j-1} - 1)$
$= \alpha( y_{i, t_i} + y_{i, t_i - 1} - 1)$

$\nabla \times \nabla \ell = -\alpha \sum_i \sum_j t_{ij} \phi(x_i) (y_{ij} (1 - y_{ij}) \alpha \phi(x_i)^\top + y_{i, j-1} (1 - y_{i, j-1}) \alpha \phi(x_i)^\top)$  
$= -\alpha^2 \sum_i \sum_j t_{ij} (y_{ij} (1 - y_{ij}) + y_{i, j-1} (1 - y_{i, j-1})) \phi(x_i) \phi(x_i)^\top$

letting $r_i = -\alpha^2 \sum_j t_{ij} (y_{ij} (1 - y_{ij}) + y_{i, j-1} (1 - y_{i, j-1})$
$= \alpha (y_{i, t_i} (1 - y_{i, t_i}) + y_{i, t_i - 1} (1 - y_{i, t_i - 1})$  and $R = diag(r_1, ..., r_n)$, we can rewrite  
$H = -\Phi^\top R \Phi$

## Gaussian processes

**def** gaussian process

define a distribution over functions with the same domain in $\mathbb{R}$:

* mean function $m(x)$, $m : \mathbb{R} \to \mathbb{R}$
* covariance function $c(x_1, x_2), c : \mathbb{R}^2 \to \mathbb{R}+$

for a sample along the domain of the functions $\mathbb{X} = \{x_1, ..., x_n\}$,  
$f(x) = \begin{bmatrix} f(x_1) \\ \vdots \\ f(x_n) \end{bmatrix} \sim \mathcal{N}_n \Bigg(\begin{bmatrix} m(x_1) \\ \vdots \\ m(x_n) \end{bmatrix}, C \Bigg)$  
where $C_{ij} = c(x_i, x_j)$

**e.g.** bayesian linear regression is a special case of gaussian processes

$w \sim \mathcal{N}(0, \alpha^{-1} I)$  
$t \mid w \sim \mathcal{N}(\Phi w, \beta^{-1} I)$  
$t_i \mid w \sim \mathcal{N}(w^\top \phi(x_i), \beta^{-1})$

then the marginal $t \sim \mathcal{N}(0, \beta^{-1} I + \alpha^{-1} K)$  
where $K = \Phi \Phi^\top$

for linear regression, we use the linear kernel

**e.g.** prediction via gaussian processes

let $t$ be a vector of observed responses $\begin{bmatrix} t_1 \\ \vdots \\ t_n \end{bmatrix}$

let $\hat{t}$ be $t_n$ with a new entry $t_{n+1}$ which is the prediction for a new set of observed features $\phi(x_{n+1})$

$\hat{t} = \begin{bmatrix} t_{n+1} \\ t_1 \\ \vdots \\ t_n \end{bmatrix} \sim \mathcal{N}(0, \begin{bmatrix} c & v^\top \\ v^\top & C \end{bmatrix})$

where $C$ is the covariance matrix of the original sample,  
$c = C(x_{n+1}, x_{n+1})$  
$v = \begin{bmatrix} c(x_1, x_{n+1}) \\ \vdots \\ c(x_n, x_{n+1}) \end{bmatrix}$

then $t_{n+1} \mid t \sim \mathcal{N}(v^\top C^{-1} t, c - v^\top C^{-1} v)$

note that 
$v^\top C^{-1} t = t^\top C^{-1} v = \sum_i \gamma_i C(x_i, x_{n+1})$  
where $\gamma_i$ is the $i^{th}$ entry of $t^\top C^{-1}$

we can replace the covariance with any kernel function

### Model selection

evidence has closed form $t \sim \mathcal{N}(0, C)$  
can write as a function of hyperparameters  
then to find estimates for hyperparameters, take the derivative of the evidence (or log evidence) w.r.t. the hyperparameters and set to 0

**e.g.** RBF kernel

* $K(x_i, x_j) = e^{-\frac{1}{2s^2} (x_i - x_j)^2}$
* let $C = \beta^{-1} I + \alpha^{-1} K$

then take the derivatives w.r.t. $\alpha$, $\beta$, and $s$