# Motivation  
## The Input Space $X$  
- Generally, no assumptions on $X$  
- But $X = R^d$ is for the specific methods we have developed:  
  - Ridge Regression
  - Lasso Regression
  - Linear SVM  
- Our hypothesis space is $H = \{x \mapsto w^tx + b | w \in R^d, b\in R\}$  
- What if the input space is not $R^d$?  

# Feature Extraction  
- Definition  
Mapping an input space $X$ to $R^d$ is called feature extraction or featurization.  

- Geometric Example: Two class problem, nonlinear boundary
<div align="center"><img src = "./nolinear_boundary.jpg" width = '500' height = '100' align = center /></div>  

  - With linear feature map $\psi(x) = (x_1,x_2)$ and linear models, can’t separate regions  
  - With appropriate nonlinearity $\psi(x) = (x_1, x_2, x_1^2 + x_2 ^2)$  
  - Consider a linear hypothesis space with a feature map $\phi: X \to R^d$  
  $$F = \{f(x) = w^T\phi(x) \}$$  
<div align="center"><img src = "./predictor.jpg" width = '500' height = '100' align = center /></div>   

## Linear Models Need Big Feature Spaces
- To get expressive hypothesis spaces using linear models, need high-dimensional feature spaces   
- Suppose we start with $x = (1,x_1,...,x_d)\in R^{d+1} = X$.  
- We want to add all monomials of degree $M: x_{1}^{p_{1}} \cdots x_{d}^{p_{d}},$ with $p_{1}+\cdots+p_{d}=M$  
- How many features will we end up with?   
- $\left(\begin{array}{c}M+d-1 \\ M\end{array}\right)$
- For $d = 4$, $M = 8$, we get 314457495 features  

## Big Feature Spaces
Very large feature spaces have two problems:  
- Overfitting  
- Memory and computational cost  
We may use "Kernel Method" to handle memory and computational cost problem  

# Kernel Methods  

## Motivation  
The featurized SVM prediction function is the solution to
$$\min _{w \in \mathbf{R}^{d}, b \in \mathbf{R}} \frac{1}{2}\|w\|^{2}+\frac{c}{n} \sum_{i=1}^{n}\left(1-y_{i}\left[w^{T} \psi\left(x_{i}\right)+b\right]\right)_{+}$$  

Found it is equivalent to solve the dual problem to get $\alpha^*$:

$$\begin{array}{ll}
\sup _{\alpha} \sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i, j=1}^{n} \alpha_{i} \alpha_{j} y_{i} y_{j} \psi\left(x_{j}\right)^{T} \psi\left(x_{i}\right) \\
\text { s.t. }  \sum_{i=1}^{n} \alpha_{i} y_{i}=0 \\
 \alpha_{i} \in\left[0, \frac{c}{n}\right] i=1, \ldots, n
\end{array}$$  

## Some Methods Can Be “Kernelized”
- Definition  
A method is **kernelized** if inputs only appear inside inner products: $\left\langle\psi(x), \psi\left(x^{\prime}\right)\right\rangle$ for $x, x^{\prime} \in X$  
- The kernel function corresponding to $\psi$ and inner product is  
$$k\left(x, x^{\prime}\right)=\left\langle\psi(x), \psi\left(x^{\prime}\right)\right\rangle$$  
- Why introduce this new notation $k(x,x_0)$?  
- We can often evaluate $k(x,x_0)$ directly, without calculating $\psi(x) $ and $\psi(x')$
- For large feature spaces, can be much faster

## Kernel Evaluation Can Be Fast
- Example:  
Quadratic feature map for $x = (x_1, ... , x_d) \in R^d$  
$$\phi(x)=\left(x_{1}, \ldots, x_{d}, x_{1}^{2}, \ldots, x_{d}^{2}, \sqrt{2} x_{1} x_{2}, \ldots, \sqrt{2} x_{i} x_{j}, \ldots \sqrt{2} x_{d-1} x_{d}\right)^{T}$$  
has dimention $O(d^2)$, but for any $x, x' \in R^d$,  
$$k\left(x, x^{\prime}\right)=\left\langle\phi(x), \phi\left(x^{\prime}\right)\right\rangle=\left\langle x, x^{\prime}\right\rangle+\left\langle x, x^{\prime}\right\rangle^{2}$$  
Naively explicit computation of  $k\left(x, x^{\prime}\right): O\left(d^{2}\right)$   

Implicit computation of $k\left(x, x^{\prime}\right): O\left(d\right)$   

## Kernels as Similarity Scores
- Often useful to think of the kernel function as a **similarity score**  
- But this is not a mathematically precise statement  
- There are many ways to design a similarity score.  
  - We will use Mercer kernels, which correspond to inner products in some feature space. 
  - Has many mathematical beneﬁts.

## What are the Beneﬁts of Kernelization?
- Computational (e.g. when feature space dimension $d$ larger than sample size $n$).  
- Access to infinite-dimensional feature spaces  
- Allows thinking in terms of “similarity” rather than features.

## Example: SVM  
Recall the SVM dual optimization problem for training set $(x_1,y_1),...,(x_n,y_n)$:  
$$\begin{array}{ll}
\sup _{\alpha} \sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i, j=1}^{n} \alpha_{i} \alpha_{j} y_{i} y_{j} x_{j}^{T} x_{i} \\
\text { s.t. }  \sum_{i=1}^{n} \alpha_{i} y_{i}=0 \\
 \alpha_{i} \in\left[0, \frac{c}{n}\right] i=1, \ldots, n
\end{array}$$    
we can replace $x_j^Tx_i$ by $k(x_j, x_i)$

## Linear Kernel  
- Input Space: $X = R^d$  
- Feature Space: $H = R^d$  
- Feature map: $\psi(x) = x$  
- kernel: $k(x,x') = x^Tx'$  

### The kernel matrix (Gram Matrix)  
- Definition
For points of $x_1, ... ,x_n \in X$, and an inner product on $X$, the kernel matrix is defined as:  
$$K=\left(\left\langle x_{i}, x_{j}\right\rangle\right)_{i, j}=\left(\begin{array}{ccc}
\left\langle x_{1}, x_{1}\right\rangle & \cdots & \left\langle x_{1}, x_{n}\right\rangle \\
\vdots & \ddots & \ldots \\
\left\langle x_{n}, x_{1}\right\rangle & \cdots & \left\langle x_{n}, x_{n}\right\rangle
\end{array}\right)$$  
Then for the standard Euclidean inner product $\left\langle x_{i}, x_{j}\right\rangle=x_{i}^{T} x_{j}$, we have  
$$K = XX^T$$  

### SVM Dual with Kernel Matrix
$$\begin{array}{ll}
\sup _{\alpha} \sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i, j=1}^{n} \alpha_{i} \alpha_{j} y_{i} y_{j} K_{j i} \\
\text { s.t. }  \sum_{i=1}^{n} \alpha_{i} y_{i}=0 \\
 \alpha_{i} \in\left[0, \frac{c}{n}\right] i=1, \ldots, n
\end{array}$$  

- Once our algorithm works with kernel matrices, we can change kernel just by changing the matrix  
- Size of matrix: $n×n$, where $n$ is the number of data points  
- Recall with ridge regression, we worked with $X^TX$, which is $d×d$, where $d$ is feature space dimension. 

## Some Nonlinear Kernels  

### Quadratic Kernel in $R^d$  
- Input space: $X = R^d$
- Feature space: $H = R^D$, where $D = d + \left(\begin{array}{l}
d \\
2
\end{array}\right) \approx d^{2} / 2$  
- Feature map:  
$$\phi(x)=\left(x_{1}, \ldots, x_{d}, x_{1}^{2}, \ldots, x_{d}^{2}, \sqrt{2} x_{1} x_{2}, \ldots, \sqrt{2} x_{i} x_{j}, \ldots \sqrt{2} x_{d-1} x_{d}\right)^{T}$$  

- Then for all $x, x' \in R^d$, 
$$\begin{aligned}
k\left(x, x^{\prime}\right) &=\left\langle\phi(x), \phi\left(x^{\prime}\right)\right\rangle \\
&=\left\langle x, x^{\prime}\right\rangle+\left\langle x, x^{\prime}\right\rangle^{2}
\end{aligned}$$  

### Polynomial Kernel in $R^d$  
- Kernel function  
$$k\left(x, x^{\prime}\right)=\left(1+\left\langle x, x^{\prime}\right\rangle\right)^{M}$$
- For any $M$, computing the kernel has same computational cost   

## Radial Basis Function(RBF) Kernel /  Gaussian Kernel
  
- Input space $X= R^d$, for all $x,x' \in R^d$,  
$$k(x, x')=\exp \left(-\frac{\left\|x-x^{\prime}\right\|^{2}}{2 \sigma^{2}}\right)$$  
where $\sigma^2$ is known as the bandwidth parameter

# Kernel Tricks  

## The “Kernel Trick”
-  Given a kernelized ML algorithm  
- Can swap out the inner product for a new kernel function  
- New kernel may correspond to a high dimensional feature space.  
- Once kernel matrix is computed, computational cost depends on number of data points, rather than the dimension of feature space.  
Swapping out a linear kernel for a new kernel is called the **kernel trick**  

# Inner Product Spaces and Projections (Hilbert Space)  

## Inner Product Space (or “Pre-Hilbert” Spaces)
- An inner product space (over reals) is a vector space $V$ and an inner product, which is a mapping  
$$\langle\cdot, \cdot\rangle: \mathcal{V} \times \mathcal{V} \rightarrow \mathbf{R}$$  
that has the following properties $\forall x,y,z \in V$ and $a,b \in R$  
- Symmetry: $<x,y> = <y,x>$  
- Linearity: $<ax+by, z> = a<x,z> + b<y,z>$  
- Positive-definiteness: $<x,x> \geqslant 0 $, and $<x,x> = 0 \Leftrightarrow x = 0$  

## Norm from Inner Product
For an inner product space, we deﬁne a norm as   
$$\|x\|=\sqrt{\langle x, x\rangle}$$  

- Theorem (Parallelogram Law)   
A norm $\|\cdot\|$ can be written in terms of an inner product on $V$ iff $\forall x,x' \in V$,  
$$2\|x\|^{2}+2\left\|x^{\prime}\right\|^{2}=\left\|x+x^{\prime}\right\|^{2}+\left\|x-x^{\prime}\right\|^{2}$$  
and if it can, the inner product is given by the **polarization identity**  
$$\left\langle x, x^{\prime}\right\rangle=\frac{\|x\|^{2}+\left\|x^{\prime}\right\|^{2}-\left\|x-x^{\prime}\right\|^{2}}{2}$$  

## Pythagorean Theorem
- Definition  
Two vectors are orthogonal if $<x,x'>= 0$. We denote this by $x ⊥x'$  
- Deﬁnition
$x$ is orthogonal to a set $S$, i.e. $x ⊥ S$, if $x ⊥ s$ for all $x \in S$  
- Theorem (Pythagorean Theorem)   
If $x ⊥ s$, then $\left\|x+x^{\prime}\right\|^{2}=\|x\|^{2}+\left\|x^{\prime}\right\|^{2}$  

## Projection onto a Plane  
- Choose some $x \in V$.  
- Let $M$ be a subspace of inner product space $V$. 
- Then $m_0$ is the projection of $x$ onto $M$

## Hilbert Space
- Projections exist for all ﬁnite-dimensional inner product spaces. 
- We want to allow inﬁnite-dimensional spaces. 
- Need an extra condition called completeness. 
- A space is complete if all Cauchy sequences in the space converge  
- Definition  
A Hilbert space is a complete inner product space  
Any ﬁnite dimensional inner product space is a Hilbert space  

## The Projection Theorem
- Theorem (Classical Projection Theorem) 
   - $H$ a Hilbert space 
   - $M$ a closed subspace of $H$ (picture a hyperplane through the origin) 
   - For any $x \in H$, there exists a unique $m_0 \in M$ for which   
   $$\left\|x-m_{0}\right\| \leqslant\|x-m\| \forall m \in M$$  
   - Then $m_0$ **orthogonal** projection of $x$ onto $M$.  
   - Furthermore, $m_0 \in M$ is the projection of $x$ onto $M$ iff   
   $$x-m_{0} \perp M$$  
   
## Projection Reduces Norm
- Theorem  
Let $M$ be a closed subspace of $H$. For any $x \in H$, let $m_0 =Proj_{M}x$ be the projection of $x$ onto $M$. Then   
$$\left\|m_{0}\right\| \leqslant\|x\|$$  
with equality only when $m_0 = x$.


# Representer Theorem  
## Generalize from SVM Objective  
- Featurized SVM objective:
$$\min _{w \in \mathbf{R}^{d}} \frac{1}{2}\|w\|^{2}+\frac{c}{n} \sum_{i=1}^{n} \max \left(0,1-y_{i}\left[\left\langle w, \psi\left(x_{i}\right)\right\rangle\right]\right)$$
- **Generalized objective**  
$$\min _{w \in \mathcal{H}} R(\|w\|)+L\left(\left\langle w, \psi\left(x_{1}\right)\right\rangle, \ldots,\left\langle w, \psi\left(x_{n}\right)\right\rangle\right)$$  
where  
  - $R : R_{\geq 0} \to R$ is nondecreasing (Regularization term)  
  - $L : R^n \to R$ is arbitrary (Loss term)  
  - $w, \psi\left(x_{1}\right), \ldots, \psi\left(x_{n}\right) \in \mathcal{H}$ for some HIlbert space $H$, we typically have $(H = R^d)$
  - $\|w\|=\sqrt{\langle w, w\rangle}$
- What is linear?  
  - The prediction/score function $x \to <w, \psi(x_i)>$ in parameter vector $w$ and feature vector $\psi(x_i) $  
  - The important part is the **linearity in the parameter w**.
  - When we discuss neural networks, we’ll mention a “linear network” in which prediction functions are linear in the feature vector $\psi(x)$, but nonlinear in the parameter vector $w$. In other words, we have something like 
 $$\min _{w \in \mathcal{H}} R(\|w\|)+L\left(\left\langle f(w), \psi\left(x_{1}\right)\right\rangle, \ldots,\left\langle f(w), \psi\left(x_{n}\right)\right\rangle\right)$$  
  - What if we penalize with $\lambda \|w\|_2$ instead of $\lambda \|w\|_2^2 $? Yes!  
  - What if we use lasso regression? No! $l_1$ norm does not correspond to an inner product

## The Representer Theorem
Let  
$$J(w)=R(\|w\|)+L\left(\left\langle w, \psi\left(x_{1}\right)\right\rangle, \ldots,\left\langle w, \psi\left(x_{n}\right)\right\rangle\right)$$  
If $J(w)$ has a minimizer, then it has the form $w^* = \sum_{i = 1}^n \alpha_i\psi(x_i)$  
If $R$ is strictly increasing, then all minimizers have this form  

**proof**  
We consider if $w^*$ does not lie in the span space of $\{\psi(x_i)\}$, then its projection on that span space is also a minimizer, and we therefore get contradiction
Let $w^*$ be the minimizer, and $M = span(\psi(x_1),..,\psi(x_n))$ --- the span of data. Let $w = Proj_Mw^*$, so $\exists w = \sum_{i = 1}^n \alpha_i \psi(x_i)$, then $$w^{\perp}:=w^{*}-w$$ is orthogonal to $M$.  
Projections decrease norms, then $\|w\| \leqslant \|w^*\|$, and since $R$ is nondecreasing, we have $R(\|w\|) \leqslant R(\|w^*\|)$  
and we have $\left\langle w^{*}, \psi\left(x_{i}\right)\right\rangle=\left\langle w+w^{\perp}, \psi\left(x_{i}\right)\right\rangle=\left\langle w, \psi\left(x_{i}\right)\right\rangle$  
we also have  
$$L\left(\left\langle w^{*}, \psi\left(x_{1}\right)\right\rangle, \ldots,\left\langle w^{*}, \psi\left(x_{n}\right)\right\rangle\right)=L\left(\left\langle w, \psi\left(x_{1}\right)\right\rangle, \ldots,\left\langle w, \psi\left(x_{n}\right)\right\rangle\right)$$  
thus, $J(w) \leqslant J(w^*)$, and $w$ is also a minimizer. Hence, $w^*$ lies in the span space of $\{\psi(x_i)\}$

## Kernelized Prediction  
- Consider $w = \sum_{i = 1}^n \alpha_i \psi(x_i)$  
- How do we make prediction of a given $x \in X$?  
    
$$\begin{aligned}
f(x) = <w, \psi(x)>  &=  <\sum_{i = 1}^n \alpha_i \psi(x_i), \psi(x)>\\
&= \sum_{i = 1}^n \alpha_i <\psi(x_i), \psi(x)>  \\  
&= \sum_{i = 1}^n \alpha_i k(x_i, x)
\end{aligned}$$   

Note: $f(x)$ is a combination of $k(x_1,x),.., k(x_n, x)$, all considered as functions of $x$.   

- Write $f_{\alpha} = \sum_{i = 1}^n \alpha_i k(x_i, x)$  


## Kernelized Regularization  
- What does $R(\|w\|)$ look like?  
$$
\begin{aligned}
\|w\|^2 &= <w,w> \\  
&= <\sum_{i = 1}^n \alpha_i \psi(x_i), \sum_{i = 1}^n \alpha_i \psi(x_i)> \\  
&= \sum_{i,j = 1}^n \alpha_i \alpha_j <\psi(x_i), \psi(x_j)>  \\  
&= \sum_{i,j = 1}^n \alpha_i \alpha_j k(x_i,x_j) \\  
&= \alpha^T K \alpha
\end{aligned}
$$  
where $K$ is the Kernel Matrix  
- So $R(\|w\|) = \sqrt{\alpha^T K \alpha}$  

$$\begin{aligned}
\left(\begin{array}{c}
f_{\alpha}\left(x_{1}\right) \\
\vdots \\
f_{\alpha}\left(x_{n}\right)
\end{array}\right) &=\left(\begin{array}{ccc}
\alpha_{1} k\left(x_{1}, x_{1}\right)+\cdots+\alpha_{n} k\left(x_{1}, x_{n}\right) \\
& \vdots \\
\alpha_{1} k\left(x_{n}, x_{1}\right)+\cdots+\alpha_{n} k\left(x_{n}, x_{n}\right)
\end{array}\right) \\
&=\left(\begin{array}{ccc}
k\left(x_{1}, x_{1}\right) & \cdots & k\left(x_{1}, x_{n}\right) \\
\vdots & \ddots & \ldots \\
k\left(x_{n}, x_{1}\right) & \cdots & k\left(x_{n}, x_{n}\right)
\end{array}\right)\left(\begin{array}{c}
\alpha_{1} \\
\vdots \\
\alpha_{n}
\end{array}\right) \\
&=K \alpha
\end{aligned}$$  
Now, our goal is to find the best coefficient $\alpha$

## Kernelized Objective  
- Substituting $w = \sum_{i = 1}^n \alpha_i \psi(x_i)$ into generalized objective, we got:  
$$\min _{\alpha \in \mathbf{R}^{n}} R(\sqrt{\alpha^{T} K \alpha})+L(K \alpha)$$  
- All references are via kernel matrix $K$  
- This is called kernelized objective function.  



### Kernelized SVM  
- SVM objectives
$$\min _{w \in \mathcal{H}} \frac{1}{2}\|w\|^{2}+\frac{c}{n} \sum_{i=1}^{n}\left(1-y_{i}\left[\left\langle w, \psi\left(x_{i}\right)\right\rangle\right]\right)_{+}$$  

- Kenelization yields  
$$\min _{\alpha \in \mathrm{R}^{n}} \frac{1}{2} \alpha^{T} K \alpha+\frac{c}{n} \sum_{i=1}^{n}\left(1-y_{i}(K \alpha)_{i}\right)_{+}$$  
### Kernelized Ridge Regression
- Ridge Regression:  
$$\min _{w \in \mathbf{R}^{d}} \frac{1}{n} \sum_{i=1}^{n}\left(w^{T} x_{i}-y_{i}\right)^{2}+\lambda\|w\|^{2}$$  
- Featurized Ridge Regression  
$$\min _{w \in \mathbf{R}^{d}} \frac{1}{n} \sum_{i=1}^{n}\left(w^{T} \psi(x_{i})-y_{i}\right)^{2}+\lambda\|w\|^{2}$$  
- Kernelized Ridge Regression  
$$\min _{\alpha \in \mathbf{R}^{n}} \frac{1}{n}\|K \alpha-y\|^{2}+\lambda \alpha^{T} K \alpha$$  
where $y = (y_1, ... , y_n)^T$


# Prediction Functions with RBF Kernel  

## RBF Basis  
- Input Space: X = R  
- Output Space y = R  
- RBF kernel: $k(w,x) = e^{-(w - x)^2}$  
- Suppose we have 6 training examples: $x_i \in \{-6,-4,-3,0,2,4\}$  
- If Representer Theorem applies, then  
$$f(x) = \sum_{i=1}^6 \alpha_i k(x,x_i)$$  
- $f$ is a linear combination of 6 basis functions of form $k(x_i, \cdot)$  
<div align="center"><img src = "./RBF.jpg" width = '500' height = '100' align = center /></div>  

- Predictions of the form $f(x) = \sum_{i=1}^6 \alpha_i k(x,x_i)$  
<div align="center"><img src = "./rbf_sum.jpg" width = '500' height = '100' align = center /></div>  

When kernelizing with RBF kernel, prediction functions always look this way  

## RBF Feature Space: The Sequence Space $l_2$  
- To work with inﬁnite dimensional feature vectors, we need a space with certain properties  
  - An inner product  
  - a norm related to the inner product  
  - projection theorem: $x=x_{\perp}+x_{\|}$, where $x_{\|} \in S = \text{span}(w_1,...,w_n)$ and $<x_{\perp}, s> = 0, \forall s \in S$
Basically, we need a Hilbert Space  
$l_2$ is the space of all real-valued sequences $\{A_n : \sum_n A_n^2 < \infty\}$   

Theorem  
With the the inner product $\left\langle x, x^{\prime}\right\rangle=\sum_{i=0}^{\infty} x_{i} x_{i}^{\prime}$, $l_2$ space is a Hilbert space  

## The Inﬁnite Dimensional Feature Vector for RBF   
- Consider RBF kernel (1-dim): $k(x,x') = e^{-\frac{(x - x')^2}{2}}$  
- We claim that $\psi: R \to l_2$  defined by  
$$[\psi(x)]_{n}=\frac{1}{\sqrt{n !}} e^{-x^{2} / 2} x^{n}$$  
gives the **“infnite-dimensional feature vector” corresponding to RBF kernel**.  
- Is $\psi(x)$ even an element of $l_2$?   
yes  
$$\sum_{n=0}^{\infty} \frac{1}{n !} e^{-x^{2}} x^{2 n}=e^{-x^{2}} \sum_{n=0}^{\infty} \frac{\left(x^{2}\right)^{n}}{n !}=1<\infty$$  

- Does feature vector $[\psi(x)]_{n}=\frac{1}{\sqrt{n !}} e^{-x^{2} / 2} x^{n}$ actually corresponding to the RBF kernel?  
Yes  
$$\begin{aligned}
\left\langle\psi(x), \psi\left(x^{\prime}\right)\right\rangle &=\sum_{n=0}^{\infty} \frac{1}{n !} e^{-\left(x^{2}+\left(x^{\prime}\right)^{2}\right) / 2} x^{n}\left(x^{\prime}\right)^{n} \\
&=e^{-\left(x^{2}+\left(x^{\prime}\right)^{2}\right) / 2} \sum_{n=0}^{\infty} \frac{\left(x x^{\prime}\right)^{n}}{n !} \\
&=\exp \left(-\left[x^{2}+\left(x^{\prime}\right)^{2}\right] / 2\right) \exp \left(x x^{\prime}\right) \\
&=\exp \left(-\left[\left(x-x^{\prime}\right)^{2} / 2\right]\right)
\end{aligned}$$

# When is $k(x,x')$ a kernel function? (Mercer's Theorem)  
## How to get kernel?  
- Explicitly construct $psi(x): X \to R^d$, and define $k(x,x') = \psi(x)^T\psi(x')$  
- Directly define kernel function $k(x,x')$, and verify it corresponding to $<\psi(x), \psi(x')>$ for some $\psi$   

## Positive Semideﬁnite Matrices
- Definition  
A real, symmetric matrix $M \in R^{n \times n}$ is positive semideﬁnite (psd) if for any $x \in R^n$  
$$x^TMx \geq 0$$  
- Theorem 
The following conditions are each necessary and suﬃcient for $M$ to be positive semideﬁnite  
   - $M$ has a “square root”, i.e. there exists $R$ s.t. $M = R^TR$  
   - All eigenvalues of $M$ are greater than or equal to 0.

A symmetric kernel function $k : X\times X \to R$ is positive semideﬁnite (psd) if for any ﬁnite set $\{x_1,...,x_n\}\in X$, the kernel matrix on this set  
$$K=\left(k\left(x_{i}, x_{j}\right)\right)_{i, j}=\left(\begin{array}{ccc}
k\left(x_{1}, x_{1}\right) & \cdots & k\left(x_{1}, x_{n}\right) \\
\vdots & \ddots & \ldots \\
k\left(x_{n}, x_{1}\right) & \cdots & k\left(x_{n}, x_{n}\right)
\end{array}\right)$$  
is positive semidefinite  

## Mercer’s Theorem
A symmetric function $k(x,x_0)$ can be expressed as an inner product  
$$k(x,x') = <\psi(x), \psi(x')>$$  
for some $\psi$ if and only if $k(x,x')$ is positive semidefinte  
In other words, if we want a matrix to be a kernel matrix, such matrix has to be positive semidefinte  

## Generating New Kernels from Old  
Suppose $k,k_1,k_2 : X\times X \to R$ are psd kernels. Then so are the following  
$k_{\text {new }}\left(x, x^{\prime}\right)=k_{1}\left(x, x^{\prime}\right)+k_{2}\left(x, x^{\prime}\right)$  

$k_{\text {new }}\left(x, x^{\prime}\right)=\alpha k\left(x, x^{\prime}\right)$  

$k_{\text {new }}\left(x, x^{\prime}\right)=f(x) f\left(x^{\prime}\right)$ for any function $f(\cdot)$  

$k_{\text {new }}\left(x, x^{\prime}\right)=k_{1}\left(x, x^{\prime}\right) k_{2}\left(x, x^{\prime}\right)$

# Details on New Kernels from Old  

## Additive Closure  
Suppose $k_1$ and $k_2$ are psd kernels with feature maps $\psi_1$ and $\psi_2$, respectively, then  
$$k_{1}\left(x, x^{\prime}\right)+k_{2}\left(x, x^{\prime}\right)$$  
is psd kernel  

## Closure under Positive Scaling
- Suppose $k$ is a psd kernel with feature maps $\psi$.  
Then for any $\alpha > 0$,  
$$\alpha k $$  
is a psd kernel  

## Scalar Function Gives a Kernel
For any function $f (x)$,
$$k\left(x, x^{\prime}\right)=f(x) f\left(x^{\prime}\right)$$  
is a kernel  
proof:  
Let $f (x)$ be the feature mapping. (It maps into a 1-dimensional feature space.)  
$$\left\langle f(x), f\left(x^{\prime}\right)\right\rangle=f(x) f\left(x^{\prime}\right)=k\left(x, x^{\prime}\right)$$  

## Closure under Hadamard Products
- Suppose $k_1$ and $k_2$ are psd kernels with feature maps $\psi_1$ and $\psi_2$, respectively, then  
$$k_{1}\left(x, x^{\prime}\right) k_{2}\left(x, x^{\prime}\right)$$  
is a psd kernel  
Proof: Take the outer product of the feature vectors  
$$\phi(x)=\phi_{1}(x)\left[\phi_{2}(x)\right]^{T}$$  
Note that $\phi(x)$ is a matrix.   
then  
$$\begin{aligned}
\left\langle\phi(x), \phi\left(x^{\prime}\right)\right\rangle &=\sum_{i j} \phi(x) \phi\left(x^{\prime}\right) \\
&=\sum_{i, j}\left[\phi_{1}(x)\left[\phi_{2}(x)\right]^{T}\right]_{i j}\left[\phi_{1}\left(x^{\prime}\right)\left[\phi_{2}\left(x^{\prime}\right)\right]^{T}\right]_{i j} \\
&=\sum_{i, j}\left[\phi_{1}(x)\right]_{i}\left[\phi_{2}(x)\right]_{j}\left[\phi_{1}\left(x^{\prime}\right)\right]_{i}\left[\phi_{2}\left(x^{\prime}\right)\right]_{j} \\
&=\left(\sum_{i}\left[\phi_{1}(x)\right]_{i}\left[\phi_{1}\left(x^{\prime}\right)\right]_{i}\right)\left(\sum_{j}\left[\phi_{2}(x)\right]_{j}\left[\phi_{2}\left(x^{\prime}\right)\right]_{j}\right) \\
&=k_{1}\left(x, x^{\prime}\right) k_{2}\left(x, x^{\prime}\right)
\end{aligned}$$