# Kernel Method

## ***Vocabulary***

**kernel function**
- a two variable function that is symmetric: $K(x, x') = K(x', x)$

**basis function**
- A basis function is a mathematical function used to transform input data into a new representation, typically in a higher-dimensional space, to help capture complex patterns or relationships

**interpolate**
- the ability to exactly reporoduce the observed values of the dataset at the training points

**positive semi-definite**
- a square matrix $K$ is postivie semi-definite if
1. it is symmetric ($K = K^T$)
2. for any non-zero vectore $z$, the quadratic form $z^TKz \ge 0$

**nonparametric method**
- a model whose complexity grows with the amount of data, as opposed to being fixed by a predefined number of parameters (e.g. the kernel method has parameters $\theta_1, \dots, \theta_n$, where $n$ is the size of the dataset)

**regularization**
- a technique used to prevent overfitting by discouraging overly complex models. it achieves this by adding a penalty term to the loss function, reducing the magnitude of the models parameters or limiting their flexibility

# Lecture Notes 

## ***The Kernel Method***

### **Introduction**

#### **What is a Kernel Method**

The kernel method is a very important class of algorithms and methods for constructing very flexible, nonlinear functional approximations in machine learning. They can achieve this by mapping data into higher dimensional spaces using the kernel trick.

#### **Reviewing Supervised Learning**

<br>
<center>
    <img width="60%" src="images/2.6.1.png" alt="Professor Notes" />
</center>
<br>

We can see that in supervised learning, the three main steps are:
1. Decide a function class $\mathcal {F}$, which cannot be too large (overfitting), but must be large enough to consider the complexities of the data.
2. Define a loss function for $\mathcal{f} \in \mathcal{F}$
3. Solve the optimization problem of finding the lowest loss for each $\mathcal {f} \in \mathcal{F}$.

Most supervised learning techniques fall into this exact framework, and they use a linear function class, such as:

$$ \mathcal{F} \triangleq \{\mathcal{F}_\theta (x) = \sum_{l=1}^d \theta_l x_l + \theta_0 \mid \forall \; \theta_l \in \mathbb{R} \}$$

Linear function classes work well in many settings, but they struggle to model more complex patterns. This is where the kernel method (and later neural networks) come in.

### **The Foundation of Kernel Functions**

#### **Basis Functions**

Basis functions, $\phi(x)$ can be viewed as mapping $x$ to some variable. An example of a basis function is the polynomial basis function:

$$ \phi_l(x0) = x^l \implies 1, x, x^2, \dots $$

This example illustrates how input features are transformed into powers of $x$ to capture polynomial relationships.

#### **A Basic and Naive Approach to Building Nonlinear Approximation**

A very basic and naive way to incorporate nonlinear functions into $\mathcal{f}$ is to replace the feature with a set of nonlinear basis functions. So

$$ f(x, \theta) = \sum_{l=1}^d \theta_l x_l = \theta^\intercal x $$

becomes

$$ f(x, \theta) = \sum_{l=1}^m \theta_l \phi_l(x) = \theta^\intercal \phi(x) $$

Where $\phi(x) = [\phi_1(x), \dots, \phi_m(x)]^\intercal$ is a set of nonlinear basis functions believed to capture important nonlinear patterns regarding the input, and $\theta = [\theta_1, \dots, \theta_m]^\intercal$ is the coefficient vector. For a fixed $phi(x)$, the coefficient $\theta$ is estimated the same way as linear regression, so this has a simple closed form solution. (It is important to note that higher dimensions of $\phi(x)$ can still make computations challenging)

The problem with this approach is that you must manually decide what type of basis function to use. This can be very problem dependent, and hand crafting different functions each time can be problematic.

This is a very simple implementation, but the **key idea of converting the input space using a nonlinear transform into another input space, them performing linear regression on it** is fundamental to nonlinear approximations. 

#### **Building Flexible and Adaptive Nonlinear Approximation**

Our goal is to have an algorithm that can **automatically construct a set of basis functions based on the data**, which we can then do linear regression on. This is called an **adaptive basis function**. Both the kernel method and neural networks are methods to find adaptive basis functions.

### **Putting it all Together**

#### **The Kernel Method in Practice**

Let's say we have a set of data points, $\mathcal{D}$, that is two-dimensional. The idea of kernel methods is that we can represent each of the data points using its similarity with all the other data points. In the example below, the data point "star" can be characterized/identified by its similarity with all the other points. Further, the more and more data we have, the more dimensions we can use to represent this data point, the more flexible of a representation we have.

<br>
<center>
    <img width="60%" src="images/2.6.2.png" alt="Professor Notes" />
</center>
<br>

#### **Generating a Similarity Score**

In order to do this, we need to define a **similarity function**. This function takes two points and calculates a similarity score betwen them, a real number:

$$ K(x,x') : X \times X \mapsto \mathbb{R} $$

Often we use Gaussian similarity functions, for example the Gaussian radial basis function (RBF) kernel:

$$K(x, x') = exp(-\frac{1}{2h^2}||x-x'||^2_2) $$

Where $h$ is a hyperparameter called **bandwidth**, which controls how quickly similarity decays with distance, and $||x-x'||^2_2$ is the Euclidean distance between $x$ and $x'$. $h$ needs to be decided by the programmer, and significantly affects the flexibility of the model. It is sometimes taken to be the variance of observed distribution, but is regardless typically tuned through cross-validation.

#### **Generating a Feature Map**

We can define a feature map, a set of basis functions, that use the similarity function we created:

$$ 
\phi(x) =
\begin{bmatrix}
k(x, x_1) \\
k(x, x_2) \\
\vdots \\
k(x, x_n)
\end{bmatrix} 
$$

Note that $\phi(x)$ grows with the size of the dataset. Now we can use this representation to construct the linear regression:

$$ \mathcal{f}_\theta (x) = \theta^\intercal \phi (x) = \sum_{i=1}^n \theta_i \phi_i (x) $$

$$ = \sum_{i=1}^n \theta_i K(x,x_i) $$

Where $\theta$ is a vector of the same size as the training set:

$$ 
\theta =
\begin{bmatrix}
\theta_1 \\
\theta_2 \\
\vdots \\
\theta_n
\end{bmatrix} 
$$

There will be a $\theta_i$ associated with each point in the dataset which measures the importance of each of the data points. This is estimated from the data by minimizing the loss function.

This formula, $\mathcal{f}_\theta (x) = \sum_{i=1}^n \theta_i K(x,x_i)$ takes a sum over all the data points, measuring the similarity from $x$ to each of the data points, weighted by some parameter, $\theta_i$. This allows us to construct a flexible class of functions.

#### **Visualizing the Concept**

<br>
<center>
    <img width=100%" src="images/2.6.3.png" alt="Professor Notes" />
</center>
<br>

#### **Key Properties**

There are two nice properties to this approach:
1. the construction of basis functions is adapted with vector training data
2. if we have more data the function class $\mathcal{F}$ will be even more flexible.

Note that $\mathcal{F}$ becomes more flexible as $n$ increases because the amount of parameters ($\theta_i$) is equal to the size of the training set, $n$. However, the dimension of $\mathcal{F}$ is also equal to $n$, so larger datasets can lead to computational challenges and cost. This makes the kernel method a **nonparametric method**.

### **Estimating Theta**

Once the model is set up, we can work on estimating $\theta$. That is, we want to find the optimal $\theta$ to minimize the loss function. In this case, the loss function is:

$$ \underset{\theta}{min} \;L(\theta) = \sum_{j=1}^n (y_j = \sum_{i=1}^n \theta_i K(x_j, x_i))^2 $$

So for each data point $x_j$, we measure the square difference between $y_j$ and the kernel function evaluation at $x_j$.

This loss function is equivalent to:

$$ \underset{\theta}{min} \;L(\theta) = || Y - K\theta||_2^2 $$

Where $Y$ is the collection of labels, and $K$ is the pairwise similarities:

<br>
<center>
    <img width="100%" src="images/2.6.4.png" alt="Professor Notes">
</center>
<br>

So we take the difference, take the norm and square it, and minizimize the square. This is a standard linear regression problem that we will solve by taking the derivate, which is:

$$ \nabla L(\theta) = 2K^\intercal (K\theta - Y) $$

Solving, making the important **assumption that $K$ is invertible**:

$$ \nabla L(\theta) = 2K^\intercal (K\theta - Y) $$
$$ \implies K^\intercal K\theta = K^\intercal Y $$
$$ \theta = K^{-1}Y $$

### **Preventing Overfitting**

#### **Regularization**

Since the amount of parameters is equal to the number of observed data points, overfitting is a concern for this model. However we can implement something called **regualrization** to prevent this.

Instead of doing the exact linear regression, which can overfit the training data, we can solve a regularized version of the loss function. 

The original loss function

$$ \underset{\theta}{min} \;L(\theta) = || Y - K\theta||_2^2 $$

Regularized version:

$$ \underset{\theta}{min} \;L(\theta) = || Y - K\theta||_2^2 + \alpha \Phi(\theta)$$

Where $\Phi(\theta)$ is some regularization term and $\alpha$ is the regularization coefficient.

#### **Aside: Understanding Phi and alpha**

$\Phi(\theta)$ is the **regularization term**, it is a function of $\theta$ that quantifies the models complexity. Different choices for $\Phi(\theta)$ lead to different types of regularization:

**L2 Regularization (Ridge)**
$$ \Phi(\theta) = ||\theta||_2^2 = \sum_{i=1}^n \theta_i^2 $$
- Penalizes the sum of squared coefficients
- Encourages small but nonzero coefficients

**L1 Regularization (Lasso)**

- Penalizes the sum of absolute values of coefficients
- Encourages sparsity (some $\theta_i$ become exactly 0)

# Personal Notes #

#### **Kernel Representation Benefits**

- Flexibility: Kernel functions can map data into spaces where complex relationships become linear.
- Nonlinearity: Even if $x$ and $x′$ have nonlinear relationships in the input space, they may appear linearly separable in the kernel space.
- Avoiding Explicit Basis Functions: The kernel trick allows computations to be performed directly with $k(x,x′)$ without explicitly defining or computing the high-dimensional $\phi(x)$.

#### **How to Choose x to Represent in Kernel Space**

- In prediction tasks, you will represent the query point (the input for which you would like an output) as $x$ in $\phi(x)$.
- If you want to analyze the relationships amongst the dat points, you can represent each point $x_i \in \mathcal{D}$ in kernel space to explore its similarity to other points in $\mathcal{D}$.

#### **What is this "Flexible Class of Functions"**

**Flexible class of functions** refers to the family of functions, including $\mathcal{f}_\theta (x) = \sum_{i=1}^n \theta_i K(x,x_i)$, which are defined by:
- the kernel used to measure similarity (e.g., Gaussian RBF, Laplace)
- the weights, $\theta_i$ that control the influence of the similarity score

This class is "flexible" because by choosing different kernels and different values for $\theta_i$, we can generate many different functions.

#### **Kernel Choice**
The kernel $K(x,x')$ chosen defines the shape and properties of the functions in the generated class, for example:
- a Laplace kernel produces a class of functions that are less smooth and more sensitive to small changes in $x$
- a Gaussian RBF kernel produces a smoother, more globally stable class of functions