# Kernel Method

## ***Vocabulary***

**kernel function**
- a two variable function that is symmetric: $K(x, x') = K(x', x)$

**basis function**
- A basis function is a mathematical function used to transform input data into a new representation, typically in a higher-dimensional space, to help capture complex patterns or relationships.

**interpolate**
- the ability to exactly reporoduce the observed values of the dataset at the training points

**positive semi-definite**
- a square matrix $K$ is postivie semi-definite if
1. it is symmetric ($K = K^T$)
2. for any non-zero vectore $z$, the quadratic form $z^TKz \ge 0$

# Lecture Notes 

## ***The Kernel Method***

### **Introduction**

#### **What is a Kernel Method**

The kernel method is a very important class of algorithms and methods for constructing very flexible, nonlinear functional approximations in machine learning. They achieve this by mapping data into higher dimensional spaces.

#### **Reviewing Supervised Learning**

<br>
<center>
    <img width="60%" src="images/2.6.1.png" alt="Professor Notes" />
</center>
<br>

We can see that in supervised learning, the three main steps are:
1. Decide a function class $\mathcal {F}$, which cannot be too large (overfitting), but must be large enough to consider the complexities of the data.
2. Define a loss function for $\mathcal{f} \in \mathcal{F}$
3. Solve the optimization problem of finding the lowest loss for each $\mathcal {f} \in \mathcal{F}$.

Most supervised learning techniques fall into this exact framework, and they use a linear function class, such as:

$$ \mathcal{F} \triangleq \{\mathcal{F}_\theta (x) = \sum_{l=1}^d \theta_l x_l + \theta_0 \mid \forall \; \theta_l \in \mathbb{R} \}$$

Linear function classes work well in many settings, but they struggle to model more complex patterns. This is where the kernel method (and later neural networks) come in.

#### **Basis Functions**

Basis functions, $\phi(x)$ can be viewed as mapping $x$ to some variable. An example of a basis function is the polynomial basis function:

$$ \phi_l(x0) = x^l \implies 1, x, x^2, \dots $$

#### **A Basic and Naive Approach to Building Nonlinear Approximation**

A very basic and naive way to incorporate nonlinear functions into $\mathcal{f}$ is to replace the feature with a set of nonlinear basis functions. So

$$ f(x, \theta) = \sum_{l=1}^d \theta_l x_l = \theta^\intercal x $$

becomes

$$ f(x, \theta) = \sum_{l=1}^m \theta_l \phi_l(x) = \theta^\intercal \phi(x) $$

Where $\phi(x) = [\phi_1(x), \dots, \phi_m(x)]^\intercal$ is a set of nonlinear basis functions believed to capture important nonlinear patterns regarding the input, and $\theta = [\theta_1, \dots, \theta_m]^\intercal$ is the coefficient vector. For a fixed $phi(x)$, the coefficient $\theta$ is estimated the same way as linear regression, so this has a simple closed form solution.

The problem with this approach is that you must manually decide what type of basis function to use. This can be very problem dependent, and hand crafting different functions each time can be problematic.

This is a very simple implementation, but the **key idea of converting the input space using a nonlinear transform into another input space, them performing linear regression on it** is fundamental to nonlinear approximations. 

#### **Building Flexible and Adaptive Nonlinear Approximation**

Our goal is to have an algorithm that can automatically construct a set of basis functions, which we can then do linear regression on. This provides a very flexible nonlinear approximation. And if we can construct a lot of basis functions, we can make the approximation even more flexible. This is called an **adaptive basis function**. Both the kernel method and neural networks are methods to find adaptive basis functions.

#### **The Kernel Method in Practice**

Let's say we have a set of data points, $\mathcal{D}$, that is two-dimensional. The idea of kernel methods is that we can represent each of the data points using its similarity with all the other data points. In the example below, the data point "star" can be characterized/identified by its similarity with all the other points. Further, the more and more data we have, the more dimensions we can use to represent this data point, the more flexible of a representation we have.

<br>
<center>
    <img width="60%" src="images/2.6.2.png" alt="Professor Notes" />
</center>
<br>

#### **Generating a Similarity Score**

In order to do this, we need to define a **similarity function**. This function takes two points and calculates a similarity score betwen them, a real number:

$$ K(x,x') : X \times X \mapsto \mathbb{R} $$

Often we use Gaussian similarity functions, for example the Gaussian radial basis function (RBF) kernel:

$$K(x, x') = exp(-\frac{1}{2h^2}||x-x'||^2_2) $$

Where $h$ is a hyperparameter called **Bandwidth**, which controls how quickly similarity decays with distance, and $||x-x'||^2_2$ is the Euclidean distance between $x$ and $x'$. 

# Personal Notes #

#### **Kernel Representation Benefits**

- Flexibility: Kernel functions can map data into spaces where complex relationships become linear.
- Nonlinearity: Even if $x$ and $x′$ have nonlinear relationships in the input space, they may appear linearly separable in the kernel space.
- Avoiding Explicit Basis Functions: The kernel trick allows computations to be performed directly with $k(x,x′)$ without explicitly defining or computing the high-dimensional $\phi(x)$.

#### **How to Choose x to Represent in Kernel Space**

- In prediction tasks, you will represent the query point (the input for which you would like an output) as $x$ in $\phi(x)$.
- If you want to analyze the relationships amongst the dat points, you can represent each point $x_i \in \mathcal{D}$ in kernel space to explore its similarity to other points in $\mathcal{D}$.