# Kernel Trick: A Mathematical Explanation

The kernel trick allows us to implicitly compute inner products in a high-dimensional feature space without explicitly mapping the data into that space. This is particularly useful when the mapping is computationally expensive or even infinite-dimensional.

## General Idea

Suppose we have a mapping

$$
\phi: \mathbb{R}^d \to \mathcal{F},
$$

which takes an input vector $\mathbf{x} \in \mathbb{R}^d$ and maps it to a (possibly high-dimensional) feature space $\mathcal{F}$. Many algorithms (such as Support Vector Machines) only require the inner product between two mapped vectors, $\phi(\mathbf{x})$ and $\phi(\mathbf{y})$:

$$
\langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle.
$$

If we can define a kernel function $k$ such that

$$
k(\mathbf{x}, \mathbf{y}) = \langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle,
$$

then we can work directly with $k(\mathbf{x}, \mathbf{y})$ without ever computing $\phi(\mathbf{x})$ and $\phi(\mathbf{y})$ explicitly.

## The Second-Degree Polynomial Example

Consider a 2-dimensional input vector:

$$
\mathbf{x} = (x_1, x_2).
$$

A common choice for a second-degree polynomial mapping is:

$$
\phi(x_1, x_2) = \begin{pmatrix}
1 \\
\sqrt{2}\, x_1 \\
\sqrt{2}\, x_2 \\
x_1^2 \\
\sqrt{2}\, x_1 x_2 \\
x_2^2
\end{pmatrix}.
$$

*Note:* The scaling factors (e.g., $\sqrt{2}$) are chosen so that the dot product in the feature space yields a neat polynomial expansion.

For two points $\mathbf{x} = (x_1, x_2)$ and $\mathbf{y} = (y_1, y_2)$, the inner product in the feature space is:

$$
\begin{aligned}
\langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle &= 1\cdot1 + (\sqrt{2}\, x_1)(\sqrt{2}\, y_1) + (\sqrt{2}\, x_2)(\sqrt{2}\, y_2) \\
&\quad + x_1^2 y_1^2 + (\sqrt{2}\, x_1 x_2)(\sqrt{2}\, y_1 y_2) + x_2^2 y_2^2.
\end{aligned}
$$

Simplify each term:

- $1 \cdot 1 = 1$
- $(\sqrt{2}\, x_1)(\sqrt{2}\, y_1) = 2\, x_1 y_1$
- $(\sqrt{2}\, x_2)(\sqrt{2}\, y_2) = 2\, x_2 y_2$
- $(\sqrt{2}\, x_1 x_2)(\sqrt{2}\, y_1 y_2) = 2\, x_1 x_2\, y_1 y_2$

Thus, the dot product becomes:

$$
\langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle = 1 + 2\,x_1 y_1 + 2\,x_2 y_2 + x_1^2 y_1^2 + 2\,x_1 x_2\, y_1 y_2 + x_2^2 y_2^2.
$$

Now, consider the expansion of the kernel function:

$$
k(\mathbf{x}, \mathbf{y}) = \big(1 + \mathbf{x}^\top \mathbf{y}\big)^2 = \big(1 + x_1 y_1 + x_2 y_2\big)^2.
$$

Expanding this, we have:

$$
\begin{aligned}
(1 + x_1 y_1 + x_2 y_2)^2 &= 1 + 2\,x_1 y_1 + 2\,x_2 y_2 \\
&\quad + (x_1 y_1)^2 + 2\,x_1 y_1\,x_2 y_2 + (x_2 y_2)^2.
\end{aligned}
$$

Recognize that:

$$
(x_1 y_1)^2 = x_1^2 y_1^2 \quad \text{and} \quad (x_2 y_2)^2 = x_2^2 y_2^2.
$$

Thus, we have:

$$
(1 + x_1 y_1 + x_2 y_2)^2 = 1 + 2\,x_1 y_1 + 2\,x_2 y_2 + x_1^2 y_1^2 + 2\,x_1 x_2\, y_1 y_2 + x_2^2 y_2^2.
$$

This expression is identical to the dot product computed in the feature space:

$$
\langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle.
$$

## The Kernel Trick

The key observation is that instead of explicitly computing $\phi(\mathbf{x})$ (which maps $\mathbf{x}$ into a 6-dimensional space), we can compute the kernel function directly:

$$
k(\mathbf{x}, \mathbf{y}) = \big(1 + \mathbf{x}^\top \mathbf{y}\big)^2.
$$

This computation is much simpler and avoids working in the higher-dimensional space explicitly.

## Summary

1. **Mapping:**

   $$
   \phi(x_1, x_2) = \begin{pmatrix}
   1 \\
   \sqrt{2}\, x_1 \\
   \sqrt{2}\, x_2 \\
   x_1^2 \\
   \sqrt{2}\, x_1 x_2 \\
   x_2^2
   \end{pmatrix}
   $$

2. **Inner Product in Feature Space:**

   $$
   \langle \phi(\mathbf{x}), \phi(\mathbf{y}) \rangle = 1 + 2\,x_1 y_1 + 2\,x_2 y_2 + x_1^2 y_1^2 + 2\,x_1 x_2\, y_1 y_2 + x_2^2 y_2^2.
   $$

3. **Kernel Function:**

   $$
   k(\mathbf{x}, \mathbf{y}) = \big(1 + \mathbf{x}^\top \mathbf{y}\big)^2.
   $$

4. **Kernel Trick:**

   Instead of computing $\phi(\mathbf{x})$ and then taking the dot product, we directly compute $k(\mathbf{x}, \mathbf{y})$, which is much more efficient.

By using the kernel trick, algorithms such as Support Vector Machines (SVMs) can work in a high-dimensional feature space without explicitly mapping the data, enabling efficient computation and the ability to capture complex, non-linear relationships.
