# Understanding Kernels and Gaussian Processes: A Deeper Dive

Welcome back to our **Gaussian Processes (GPs)** blog series! So far, we've introduced GPs as distributions over functions, explored various kernel functions, and seen how to apply them to real-world data like the Mauna Loa CO₂ measurements.

In this post, we'll peel back another layer of abstraction to truly understand what Gaussian Processes and kernels are. We'll address some fundamental questions that often arise when first learning about GPs:

---

## Key Questions

- **Are Gaussian Processes truly probability distributions over functions?**  
    What kind of functions do they define?

- **What exactly are kernels?**  
    Can we think of them as "infinitely large matrices"?

- **What's the connection between GPs and "kernel machines"?**  
    How do concepts from other machine learning courses relate to GPs?

- **If GPs can use infinitely many features, can they learn any function?**

---

This lecture will provide deeper theoretical insights, connecting GPs to concepts from functional analysis and linear algebra. We'll break down each question, build intuition with examples, and use mathematical tools to clarify these powerful ideas.

Stay tuned as we explore the mathematical foundations and practical implications of kernels and Gaussian Processes!

## Recap: What We've Learned So Far & Key Questions

Let's briefly summarize the main ideas from our previous discussions:

- **Inference with Gaussians:**  
    Inference in models involving linear relationships between Gaussian random variables can be performed using linear algebra. This enables us to compute posterior distributions analytically, making Gaussian models both elegant and practical.

- **Features for Non-linearity:**  
    By applying feature mappings $\phi(x)$, we can extend linear models to capture complex, non-linear relationships. This allows us to model a wide variety of real-valued functions across different domains.

- **Feature Learning (Type-II Maximum Likelihood):**  
    We can optimize feature representations—or kernel hyperparameters—by maximizing the marginal likelihood. This approach, also known as Type-II Maximum Likelihood or empirical Bayes, helps us learn the most suitable features for our data.

- **Gaussian Process Models:**  
    Gaussian Processes (GPs) enable us to work with models that effectively use infinitely many features, all within finite computation time. The core components of a GP are its mean function and a positive definite covariance function (also called a Mercer kernel).

---

These foundational ideas naturally lead to several important questions:

- **Are Gaussian Processes a probability measure?**  
    Over what space? Do they truly define "random functions"?

- **What are kernels?**  
    Can we think of them as "infinitely large matrices"?

- **How do kernel machines relate to GPs?**  
    In statistical machine learning, kernel machines are a common concept. What is their connection to Gaussian Processes?

- **If GPs or kernel machines use infinitely many features, can they learn every function?**

---

Understanding these questions will deepen our intuition for Gaussian Processes and kernels, and clarify their role in modern machine learning.

## Goals for Today: Three Key Insights

Today's lecture will focus on three crucial insights into Gaussian Processes (GPs) and kernels:

1. **GPs as Probability Distributions over Functions**  
    - We'll confirm that Gaussian Processes truly define a *probability distribution over functions*.
    - We'll discuss how the associated function space is very general, and why understanding its structure requires a closer look at the kernel.

2. **Connection to Frequentist Kernel Methods**  
    - The covariance function in a GP is a *kernel*, which opens a direct connection to frequentist kernel methods.
    - By comparing these approaches, we can uncover both the relationships and the differences between frequentist and Bayesian perspectives.

3. **Bayesian-Frequentist Interplay & Limitations**  
    - Bayesian analysis can be supported by frequentist tools, and vice versa.
    - However, it's important not to assume these approaches are always equivalent.
    - Nonparametric models, like GPs, are extremely powerful—but not omnipotent. The size of a model class does not necessarily guarantee fast convergence or perfect learning.

---

By the end of this lecture, you'll have a deeper understanding of:

- How GPs define distributions over functions,
- The mathematical and conceptual connections between Bayesian and frequentist kernel methods,
- And the practical limitations and subtleties of nonparametric models in machine learning.

## What is a Gaussian Process? A More Careful Definition

Let's start with a more formal and careful definition of a **Gaussian Process (GP)** as a stochastic process.

---

### **Definition: Gaussian Process**

A **Gaussian Process** $f$ with index set $X$ is a family of $\mathbb{R}$-valued random variables $\omega \mapsto f(x, \omega)$ for $x \in X$ on a common probability space $(\Omega, \mathcal{F}, P)$ such that **every finite collection** $f(x_1, \cdot), \ldots, f(x_n, \cdot)$ of these random variables follows a multivariate Gaussian distribution.

We often simplify the notation to $f(x) := f(x, \cdot)$.

**Key points:**
- For each input $x$, $f(x)$ is a random variable.
- A GP is a collection of such random variables, indexed by $x \in X$.
- The defining property: **any finite subset** of these random variables is jointly Gaussian.

---

### **Definition: Mean and Covariance Function of a GP**

Let $f$ be a Gaussian process.

- The **mean function** $\mu: X \to \mathbb{R}$ is defined as:
    $$
    \mu(x) = \mathbb{E}[f(x)]
    $$

- The **covariance function** (or kernel) $k: X \times X \to \mathbb{R}$ is defined as:
    $$
    k(x_1, x_2) = \operatorname{Cov}[f(x_1), f(x_2)] = \mathbb{E}\left[(f(x_1) - \mu(x_1))(f(x_2) - \mu(x_2))\right]
    $$

Every Gaussian process has a unique mean and covariance function. Conversely, a mean function and a **positive definite kernel** define a unique Gaussian process. This is often written as:
$$
f \sim \mathcal{GP}(\mu, k)
$$

---

### **Intuition: What Does a GP Sample Look Like?**

A **sample path** $f(x, \omega)$ from a GP can be thought of as:
$$
f(x, \omega) = \mu_x + \text{Cholesky}(k_{XX}) \, \omega
$$
where:
- $k_{XX}$ is the covariance matrix evaluated at a finite set of points $X = \{x_1, \ldots, x_n\}$,
- $\omega$ is a vector of standard normal random variables.

This means that, for any finite set of input points, drawing a sample from a GP is equivalent to drawing from a multivariate normal distribution with the specified mean and covariance.

---

**In summary:**
- A Gaussian Process is a distribution over functions, defined by its mean and covariance functions.
- It generalizes the multivariate normal distribution to infinite (possibly uncountable) index sets.
- GPs are a powerful tool for modeling distributions over functions in a principled, probabilistic way.

## Covariance Functions and Kernels: Every Covariance Function is a Kernel

We've defined a **kernel** as a function that produces symmetric, positive semidefinite matrices. Now, let's formally prove that **every covariance function is indeed a positive definite kernel**.

---

### **Lemma:**  
Every covariance function $k$ is a positive-definite kernel.

---

### **Proof**

Let $v \in \mathbb{R}^n$ be an arbitrary vector, and let $X = \{x_i\}_{i=1}^n \subset \mathcal{X}$ be any finite set of points from the index set.

We need to show that the matrix $K_{XX}$ with entries  
$$
[K_{XX}]_{ij} = k(x_i, x_j) = \operatorname{Cov}[f(x_i), f(x_j)]
$$  
is positive semidefinite, i.e.,  
$$
v^\top K_{XX} v \geq 0.
$$

Let $m(x_i)$ denote the mean function evaluated at $x_i$.

Now, consider:
$$
v^\top K_{XX} v = \sum_{i=1}^n \sum_{j=1}^n v_i v_j \, \mathbb{E}\left[(f(x_i) - m(x_i))(f(x_j) - m(x_j))\right]
$$

Since expectation is a linear operator, we can move it outside the sums:
$$
= \mathbb{E}\left[\sum_{i=1}^n \sum_{j=1}^n v_i v_j (f(x_i) - m(x_i))(f(x_j) - m(x_j))\right]
$$

Observe that the terms inside the expectation can be rewritten as a square:
$$
= \mathbb{E}\left[\left(\sum_{i=1}^n v_i (f(x_i) - m(x_i))\right)^2\right]
$$

Let $Z = \sum_{i=1}^n v_i (f(x_i) - m(x_i))$. $Z$ is a random variable.

Therefore,
$$
v^\top K_{XX} v = \mathbb{E}[Z^2]
$$

Since $Z^2$ is always non-negative, its expectation $\mathbb{E}[Z^2]$ must also be non-negative:
$$
v^\top K_{XX} v \geq 0
$$

---

**Conclusion:**  
This proves that every covariance function is a positive-definite kernel. $\boxed{}


## Covariance Functions and Kernels: Every Kernel Can Be a Covariance Function of a GP

We've just shown that **every covariance function is a kernel**. But the converse is also true and equally important:

> **Every positive-definite kernel can serve as the covariance function of some Gaussian Process.**

---

### **Lemma**

For every function $m: X \to \mathbb{R}$ and every positive-definite kernel $k: X \times X \to \mathbb{R}$, there exists a Gaussian process $f$ with mean function $m$ and covariance function $k$.

---

### **Why Is This True?**

The proof of this lemma is more advanced and relies on a fundamental result from probability theory called the **Kolmogorov Extension Theorem**.

---

#### **Kolmogorov Extension Theorem (Simplified Statement)**

- Let $I$ be a non-empty index set.
- Suppose we are given a collection of consistent finite-dimensional probability distributions.
- Then, these finite-dimensional distributions uniquely define a probability measure on the infinite-dimensional product space.

---

#### **How Does This Apply to Gaussian Processes?**

- The **index set** $I$ is our input space $X$.
- The **finite-dimensional distributions** are the multivariate Gaussian distributions
    $$
    p(f_X) = \mathcal{N}(f_X; m_X, K_{XX})
    $$
    for any finite set of points $X$.
- **Consistency** means that these distributions "fit together" properly: for example, if you marginalize $p(f_{x_1}, f_{x_2})$ over $f_{x_2}$, you get $p(f_{x_1})$. The properties of Gaussian distributions and positive-definite kernels guarantee this consistency.

---

### **What Does This Mean in Practice?**

- If you specify **any** valid mean function $m$ and **any** positive-definite kernel $k$, you are guaranteed that there exists a unique Gaussian process $f \sim \mathcal{GP}(m, k)$.
- This GP will generate finite-dimensional distributions that are exactly the multivariate Gaussians you expect, for any finite collection of input points.

---

### **Summary**

- **Every covariance function is a kernel** (previous lemma).
- **Every positive-definite kernel is a valid covariance function for some GP** (this lemma).
- The Kolmogorov Extension Theorem ensures that our finite-dimensional Gaussian distributions "glue together" to define a true probability distribution over functions.

This is why we can confidently write:
$$
f \sim \mathcal{GP}(m, k)
$$
and know that this uniquely specifies a probability distribution over functions, fully determined by $m$ and $k$.

## Are GPs Probability Distributions over Functions?  
### Sample Paths and Random Functions

Yes, **Gaussian Processes (GPs)** are truly probability distributions over functions. Let's break down what this means and why it's important.

---

### **What Is a Sample Path?**

- For a fixed outcome $\omega \in \Omega$ from the underlying probability space, the function $f(\cdot, \omega): X \to \mathbb{R}$ is called a **sample path** (or realization) of the GP $f$.
- When we plot samples from a GP, we are visualizing these sample paths—each one is a possible function that could be drawn from the GP.

---

### **Random Functions: The Mapping $\omega \mapsto f(\cdot, \omega)$**

- The mapping $\omega \mapsto f(\cdot, \omega)$ is itself a **random variable**, but its values are functions rather than numbers or vectors.
- More precisely, this mapping transforms an outcome $\omega$ into a function, and it maps into a measurable space of functions (for example, $\mathbb{R}^X$, the space of all real-valued functions on $X$).
- This is a direct consequence of the **Kolmogorov Extension Theorem**, which ensures that the collection of finite-dimensional distributions from a GP "glue together" to define a probability measure over functions.

---

### **What Does This Mean?**

- **GPs can truly be interpreted as random functions.**  
    Each draw from a GP is a function, not just a finite-dimensional vector.
- The space $\mathbb{R}^X$ (all real-valued functions on $X$) is extremely large and includes many "wild" or "ill-behaved" functions (e.g., functions that are nowhere continuous).
- However, most GPs used in practice (such as those with Squared Exponential, Matérn, or Wiener kernels) produce sample paths that are at least continuous, and often much smoother.

---

### **Why Does the Kernel Matter?**

- The **kernel** (covariance function) of a GP determines the properties of its sample paths:
        - **Smoothness:** Does the GP produce smooth, continuous, or even differentiable functions?
        - **Roughness:** Can the GP model abrupt changes or only gradual variations?
- For example:
        - The **Squared Exponential kernel** produces sample paths that are infinitely differentiable (very smooth).
        - The **Matérn kernel** can be tuned to produce sample paths with different degrees of smoothness.
        - The **Wiener process** (Brownian motion) kernel produces continuous but nowhere differentiable sample paths.

---

### **Summary**

- **Gaussian Processes are probability distributions over functions.**
- Each sample from a GP is a function (a sample path), not just a vector.
- The **kernel** controls the "type" of functions you get—its properties dictate the smoothness, continuity, and other characteristics of the sample paths.
- To understand what kinds of functions a GP can model, always look at the kernel!

---

**In practice:**  
When you use a GP for regression or modeling, you are placing a prior over the space of functions, and the kernel encodes your assumptions about what those functions should look like.

In [None]:
import jax.numpy as jnp
import jax.random as random
import matplotlib.pyplot as plt
from typing import Callable


# --- Re-using kernel definitions from previous lectures for completeness ---
# Squared Exponential (RBF) Kernel
def squared_exponential_kernel(
    x1: jnp.ndarray, x2: jnp.ndarray, sigma: float = 1.0, lengthscale: float = 1.0
) -> jnp.ndarray:
    """
    Computes the Squared Exponential (RBF) kernel matrix.
    """
    x1 = jnp.atleast_2d(x1)
    x2 = jnp.atleast_2d(x2)
    sq_dist = jnp.sum((x1[:, None, :] - x2[None, :, :]) ** 2, axis=-1)
    K = sigma**2 * jnp.exp(-0.5 * sq_dist / lengthscale**2)
    return K


# --- Function to Sample from a GP (assuming zero mean for simplicity) ---
def sample_from_gp(
    X: jnp.ndarray,
    kernel_func: Callable,
    num_samples: int = 1,
    key: random.PRNGKey = random.PRNGKey(0),
) -> jnp.ndarray:
    """
    Draws samples from a Gaussian Process with zero mean.

    Args:
        X: Input points to sample at. Shape (N, D).
        kernel_func: Covariance function (kernel).
        num_samples: Number of samples to draw.
        key: JAX PRNG key for reproducibility.

    Returns:
        An array of samples. Shape (num_samples, N).
    """
    # Ensure X is at least 2D
    X = jnp.atleast_2d(X)

    # Compute the covariance matrix
    K_XX = kernel_func(X, X)

    # Add a small jitter for numerical stability in Cholesky decomposition
    jitter = 1e-6 * jnp.eye(X.shape[0])
    K_XX += jitter

    # Compute the Cholesky decomposition of the covariance matrix
    L = jnp.linalg.cholesky(K_XX, lower=True)

    # Generate random standard normal variables
    z = random.normal(key, shape=(num_samples, X.shape[0]))

    # Compute the samples: mean + L @ z^T (mean is zero here)
    # L is (N, N), z.T is (N, num_samples). Result is (N, num_samples).
    # Transpose to get (num_samples, N)
    samples = jnp.dot(L, z.T).T

    return samples


# --- Example: Sampling from a GP with RBF Kernel ---
key = random.PRNGKey(123)

# Generate input points
X_test = jnp.linspace(-5.0, 5.0, 100)[:, None]  # Test inputs (needs to be 2D)

# Define the RBF kernel with specific hyperparameters
rbf_sigma = 1.0
rbf_lengthscale = 1.0
rbf_k = lambda x1, x2: squared_exponential_kernel(
    x1, x2, sigma=rbf_sigma, lengthscale=rbf_lengthscale
)

# Sample functions
num_samples = 5
gp_samples = sample_from_gp(X_test, rbf_k, num_samples=num_samples, key=key)

# Plot the samples
plt.figure(figsize=(10, 6))
for i in range(num_samples):
    plt.plot(X_test[:, 0], gp_samples[i, :], alpha=0.7)

# Add the mean function (which is zero here)
plt.plot(
    X_test[:, 0],
    jnp.zeros_like(X_test[:, 0]),
    color="black",
    linestyle="--",
    label="Mean Function",
)

plt.xlabel("x")
plt.ylabel("f(x)")
plt.title("Samples from a Gaussian Process (RBF Kernel)")
plt.grid(True)
plt.show()


## Kernels as "Infinite Matrices": Introduction

So far, we've established several foundational facts:

- **GPs are probability distributions on function spaces.**  
    However, the probability space defined by a GP is only weakly specified by its general construction and lacks much useful structure. To truly understand the nature of the sample space (the set of possible functions), we must study the kernel in detail.

- **Every covariance function is a kernel, and every kernel is a covariance function.**  
    This deep equivalence means that the mathematical object we call a "kernel" is not just a computational trick—it's the very heart of how GPs define distributions over functions.

---

Now, let's tackle a key conceptual question:

> **What are kernels? Can we think of them as "infinitely large matrices"?**

In the next sections, we'll explore this idea, building intuition for how kernels generalize the notion of covariance matrices to infinite-dimensional spaces, and why this perspective is so powerful for understanding Gaussian Processes.

## Quick Linear-Algebra Refresher: Positive Definite Matrices

Before we dive into kernels as "infinite matrices," let's briefly review some key concepts from linear algebra, focusing on symmetric positive-definite matrices.

---

### **Definition: Eigenvalue and Eigenvector**

Let $A \in \mathbb{R}^{n \times n}$ be a matrix.  
A scalar $\lambda \in \mathbb{C}$ and a vector $v \in \mathbb{C}^n$ are called an **eigenvalue** and corresponding **eigenvector** of $A$ if:

$$
A v = \lambda v
$$

Or, written in components:

$$
[Av]_i = \sum_{j=1}^n [A]_{ij} [v]_j = \lambda [v]_i
$$

---

### **Theorem: Spectral Theorem for Symmetric Positive-Definite Matrices**

- If $A$ is a **symmetric matrix** ($A = A^\top$), then:
    - All eigenvalues $\lambda_a$ are **real**.
    - The eigenvectors $\{v_a\}$ form an **orthonormal basis** for $\mathbb{R}^n$.
- If $A$ is also **positive definite** (i.e., $x^\top A x > 0$ for all $x \neq 0$), then **all eigenvalues are positive**: $\lambda_a > 0$ for all $a = 1, \ldots, n$.

Any symmetric positive-definite matrix $A$ can be written as a **Gramian** (outer product) of its eigenvectors:

$$
[A]_{ij} = \sum_{a=1}^n \lambda_a [v_a]_i [v_a]_j
$$

where:
- $\lambda_a$ are the eigenvalues,
- $v_a$ are the corresponding orthonormal eigenvectors.

---

### **Why Is This Important?**

- This decomposition is fundamental for understanding the structure of covariance matrices and kernels.
- It allows us to interpret a positive-definite matrix as a weighted sum of rank-one matrices (outer products of eigenvectors).
- This perspective will help us generalize to the infinite-dimensional case, which is essential for understanding kernels in Gaussian Processes.

---

## Kernels as Inner Products: Mercer's Theorem

To understand kernels at a deeper level, let's extend the familiar concept of eigenvalues and eigenvectors from finite matrices to the world of functions and kernels. This leads us to the powerful idea of **eigenfunctions** and **Mercer's Theorem**.

---

### **Eigenfunctions and Eigenvalues of a Kernel**

Given a kernel function $k : X \times X \to \mathbb{R}$ and a measure $\nu$ on $X$, an **eigenfunction** $\phi : X \to \mathbb{R}$ and an **eigenvalue** $\lambda \in \mathbb{C}$ satisfy the following integral equation:

$$
\int k(x, \tilde{x}) \, \phi(\tilde{x}) \, d\nu(\tilde{x}) = \lambda \, \phi(x)
$$

- This is the functional analogue of the matrix equation $A v = \lambda v$.
- Here, the kernel $k$ plays the role of the matrix $A$, and the function $\phi$ is analogous to the eigenvector $v$.

---

### **Mercer's Theorem (1909)**

Let $(X, \nu)$ be a finite measure space, and let $k : X \times X \to \mathbb{R}$ be a continuous (Mercer) kernel. Then:

- There exists a countable set of eigenvalues and eigenfunctions $\{ (\lambda_i, \phi_i) \}_{i \in I}$ with respect to $\nu$ such that:
    - $I$ is countable.
    - All $\lambda_i$ are real and non-negative.
    - The eigenfunctions $\phi_i$ can be chosen to be orthonormal in $L^2(X, \nu)$:
      $$
      \int \phi_i(x) \phi_j(x) d\nu(x) = \delta_{ij}
      $$
    - The kernel admits the following **spectral decomposition** (series converges absolutely and uniformly $\nu^2$-almost everywhere):
      $$
      k(a, b) = \sum_{i \in I} \lambda_i \, \phi_i(a) \, \phi_i(b) \qquad \forall a, b \in X
      $$

---

### **Why Is Mercer's Theorem Important?**

- **Analogy to Finite Matrices:**  
  Just as any symmetric positive-definite matrix can be decomposed into a sum of outer products of its eigenvectors, a Mercer kernel can be decomposed into an (infinite) sum of products of its eigenfunctions, weighted by their eigenvalues.
- **Feature Space Interpretation:**  
  This decomposition shows that kernels implicitly define an infinite-dimensional feature space, where each eigenfunction acts as a "feature" and the eigenvalues determine their importance.
- **Foundation for Kernel Methods:**  
  Mercer's Theorem provides the mathematical foundation for kernel methods in machine learning, including Support Vector Machines and Gaussian Processes.

---

### **Summary**

- **Mercer's Theorem** bridges the gap between finite-dimensional linear algebra and infinite-dimensional function spaces.
- It tells us that every continuous, positive-definite kernel can be viewed as an "infinite Gram matrix"—an inner product in a (possibly infinite-dimensional) feature space.
- This perspective is crucial for understanding the power and flexibility of kernel-based models, especially Gaussian Processes.

---

**In essence:**  
Mercer's Theorem reveals that kernels are not just similarity measures—they are inner products in a rich, often infinite-dimensional, space of features defined by the kernel's eigenfunctions.

In [None]:
import jax.numpy as jnp
import matplotlib.pyplot as plt


# --- Re-using kernel definition ---
# Squared Exponential (RBF) Kernel
def squared_exponential_kernel(
    x1: jnp.ndarray, x2: jnp.ndarray, sigma: float = 1.0, lengthscale: float = 1.0
) -> jnp.ndarray:
    """
    Computes the Squared Exponential (RBF) kernel matrix.
    """
    x1 = jnp.atleast_2d(x1)
    x2 = jnp.atleast_2d(x2)
    sq_dist = jnp.sum((x1[:, None, :] - x2[None, :, :]) ** 2, axis=-1)
    K = sigma**2 * jnp.exp(-0.5 * sq_dist / lengthscale**2)
    return K


# --- Example: Eigenvalues and Eigenvectors of a Finite Kernel Matrix ---
# This is an analogy to Mercer's Theorem for infinite-dimensional kernels.
# For a finite set of points, the kernel matrix is a symmetric positive definite matrix,
# and thus has real, non-negative eigenvalues and orthogonal eigenvectors.

# Define a finite set of input points
X_finite = jnp.linspace(-3.0, 3.0, 20)[:, None]  # 20 points

# Compute the kernel matrix for these points
K_finite = squared_exponential_kernel(X_finite, X_finite, sigma=1.0, lengthscale=1.0)

# Compute eigenvalues and eigenvectors
# jnp.linalg.eigh returns eigenvalues in ascending order
eigenvalues, eigenvectors = jnp.linalg.eigh(K_finite)

# Sort eigenvalues in descending order and reorder eigenvectors accordingly
idx = eigenvalues.argsort()[::-1]  # Get indices for descending order
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]  # Reorder columns of eigenvectors matrix

print(f"Top 5 Eigenvalues:\n{eigenvalues[:5]}")
print(
    f"Smallest 5 Eigenvalues:\n{eigenvalues[-5:]}"
)  # Should be positive/close to zero

# Plot the first few eigenvectors (analogous to eigenfunctions)
plt.figure(figsize=(12, 8))
for i in range(min(5, len(eigenvalues))):  # Plot up to 5 eigenvectors
    # Scale eigenvectors by sqrt of eigenvalues for visualization,
    # as per Karhunen-Loève expansion form (lambda_i^1/2 * phi_i(x))
    plt.plot(
        X_finite[:, 0],
        eigenvectors[:, i] * jnp.sqrt(eigenvalues[i]),
        label=f"Eigenvector {i + 1} (scaled by $\\sqrt{{\\lambda_{i + 1}}}$)",
    )

plt.xlabel("x")
plt.ylabel("Value")
plt.title(
    "First Few Eigenvectors of a Finite RBF Kernel Matrix (Analogy to Eigenfunctions)"
)
plt.legend()
plt.grid(True)
plt.show()

# --- Verify Reconstruction (Mercer's Theorem Analogy) ---
# We can reconstruct the kernel matrix from its eigenvalues and eigenvectors
# K_reconstructed = Sum_{i} lambda_i * v_i * v_i^T
K_reconstructed = jnp.dot(eigenvectors * eigenvalues, eigenvectors.T)

# Check if the reconstruction is close to the original matrix
print(
    f"\nMax absolute difference between original and reconstructed K: {jnp.max(jnp.abs(K_finite - K_reconstructed)):.2e}"
)


## Are Kernels Infinitely Large Positive Definite Matrices? Kind of...

In the sense of **Mercer’s theorem**, you can *loosely* think of a kernel \( k : X \times X \to \mathbb{R} \), evaluated at \( k(a, b) \) for \( a, b \in X \), as the "element" of an **infinitely large** matrix \( k_{ab} \). The spectral decomposition of the kernel is:

$$
k(a, b) = \sum_{i \in I} \lambda_i \, \phi_i(a) \, \phi_i(b)
$$

This can also be written more compactly using a **feature map**:

$$
\Phi(x) = [\sqrt{\lambda_1} \, \phi_1(x), \sqrt{\lambda_2} \, \phi_2(x), \ldots]^T
$$

so that

$$
k(a, b) = \langle \Phi(a), \Phi(b) \rangle
$$

where $ \langle \cdot, \cdot \rangle $ denotes the inner product in this (possibly infinite-dimensional) feature space.

> **Note:**  
> This interpretation depends on the measure \( \nu : X \to \mathbb{R} \) used in the integral equation for eigenfunctions. In practice, it is often *not* straightforward to find the eigenfunctions analytically for arbitrary kernels.

---

### Better Questions to Ask

- **Do the eigenfunctions span a space like the eigenvectors of a matrix?**
- **What is that space? Is it the sample space of a GP?**

These questions naturally lead us to the concept of **Reproducing Kernel Hilbert Spaces (RKHS)**, which provide a rigorous mathematical framework for understanding the function spaces associated with kernels.

---

**Summary:**

- Kernels can be viewed as "infinite matrices" via their spectral decomposition.
- The feature map \( \Phi(x) \) embeds inputs into a (possibly infinite-dimensional) space where the kernel acts as an inner product.
- Understanding the space spanned by the eigenfunctions leads to the powerful theory of RKHS, which we will explore next.

## Transition to RKHS and Kernel Machines

So far, we've established several key insights:

- **Gaussian Processes (GPs) are probability distributions over function spaces.**
- **Every covariance function is a kernel, and every kernel is a covariance function.**
- **Kernels have eigenfunctions, just as matrices have eigenvectors.**  
    This means we can think of kernels as a kind of "infinite matrix" that spans a space of functions.

---

Now, let's explore two fundamental questions:

1. **What is the space spanned by the eigenfunctions of a kernel?**  
     What does it mean for a kernel to define a "space of functions," and how is this space constructed?

2. **What is the connection to "kernel machines"?**  
     How do these ideas relate to kernel-based algorithms in machine learning, such as Support Vector Machines (SVMs) and Gaussian Processes?

---

In the next sections, we'll introduce the concept of the **Reproducing Kernel Hilbert Space (RKHS)**, which provides a rigorous mathematical framework for understanding the function spaces associated with kernels. We'll also see how this connects to practical machine learning methods known as kernel machines.

## Gaussian Processes, By Any Other Name

Gaussian Process (GP) regression is one of the most deeply studied and widely used models in statistics and machine learning. Interestingly, it has appeared under various names across different fields, reflecting its fundamental nature and broad applicability.

### Equivalent and Closely Related Names

- **Kriging**  
    - Originates from geostatistics and spatial analysis.
    - Used for spatial interpolation and prediction of unknown values based on observed data.

- **Kernel Ridge Regression**  
    - A regularized regression method in machine learning.
    - Closely connected to GP regression through the use of kernels and feature spaces (we will explore this connection in detail).

- **Wiener–Kolmogorov Prediction**  
    - Emerges from time series analysis and signal processing.
    - Focuses on optimal linear prediction of future values in a stochastic process.

- **Linear Least-Squares Regression (in a generalized feature space)**  
    - The classical regression method, extended to infinite-dimensional feature spaces via kernels.

---

These different names reflect the various historical contexts and research communities that independently discovered or developed aspects of what we now unify under the umbrella of **Gaussian Processes**.

> **Key Takeaway:**  
> Whether you encounter the term Kriging, kernel ridge regression, or Wiener–Kolmogorov prediction, you are often dealing with the same underlying mathematical ideas as Gaussian Process regression—just viewed through different disciplinary lenses.

## The Gaussian Posterior Mean as a Least-Squares Estimate (Kernel Ridge Regression)

One of the most profound connections in Gaussian Process (GP) theory is the equivalence between the **GP posterior mean** and the solution to a regularized least-squares problem, known as **Kernel Ridge Regression (KRR)**.

---

### **Setup: GP Regression with Gaussian Likelihood**

- **Prior:**  
    $$ p(f) = \mathcal{GP}(f; 0, k) $$
    where $k$ is the kernel (covariance function).

- **Likelihood:**  
    $$ p(y \mid f) = \mathcal{N}(y; f_X, \sigma^2 I) $$
    where $f_X$ denotes the vector of function values at the observed inputs $X$.

- **Posterior:**  
    The posterior over function values at $X$ is:
    $$
    p(f_X \mid y) = \frac{p(y \mid f_X) p(f_X)}{p(y)}
    $$

---

### **Posterior Mean as a Solution to Regularized Least Squares**

The **posterior mean** at a new input $x$ is:
$$
m(x) = \mathbb{E}_{p(f \mid y)}[f(x)]
$$

It turns out that $m(x)$ is the function in the **Reproducing Kernel Hilbert Space (RKHS)** $\mathcal{H}_k$ that minimizes the following regularized $\ell_2$ loss:
$$
L(f) = \frac{1}{2\sigma^2} \| y - f_X \|^2 + \frac{1}{2} \| f \|_{\mathcal{H}_k}^2
$$

- $\| f \|_{\mathcal{H}_k}^2$ is the RKHS norm of $f$, acting as a regularizer.
- Minimizing $L(f)$ is exactly the objective of **Kernel Ridge Regression**.

---

### **Explicit Formula for the Posterior Mean**

The solution (posterior mean) is given by:
$$
m(x) = k_{xX} (k_{XX} + \sigma^2 I)^{-1} y
$$

- $k_{xX}$: kernel vector between test point $x$ and training points $X$
- $k_{XX}$: kernel matrix on training points
- $\sigma^2$: noise variance

---

### **Key Insights**

- The **Bayesian posterior mean** of a GP is **identical** to the solution of a specific regularized least-squares problem in the frequentist framework.
- This highlights a deep connection between Bayesian and frequentist approaches:  
    **Many seemingly different models are fundamentally linked.**

---

### **Historical Context**

The connection to least-squares and regularization dates back over 200 years to mathematicians like **Adrien-Marie Legendre** and **Carl-Friedrich Gauss**, who developed the method of least squares. Today, this classical idea underpins modern kernel methods and Gaussian Process regression.

---

**In summary:**  
The GP posterior mean is not just a Bayesian prediction—it is also the optimal solution to a regularized least-squares problem in an infinite-dimensional feature space defined by the kernel.

# Reproducing Kernel Hilbert Spaces (RKHS): Two Definitions

The **eigenfunctions** of a kernel span a very important space of functions called the **Reproducing Kernel Hilbert Space (RKHS)**. This space is central to understanding what a Gaussian Process (GP) can "learn" or represent.

---

## Definition: Reproducing Kernel Hilbert Space (RKHS)

Let $\mathcal{H} = (X, \langle \cdot, \cdot \rangle_{\mathcal{H}})$ be a Hilbert space of functions $f : X \to \mathbb{R}$.  
$\mathcal{H}$ is called a **reproducing kernel Hilbert space** if there exists a kernel $k : X \times X \to \mathbb{R}$ such that:

1. **Kernel Functions Belong to the Space:**  
    For all $x \in X$, the function $k(\cdot, x) \in \mathcal{H}$.

2. **Reproducing Property:**  
    For all $f \in \mathcal{H}$ and all $x \in X$,
    $$
    \langle f(\cdot), k(\cdot, x) \rangle_{\mathcal{H}} = f(x)
    $$
    This means that evaluating a function at $x$ is equivalent to taking its inner product with the kernel function $k(\cdot, x)$.

---

## Theorem (Aronszajn, 1950)

> For every positive definite kernel $k$ on $X$, there exists a **unique** RKHS $\mathcal{H}_k$ associated with $k$.

---

## Intuition

- The kernel $k(\cdot, x)$ acts like a **generalized identity function** or a "delta function" in the RKHS.
- It allows us to "reproduce" the value of any function in the space by taking an inner product.
- The RKHS is the smallest Hilbert space of functions containing all finite linear combinations of kernel functions $k(\cdot, x)$, with the inner product defined so that the reproducing property holds.

---

## Why Is RKHS Important?

- The RKHS associated with a kernel $k$ describes the set of functions that can be "expressed" or "learned" by kernel methods (including GPs and SVMs).
- The **smoothness** and **complexity** of functions in the RKHS are controlled by the choice of kernel.
- In practice, the RKHS provides a rigorous mathematical framework for understanding the power and limitations of kernel-based learning algorithms.

---

## Example: RKHS for the RBF Kernel

- For the popular **RBF (Squared Exponential) kernel**, the associated RKHS consists of very smooth functions.
- The more "wiggly" the kernel (e.g., Matérn kernels with low smoothness), the larger and rougher the RKHS.

---

> **Summary:**  
> The RKHS is the function space "spanned" by the kernel's eigenfunctions. It is the natural habitat for kernel methods and provides deep insight into what kinds of functions a GP or kernel machine can represent.

## What is the RKHS? (1) The Space of Possible Posterior Mean Functionsons

A key insight in kernel methods and Gaussian Processes (GPs) is that the **Reproducing Kernel Hilbert Space (RKHS)** associated with a kernel $k$ is precisely the space of all possible posterior mean functions that can be obtained from a GP with that kernel.f all possible posterior mean functions that can be obtained from a GP with that kernel.

---

### Theorem: Reproducing Kernel Map Representation

Let $X$, $\nu$, and $\{(\phi_i, \lambda_i)\}_{i \in I}$ be as defined in Mercer's Theorem. Let $\{x_i\}_{i \in I} \subset X$ be a countable collection of points in $X$. Then, the RKHS $\mathcal{H}_k$ can be written as the space of linear combinations of kernel functions:Let $X$, $\nu$, and $\{(\phi_i, \lambda_i)\}_{i \in I}$ be as defined in Mercer's Theorem. Let $\{x_i\}_{i \in I} \subset X$ be a countable collection of points in $X$. Then, the RKHS $\mathcal{H}_k$ can be written as the space of linear combinations of kernel functions:

$$
\mathcal{H}_k = \left\{ f(x) := \sum_{i \in I} \tilde{\alpha}_i\, k(x_i, x) \right\}\mathcal{H}_k = \left\{ f(x) := \sum_{i \in I} \tilde{\alpha}_i\, k(x_i, x) \right\}
$$

with the inner productwith the inner product

$$
\langle f, g \rangle_{\mathcal{H}_k} := \sum_{i \in I} \tilde{\alpha}_i \tilde{\beta}_i\, k(x_i, x_i)
$$

where $f(x) = \sum_{i} \tilde{\alpha}_i k(x_i, x)$ and $g(x) = \sum_{i} \tilde{\beta}_i k(x_i, x)$.

---

### GP Posterior Mean Functions Live in the RKHS

Consider a Gaussian process prior $p(f) = \mathcal{GP}(0, k)$ with likelihood $p(y \mid f, X) = \mathcal{N}(y; f_X, \sigma^2 I)$. The posterior mean function is given by:

$$
\mu(x) = k_{xX} (k_{XX} + \sigma^2 I)^{-1} y
$$

Let $w = (k_{XX} + \sigma^2 I)^{-1} y$. Then,

$$
\mu(x) = k_{xX} w = \sum_{i=1}^n w_i\, k(x, x_i)
$$

This shows that the posterior mean function is a **finite linear combination of kernel functions** centered at the training data points.

---

### Why Does This Matter?

- **The RKHS is the space of all possible posterior mean functions of GP regression.**
- To understand what a GP can "learn," we must analyze the RKHS associated with its kernel.
- The **posterior mean** (the GP's best estimate of the function) always lives in this space, regardless of the data.

---

### Key Takeaways

- The RKHS provides a rigorous mathematical framework for understanding the expressiveness and limitations of kernel-based learning algorithms.
- The choice of kernel determines the "shape" and "smoothness" of functions in the RKHS, and thus what the GP can represent.
- This connection is fundamental to the

---

> **Summary:**  
> The RKHS associated with a kernel $k$ is the natural habitat for the posterior mean functions of GP regression. Understanding the RKHS is crucial for understanding what your GP model can and cannot learn.
> The RKHS associated with a kernel $k$ is the natural habitat for the posterior mean functions of GP regression. Understanding the RKHS is crucial for understanding what your GP model can and cannot learn.
> **Summary:**  

---
 statistical learning theory of RKHSs and underpins much of modern machine learning with kernels.
$$
\langle f, g \rangle_{\mathcal{H}_k} := \sum_{i \in I} \tilde{\alpha}_i \tilde{\beta}_i\, k(x_i, x_i)
$$

where $f(x) = \sum_{i} \tilde{\alpha}_i k(x_i, x)$ and $g(x) = \sum_{i} \tilde{\beta}_i k(x_i, x)$.

---

### GP Posterior Mean Functions Live in the RKHS

Consider a Gaussian process prior $p(f) = \mathcal{GP}(0, k)$ with likelihood $p(y \mid f, X) = \mathcal{N}(y; f_X, \sigma^2 I)$. The posterior mean function is given by:

$$
\mu(x) = k_{xX} (k_{XX} + \sigma^2 I)^{-1} y
$$

Let $w = (k_{XX} + \sigma^2 I)^{-1} y$. Then,

$$
\mu(x) = k_{xX} w = \sum_{i=1}^n w_i\, k(x, x_i)
$$

This shows that the posterior mean function is a **finite linear combination of kernel functions** centered at the training data points.

---

### Why Does This Matter?

- **The RKHS is the space of all possible posterior mean functions of GP regression.**
- To understand what a GP can "learn," we must analyze the RKHS associated with its kernel.
- The **posterior mean** (the GP's best estimate of the function) always lives in this space, regardless of the data.

---

### Key Takeaways

- The RKHS provides a rigorous mathematical framework for understanding the expressiveness and limitations of kernel-based learning algorithms.
- The choice of kernel determines the "shape" and "smoothness" of functions in the RKHS, and thus what the GP can represent.
- This connection is fundamental to the statistical learning theory of RKHSs and underpins much of modern machine learning with kernels.

In [None]:
import jax.numpy as jnp
import jax.random as random
import matplotlib.pyplot as plt
from jax.scipy.linalg import solve  # Use JAX's solve for numerical stability
from typing import Callable


# --- Re-using kernel definitions from previous lectures for completeness ---
def squared_exponential_kernel(
    x1: jnp.ndarray, x2: jnp.ndarray, sigma: float = 1.0, lengthscale: float = 1.0
) -> jnp.ndarray:
    """
    Computes the Squared Exponential (RBF) kernel matrix.
    """
    x1 = jnp.atleast_2d(x1)
    x2 = jnp.atleast_2d(x2)
    sq_dist = jnp.sum((x1[:, None, :] - x2[None, :, :]) ** 2, axis=-1)
    K = sigma**2 * jnp.exp(-0.5 * sq_dist / lengthscale**2)
    return K


# GP Prediction Function (as used in previous lectures)
def gp_predict(
    X_train: jnp.ndarray,
    y_train: jnp.ndarray,
    X_test: jnp.ndarray,
    mean_func: Callable[[jnp.ndarray], jnp.ndarray],
    kernel_func: Callable[[jnp.ndarray, jnp.ndarray], jnp.ndarray],
    noise_variance: float = 1e-6,
) -> tuple[jnp.ndarray, jnp.ndarray]:
    """
    Performs Gaussian Process regression prediction.
    """
    X_train = jnp.atleast_2d(X_train)
    X_test = jnp.atleast_2d(X_test)
    y_train = jnp.atleast_1d(y_train)

    K_train_train = kernel_func(X_train, X_train) + noise_variance * jnp.eye(
        X_train.shape[0]
    )
    K_test_train = kernel_func(X_test, X_train)
    K_test_test = kernel_func(X_test, X_test)

    K_train_train_inv_y_diff = solve(K_train_train, y_train - mean_func(X_train))
    mu_pred = mean_func(X_test) + jnp.dot(K_test_train, K_train_train_inv_y_diff)

    K_train_train_inv_K_test_train_T = solve(K_train_train, K_test_train.T)
    Sigma_pred = K_test_test - jnp.dot(K_test_train, K_train_train_inv_K_test_train_T)

    return mu_pred, Sigma_pred


# --- Example: GP Regression to illustrate Posterior Mean (which lies in RKHS) ---
key = random.PRNGKey(456)

# Generate synthetic data
X_train = jnp.sort(random.uniform(key, shape=(10, 1), minval=-4, maxval=4))
noise = random.normal(key, shape=X_train.shape) * 0.3
y_train = jnp.sin(X_train) + noise

# Define mean and kernel functions
zero_mean_func = lambda x: jnp.zeros(x.shape[0])
rbf_kernel_for_pred = lambda x1, x2: squared_exponential_kernel(
    x1, x2, sigma=1.0, lengthscale=1.0
)
obs_noise_variance = 0.1**2

# Generate test points
X_test = jnp.linspace(-5, 5, 100)[:, None]

# Perform GP prediction
mu_pred, Sigma_pred = gp_predict(
    X_train,
    y_train,
    X_test,
    zero_mean_func,
    rbf_kernel_for_pred,
    noise_variance=obs_noise_variance,
)

predictive_std = jnp.sqrt(jnp.diag(Sigma_pred))

# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X_train[:, 0], y_train, label="Training Data", zorder=2)
plt.plot(X_test[:, 0], mu_pred, label="GP Posterior Mean (in RKHS)", color="red")
plt.fill_between(
    X_test[:, 0],
    mu_pred - 2 * predictive_std,
    mu_pred + 2 * predictive_std,
    color="red",
    alpha=0.2,
    label="95% Confidence Interval",
)
plt.xlabel("X")
plt.ylabel("y")
plt.title("Gaussian Process Regression: Posterior Mean in RKHS")
plt.legend()
plt.grid(True)
plt.show()

# The posterior mean can be expressed as a linear combination of kernel functions
# centered at the training data points.
# This explicitly shows its membership in the RKHS.
# The weights 'alpha' are computed during the GP prediction:
# alpha = (K_train_train + noise_variance * I)^-1 * (y_train - mean_func(X_train))
K_train_train_noisy = rbf_kernel_for_pred(
    X_train, X_train
) + obs_noise_variance * jnp.eye(X_train.shape[0])
alpha_weights = solve(K_train_train_noisy, y_train - zero_mean_func(X_train))

# Reconstruct posterior mean using alpha_weights and kernel functions
mu_pred_reconstructed = jnp.dot(
    rbf_kernel_for_pred(X_test, X_train), alpha_weights
) + zero_mean_func(X_test)

print(
    f"\nMax absolute difference between direct mu_pred and reconstructed mu_pred: {jnp.max(jnp.abs(mu_pred - mu_pred_reconstructed)):.2e}"
)
print(
    "This confirms the posterior mean is a linear combination of kernel functions, residing in the RKHS."
)


# What is the Meaning of Uncertainty?  
## Frequentist Interpretation of Posterior Variance

In **Bayesian Gaussian Processes (GPs)**, the posterior variance  
$$
v(x) = \operatorname{Cov}[f(x) \mid y]
$$  
quantifies our uncertainty about the function value at a point $x$ given the observed data $y$. Interestingly, this Bayesian measure of uncertainty has a strong **Frequentist interpretation**.

---

### Theorem: Posterior Variance as a Worst-Case Error in the RKHS

Assume a GP prior $p(f) = \mathcal{GP}(f; 0, k)$ and **noise-free observations** $p(y \mid f) = \delta(y - f_X)$ (where $\delta$ is the Dirac delta function, implying exact observations). The GP posterior variance (the expected squared error) is:
$$
v(x) := \mathbb{E}_{p(f \mid y)}\left[(f(x) - m(x))^2\right] = k_{xx} - k_{xX} K_{XX}^{-1} k_{Xx}
$$

This variance is also a **worst-case bound** on the divergence between the posterior mean $m(x)$ and any function $f$ in the RKHS with norm at most $1$:
$$
v(x) = \sup_{f \in \mathcal{H}_k,\, \|f\|_{\mathcal{H}_k} \leq 1} (m(x) - f(x))^2
$$

---

### Proof Sketch

1. **Start with the supremum:**
    $$
    \sup_{f \in \mathcal{H},\, \|f\| \leq 1} (m(x) - f(x))^2
    $$
    where $m(x) = \sum_i f(x_i) [K_{XX}^{-1} k(X, x)]_i$.

2. **Use the reproducing property:**  
    For any $f \in \mathcal{H}_k$,
    $$
    f(x) = \langle f(\cdot), k(\cdot, x) \rangle_{\mathcal{H}}
    $$

3. **Rewrite the difference as an inner product:**
    $$
    m(x) - f(x) = \left\langle \sum_i [K_{XX}^{-1} k(X, x)]_i k(\cdot, x_i) - k(\cdot, x),\, f(\cdot) \right\rangle_{\mathcal{H}}
    $$

4. **Apply Cauchy-Schwarz inequality:**  
    $$
    | \langle a, b \rangle | \leq \|a\| \|b\| \implies \sup_{\|f\| \leq 1} \langle a, f \rangle^2 = \|a\|^2
    $$

5. **Compute the RKHS norm:**
    $$
    \left\| \sum_i [K_{XX}^{-1} k(X, x)]_i k(\cdot, x_i) - k(\cdot, x) \right\|_{\mathcal{H}}^2
    $$

6. **Expand using the RKHS inner product:**
    $$
    = \sum_{ij} [K_{XX}^{-1} k(X, x)]_i [K_{XX}^{-1} k(X, x)]_j k(x_i, x_j)
    - 2 \sum_i [K_{XX}^{-1} k(X, x)]_i k(x, x_i)
    + k(x, x)
    $$

    This simplifies to:
    $$
    = k_{xx} - k_{xX} K_{XX}^{-1} k_{Xx}
    $$

---

### Interpretation

- The **Bayesian posterior variance** at $x$ is equal to the **maximum possible squared difference** between the posterior mean $m(x)$ and any function $f$ in the RKHS with $\|f\|_{\mathcal{H}_k} \leq 1$.
- In essence, it tells us *how far off* our posterior mean could be from any "well-behaved" function in the RKHS, even in the best-case scenario of noise-free observations.

---

### Why Is This Powerful?

- The GP's **expected squared error** (a Bayesian quantity) is equivalent to the RKHS's **worst-case squared error** for functions of bounded norm (a Frequentist quantity).
- This suggests that Bayesians, in a sense, "expect the worst" when it comes to uncertainty, as their posterior variance provides a robust bound.

> **Note:** The posterior variance $v(x)$ itself is generally **not** an element of $\mathcal{H}_k$.

---

### Summary

- **Bayesian posterior variance** quantifies uncertainty about $f(x)$ after observing data.
- **Frequentist interpretation:** It is the worst-case squared error between the posterior mean and any function in the RKHS of bounded norm.
- This duality bridges Bayesian and Frequentist perspectives, deepening our understanding of uncertainty in kernel methods and GPs.

# Are Bayesian and Frequentist Analysis the Same Thing?

Let's recap the key connections we've uncovered so far:

- **GPs are probability distributions on function spaces.**
- **Every covariance function is a kernel, and every kernel is a covariance function.**
- **Kernels have eigenfunctions, just as matrices have eigenvectors.**  
    This means we can think of kernels as a kind of "infinite matrix" that spans a space of functions.
- **That space is the Reproducing Kernel Hilbert Space (RKHS).**  
    The RKHS is identical to the space of all possible posterior mean functions of GP regression.
- **The posterior covariance function (the Bayesian average squared error) is a worst-case squared error in the RKHS.**

---

These deep connections might make it tempting to conclude that **Bayesian and Frequentist analyses are essentially the same**—at least for Gaussian Processes. However, it's important to be cautious:

- While there are strong equivalences for specific quantities (like the posterior mean and variance in GPs), the **philosophical foundations**, **interpretation of probability**, and the **treatment of uncertainty** often differ significantly between the two approaches.
- For example, the Bayesian framework interprets probability as a degree of belief, while the Frequentist approach treats probability as a long-run frequency of events.
- The way uncertainty is quantified and interpreted can also diverge, especially outside the special case of GPs.

---

> **Key Point:**  
> The mathematical overlap between Bayesian and Frequentist perspectives in GPs is profound, but the two frameworks are not identical in general. Their interpretations and practical implications can differ, especially when we move beyond the quantities where their answers coincide.

---

In the next sections, we'll highlight a crucial difference regarding the **sample paths** of a Gaussian Process, and see where the Bayesian and Frequentist perspectives start to diverge.

## What is the RKHS? (2) Representation in Terms of Eigenfunctions

The **Reproducing Kernel Hilbert Space (RKHS)** can also be characterized using the eigenfunctions of its kernel, providing a powerful and intuitive perspective on the structure of this function space.

---

### **Theorem (Mercer Representation)**

Let $X$ be a compact metric space, $k$ a continuous kernel on $X$, and $\nu$ a finite Borel measure whose support is $X$. Let $\{ (\phi_i, \lambda_i) \}_{i \in I}$ be the eigenfunctions and eigenvalues of $k$ with respect to $\nu$. Then, the RKHS $\mathcal{H}_k$ is given by:

$$
\mathcal{H}_k = \left\{ f(x) := \sum_{i \in I} \alpha_i \lambda_i^{1/2} \phi_i(x) \;\; \Bigg| \;\; \|f\|^2_{\mathcal{H}_k} := \sum_{i \in I} \alpha_i^2 < \infty \right\}
$$

with the inner product

$$
\langle f, g \rangle_{\mathcal{H}_k} := \sum_{i \in I} \alpha_i \beta_i
$$

where $f(x) = \sum_{i \in I} \alpha_i \lambda_i^{1/2} \phi_i(x)$ and $g(x) = \sum_{i \in I} \beta_i \lambda_i^{1/2} \phi_i(x)$.

---

### **Interpretation**

- **Infinite Linear Combinations:**  
    Every function in the RKHS can be written as an (infinite) linear combination of the kernel's eigenfunctions, weighted by $\lambda_i^{1/2}$ and coefficients $\alpha_i$.
- **Square-Summable Coefficients:**  
    The coefficients $\alpha_i$ must satisfy $\sum_{i \in I} \alpha_i^2 < \infty$. This ensures that the function has finite "complexity" or "smoothness" as measured by the RKHS norm.
- **Inner Product Structure:**  
    The inner product in the RKHS is simply the dot product of the coefficient sequences $(\alpha_i)$ and $(\beta_i)$.

---

### **Why Is This Useful?**

- **Feature Space View:**  
    This representation makes it clear that the RKHS is a space of functions built from the kernel's eigenfunctions, analogous to how a vector space is built from its basis vectors.
- **Smoothness Control:**  
    The decay of the eigenvalues $\lambda_i$ controls the smoothness and richness of the RKHS. Faster decay means smoother functions.
- **Practical Computation:**  
    In practice, for finite datasets, we often approximate functions in the RKHS using only the leading eigenfunctions (those with largest eigenvalues).

---

### **Summary**

- The RKHS associated with a kernel $k$ consists of all functions that can be expressed as square-summable linear combinations of the kernel's eigenfunctions.
- This spectral view provides deep insight into the expressiveness and limitations of kernel methods, including Gaussian Processes and Support Vector Machines.

---

> **Key Takeaway:**  
> The RKHS is the "natural habitat" for kernel-based learning algorithms. Understanding its structure through eigenfunctions and eigenvalues is essential for grasping what kinds of functions your model can represent and learn.

# What About the Samples? GP Samples Are *Not* in the RKHS!

This is a subtle but important point that often surprises beginners:  
**While the posterior mean of a Gaussian Process (GP) always lives in the RKHS, the sample paths (i.e., individual functions drawn from the GP) almost never do—especially when the RKHS is infinite-dimensional.**

---

## Why Aren't GP Samples in the RKHS?

To understand this, let's introduce the **Karhunen–Loève Expansion**, which provides a spectral decomposition for stochastic processes:

### **Theorem (Karhunen–Loève Expansion, Simplified)**

Let $X$ be a compact metric space, $k : X \times X \to \mathbb{R}$ a continuous kernel, $\nu$ a finite Borel measure whose support is $X$, and $\{ (\phi_i, \lambda_i) \}_{i \in I}$ the eigenfunctions and eigenvalues of $k$ (from Mercer's theorem).  
Let $\{ z_i \}_{i \in I}$ be i.i.d. standard normal random variables ($z_i \sim \mathcal{N}(0, 1)$).

Then, a sample path $f$ from the GP can be written as:
$$
f(x) = \sum_{i \in I} z_i \lambda_i^{1/2} \phi_i(x) \sim \mathcal{GP}(0, k)
$$

- Each sample path is an **infinite sum** of eigenfunctions, weighted by random Gaussian coefficients and the square roots of the eigenvalues.

---

## **Why Don't These Sample Paths Belong to the RKHS?**

Recall that the RKHS norm of a function $f$ (with expansion $f(x) = \sum_{i \in I} \alpha_i \lambda_i^{1/2} \phi_i(x)$) is:
$$
\| f \|_{\mathcal{H}_k}^2 = \sum_{i \in I} \alpha_i^2
$$

For a GP sample path, the coefficients are random: $\alpha_i = z_i$.  
So, the expected RKHS norm squared is:
$$
\mathbb{E}\left[ \| f \|_{\mathcal{H}_k}^2 \right] = \mathbb{E}\left[ \sum_{i \in I} z_i^2 \right] = \sum_{i \in I} \mathbb{E}[z_i^2] = \sum_{i \in I} 1
$$

- **If $I$ is infinite, this sum diverges:** $\sum_{i \in I} 1 = \infty$.
- Therefore, with probability 1, a GP sample path has **infinite RKHS norm** and does **not** belong to the RKHS.

---

## **Corollary (Wahba, 1990; see also Kanagawa et al., Thm. 4.9)**

> If the set of eigenfunctions is infinite, then for $f \sim \mathcal{GP}(0, k)$, almost surely $f \notin \mathcal{H}_k$.

---

## **Intuitive Summary**

- **Posterior mean functions** of a GP always live in the RKHS.
- **Sample paths** from a GP are "rougher" or "more complex" than any function in the RKHS, and almost never belong to it when the RKHS is infinite-dimensional.
- This distinction is crucial for understanding the expressiveness and limitations of GPs and kernel methods.

---

> **Key Takeaway:**  
> The RKHS is the space of "well-behaved" functions that kernel methods can learn as mean predictions, but the random functions sampled from a GP prior are almost always much wilder!

# What About the Samples? Sample Spaces of Gaussian Processes

Understanding the **sample space** of a Gaussian Process (GP) is subtle and foundational for kernel methods and probabilistic modeling. Here, we clarify what kinds of functions GP samples are, and how this relates to the kernel and the RKHS.

---

## Function Spaces Containing GP Samples

- The **precise sample space** of a GP is often difficult to characterize exactly.
- Instead, we typically identify *function spaces* that are large enough to contain (almost all) GP sample paths.

### Why Does This Matter?

- If you know your target function has certain properties (e.g., continuity, differentiability), you should **choose a kernel** whose GP sample paths have matching properties. This can dramatically improve learning efficiency.

### Common Function Spaces for GP Samples

- $\mathbb{R}^X$: The space of all real-valued functions on $X$ (too large for practical use).
- **Banach space $C(X)$**: The space of continuous functions on $X$.
- **Banach space $C^k(X)$**: The space of $k$-times continuously differentiable functions (useful for modeling derivatives, e.g., in Bayesian optimization).
- **Sobolev spaces $W_2^k(X)$**: Spaces of functions with square-integrable derivatives up to order $k$ (important in PDE inference and functional analysis).

---

## GP Samples Are *Not* in the RKHS! But Almost...

- **Key fact:** While the **posterior mean** of a GP always lies in the RKHS associated with the kernel, **sample paths drawn from the GP almost never do** (when the RKHS is infinite-dimensional).
- However, GP samples do belong to a slightly larger space—a "completion" of the RKHS.

---

## Theorem: GP Samples in Interpolated RKHS Spaces

**(Kanagawa, 2018; restricted from Steinwart, 2017; generalizing Driscoll, 1973)**

Let $\mathcal{H}_k$ be the RKHS of kernel $k$, and $0 < \theta \leq 1$. Define the $\theta$-power of the RKHS as:

$$
\mathcal{H}_k^\theta = \left\{ f(x) := \sum_{i \in I} \alpha_i \lambda_i^{\theta/2} \phi_i(x) \;\; \Bigg| \;\; \|f\|^2_{\mathcal{H}_k^\theta} := \sum_{i \in I} \alpha_i^2 < \infty \right\}
$$

with inner product

$$
\langle f, g \rangle_{\mathcal{H}_k^\theta} := \sum_{i \in I} \alpha_i \beta_i
$$

where $f(x) = \sum_{i} \alpha_i \lambda_i^{\theta/2} \phi_i(x)$ and $g(x) = \sum_{i} \beta_i \lambda_i^{\theta/2} \phi_i(x)$.

If

$$
\sum_{i \in I} \lambda_i^{1-\theta} < \infty,
$$

then for $f \sim \mathcal{GP}(0, k)$, we have $f \in \mathcal{H}_k^\theta$ with probability 1.

---

### **Interpretation**

- **GP samples are "almost" in the RKHS**: They live in a slightly larger space, $\mathcal{H}_k^\theta$, for any $\theta < 1$ (provided the eigenvalues decay fast enough).
- The faster the eigenvalues $\lambda_i$ decay, the closer the GP samples are to being in the RKHS.

---

## **Summary**

- The choice of kernel determines not only the RKHS (the space of possible posterior means), but also the broader function space where GP samples live.
- **GP samples are typically rougher and more complex than any function in the RKHS**, but they are still "well-behaved" in a precise mathematical sense.
- Understanding these spaces helps you select kernels that encode the right assumptions for your modeling task, leading to better learning and generalization.

---

# Summary: Understanding Kernels and Gaussian Processes (GPs)

This lecture provided a deeper, more theoretical understanding of **Gaussian Processes (GPs)** and their underlying **kernels**. Here are the key takeaways, organized for clarity and accessibility:

---

## 1. GPs as Distributions over Functions

- **Gaussian Processes** are truly probability distributions over spaces of functions.
- However, the probability space defined by a GP is only *weakly specified* by its general construction and lacks much useful structure.
- To understand what kinds of functions a GP can generate, we must study the **kernel** in detail.

---

## 2. Kernel–Covariance Equivalence

- **Every covariance function is a kernel**, and **every kernel can serve as the covariance function of a GP**.
- This fundamental equivalence means that specifying a valid kernel is sufficient to define a GP.

---

## 3. Kernels as "Infinite Matrices"

- Kernels have **eigenfunctions**, just as matrices have eigenvectors.
- You can think of a kernel as a kind of *infinite matrix* that spans a space of functions.
- This perspective is formalized by **Mercer's Theorem**, which connects kernels to infinite-dimensional feature spaces.

---

## 4. Reproducing Kernel Hilbert Space (RKHS)

- The space spanned by the kernel's eigenfunctions is called the **Reproducing Kernel Hilbert Space (RKHS)**.
- The RKHS is **identical to the space of all possible posterior mean functions** in GP regression.
- This directly connects GPs to frequentist kernel methods, such as **Kernel Ridge Regression**.

---

## 5. Posterior Variance as Worst-Case Error

- The **posterior covariance function** (the Bayesian average squared error) is also a **worst-case squared error in the RKHS**.
- This reveals a deep connection between Bayesian uncertainty and frequentist error bounds.

---

## 6. GP Samples vs. RKHS

- **GP samples do not lie in the RKHS**. While the posterior mean always lives in the RKHS, the functions sampled from a GP are generally "rougher" and more complex.
- To study GP samples, we often use a *power* of the RKHS (a slightly larger function space), but not the RKHS itself.
- This highlights a key distinction:  
    - **What a GP can generate (samples) is generally rougher than what it can learn or represent (posterior means).**

---

## Final Thoughts

A thorough understanding of the mathematical properties of kernels and RKHSs is essential for the effective application and interpretation of Gaussian Processes.  
While GPs are powerful and flexible tools, their true capabilities and limitations are best understood through the lens of kernel theory and functional analysis.

---

> **Key Takeaway:**  
> Mastering kernels and RKHSs unlocks a deeper intuition for GPs, bridging Bayesian and frequentist perspectives, and empowering you to design better models for real-world data.