Understanding Kernels and GPs: A Deeper Dive

Welcome back to our Gaussian Processes (GPs) blog series! So far, we've introduced GPs as distributions over functions, explored various kernel functions, and seen how to apply them to real-world data like the Mauna Loa CO2 measurements.

In this lecture, we'll peel back another layer of abstraction to truly understand what Gaussian Processes and kernels are. We'll address some fundamental questions that often arise when first learning about GPs:

    Are Gaussian Processes truly probability distributions over functions? What kind of functions do they define?

    What exactly are kernels? Can we think of them as "infinitely large matrices"?

    What's the connection between GPs and "kernel machines" you might have heard about in other machine learning courses?

    If GPs can use infinitely many features, can they learn any function?

This lecture will provide deeper theoretical insights, connecting GPs to concepts from functional analysis and linear algebra.

Recap: What We've Seen So Far

Let's quickly recap the key points from our previous discussions:

    Inference with Gaussians: We've seen that inference in models involving linear relationships between Gaussian random variables primarily relies on linear algebra operations. This allows for analytical solutions for posterior distributions.

    Features for Non-linearity: Feature mappings ϕ(x) enable linear models to learn complex, non-linear (real-valued) functions across various domains.

    Feature Learning (Type-II Maximum Likelihood): We can learn optimal feature representations (or kernel hyperparameters) by maximizing the marginal likelihood (also known as Type-II Maximum Likelihood or empirical Bayes).

    Gaussian Process Models: GPs allow us to work with models that effectively utilize infinitely many features in finite time. Their core components are a mean function and a positive definite covariance function, which is also called a Mercer kernel.

These points set the stage for our deeper exploration today.

Goals for Today: Three Key Insights

Today's lecture aims to provide three crucial insights into Gaussian Processes and kernels:

    GPs as Probability Distributions over Functions: We will confirm that Gaussian processes indeed define a "probability distribution over functions." We'll also discuss that the associated function space is very general, and understanding its structure requires a closer look at the kernel.

    Connection to Frequentist Kernel Methods: The fact that the covariance function is a kernel establishes a strong connection to Frequentist kernel methods (like Kernel Ridge Regression or Support Vector Machines). By comparing Bayesian and Frequentist analyses, we can uncover relationships and differences in their philosophical approaches.

    Bayesian-Frequentist Interplay & Limitations: We'll see that Bayesian analysis can be supported by Frequentist tools, and vice versa. However, it's crucial to be careful not to jump to conclusions about their equivalence. We'll also touch upon the idea that while non-parametric models are powerful, they are not omnipotent; the size of a model class doesn't necessarily translate into superior convergence rates.

What is a Gaussian Process? A More Careful Definition

Let's start with a more formal and careful definition of a Gaussian Process as a stochastic process.

Definition (Gaussian Process)
A Gaussian process f with index set X is a family of R-valued random variables ω↦f(x,ω)x∈X​ on a common probability space (Ω,F,P) such that every finite combination f(x1​,⋅),...,f(xn​,⋅) of these random variables follows a multivariate Gaussian distribution. We often simplify the notation to f(x):=f(x,⋅).

This definition emphasizes that for each input point x, f(x) is a random variable. A Gaussian Process is a collection of such random variables, indexed by x∈X. The key is that any finite subset of these random variables jointly follows a Gaussian distribution.

Definition (Mean and Covariance Function of a GP)
Let f be a Gaussian process.

    The function μ:X→R, x↦E[f(x)] is called the mean function of f.

    The function k:X×X→R, (x1​,x2​)↦Cov[f(x1​),f(x2​)]=E[(f(x1​)−μ(x1​))(f(x2​)−μ(x2​))] is called the covariance function of f.

Every Gaussian process has a unique mean and covariance function. Conversely, as we've seen, a mean function and a positive definite kernel define a unique Gaussian process. This allows us to use the notation f∼GP(μ,k).

Conceptually, a sample path f(⋅,ω) from a GP can be thought of as:
f(x,ω)=μx​+Cholesky(kXX​)ω (simplified for finite dimensions)
where ω represents a realization from a standard Gaussian distribution.

Covariance Functions and Kernels: Every Covariance Function is a Kernel

We've defined a kernel as a function that produces symmetric positive semidefinite matrices. Now, let's formally prove that every covariance function is indeed a positive definite kernel.

Lemma: Every covariance function k is a positive-definite kernel.

Proof:
Let v∈Rn be an arbitrary vector and X=xi​i=1n​⊂X be any finite set of points from the index set.
We need to show that the matrix KXX​ with entries [KXX​]ij​=k(xi​,xj​)=Cov[f(xi​),f(xj​)] is positive semidefinite, i.e., vTKXX​v≥0.

Let m(xi​) be the mean function evaluated at xi​.
vTKXX​v=∑i=1n​∑j=1n​vi​vj​Cov[f(xi​),f(xj​)]
By the definition of covariance, Cov[f(xi​),f(xj​)]=E[(f(xi​)−m(xi​))(f(xj​)−m(xj​))].
So,
vTKXX​v=∑i=1n​∑j=1n​vi​vj​E[(f(xi​)−m(xi​))(f(xj​)−m(xj​))]

Since expectation is a linear operator, we can move the expectation outside the sums:
vTKXX​v=E[∑i=1n​∑j=1n​vi​vj​(f(xi​)−m(xi​))(f(xj​)−m(xj​))]

Now, observe the terms inside the expectation. This is a sum that can be rewritten as a square:
(∑i=1n​vi​(f(xi​)−m(xi​)))(∑j=1n​vj​(f(xj​)−m(xj​)))=(∑i=1n​vi​(f(xi​)−m(xi​)))2

Let Z=∑i=1n​vi​(f(xi​)−m(xi​)). Z is a random variable.
Then, vTKXX​v=E[Z2].

Since Z2 is always non-negative, its expectation E[Z2] must also be non-negative.
Therefore, vTKXX​v≥0.

This proves that every covariance function is a positive-definite kernel.
□


Covariance Functions and Kernels: Every Kernel Can Be a Covariance Function

We've just shown that every covariance function is a kernel. The converse is also true: every positive-definite kernel can serve as the covariance function of some Gaussian Process.

Lemma: For every function m:X→R and every positive-definite kernel k:X×X→R, there exists a Gaussian process f with mean function m and covariance function k.

The proof for this lemma is more involved and typically relies on advanced probability theory, specifically the Kolmogorov Extension Theorem.

Theorem (Kolmogorov Extension Theorem - Simplified):
Let I be a non-empty index set. Assume we are given consistent finite-dimensional probability distributions. Then these finite-dimensional distributions uniquely define a probability measure on the infinite-dimensional product space.

In the context of GPs:

    The "index set" is X.

    The "finite-dimensional probability distributions" are the multivariate Gaussian distributions p(fX​)=N(fX​;mX​,KXX​) for any finite set of points X.

    The "consistency" requirement means that these finite-dimensional distributions must be compatible (e.g., if you consider p(fx1​​,fx2​​) and p(fx1​​), marginalizing the former should give the latter). The properties of Gaussian distributions and positive-definite kernels ensure this consistency.

This theorem ensures that if we have a valid mean function and a valid kernel, we can indeed construct a unique Gaussian process that generates these finite-dimensional Gaussian distributions. This is why we can confidently use the notation f∼GP(m,k) to uniquely identify a Gaussian process by its mean and covariance functions.

Are GPs Probability Distributions over Functions? Sample Paths and Random Functions

Yes, Gaussian Processes are indeed probability distributions over functions. Let's clarify what this means.

For a fixed outcome ω∈Ω from the underlying probability space, the function f(⋅,ω):X→R is called a sample path or a realization of the GP f. When we plot samples from a GP, we are visualizing these sample paths.

The mapping ω↦f(⋅,ω) itself is a well-defined random variable, whose values are the sample paths of f. More precisely, this mapping transforms an outcome ω into a function, and it maps into a measurable space of functions (e.g., RX, the space of all real-valued functions on X). This is another consequence of the Kolmogorov Extension Theorem.

This means GPs can truly be interpreted as random functions.

However, the space of all real-valued functions RX is extremely large and contains many "ill-behaved" functions (e.g., functions that are nowhere continuous). Most of the GPs we've encountered (like those with Squared Exponential, Matérn, or Wiener kernels) produce sample paths that are at least continuous, or even smooth.

In many applications, we might want to reason about properties like differentiability of functions, or even their derivatives. To understand the specific characteristics of the functions that a GP can generate (i.e., the "sample space" of the GP), we need to delve deeper into the properties of its kernel. The kernel's properties dictate the smoothness and other characteristics of the sample paths.

Kernels as 'Infinite Matrices': Eigenvalues and Eigenfunctions

Can we think of kernels as "infinitely large matrices"? In a sense, yes, and this analogy is made precise through the concept of eigenfunctions and eigenvalues.

First, a quick refresher on finite-dimensional matrices:

Definition (Eigenvalue and Eigenvector): Let A∈Rn×n be a matrix. A scalar λ∈C and a vector v∈Cn are called an eigenvalue and corresponding eigenvector if Av=λv.

Theorem (Spectral Theorem for Symmetric Positive-Definite Matrices):
The eigenvectors of symmetric matrices A=AT are real and form an orthonormal basis for the image of A. A symmetric positive definite matrix A can be written as a sum over its eigenvalues and eigenvectors:
[A]ij=∑a=1nλa​[va​]i​[va​]j​
where λa​>0 for all a=1,...,n. This is also known as a Gramian representation.

Now, extending this to functions and kernels:

Definition (Eigenfunction and Eigenvalue of a Kernel):
A function ϕ:X→R and a scalar λ∈C that obey the integral equation:
∫k(x,x~)ϕ(x~)dν(x~)=λϕ(x)
are called an eigenfunction and eigenvalue of the kernel k with respect to a measure ν. This is the functional analogue of Av=λv.

Theorem (Mercer's Theorem, 1909):
Let (X,ν) be a finite measure space and k:X×X→R be a continuous (Mercer) kernel. Then there exist eigenvalues/functions (λi​,ϕi​)i∈I​ with respect to ν such that:

    I is countable.

    All λi​ are real and non-negative.

    The eigenfunctions ϕi​ can be made orthonormal.

    The following series converges absolutely and uniformly ν2-almost-everywhere:
    k(a,b)=∑i∈I​λi​ϕi​(a)ϕi​(b)∀a,b∈X.

This theorem is profound because it shows that a continuous Mercer kernel can be decomposed into an infinite sum of products of its eigenfunctions, weighted by its eigenvalues. This is the direct analogue of the spectral decomposition for finite matrices.

So, yes, in the sense of Mercer's theorem, we can vaguely think of a kernel k(a,b) as an "element" of an "infinitely large" matrix, where the "rows" and "columns" are indexed by points in X, and the "basis vectors" are the eigenfunctions. The eigenvalues λi​ tell us the "importance" of each eigenfunction in constructing the kernel.

Reproducing Kernel Hilbert Spaces (RKHS)

The eigenfunctions of a kernel span a very important space of functions called the Reproducing Kernel Hilbert Space (RKHS). This space is central to understanding what a GP can "learn" or represent.

Definition (Reproducing Kernel Hilbert Space (RKHS)):
Let H=(X,⟨⋅,⋅⟩H​) be a Hilbert space of functions f:X→R. Then H is called a reproducing kernel Hilbert space if there exists a kernel k:X×X→R such that:

    For all x∈X: k(⋅,x)∈H (the kernel function, when one argument is fixed, is itself a member of the space).

    For all f∈H: ⟨f(⋅),k(⋅,x)⟩H​=f(x) (this is the reproducing property; evaluating a function at x is equivalent to taking its inner product with the kernel function k(⋅,x)).

Theorem (Aronszajn, 1950): For every positive definite kernel k on X, there exists a unique RKHS.

Intuition: The kernel k(⋅,x) acts like a "generalized identity function" or a "delta function" in the RKHS. It allows us to "reproduce" the value of any function in the space by taking an inner product.
What is the RKHS? (1) The Space of Possible Posterior Mean Functions

A key insight is that the RKHS associated with a kernel k is precisely the space of all possible posterior mean functions that can be obtained from a GP with that kernel.

Theorem (Reproducing Kernel Map Representation):
The RKHS Hk​ can be written as the space of linear combinations of kernel functions:
\mathcal{H}k = \left{ f(x) := \sum{i \in I} \tilde{\alpha}_i k(x_i, x) \right}
with the inner product ⟨f,g⟩Hk​:=∑i∈Iα~i​β~​i​k(xi​,xi​).

Consider a Gaussian process p(f)=GP(0,k) with likelihood p(y∣f,X)=N(y;fX​,σ2I). The posterior mean function is given by:
μ(x)=kxX​(KXX​+σ2I)−1y

Let w=(KXX​+σ2I)−1y. Then μ(x)=kxX​w=∑i=1n​wi​k(x,xi​).
This shows that the posterior mean function is indeed a linear combination of kernel functions centered at the training data points. Thus, the RKHS is the space of all possible posterior mean functions of the GP regression method.

The Gaussian Posterior Mean is a Least-Squares Estimate (Kernel Ridge Regression)

The connection between Bayesian GP regression and Frequentist methods becomes even clearer when we look at the posterior mean.

Theorem (The Kernel Ridge Estimate):
Consider the model p(f)=GP(f;0,k) and likelihood p(y∣f)=N(y;fX​,σ2I).
The posterior mean:
m(x)=kxX​(KXX​+σ2I)−1y
is the element of the RKHS Hk​ that minimizes the regularized ℓ2​ loss:
L(f)=σ21​∑i​(f(xi​)−yi​)2+∣f∣Hk​2​

The term ∣f∣Hk​2​ is the RKHS norm, which acts as a regularizer, penalizing overly complex or "rough" functions. This loss function is precisely the objective minimized in Kernel Ridge Regression.

This theorem reveals a profound equivalence: the Bayesian posterior mean of a Gaussian Process is identical to the solution of a specific regularized least-squares problem in the Frequentist framework. This highlights that many seemingly different models are deeply connected.

This connection to least-squares and regularization has a long history, dating back to figures like Adrien-Marie Legendre and Carl-Friedrich Gauss, who developed the method of least squares over 200 years ago.

What is the Meaning of Uncertainty? Frequentist Interpretation of Posterior Variance

In Bayesian GPs, the posterior variance v(x)=Cov[f(x)∣y] quantifies our uncertainty about the function value at a point x given the observed data y. Interestingly, this Bayesian measure of uncertainty has a strong Frequentist interpretation.

Theorem: Assume p(f)=GP(f;0,k) and noise-free observations p(y∣f)=δ(y−fX​) (where δ is the Dirac delta function, implying exact observations). The GP posterior variance (the expected square error):
v(x):=Ep(f∣y)[(f(x)−m(x))2]=kxx−kxX​KXX−1​kXx​
is a worst-case bound on the divergence between m(x) and an RKHS element of bounded norm:
v(x)=supf∈Hk,∣f∣Hk​≤1​(m(x)−f(x))2

This theorem states that the Bayesian posterior variance at a point x is equal to the maximum possible squared difference between the posterior mean m(x) and any function f in the RKHS with a norm less than or equal to 1. In essence, it tells us "how far off" our posterior mean could be from any "well-behaved" function in the RKHS, even in the best-case scenario of noise-free observations.

This is a powerful result: the GP's expected square error (a Bayesian quantity) is equivalent to the RKHS's worst-case square error for functions of bounded norm (a Frequentist quantity). It suggests that Bayesians, in a sense, "expect the worst" when it comes to uncertainty, as their posterior variance provides a robust bound.

Note: The posterior variance v(x) itself is generally not an element of the RKHS Hk​.

What About the Samples? GP Samples Are Not in the RKHS!

This is a subtle but important point that often surprises beginners: while the posterior mean of a GP lives in the RKHS, the sample paths (individual functions drawn from the GP) generally do not lie in the RKHS, especially when the RKHS is infinite-dimensional.

To understand this, we use the Karhunen-Loève Expansion, which provides a spectral decomposition for stochastic processes:

Theorem (Karhunen-Loève Expansion - Simplified):
Let X be a compact metric space, k:X×X be a continuous kernel, and (ϕi​,λi​)i∈I​ be its eigenfunctions and eigenvalues. Let (zi​)i∈I​ be a collection of i.i.d. standard Gaussian random variables (zi​∼N(0,1)).
Then (simplified!):
f(x)=∑i∈I​zi​λi1/2​ϕi​(x)∼GP(0,k).

This theorem shows that any sample path from a GP can be represented as an infinite sum of its eigenfunctions, weighted by Gaussian random variables and the square root of the eigenvalues.

Corollary (Wahba, 1990):
If I (the set of indices for eigenfunctions) is infinite, then f∼GP(0,k) implies almost surely f∈/Hk​.

Why? The norm of a function f in the RKHS is defined as ∣f∣Hk2​:=∑i∈Iαi2​<∞, where f(x)=∑i∈I​αi​λi1/2​ϕi​(x).
For a sample path f(x)=∑i∈I​zi​λi1/2​ϕi​(x), the corresponding αi​ values are zi​.
So, for a sample path, its "norm" in the RKHS would be ∑i∈I​zi2​.
The expected value of this squared norm is E[∑i∈I​zi2​]=∑i∈I​E[zi2​]=∑i∈I​1.
If I is infinite, this sum diverges (∑i∈I​1=∞). This means that sample paths from an infinite-dimensional GP almost surely have infinite RKHS norm, and thus do not belong to the RKHS.
Sample Spaces of Gaussian Processes

This distinction is important. While the RKHS defines the space of mean functions (what the GP can "learn" or approximate well), the sample paths (what the GP can "generate") are generally "rougher" and exist in a larger space.

To study the properties of GP samples, one often identifies other function spaces (e.g., Banach spaces of continuous or differentiable functions, Sobolev spaces) that contain the samples as a subset. The goal is to choose a kernel such that its sample space matches our prior knowledge about the target function's properties (e.g., if we expect a smooth function, we choose a kernel like RBF or Matérn with large ν).

In some cases, GP samples belong to a "power" of the RKHS, denoted Hkθ​, which is a slightly larger space where the sum ∑i∈I​λi1−θ​ converges. This means GP samples are "almost" in the RKHS, but not quite.

Summary of Understanding Kernels and GPs

This lecture has provided a deeper, more theoretical understanding of Gaussian Processes and their underlying kernels. Here are the key takeaways:

    GPs as Function Distributions: Gaussian Processes are indeed probability distributions over function spaces. However, the general construction identifies a very weak probability space. To understand the specific characteristics of functions sampled from a GP, we must study its kernel.

    Kernel-Covariance Equivalence: Every covariance function is a kernel, and conversely, every kernel can be the covariance function of a GP. This fundamental equivalence means that specifying a valid kernel is sufficient to define a GP.

    Kernels as "Infinite Matrices": Kernels have eigenfunctions and eigenvalues, much like matrices have eigenvectors and eigenvalues. This allows us to conceptualize them as a kind of "infinite matrix" that spans a space of functions.

    Reproducing Kernel Hilbert Space (RKHS): The RKHS is the specific function space associated with a kernel. Crucially, it is identical to the space of all possible posterior mean functions that can be obtained from GP regression. This connects GPs directly to Frequentist kernel methods like Kernel Ridge Regression.

    Posterior Variance as Worst-Case Error: The GP's posterior covariance function (which represents the Bayesian expected square error) has a strong Frequentist interpretation: it is a worst-case square error bound within the RKHS for functions of bounded norm. This suggests a deep connection between Bayesian uncertainty and Frequentist error bounds.

    GP Samples vs. RKHS: A critical distinction is that sample paths drawn from a GP generally do not lie in the RKHS. They tend to be "rougher" and reside in a larger function space (e.g., a "power" of the RKHS). This means what a GP can generate (samples) is different from what it can learn or represent (posterior means).

This lecture reinforces that while GPs are powerful, a thorough understanding of their underlying mathematical properties, particularly those of kernels and RKHSs, is essential for effective application and interpretation.

In [None]:
import jax.numpy as jnp
import jax.random as random
import matplotlib.pyplot as plt
from typing import Callable

# --- Re-using kernel definitions from previous lectures for completeness ---
# Squared Exponential (RBF) Kernel
def squared_exponential_kernel(x1: jnp.ndarray, x2: jnp.ndarray, sigma: float = 1.0, lengthscale: float = 1.0) -> jnp.ndarray:
    """
    Computes the Squared Exponential (RBF) kernel matrix.
    """
    x1 = jnp.atleast_2d(x1)
    x2 = jnp.atleast_2d(x2)
    sq_dist = jnp.sum((x1[:, None, :] - x2[None, :, :])**2, axis=-1)
    K = sigma**2 * jnp.exp(-0.5 * sq_dist / lengthscale**2)
    return K

# --- Function to Sample from a GP (assuming zero mean for simplicity) ---
def sample_from_gp(X: jnp.ndarray, kernel_func: Callable, num_samples: int = 1, key: random.PRNGKey = random.PRNGKey(0)) -> jnp.ndarray:
    """
    Draws samples from a Gaussian Process with zero mean.

    Args:
        X: Input points to sample at. Shape (N, D).
        kernel_func: Covariance function (kernel).
        num_samples: Number of samples to draw.
        key: JAX PRNG key for reproducibility.

    Returns:
        An array of samples. Shape (num_samples, N).
    """
    # Ensure X is at least 2D
    X = jnp.atleast_2d(X)

    # Compute the covariance matrix
    K_XX = kernel_func(X, X)

    # Add a small jitter for numerical stability in Cholesky decomposition
    jitter = 1e-6 * jnp.eye(X.shape[0])
    K_XX += jitter

    # Compute the Cholesky decomposition of the covariance matrix
    L = jnp.linalg.cholesky(K_XX, lower=True)

    # Generate random standard normal variables
    z = random.normal(key, shape=(num_samples, X.shape[0]))

    # Compute the samples: mean + L @ z^T (mean is zero here)
    # L is (N, N), z.T is (N, num_samples). Result is (N, num_samples).
    # Transpose to get (num_samples, N)
    samples = jnp.dot(L, z.T).T

    return samples

# --- Example: Sampling from a GP with RBF Kernel ---
key = random.PRNGKey(123)

# Generate input points
X_test = jnp.linspace(-5.0, 5.0, 100)[:, None] # Test inputs (needs to be 2D)

# Define the RBF kernel with specific hyperparameters
rbf_sigma = 1.0
rbf_lengthscale = 1.0
rbf_k = lambda x1, x2: squared_exponential_kernel(x1, x2, sigma=rbf_sigma, lengthscale=rbf_lengthscale)

# Sample functions
num_samples = 5
gp_samples = sample_from_gp(X_test, rbf_k, num_samples=num_samples, key=key)

# Plot the samples
plt.figure(figsize=(10, 6))
for i in range(num_samples):
    plt.plot(X_test[:, 0], gp_samples[i, :], alpha=0.7)

# Add the mean function (which is zero here)
plt.plot(X_test[:, 0], jnp.zeros_like(X_test[:, 0]), color='black', linestyle='--', label='Mean Function')

plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('Samples from a Gaussian Process (RBF Kernel)')
plt.grid(True)
plt.show()


In [None]:
import jax.numpy as jnp
import matplotlib.pyplot as plt

# --- Re-using kernel definition ---
# Squared Exponential (RBF) Kernel
def squared_exponential_kernel(x1: jnp.ndarray, x2: jnp.ndarray, sigma: float = 1.0, lengthscale: float = 1.0) -> jnp.ndarray:
    """
    Computes the Squared Exponential (RBF) kernel matrix.
    """
    x1 = jnp.atleast_2d(x1)
    x2 = jnp.atleast_2d(x2)
    sq_dist = jnp.sum((x1[:, None, :] - x2[None, :, :])**2, axis=-1)
    K = sigma**2 * jnp.exp(-0.5 * sq_dist / lengthscale**2)
    return K

# --- Example: Eigenvalues and Eigenvectors of a Finite Kernel Matrix ---
# This is an analogy to Mercer's Theorem for infinite-dimensional kernels.
# For a finite set of points, the kernel matrix is a symmetric positive definite matrix,
# and thus has real, non-negative eigenvalues and orthogonal eigenvectors.

# Define a finite set of input points
X_finite = jnp.linspace(-3.0, 3.0, 20)[:, None] # 20 points

# Compute the kernel matrix for these points
K_finite = squared_exponential_kernel(X_finite, X_finite, sigma=1.0, lengthscale=1.0)

# Compute eigenvalues and eigenvectors
# jnp.linalg.eigh returns eigenvalues in ascending order
eigenvalues, eigenvectors = jnp.linalg.eigh(K_finite)

# Sort eigenvalues in descending order and reorder eigenvectors accordingly
idx = eigenvalues.argsort()[::-1] # Get indices for descending order
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx] # Reorder columns of eigenvectors matrix

print(f"Top 5 Eigenvalues:\n{eigenvalues[:5]}")
print(f"Smallest 5 Eigenvalues:\n{eigenvalues[-5:]}") # Should be positive/close to zero

# Plot the first few eigenvectors (analogous to eigenfunctions)
plt.figure(figsize=(12, 8))
for i in range(min(5, len(eigenvalues))): # Plot up to 5 eigenvectors
    # Scale eigenvectors by sqrt of eigenvalues for visualization,
    # as per Karhunen-Loève expansion form (lambda_i^1/2 * phi_i(x))
    plt.plot(X_finite[:, 0], eigenvectors[:, i] * jnp.sqrt(eigenvalues[i]),
             label=f'Eigenvector {i+1} (scaled by $\\sqrt{{\\lambda_{i+1}}}$)')

plt.xlabel('x')
plt.ylabel('Value')
plt.title('First Few Eigenvectors of a Finite RBF Kernel Matrix (Analogy to Eigenfunctions)')
plt.legend()
plt.grid(True)
plt.show()

# --- Verify Reconstruction (Mercer's Theorem Analogy) ---
# We can reconstruct the kernel matrix from its eigenvalues and eigenvectors
# K_reconstructed = Sum_{i} lambda_i * v_i * v_i^T
K_reconstructed = jnp.dot(eigenvectors * eigenvalues, eigenvectors.T)

# Check if the reconstruction is close to the original matrix
print(f"\nMax absolute difference between original and reconstructed K: {jnp.max(jnp.abs(K_finite - K_reconstructed)):.2e}")


In [None]:
import jax.numpy as jnp
import jax.random as random
import matplotlib.pyplot as plt
from jax.scipy.linalg import solve # Use JAX's solve for numerical stability
from typing import Callable

# --- Re-using kernel definitions from previous lectures for completeness ---
def squared_exponential_kernel(x1: jnp.ndarray, x2: jnp.ndarray, sigma: float = 1.0, lengthscale: float = 1.0) -> jnp.ndarray:
    """
    Computes the Squared Exponential (RBF) kernel matrix.
    """
    x1 = jnp.atleast_2d(x1)
    x2 = jnp.atleast_2d(x2)
    sq_dist = jnp.sum((x1[:, None, :] - x2[None, :, :])**2, axis=-1)
    K = sigma**2 * jnp.exp(-0.5 * sq_dist / lengthscale**2)
    return K

# GP Prediction Function (as used in previous lectures)
def gp_predict(
    X_train: jnp.ndarray,
    y_train: jnp.ndarray,
    X_test: jnp.ndarray,
    mean_func: Callable[[jnp.ndarray], jnp.ndarray],
    kernel_func: Callable[[jnp.ndarray, jnp.ndarray], jnp.ndarray],
    noise_variance: float = 1e-6
) -> tuple[jnp.ndarray, jnp.ndarray]:
    """
    Performs Gaussian Process regression prediction.
    """
    X_train = jnp.atleast_2d(X_train)
    X_test = jnp.atleast_2d(X_test)
    y_train = jnp.atleast_1d(y_train)

    K_train_train = kernel_func(X_train, X_train) + noise_variance * jnp.eye(X_train.shape[0])
    K_test_train = kernel_func(X_test, X_train)
    K_test_test = kernel_func(X_test, X_test)

    K_train_train_inv_y_diff = solve(K_train_train, y_train - mean_func(X_train))
    mu_pred = mean_func(X_test) + jnp.dot(K_test_train, K_train_train_inv_y_diff)

    K_train_train_inv_K_test_train_T = solve(K_train_train, K_test_train.T)
    Sigma_pred = K_test_test - jnp.dot(K_test_train, K_train_train_inv_K_test_train_T)

    return mu_pred, Sigma_pred

# --- Example: GP Regression to illustrate Posterior Mean (which lies in RKHS) ---
key = random.PRNGKey(456)

# Generate synthetic data
X_train = jnp.sort(random.uniform(key, shape=(10, 1), minval=-4, maxval=4))
noise = random.normal(key, shape=X_train.shape) * 0.3
y_train = jnp.sin(X_train) + noise

# Define mean and kernel functions
zero_mean_func = lambda x: jnp.zeros(x.shape[0])
rbf_kernel_for_pred = lambda x1, x2: squared_exponential_kernel(x1, x2, sigma=1.0, lengthscale=1.0)
obs_noise_variance = 0.1**2

# Generate test points
X_test = jnp.linspace(-5, 5, 100)[:, None]

# Perform GP prediction
mu_pred, Sigma_pred = gp_predict(
    X_train, y_train, X_test, zero_mean_func, rbf_kernel_for_pred, noise_variance=obs_noise_variance
)

predictive_std = jnp.sqrt(jnp.diag(Sigma_pred))

# Plot the results
plt.figure(figsize=(10, 6))
plt.scatter(X_train[:, 0], y_train, label='Training Data', zorder=2)
plt.plot(X_test[:, 0], mu_pred, label='GP Posterior Mean (in RKHS)', color='red')
plt.fill_between(X_test[:, 0], mu_pred - 2 * predictive_std, mu_pred + 2 * predictive_std, color='red', alpha=0.2, label='95% Confidence Interval')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Gaussian Process Regression: Posterior Mean in RKHS')
plt.legend()
plt.grid(True)
plt.show()

# The posterior mean can be expressed as a linear combination of kernel functions
# centered at the training data points.
# This explicitly shows its membership in the RKHS.
# The weights 'alpha' are computed during the GP prediction:
# alpha = (K_train_train + noise_variance * I)^-1 * (y_train - mean_func(X_train))
K_train_train_noisy = rbf_kernel_for_pred(X_train, X_train) + obs_noise_variance * jnp.eye(X_train.shape[0])
alpha_weights = solve(K_train_train_noisy, y_train - zero_mean_func(X_train))

# Reconstruct posterior mean using alpha_weights and kernel functions
mu_pred_reconstructed = jnp.dot(rbf_kernel_for_pred(X_test, X_train), alpha_weights) + zero_mean_func(X_test)

print(f"\nMax absolute difference between direct mu_pred and reconstructed mu_pred: {jnp.max(jnp.abs(mu_pred - mu_pred_reconstructed)):.2e}")
print("This confirms the posterior mean is a linear combination of kernel functions, residing in the RKHS.")
