# Probabilistic Machine Learning: Lecture 16 - Deep Learning

#### Introduction

Welcome to Lecture 16 of Probabilistic Machine Learning, where we delve into the world of **Deep Learning**! This lecture connects the concepts we've learned about probabilistic models, Gaussian Processes, and inference to the highly influential field of deep neural networks. We'll explore the fundamental definitions, training paradigms, and compare the strengths and weaknesses of deep learning against our established probabilistic frameworks.

This notebook will guide you through the key ideas from the lecture slides, providing explanations and setting the stage for understanding deep learning from a probabilistic perspective. We'll maintain our use of **JAX** for any underlying numerical concepts and **Plotly** for visualizations, ensuring consistency with previous lectures.

#### 1. Recap: The Course So Far

Before diving into deep learning, let's quickly recap the core themes and models we've covered in this course (as per Slide 2):

* **Learning is Inference**: We've consistently framed learning as a process of re-weighing a space of hypotheses using likelihood functions, fundamentally rooted in Bayes' theorem. Probability theory provides the mathematical framework for tracking volume/measure correctly.

* **Exponential Families**: These are parametric probability distributions that allow for tractable inference. They simplify calculations and provide a foundation for many statistical models.

* **Gaussian Distributions**: A particularly important exponential family where inference often reduces to linear algebra. They are central to:
    * **Linear Regression**: Learning general linear functions $f: X \to \mathbb{R}$, $f(x) = \phi(x)^T w$.
    * **Gaussian Process (GP) Regression**: Abstracting linear functions to a functional form that doesn't require explicit features, using inner products (kernels, covariance functions) $k(\bullet,\circ) = \phi(\bullet)^T \Sigma \phi(\circ)$. This leads to a nonparametric model.

* **Classification with Sigmoid Likelihood**: For classification problems (functions $f: X \to [0,1]^C$), we adapted this framework by considering a sigmoid likelihood (logistic regression). This necessitated approximate inference, often realized through **Laplace approximations**, as exact posteriors are non-Gaussian.

#### 2. Deep Learning: A Vague Definition

Deep learning, while widely used, can be defined in various ways. For our purposes in this lecture, we will consider a deep neural network as (as per Slide 3):

A function $f(x, \theta): X \times \mathbb{R}^D \to \mathbb{R}^F$, parametrized by parameters $\theta \in \mathbb{R}^D$ and mapping inputs $x \in \mathbb{X}$ to outputs $f(x, \theta) \in \mathbb{R}^F$.

Such functions are often realized in a **hierarchical fashion**:

$$f(x, \theta) = b_L + w_L \sigma(b_{L-1} + w_{L-1} \sigma(\cdot\cdot\cdot\sigma(b_0 + w_0 x)))$$

This structure is parametrized by weights and biases $\theta = [b_i, w_i]_{i=0,...,L}$ and uses nonlinearities $\sigma$ (e.g., ReLU, tanh, sigmoid). We will ignore the intricate details of specific architectures here, as they are typically covered in dedicated deep learning courses.

The core idea is that deep networks learn complex, hierarchical representations of data through multiple layers of non-linear transformations.

Let's illustrate a simple multi-layer perceptron (MLP) using JAX.

In [None]:
import jax.numpy as jnp
import jax
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Set JAX to use 64-bit floats for numerical stability
jax.config.update("jax_enable_x64", True)


# --- Utility Functions (re-defined for clarity and JAX compatibility) ---
def sigmoid(f):
    """Logistic sigmoid function."""
    return 1 / (1 + jnp.exp(-f))


def relu(x):
    """ReLU activation function."""
    return jnp.maximum(0, x)


def rbf_kernel(X1, X2, length_scale=1.0):
    """Radial Basis Function (RBF) kernel."""
    sqdist = jnp.sum(X1**2, 1)[:, None] + jnp.sum(X2**2, 1) - 2 * jnp.dot(X1, X2.T)
    return jnp.exp(-0.5 * (1 / length_scale**2) * sqdist)


def generate_data(type="separable", n_samples=100):
    """Generates synthetic 2D classification data (using numpy for random generation)."""
    np.random.seed(42)  # Ensure reproducibility for data generation
    if type == "separable":
        mean1 = [-1, 0.5]
        cov1 = [[0.5, 0.2], [0.2, 0.5]]
        data1 = np.random.multivariate_normal(mean1, cov1, n_samples // 2)
        labels1 = np.ones(n_samples // 2) * -1
        mean2 = [1, -0.5]
        cov2 = [[0.5, 0.2], [0.2, 0.5]]
        data2 = np.random.multivariate_normal(mean2, cov2, n_samples // 2)
        labels2 = np.ones(n_samples // 2)
    elif type == "overlapping":
        mean1 = [-0.5, 0.5]
        cov1 = [[1.0, 0.5], [0.5, 1.0]]
        data1 = np.random.multivariate_normal(mean1, cov1, n_samples // 2)
        labels1 = np.ones(n_samples // 2) * -1
        mean2 = [0.5, -0.5]
        cov2 = [[1.0, 0.5], [0.5, 1.0]]
        data2 = np.random.multivariate_normal(mean2, cov2, n_samples // 2)
        labels2 = np.ones(n_samples // 2)
    elif type == "intermingled":
        r1 = np.random.rand(n_samples // 2) * 2
        theta1 = np.random.rand(n_samples // 2) * 2 * np.pi
        data1 = np.array(
            [
                r1 * np.cos(theta1) + np.random.randn(n_samples // 2) * 0.2,
                r1 * np.sin(theta1) + np.random.randn(n_samples // 2) * 0.2,
            ]
        ).T
        labels1 = np.ones(n_samples // 2) * -1
        r2 = np.random.rand(n_samples // 2) * 2 + 1.5
        theta2 = np.random.rand(n_samples // 2) * 2 * np.pi
        data2 = np.array(
            [
                r2 * np.cos(theta2) + np.random.randn(n_samples // 2) * 0.2,
                r2 * np.sin(theta2) + np.random.randn(n_samples // 2) * 0.2,
            ]
        ).T
        labels2 = np.ones(n_samples // 2)
        all_data = np.vstack((data1, data2))
        all_labels = np.hstack((labels1, labels2))
        perm = np.random.permutation(n_samples)
        data1 = all_data[all_labels == -1]
        data2 = all_data[all_labels == 1]
    X = np.vstack((data1, data2))
    y = np.hstack((labels1, labels2))
    return X, y


def plot_data_plotly(X, y, title="", fig=None, row=None, col=None):
    """Plots 2D classification data using Plotly."""
    if fig is None:
        fig = go.Figure()

    # Convert JAX arrays to NumPy for Plotly
    X_np = np.asarray(X)
    y_np = np.asarray(y)

    fig.add_trace(
        go.Scatter(
            x=X_np[y_np == -1, 0],
            y=X_np[y_np == -1, 1],
            mode="markers",
            marker=dict(color="maroon", symbol="circle"),
            name="Class -1",
            showlegend=True,
        ),
        row=row,
        col=col,
    )

    fig.add_trace(
        go.Scatter(
            x=X_np[y_np == 1, 0],
            y=X_np[y_np == 1, 1],
            mode="markers",
            marker=dict(
                color="skyblue", symbol="circle", line=dict(width=1, color="skyblue")
            ),
            name="Class +1",
            showlegend=True,
        ),
        row=row,
        col=col,
    )

    fig.update_layout(title_text=title, title_x=0.5)
    fig.update_xaxes(title_text="$x_1$", range=[-4, 4], row=row, col=col)
    fig.update_yaxes(
        title_text="$x_2$",
        range=[-4, 4],
        scaleanchor="x",
        scaleratio=1,
        row=row,
        col=col,
    )

    return fig


# --- Simple MLP Implementation in JAX ---
def init_mlp_params(key, layer_sizes):
    """Initializes parameters for a simple MLP."""
    params = []
    for i in range(len(layer_sizes) - 1):
        key, subkey = jax.random.split(key)
        in_dim = layer_sizes[i]
        out_dim = layer_sizes[i + 1]
        # Glorot initialization for weights
        limit = jnp.sqrt(6 / (in_dim + out_dim))
        weights = jax.random.uniform(
            subkey, (in_dim, out_dim), minval=-limit, maxval=limit
        )
        biases = jnp.zeros(out_dim)
        params.append({"weights": weights, "biases": biases})
    return params


def mlp_forward(params, x):
    """Forward pass through the MLP."""
    hidden_layers = params[:-1]
    output_layer = params[-1]

    h = x
    for layer in hidden_layers:
        h = jnp.dot(h, layer["weights"]) + layer["biases"]
        h = relu(h)  # Using ReLU as nonlinearity

    # Output layer (no activation for regression, or sigmoid/softmax for classification)
    output = jnp.dot(h, output_layer["weights"]) + output_layer["biases"]
    return output


# Example Usage:
key = jax.random.PRNGKey(0)
input_dim = 2
hidden_dim = 10
output_dim = 1
layer_sizes = [input_dim, hidden_dim, output_dim]
mlp_params = init_mlp_params(key, layer_sizes)

dummy_input = jnp.array([[1.0, 2.0], [0.5, -1.0]])
dummy_output = mlp_forward(mlp_params, dummy_input)

print("\n--- Simple MLP Example ---")
print("MLP Parameters (first layer weights and biases):\n", mlp_params[0])
print("Dummy Input:\n", dummy_input)
print("Dummy Output (before final activation if any):\n", dummy_output)


#### 3. How are Deep Architectures Trained? Empirical Risk Minimization (ERM)

Deep learning models are typically trained using **Empirical Risk Minimization (ERM)**. The goal is to find parameters $\theta_*$ that minimize a loss function on a given training dataset $\mathcal{D} = [(x_i, y_i)]_{i=1,...,N}$ (as per Slide 5):

$$\theta_* = \arg \min_{\theta} \mathcal{L}(\theta) = \arg \min_{\theta} \left( \frac{1}{N} \sum_{i=1}^N \ell(y_i, f(x_i, \theta)) + r(\theta) \right)$$

Here:
* $\ell(y_i, f(x_i, \theta))$ is the **loss function** for a single data point, measuring how well the model's output $f(x_i, \theta)$ matches the true label $y_i$.
* $r(\theta)$ is a **regularization term**, which penalizes complex models to prevent overfitting.

Typical choices for loss functions include:
* **Cross-entropy loss (aka. log loss)** for classification:
    * Binary: $\ell_{\text{logistic}}(y_i, \hat{y}_i) = -y_i \log \hat{y}_i - (1-y_i) \log(1-\hat{y}_i)$ (for $y_i \in \{0,1\}$)
    * Multi-class: $\ell_{\text{CE}}(y_i, \hat{y}_i) = -\sum_{c=1}^C \mathbb{I}_{y_i=c} \log(\hat{y}_{ic})$
* **Mean Squared Error (MSE)** for regression:
    * $\ell_{\text{MSE}}(y_i, \hat{y}_i) = \frac{1}{2}||y_i - \hat{y}_i||^2$

A typical choice of regularizer $r(\theta)$ is **weight decay (L2-regularization)** (as per Slide 6):

$$r_{L2}(\theta) = \frac{\lambda}{2} \sum_{j=1}^D \theta_j^2 = \frac{\lambda}{2} ||\theta||_2^2$$

### ERM as Maximum A Posteriori (MAP) Estimation

Crucially, the ERM objective can be directly related to a **Maximum A Posteriori (MAP) estimate** (as per Slide 7):

$$\theta_* = \arg \min_{\theta \in \mathbb{R}^D} \mathcal{L}(\theta) = \arg \min_{\theta \in \mathbb{R}^D} -\log p(\mathcal{D} | \theta) - \log p(\theta)$$
$$= \arg \max_{\theta \in \mathbb{R}^D} \underbrace{\log p(\mathcal{D} | \theta)}_{\text{likelihood}} + \underbrace{\log p(\theta)}_{\text{prior}} = \arg \max_{\theta \in \mathbb{R}^D} p(\theta | \mathcal{D})$$

This shows that training a deep neural network via ERM is equivalent to finding the mode of the posterior distribution over the parameters, given the data, under certain probabilistic assumptions on the likelihood and prior.

For example:
* **MSE loss** corresponds to a Gaussian likelihood $p(y_i | \hat{y}_i) = \mathcal{N}(y_i; \hat{y}_i, I)$.
* **Cross-entropy loss** corresponds to a Bernoulli (for binary) or Categorical (for multi-class) likelihood.
* **L2-regularization** corresponds to a Gaussian prior on the parameters $p(\theta) = \mathcal{N}(\theta; 0, \lambda^{-1}I)$.

This probabilistic interpretation of deep learning is fundamental to connecting it with the GP framework.

Let's define some common loss and regularization functions in JAX and see how gradients can be computed.

In [None]:
# --- Loss and Regularization Functions in JAX ---


def mse_loss(predictions, targets):
    """Mean Squared Error loss."""
    return jnp.mean(jnp.square(predictions - targets))


def l2_regularization(params, lambda_reg):
    """L2 regularization (weight decay)."""
    l2_norm = 0.0
    for layer in params:
        l2_norm += jnp.sum(jnp.square(layer["weights"]))
        # Biases are typically not regularized, but can be added if desired
    return 0.5 * lambda_reg * l2_norm


def total_loss(params, x_batch, y_batch, lambda_reg):
    """Combines forward pass, MSE loss, and L2 regularization."""
    predictions = mlp_forward(params, x_batch)
    data_loss = mse_loss(predictions, y_batch)
    reg_loss = l2_regularization(params, lambda_reg)
    return data_loss + reg_loss


# Example Usage:
dummy_x_batch = jnp.array([[1.0, 2.0], [0.5, -1.0]])
dummy_y_batch = jnp.array([[3.0], [0.0]])
lambda_reg = 0.01

# Compute the total loss
current_loss = total_loss(mlp_params, dummy_x_batch, dummy_y_batch, lambda_reg)
print("\n--- Loss and Regularization Example ---")
print(f"Current Total Loss: {current_loss:.4f}")

# Compute gradients of the total loss with respect to parameters
grad_loss = jax.grad(total_loss)(mlp_params, dummy_x_batch, dummy_y_batch, lambda_reg)
print("Gradients of parameters (first layer weights and biases):\n", grad_loss[0])


#### 4. Context: Parametric Regression

Let's connect deep learning to our previous understanding of **parametric regression** (as per Slide 9). For Gaussian / parametric / least-squares regression, we posited functions of the linear form:

$$f(x, \theta): \mathbb{X} \times \mathbb{R}^D \to \mathbb{R}^F, \quad f(x, \theta) = \phi(x)^T \theta$$

Here, $\phi: \mathbb{X} \to \mathbb{R}^D$ can be a feature map of almost any form (including discontinuities, point-masses, etc., provided numerical stability). $\mathbb{X}$ can also be a diverse set, not just $\mathbb{R}^M$. Examples include:
* Strings (Natural Language Processing)
* Graphs (molecules, genes, proteins)
* Functions (operators, simulations)
* Gödel numbers, etc.

This flexibility arises because $\phi(x)$ "masks" the input $x$, transforming it into a feature representation that the linear model then operates on.

The diagram on Slide 9 illustrates this: input $X$ is transformed into features $\phi_X$, which are then weighted by parameters $W$ (or $\theta$) to produce the output $y$.

### Probabilistic Inference in Parametric Regression

For parametric regression with Gaussian generative models (as per Slide 10):

Prior: $p(\theta) = \mathcal{N}(\theta; \mu, \Sigma)$
Likelihood: $p(y | f_X) = \mathcal{N}(y; f_X, \sigma^2 I)$, where $f_X = \phi_X^T \theta$

The posterior $p(\theta | y)$ is then also Gaussian:

$$p(\theta | y) = \frac{p(y | \theta) p(\theta)}{p(y)} = \mathcal{N}(\theta; \mu_{\text{post}}, \Sigma_{\text{post}})$$

with:
$$\mu_{\text{post}} = (\Sigma^{-1} + \sigma^{-2} \phi_X \phi_X^T)^{-1}(\Sigma^{-1}\mu + \sigma^{-2}\phi_X y)$$
$$\Sigma_{\text{post}} = (\Sigma^{-1} + \sigma^{-2} \phi_X \phi_X^T)^{-1}$$

Observation: $\mu_{\text{post}}$ is the mode of $\log p(\theta | y)$, and $-\Sigma_{\text{post}}^{-1}$ is the Hessian of $\log p(\theta | y)$. This directly links back to our MAP estimation and Laplace Approximation concepts.

As shown on Slide 11, this Gaussian inference can be described as L2-regularized empirical risk minimization:

$$\mu_{\text{post}} = \arg \min_{\theta \in \mathbb{R}^D} -\log p(\theta | y) = \arg \min_{\theta \in \mathbb{R}^D} -\log p(y | \theta) - \log p(\theta)$$
$$= \arg \min_{\theta \in \mathbb{R}^D} \frac{1}{2\sigma^2} ||y - \phi_X^T \theta||^2 + \frac{1}{2}(\theta - \mu)^T \Sigma^{-1} (\theta - \mu)$$
If we assume $\mu=0$ and $\Sigma = \lambda^{-1}I$, this becomes:
$$= \arg \min_{\theta \in \mathbb{R}^D} \frac{1}{2} \sum_{i=1}^N (y_i - \phi_{x_i}^T \theta)^2 + \frac{\sigma^2 \lambda}{2} ||\theta||^2$$

Let's illustrate a simple parametric regression model using a polynomial feature map.

In [None]:
# --- Parametric Regression Example ---


def polynomial_features(x, degree):
    """Generates polynomial features for input x."""
    return jnp.power(x[:, None], jnp.arange(degree + 1))


def linear_model_predict(features, weights):
    """Predicts output using a linear model: f(x) = phi(x)^T @ weights."""
    return features @ weights


# Generate some dummy 1D data
np.random.seed(1)
X_1d_np = np.linspace(-2, 2, 50)
true_weights = jnp.array([0.5, -1.0, 0.3])  # For a quadratic function
true_features = polynomial_features(jnp.array(X_1d_np), degree=2)
y_1d_np = np.asarray(
    linear_model_predict(true_features, true_weights) + 0.2 * np.random.randn(50)
)

X_1d_jax = jnp.array(X_1d_np)
y_1d_jax = jnp.array(y_1d_np)

# Initialize random weights for the linear model
key, subkey = jax.random.split(jax.random.PRNGKey(1))
initial_weights = jax.random.normal(subkey, (3,))

# Compute predictions with initial weights
features_at_x = polynomial_features(X_1d_jax, degree=2)
initial_predictions = linear_model_predict(features_at_x, initial_weights)

print("\n--- Parametric Regression Example ---")
print("Initial Weights:\n", initial_weights)
print("First 5 True Y values:\n", y_1d_jax[:5])
print("First 5 Initial Predictions:\n", initial_predictions[:5])

# Plotting the initial fit
fig_param_reg = go.Figure()
fig_param_reg.add_trace(
    go.Scatter(x=X_1d_np, y=y_1d_np, mode="markers", name="True Data")
)
fig_param_reg.add_trace(
    go.Scatter(
        x=X_1d_np,
        y=np.asarray(initial_predictions),
        mode="lines",
        name="Initial Model Fit",
    )
)
fig_param_reg.update_layout(
    title_text="Parametric Regression with Polynomial Features (Initial Fit)",
    title_x=0.5,
    xaxis_title="X",
    yaxis_title="Y",
    height=500,
    width=700,
)
fig_param_reg.show()


#### 5. Context: Nonparametric Regression (GPs)

In contrast to parametric models, **nonparametric regression** with Gaussian Processes (GPs) operates without explicit weights in the traditional sense (as per Slide 12):

Prior: $p(f) = \mathcal{GP}(f; m, k)$
Likelihood: $p(y | f_X) = \mathcal{N}(y; f_X, \sigma^2 I)$

The posterior $p(f | y)$ is also a GP, with updated mean and covariance functions:

$$m_{\text{post}}(\bullet) = m_{\bullet} + K_{\bullet X}(K_{XX} + \sigma^2 I)^{-1}(f_X - m_X)$$
$$k_{\text{post}}(\bullet, \circ) = k_{\bullet \circ} - K_{\bullet X}(K_{XX} + \sigma^2 I)^{-1}K_{X \circ}$$

Although a GP model is nonparametric ("there are no weights"), there are interesting connections:

* $m_{\text{post}}(\bullet)$ is the minimizer of the **ridge loss in the Reproducing Kernel Hilbert Space (RKHS)** $\mathcal{H}_k$:
    $$m_{\text{post}}(\bullet) = \arg \min_{f \in \mathcal{H}_k} \left( \frac{1}{2} \sum_{i=1}^N (y_i - f(x_i))^2 + \frac{\sigma^2}{2} ||f||_{\mathcal{H}_k}^2 \right)$$
* $k_{\text{post}}(\bullet, \bullet)$ provides a worst-case error estimate for RKHS functions of bounded norm.

Every RKHS function can be expanded in the (countably many) eigenfunctions $\phi_i$ of the kernel:
$$\mathcal{H}_k = \left\{ f(x) := \sum_{i \in I} \alpha_i \lambda_i^{1/2} \phi_i(x) \text{ such that } ||f||_{\mathcal{H}_k}^2 := \sum_{i \in I} \alpha_i^2 < \infty \right\}$$
with inner product $\langle f, g \rangle_{\mathcal{H}_k} := \sum_{i \in I} \alpha_i \beta_i$.

This perspective highlights that even nonparametric models implicitly operate within a structured function space defined by the kernel.

#### 6. Context: Logistic Regression / GP Classification

To model classification problems, we change the likelihood function (as per Slide 13):

$$p(y | f(x)) = \sigma(yf(x)) \quad \text{with } \sigma(a) = \frac{1}{1 + \exp(-a)}$$

The negative log-likelihood (which becomes the loss function) is then:

$$-\log p(y | f(x)) = \log(1 + \exp(-yf(x)))$$

For multi-class classification, the likelihood uses the softmax function:

$$p(y | f(x)) = \text{softmax}(f(x))_y \quad \text{with } \text{softmax}(a)_i = \frac{\exp(a_i)}{\sum_{j=1}^C \exp(a_j)}$$

The corresponding negative log-likelihood (cross-entropy loss) is:

$$-\log p(y | f(x)) = -f(x)_y + \log \sum_{j=1}^C \exp(f(x)_j)$$

Implementing multi-class GP classification requires **multi-output GPs**, which involve a joint covariance function between different outputs (as discussed on Slide 14):

$$k(f_c(a), f_d(b)) = k((a, c), (b, d))$$

While possible, this often leads to more complex covariance structures. Simple cases factorize covariance between inputs and outputs, leading to Kronecker structure, or assume independent outputs. For this course, we've focused on binary classification to keep the code structure manageable.

#### 7. Are Infinitely Many Weights Enough? (Universality of Kernels)

A fascinating question arises: If GPs are nonparametric and can be seen as "infinitely wide single-layer neural networks," does this mean they can learn anything? (as per Slide 15)

For some kernels, known as **universal kernels**, the RKHS "lies dense" in the space of all continuous functions. A prime example is the **Square-Exponential / Gaussian / RBF kernel**:

$$k(a, b) = \exp\left(-\frac{1}{2}(a - b)^2\right)$$

When using such kernels for GP or kernel-ridge regression, for any continuous function $f$ and any $\epsilon > 0$, there exists an RKHS element $\hat{f} \in \mathcal{H}_k$ such that $||f - \hat{f}|| < \epsilon$ (where $||\cdot||$ is the maximum norm on a compact subset of $X$).

**This implies that, given enough data, the GP posterior mean can approximate any function arbitrarily well!** GPs are indeed "infinitely flexible" and can learn infinite-dimensional functions arbitrarily well, provided the true function is from a sufficiently smooth space and the RKHS covers that space well (Theorem by v.d. Vaart & v. Zanten, 2011, Slide 27).

### The Bad News: If f is not in the RKHS

However, this theoretical flexibility comes with a caveat: if the true function $f$ is **not well-covered by the RKHS**, the convergence rates can be severely impacted (as illustrated on Slides 16-24). The number of data points required to achieve a certain error $\epsilon$ can become **exponential in $\epsilon$**. Outside the observation range, there are no guarantees at all.

This is analogous to representing an irrational number like $\pi$ using rational numbers (Slide 26). While rational numbers are dense in real numbers, some sequences (like decimal expansion) converge quickly, while others (like Gregory-Leibniz series) converge very slowly, requiring many "datapoints" (terms) to achieve high precision.

The takeaway is that while GPs are theoretically powerful, practical convergence can depend heavily on the alignment between the true function's properties and the chosen kernel's RKHS.

#### 8. Deep or Wide? An Assessment

This brings us to a crucial question: why is deep learning so popular if nonparametric models like GPs offer such theoretical flexibility? (as per Slide 28)

While nonparametric models provide a strong theoretical foundation for supervised learning, "having infinite flexibility" does not automatically translate to fast learning in practice. There are applications where carefully designing the RKHS or sample space matters (e.g., in simulation methods).

The real reasons for deep learning's popularity are often practical, rather than purely theoretical. It's important to separate fact from misconception.

### Deep Learning: An Assessment (Pros and Cons)

**What people like about deep learning (as per Slide 29):**
* **Efficient Training**: Training with optimizers like SGD, Adam, etc., is often considered "O(1)" per step because stochastic gradients can be computed on mini-batches, making it scalable to large datasets.
    $$\nabla \mathcal{L}(w) = \frac{1}{N} \sum_{i=1}^N \nabla \ell(y_i, f(w, x_i)) + \nabla r(w) \approx \frac{1}{B} \sum_{j=1}^B \nabla \ell(y_{i(j)}, f(w, x_{i(j)})) + \nabla r(w)$$
* **Parallelization**: The parametric loss has an array structure, allowing for efficient sharding and parallelization across multiple processors or GPUs.
* **Intuitive Metaphors**: Concepts like neural networks, skip connections, attention, pooling, and compression resonate well and provide useful abstractions.
* **Model Deployment**: Once trained, the model (parameters) can be deployed independently of the training data, often hidden behind an API, which is convenient for commercial applications.

**What people don't like about deep learning (as per Slide 30):**
* **Fiddly Training**: Training a deep net is often an art, requiring many choices and hyperparameter tuning:
    * Initialization strategy for weights.
    * Learning rate and learning-rate schedules.
    * Other optimizer parameters.
    * Regularization techniques (dropout, batch normalization, weight decay).
    * Deciding when to stop training and monitoring optimizer stability.
    * Overall architecture design.
* **Model Updates**: It's often unclear how to efficiently update a trained deep learning model when new data arrives (requiring retraining or complex fine-tuning).
* **Conceptual Pathologies**: Deep learning models can exhibit certain brittleness, leading to issues like adversarial examples and poor generalization to out-of-distribution data.

### GPs and Kernels: An Assessment (Pros and Cons)

**What people like about GPs (as per Slide 31):**
* **Automatic Training**: Training primarily involves linear algebra, with no explicit parameters to tune in the same way as deep networks.
* **Full Probabilistic Model**: Provides a complete probabilistic framework, allowing for interpretability (e.g., drawing samples from the prior) and crucial uncertainty quantification.
* **Easy Updates**: Updating the model with new data is straightforward using techniques like the Schur complement.
* **Elegant Mathematical Theory**: Supported by a rich and elegant mathematical theory.

**What people don't like about GPs (as per Slide 31):**
* **Computational Cost**: Training is typically $O(N^3)$ with respect to the number of data points $N$, making it computationally expensive for large datasets.
* **"The data is the model"**: Releasing a GP model often means releasing the entire training data, which can be a privacy or intellectual property concern.
* **Limited Parallelization**: Because all data points interact directly (not via a compact weight-space), sharding and parallelization are not as straightforward as in deep learning.

#### 9. Summary

To summarize (as per Slide 32):

* **Deep Learning and GP regression/classification are closely related** from a probabilistic perspective, often representing different approaches to function approximation and inference.
* The **central difference** lies in their structure: Deep models are hierarchical and inherently nonlinear, while basic GP models (with standard kernels) are often seen as shallow or linear in a high-dimensional feature space.
* The **nonlinear nature of deep models** can be advantageous for performance but necessitates complex nonlinear optimization for training.
* Both **shallow (GP) and deep models have distinct advantages and disadvantages**, making each suitable for different problem settings and priorities.

This lecture serves as a bridge, highlighting the connections and contrasts between these two powerful paradigms in machine learning. The next lecture will delve into combining GPs and deep learning, exploring the best of both worlds.

#### Exercises

Since this lecture is more conceptual, the exercises will focus on understanding the implications of the discussed concepts.

**Exercise 1: ERM and MAP Equivalence**
Consider a simple linear regression problem with a Gaussian likelihood and a Gaussian prior on the weights. Write down the explicit form of the ERM objective function and show how it maps directly to the negative log-posterior, thus demonstrating the MAP equivalence.

**Exercise 2: Deep vs. Shallow Model Strengths**
Based on the pros and cons discussed, identify a real-world problem where a GP model would likely be preferred over a deep learning model, and explain why. Then, identify a problem where a deep learning model would be more suitable, and explain your reasoning.

**Exercise 3: The "Data is the Model" Implication**
Discuss the practical and ethical implications of the statement "The data is the model" for GPs, especially in scenarios involving sensitive data or intellectual property. How do deep learning models mitigate (or exacerbate) these issues?

**Exercise 4: Challenges in Deep Learning Training**
Choose one of the "fiddly" aspects of deep learning training (e.g., learning rate schedules, regularization). Briefly research common strategies used to address this challenge and explain why it's a non-trivial problem.