# Probabilistic Machine Learning: Lecture 25 - A Historical Perspective

#### Introduction

Welcome to the final lecture of Probabilistic Machine Learning! Having explored the theoretical foundations, key models, and powerful algorithms of probabilistic machine learning, we now take a step back to appreciate the **historical context** that shaped this fascinating field. This lecture will trace some of the intellectual lineages and pivotal moments, highlighting how core ideas emerged, evolved, and converged across different scientific disciplines and challenging historical periods.



#### 1. The Probabilistic Machine Learning Toolbox (Recap)

Before diving into history, let's briefly revisit the comprehensive **toolbox** we've assembled throughout this course (Slide 3):

* **Framework**: The bedrock of probabilistic reasoning – sum rule, product rule, and Bayes' Theorem – enabling us to describe all inference tasks by assigning probabilities.
* **Modeling**: Diverse tools for constructing models:
    * **Directed Graphical Models**: For representing conditional independencies and joint distributions.
    * **Exponential Families**: A class of distributions allowing for tractable inference and conjugate priors.
    * **Gaussian Distributions**: The workhorse of linear algebra-based inference.
    * **Kernels**: For abstracting inner products and enabling nonparametric models like Gaussian Processes.
    * **Markov Chains**: For modeling time series with local memory.
    * **Deep Networks**: Flexible function approximators, which we've learned can be viewed through a probabilistic lens (e.g., as GPs via Laplace approximation).
* **Computation**: Algorithms for performing inference and learning:
    * **Autodiff**: For efficient gradient computation in optimization.
    * **MAP with Laplace approximations**: For approximating intractable posteriors with Gaussians.
    * **Linear algebra as a computational primitive**: The cornerstone of Gaussian inference.
    * **Variational Inference**: For approximating intractable posteriors by optimizing a lower bound (ELBO).
    * **Monte Carlo**: For approximating integrals and expectations through sampling.

These tools, developed over decades by brilliant minds, form the foundation of modern probabilistic machine learning.

#### 2. A Historical Story: Lemberg / Lwów / Львів (Slides 4-9)

The history of mathematics and science is often intertwined with geopolitical events. One poignant story comes from **Lemberg (Lwów / Львів)**, a city with a rich intellectual heritage that endured immense suffering in the 20th century. This city was home to the **Lwów School of Mathematics**, a vibrant community of Polish mathematicians in the interwar period.

Central to this school was the **Scottish Cafe**, where mathematicians like **Stefan Banach, Stanisław Ulam, and Hugo Steinhaus** would meet, discuss problems, and even write them down in a notebook known as the **Scottish Book**. These informal gatherings fostered an environment of intense collaboration and groundbreaking discoveries.

**Hugo Steinhaus (1887-1972)**, a key figure of the Lwów School and a PhD student of David Hilbert, is particularly relevant to our course. He was a pioneer in functional analysis and probability theory. During the "Third Reich" and the horrors of World War II, many Jewish scientists, including some from Lwów, were forced into hiding or fled. Tragically, the **Lemberger Professorenmorde (Massacre of Lwów Professors)** in July 1941 saw the execution of many Polish intellectuals, including mathematicians, by Nazi forces.

Despite these dark times, some, like **John von Neumann (1903-1957)** and **Stanisław Ulam (1909-1984)**, made it out alive, emigrating to the United States in the 1930s. Their contributions, particularly at institutions like Los Alamos, would profoundly impact science and technology, including the development of the atomic bomb and the birth of modern computing.

This historical backdrop reminds us that scientific progress is a human endeavor, often pursued amidst challenging circumstances, and that the abstract ideas we study have deep roots in the lives and experiences of their creators.

#### 3. k-Means Clustering: An Early Method and its Convergence (Slides 10, 13-14)

One of the oldest and most widely used clustering methods, **k-Means**, was proposed by **Hugo Steinhaus** in 1957. It's a simple, iterative algorithm for partitioning $N$ data points into $K$ clusters.

**Algorithm:**
1.  **Initialization**: Randomly choose $K$ initial cluster means (centroids) $\{m_k\}_{k=1}^K$.
2.  **Assignment (E-step-like)**: Assign each data point $x_i$ to the cluster whose mean $m_k$ is closest to it (e.g., using Euclidean distance). This can be represented by binary responsibilities $r_{ik} = 1$ if $x_i$ is assigned to cluster $k$, and $0$ otherwise.
    $$k_i = \arg \min_k ||x_i - m_k||^2$$
3.  **Update (M-step-like)**: Update each cluster mean $m_k$ to be the sample mean of all data points assigned to that cluster.
    $$m_k \leftarrow \frac{\sum_i r_{ik} x_i}{\sum_i r_{ik}}$$
4.  **Repeat**: Iterate steps 2 and 3 until the cluster assignments no longer change.

A remarkable property of k-Means is that it **always converges**. This is due to the existence of a **Lyapunov Function** (Slide 13). A Lyapunov function $J$ for an iterative algorithm is a positive function of the algorithm's state variables that decreases in each step. The existence of such a function guarantees convergence to a local (not necessarily global) minimum of $J$.

For k-Means, the objective function is the sum of squared distances of each point to its assigned cluster mean:
$$J(r, m) := \sum_n \sum_k r_{nk} \frac{1}{2} ||x_n - m_k||^2$$
Both the assignment step (by definition, each point is assigned to its *nearest* mean) and the update step (by setting the mean to the sample mean, which minimizes the sum of squared errors for a fixed set of points) guarantee that $J(r, m)$ is non-increasing. Since $J$ is bounded below (it's a sum of squared distances), it must converge.

Let's implement a simple k-Means algorithm using JAX.

In [ ]:
import jax.numpy as jnp
import jax
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from jax.scipy.stats import multivariate_normal as mvn
from jax.scipy.special import digamma, gammaln # For Dirichlet and Wishart expectations

# Set JAX to use 64-bit floats for numerical stability
jax.config.update("jax_enable_x64", True)

# --- Utility Functions (from previous lectures, adapted) ---
def plot_gmm_plotly(X, means, covariances, responsibilities=None, title="Gaussian Mixture Model", colors=['red', 'blue', 'green', 'purple', 'orange']):
    """Plots 2D data, GMM components, and optionally responsibilities."""
    fig = go.Figure()

    # Plot data points, optionally colored by responsibility
    if responsibilities is not None:
        dominant_component = jnp.argmax(responsibilities, axis=1)
        for k in range(means.shape[0]):
            mask = dominant_component == k
            fig.add_trace(go.Scatter(
                x=X[mask, 0],
                y=X[mask, 1],
                mode='markers',
                marker=dict(color=colors[k % len(colors)], size=5, opacity=0.7),
                name=f'Data (Component {k+1})',
                showlegend=True
            ))
    else:
        fig.add_trace(go.Scatter(
            x=X[:, 0],
            y=X[:, 1],
            mode='markers',
            marker=dict(color='gray', size=5, opacity=0.7),
            name='Data Points',
            showlegend=True
        ))

    # Plot Gaussian components (mean and covariance ellipses)
    for k in range(means.shape[0]):
        mean = means[k]
        cov = covariances[k]
        
        # Draw ellipse representing 2-sigma contour
        vals, vecs = jnp.linalg.eigh(cov)
        order = vals.argsort()[::-1]
        vals = vals[order]
        vecs = vecs[:, order]
        
        theta = jnp.degrees(jnp.arctan2(*vecs[:, 0][::-1]))
        width, height = 2 * jnp.sqrt(5.991 * vals) # 5.991 for 95% confidence for 2D Chi-squared with 2 DOF

        fig.add_shape(
            type='circle',
            xref='x',
            yref='y',
            x0=mean[0] - width / 2,
            y0=mean[1] - height / 2,
            x1=mean[0] + width / 2,
            y1=mean[1] + height / 2,
            line=dict(color=colors[k % len(colors)], width=2),
            opacity=0.8,
            layer='below',
            name=f'Component {k+1}'
        )
        # Add mean point
        fig.add_trace(go.Scatter(
            x=[mean[0]],
            y=[mean[1]],
            mode='markers',
            marker=dict(symbol='x', size=10, color=colors[k % len(colors)], line=dict(width=2, color='black')),
            name=f'Mean {k+1}',
            showlegend=False
        ))

    fig.update_layout(title_text=title, title_x=0.5,
                      xaxis_title='Feature 1',
                      yaxis_title='Feature 2',
                      autosize=False, width=700, height=600)
    fig.update_yaxes(scaleanchor="x", scaleratio=1) # Keep aspect ratio square
    fig.show()

def generate_gmm_data(num_samples=300, num_components=3, random_seed=42):
    """Generates synthetic data from a Gaussian Mixture Model."""
    np.random.seed(random_seed)
    
    # True parameters for 3 components
    true_weights = np.array([0.3, 0.4, 0.3])
    true_means = np.array([
        [0, 0],
        [3, 3],
        [0, 4]
    ])
    true_covariances = np.array([
        [[0.5, 0.2], [0.2, 0.5]],
        [[0.8, -0.1], [-0.1, 0.8]],
        [[0.6, 0.3], [0.3, 0.6]]
    ])

    X = []
    for _ in range(num_samples):
        # Choose a component based on weights
        k = np.random.choice(num_components, p=true_weights)
        # Sample from the chosen Gaussian
        sample = np.random.multivariate_normal(true_means[k], true_covariances[k])
        X.append(sample)
    
    return jnp.array(X), true_weights, true_means, true_covariances

# --- K-Means Implementation ---
@jax.jit
def assign_to_clusters(X, means):
    """Assigns each data point to the closest mean."""
    # Compute squared Euclidean distance from each point to each mean
    diff = X[:, jnp.newaxis, :] - means[jnp.newaxis, :, :]
    distances_sq = jnp.sum(diff**2, axis=-1)
    # Get the index of the closest mean for each point
    assignments = jnp.argmin(distances_sq, axis=1)
    return assignments

@jax.jit
def update_means(X, assignments, num_components):
    """Updates the means based on current assignments."""
    new_means = jnp.zeros((num_components, X.shape[1]))
    for k in range(num_components):
        # Get points assigned to cluster k
        cluster_points = X[assignments == k]
        if cluster_points.shape[0] > 0:
            new_means = new_means.at[k].set(jnp.mean(cluster_points, axis=0))
        else:
            # If a cluster becomes empty, keep its mean or re-initialize (here, keep)
            new_means = new_means.at[k].set(new_means[k]) # No change
    return new_means

def run_kmeans(X, num_components, max_iter=100, tol=1e-4, key=None):
    """Runs the K-Means clustering algorithm."""
    if key is None:
        key = jax.random.PRNGKey(0)

    num_samples, data_dim = X.shape

    # Initialize means randomly from data points
    key, subkey_init = jax.random.split(key)
    random_indices = jax.random.choice(subkey_init, num_samples, (num_components,), replace=False)
    means = X[random_indices]

    prev_assignments = None
    history_means = [means]

    print(f"Starting K-Means with {num_components} clusters...")
    for i in range(max_iter):
        # E-step-like: Assign points to clusters
        assignments = assign_to_clusters(X, means)

        # M-step-like: Update cluster means
        new_means = update_means(X, assignments, num_components)

        # Check for convergence
        if prev_assignments is not None and jnp.all(assignments == prev_assignments):
            print(f"K-Means converged in {i+1} iterations.")
            break

        means = new_means
        prev_assignments = assignments
        history_means.append(means)
    else:
        print(f"K-Means did not converge after {max_iter} iterations.")

    # Compute final responsibilities for plotting
    final_assignments = assign_to_clusters(X, means)
    final_responsibilities = jax.nn.one_hot(final_assignments, num_classes=num_components)

    return means, final_responsibilities, history_means


### Example: Running k-Means on Synthetic Data

Let's generate some data from a GMM and then apply k-Means to it.

In [ ]:
# --- Main Execution: K-Means Example ---

# 1. Generate synthetic GMM data (similar to previous lectures)
num_samples_kmeans = 500
num_components_kmeans = 3
X_kmeans, _, _, _ = generate_gmm_data(num_samples_kmeans, num_components_kmeans, random_seed=42)

print("Generated Data Shape for K-Means:", X_kmeans.shape)

# 2. Run the K-Means algorithm
kmeans_key = jax.random.PRNGKey(100)
estimated_means_kmeans, final_responsibilities_kmeans, history_means_kmeans = \
    run_kmeans(X_kmeans, num_components_kmeans, max_iter=100, key=kmeans_key)

print("\nFinal K-Means Estimated Means:\n", estimated_means_kmeans)

# 3. Plot the final K-Means fit
# For plotting with plot_gmm_plotly, we need 'covariances'. K-Means implicitly assumes spherical/identity covariance.
# We can approximate this by computing the sample covariance of each cluster.
estimated_covariances_kmeans = jnp.zeros((num_components_kmeans, X_kmeans.shape[1], X_kmeans.shape[1]))
for k in range(num_components_kmeans):
    cluster_points = X_kmeans[jnp.argmax(final_responsibilities_kmeans, axis=1) == k]
    if cluster_points.shape[0] > 1:
        estimated_covariances_kmeans = estimated_covariances_kmeans.at[k].set(jnp.cov(cluster_points, rowvar=False) + jnp.eye(X_kmeans.shape[1]) * 1e-6)
    else:
        estimated_covariances_kmeans = estimated_covariances_kmeans.at[k].set(jnp.eye(X_kmeans.shape[1]) * 0.1)

plot_gmm_plotly(
    X_kmeans,
    estimated_means_kmeans,
    estimated_covariances_kmeans,
    responsibilities=final_responsibilities_kmeans,
    title='K-Means Clustering Result'
)

# Optional: Visualize mean movement over iterations
fig_means_path = go.Figure()
fig_means_path.add_trace(go.Scatter(
    x=X_kmeans[:, 0],
    y=X_kmeans[:, 1],
    mode='markers',
    marker=dict(color='gray', size=3, opacity=0.5),
    name='Data Points'
))
for k in range(num_components_kmeans):
    means_path = jnp.array([m[k] for m in history_means])
    fig_means_path.add_trace(go.Scatter(
        x=means_path[:, 0],
        y=means_path[:, 1],
        mode='lines+markers',
        name=f'Mean {k+1} Path',
        line=dict(color=plot_gmm_plotly.colors[k % len(plot_gmm_plotly.colors)], width=2),
        marker=dict(size=8, symbol='circle')
    ))
fig_means_path.update_layout(title_text='K-Means: Path of Cluster Means', title_x=0.5,
                             xaxis_title='Feature 1',
                             yaxis_title='Feature 2',
                             autosize=False, width=700, height=600)
fig_means_path.show()


#### 4. k-Means as a "Hard" EM Algorithm (Slide 15)

Interestingly, k-Means can be viewed as a special case of the EM algorithm for a Gaussian Mixture Model where:
1.  The covariance matrices for all components are fixed to be identity matrices (or spherical, $I\sigma^2$).
2.  The mixing coefficients are uniform and fixed.
3.  Crucially, the E-step performs a **"hard" assignment** of data points to clusters (i.e., each point belongs *exclusively* to one cluster, $r_{nk} \in \{0, 1\}$), rather than the "soft" probabilistic assignments ($r_{nk} \in [0, 1]$) of the standard GMM EM.

This "hard" assignment simplifies the likelihood to:
$$p(x | \mu) = \prod_{n=1}^N \sum_{k=1}^K z_{nk} \mathcal{N}(x_n; \mu_k, I)$$

By iterating the assignment (E-step) and mean update (M-step), k-Means maximizes this specific likelihood. This connection highlights the unifying power of the EM framework, showing how even simple algorithms can be understood as optimizing a probabilistic objective.

#### 5. Free Energy: The Connection to Physics (Slides 16-17)

The concept of the **Evidence Lower Bound (ELBO)**, which we maximized in Variational Inference, has a profound connection to **Free Energy** in statistical physics. This is not a mere analogy; it's a direct mathematical equivalence.

Recall the decomposition of the log evidence:
$$\log p(x) = \mathcal{L}(q(z)) + D_{KL}(q(z) || p(z | x))$$ 

where $\mathcal{L}(q(z)) = \int q(z) \log \left( \frac{p(x, z)}{q(z)} \right) dz$.

In statistical physics, probability distributions are often expressed as **Gibbs measures**: $p(x, z) = \exp(-E(x, z))$, where $E(x, z)$ is the energy of a state $(x, z)$. If we substitute this into the ELBO, we find:
$$\mathcal{L}(q) = \int q(z) (\log p(x, z) - \log q(z)) dz = \int q(z) (-E(x, z) - \log q(z)) dz$$ 
$$\mathcal{L}(q) = -\mathbb{E}_q[E(x, z)] - \mathbb{E}_q[\log q(z)] = -\mathbb{E}_q[E(x, z)] + H(q)$$ 

where $H(q) = -\mathbb{E}_q[\log q(z)]$ is the **entropy** of the variational distribution $q(z)$.

The negative ELBO, $-\mathcal{L}(q)$, is precisely the **Variational Free Energy** (often denoted $F$ or $A$ in physics, related to the Helmholtz free energy): 
$$F = \mathbb{E}_q[E(x, z)] - H(q)$$ 
This has the familiar form of internal energy minus temperature times entropy ($F = U - TS$, where $U$ is internal energy and $T$ is temperature). In the context of VI, $\mathbb{E}_q[E(x, z)]$ can be seen as an average energy, and $H(q)$ as an entropy term. Maximizing the ELBO is equivalent to minimizing this variational free energy.

This deep connection highlights how concepts from physics, particularly statistical mechanics, have provided fundamental insights and mathematical tools for machine learning. Pioneers like **Hermann von Helmholtz, Josiah Willard Gibbs, and Ludwig Boltzmann** laid the groundwork for these ideas, which are now being applied to modern machine learning by researchers like **David M. Blei**.

#### 6. The Calculus of Variations (Slide 18)

At the heart of Variational Inference lies the **Calculus of Variations**, a branch of mathematics concerned with finding functions that optimize certain functionals (integrals that depend on functions). While not typically taught in introductory calculus, it's a powerful idea that underpins many advanced scientific and engineering disciplines.

Pioneers like **Leonhard Euler** and **Joseph-Louis Lagrange** developed the foundational concepts of the calculus of variations in the 18th century. Later, physicists like **Richard Feynman** (Nobel Prize 1965) used variational principles extensively in quantum mechanics (e.g., Feynman path integrals).

In VI, we are essentially using the principles of the calculus of variations to find the optimal distribution $q(z)$ that maximizes the ELBO functional. The mean-field updates we derived are direct consequences of applying variational calculus to the ELBO under the factorization assumption.

#### 7. Randomized Computations: Monte Carlo (Slides 19-24)

Another critical computational paradigm in probabilistic machine learning is **Monte Carlo methods**. These techniques use random sampling to approximate integrals or expectations that are otherwise intractable. While the core idea of using randomness for computation has ancient roots, its modern application gained prominence during the **Manhattan Project** in the 1940s.

**John von Neumann** and **Stanisław Ulam**, both with ties to the Lwów School, were instrumental in developing and popularizing Monte Carlo methods. Ulam, while recovering from an illness, conceived the idea of using random sampling to solve complex physics problems, such as neutron diffusion in fissionable material, which were too difficult for deterministic calculations. Von Neumann then formalized the approach and gave it the name "Monte Carlo" (after Ulam's uncle's gambling habits).

These methods were crucial for the design of nuclear weapons, including the **Teller-Ulam design** for thermonuclear weapons, which involved complex simulations that could only be tackled with probabilistic sampling. The development of Monte Carlo methods at Los Alamos, often involving physical simulations with random numbers, marked a pivotal moment in the history of scientific computing and laid the groundwork for modern sampling-based inference techniques like Markov Chain Monte Carlo (MCMC).

#### 8. And then the Machines Changed Everything. Everything? (Slides 25, 29-30)

The advent of **digital computers** fundamentally transformed what could be computed and modeled. The history of AI and machine learning has seen a fascinating oscillation between advances driven by sheer **computing power** (brute force) and those driven by **algorithmic improvements** (efficiency).

Key eras and developments include (Slide 25):
* **Symbolic Logic (1930s onwards)**: Pioneered by Alan Turing, focusing on logical reasoning and rule-based systems.
* **Connectionism (1950s onwards)**: The rise of neural networks, exemplified by **Frank Rosenblatt's Perceptron**, which sought to model intelligence through interconnected nodes.
* **Operations Research (1970s onwards)**: Development of powerful optimization algorithms like BFGS.
* **Kernels, Probability, Graphs (1990s)**: A resurgence of probabilistic methods, graphical models (Judea Pearl), and kernel methods (Vladimir Vapnik), emphasizing statistical rigor and structured representations.
* **GPUs and Deep Learning (2010s)**: The explosion of deep learning, heavily reliant on massive datasets and parallel computation on GPUs, leading to unprecedented performance in areas like image recognition and natural language processing.

Today, we stand at another inflection point. While computing power allows us to be "lazy" with algorithms, training contemporary large models is incredibly expensive, both computationally and environmentally. This suggests that **algorithmic and modeling knowledge will remain crucial** (Slide 30). Efficient algorithms can make the difference between a workstation and an edge device, between training on millions and billions of data points, and ultimately, between profitability and financial drain.

The future of machine learning likely lies in a synergistic combination of powerful hardware, vast datasets, and ever more sophisticated and efficient probabilistic algorithms.

#### 9. Stochastic Variational Inference: Leveraging Deep Networks (Slides 26-28)

Bridging the gap between the algorithmic elegance of Variational Inference and the power of deep learning is **Stochastic Variational Inference (SVI)**, notably popularized by **Kingma & Welling's "Auto-Encoding Variational Bayes" (2013)**. SVI allows us to perform VI on very large datasets and with complex, non-conjugate models by leveraging deep neural networks and Monte Carlo sampling.

Instead of carefully deriving analytic updates for each variational factor (which can be tedious or impossible for complex models), SVI proposes to:
1.  **Parameterize the variational distribution $q(z|\phi, x)$ (encoder) and the likelihood $p(x, z|\theta)$ (decoder) using deep neural networks.** The parameters of these networks are $\phi$ and $\theta$, respectively.
2.  **Optimize the ELBO by using stochastic gradient descent.** The challenge is computing gradients of expectations with respect to the parameters of the distribution defining the expectation. This is where the **reparameterization trick** comes in.

**The Reparameterization Trick** (Slide 28):
For a variational distribution like $q(z | \phi, x) = \mathcal{N}(z; \mu_\phi(x), \Sigma_\phi(x))$, we can reparameterize the random variable $z$ as a deterministic function of a simpler random variable $u$ (e.g., $u \sim \mathcal{N}(0, I)$) and the variational parameters $\phi$:
$$z = \mu_\phi(x) + L_\phi(x) u$$ 
where $\Sigma_\phi(x) = L_\phi(x)L_\phi(x)^T$ (Cholesky decomposition).

This allows us to move the expectation outside the gradient, making the gradient computable via backpropagation:
$$\nabla_\phi \mathbb{E}_{q_\phi(z|x)} [\dots] = \mathbb{E}_{u \sim \mathcal{N}(0, I)} [\nabla_\phi (\dots)]$$ 

The expectation can then be approximated by Monte Carlo sampling (taking a few samples of $u$). This combination of deep networks, variational inference, and the reparameterization trick forms the basis of powerful generative models like **Variational Autoencoders (VAEs)**, allowing for efficient inference and generation in high-dimensional spaces.

SVI represents a significant step towards scalable and flexible probabilistic modeling, enabling us to combine the strengths of deep learning with the principled framework of Bayesian inference.

#### Conclusion

This course has taken you on a journey through the fascinating world of Probabilistic Machine Learning. We've seen how fundamental ideas from probability theory, statistics, and even physics have converged to create powerful tools for understanding data and making informed decisions under uncertainty. From the rigorous foundations of Bayes' theorem to the practicalities of EM and Variational Inference, and the modern advancements in deep probabilistic models, you now possess a robust framework for tackling complex machine learning problems.

The historical perspective reminds us that scientific progress is often iterative, collaborative, and deeply human. The challenges that motivated past generations of researchers continue to inspire new innovations, and the pursuit of efficient, interpretable, and robust algorithms remains a vital frontier in machine learning.

#### Exercises

**Exercise 1: K-Means Initialization Sensitivity**
Similar to EM, k-Means is sensitive to initialization. In the `run_kmeans` function, try different `key` values (e.g., 100, 200, 300). Observe how the final cluster assignments and means change. How might you address this sensitivity in a real-world application?

**Exercise 2: K-Means vs. GMM EM**
Compare the clustering results of k-Means (from this lecture) with the GMM EM algorithm (from Lecture 23) on the same `generate_gmm_data` dataset. Discuss the visual differences in the cluster shapes and assignments. Explain why GMM EM provides a "softer" and more flexible clustering compared to k-Means, relating it to their underlying probabilistic assumptions.

**Exercise 3: The ELBO and Free Energy (Conceptual)**
Revisit the ELBO decomposition: $\log p(x) = \mathcal{L}(q(z)) + D_{KL}(q(z) || p(z | x))$. Explain, in your own words, how minimizing the KL divergence is equivalent to maximizing the ELBO. Then, explain how the term $-\mathcal{L}(q(z))$ relates to the concept of Free Energy in statistical physics. What does this connection imply about the nature of variational inference?

**Exercise 4 (Advanced): Reparameterization Trick (Conceptual)**
For a simple 1D Gaussian $q(z | \mu, \sigma^2) = \mathcal{N}(z; \mu, \sigma^2)$, write down the reparameterization trick for $z$. If you wanted to compute $\nabla_\mu \mathbb{E}_{q(z)}[f(z)]$ for some function $f(z)$, show how the reparameterization trick allows you to move the gradient inside the expectation, making it amenable to Monte Carlo estimation and automatic differentiation. Discuss why this is a crucial innovation for training VAEs and other deep generative models.