In [1]:
%load_ext autoreload
%autoreload 2

In [21]:
import jax
import jax.numpy as jnp
import jax.random as jrandom
import jax.scipy.stats as jss  # For standard distributions like norm
import numpy as np  # For plotting
import plotly.graph_objects as go
import plotly.io as pio
import ipywidgets as widgets  # For interactive controls
from IPython.display import display  # To display widgets and output

In [7]:
pio.templates.default = "plotly_white"


# Lecture 06: Gaussian Probability Distributions

Based on the lecture slides by Philipp Hennig (SS 2023).

Welcome to Lecture 06! This session focuses on the Gaussian distribution, which is arguably the most important distribution in probabilistic machine learning. We will explore its key properties, how it relates to linear algebra, and why this makes inference particularly tractable. We'll use JAX to implement the concepts and work through realistic examples.

## Review: Exponential Families

As we saw in Lecture 05, many common probability distributions belong to the **Exponential Family**. A distribution is in this family if its probability density/mass function can be written as:
$$p_w(x) = h(x) \exp[\phi(x)^T w - \log Z(w)]$$
where $h(x)$ is the base measure, $\phi(x)$ are the sufficient statistics, $w$ are the natural parameters, and $Z(w)$ is the partition function. This structure is powerful because it simplifies things like finding conjugate priors and performing maximum likelihood estimation.

## The Univariate Gaussian Distribution

Let's start with the familiar **univariate Gaussian** (or Normal) distribution, denoted as $N(x; \mu, \sigma^2)$. Its probability density function (PDF) is:
$$N(x; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$$
Here, $\mu$ is the mean and $\sigma^2$ is the variance.

In Lecture 05, we showed that the univariate Gaussian is an Exponential Family member. Its sufficient statistics are $\phi(x) = \begin{bmatrix} x \\ -x^2/2 \end{bmatrix}$ and its natural parameters are $w = \begin{bmatrix} \mu/\sigma^2 \\ 1/\sigma^2 \end{bmatrix}$.

Let's use JAX to work with the univariate Gaussian. JAX's `jax.scipy.stats` module provides convenient functions for standard distributions.

In [15]:
# Set a random seed for reproducibility
key = jrandom.PRNGKey(42)


# Define the standard log PDF of a univariate Gaussian
def univariate_gaussian_log_pdf(x, mu, sigma_sq):
    """
    Log PDF of a univariate Gaussian N(x; mu, sigma_sq).
    """
    # jss.norm.logpdf handles the calculation directly
    return jss.norm.logpdf(x, loc=mu, scale=jnp.sqrt(sigma_sq))  # scale is std dev


# Example: Plotting a Gaussian PDF
mu_example = 3.0
sigma_sq_example = 0.8
x_values = jnp.linspace(0, 6, 100)
pdf_values = jnp.exp(
    univariate_gaussian_log_pdf(x_values, mu_example, sigma_sq_example)
)

fig = go.Figure()
fig.add_trace(
    go.Scatter(x=np.array(x_values), y=np.array(pdf_values), mode="lines", name="PDF")
)
fig.update_layout(
    title=f"Univariate Gaussian PDF (mu={mu_example}, sigma^2={sigma_sq_example})",
    xaxis_title="x",
    yaxis_title="p(x)",
    template="plotly_white",
)
fig.show()

# Example: Sampling from a univariate Gaussian
key, subkey = jrandom.split(key)  # Split the key for new randomness
num_samples = 500
samples = (
    jrandom.normal(subkey, (num_samples,)) * jnp.sqrt(sigma_sq_example) + mu_example
)

# Plot histogram of samples and true PDF using Plotly
hist = go.Histogram(
    x=np.array(samples),
    nbinsx=30,
    histnorm="probability density",
    opacity=0.6,
    marker_color="green",
    name="Samples",
)
pdf_line = go.Scatter(
    x=np.array(x_values),
    y=np.array(pdf_values),
    mode="lines",
    line=dict(color="red", width=2),
    name="True PDF",
)
fig_hist = go.Figure([hist, pdf_line])
fig_hist.update_layout(
    title="Histogram of Samples vs. True PDF",
    xaxis_title="x",
    yaxis_title="Density",
    template="plotly_white",
)
fig_hist.show()

## Closure Properties: The Magic of Gaussians

One of the most powerful aspects of Gaussian distributions is their **closure properties** under linear operations. This means that if you start with Gaussian random variables and perform certain operations (like adding them, applying linear transformations, looking at subsets, or conditioning on some values), the resulting distributions are **still Gaussian**. This makes exact inference possible using linear algebra.

### 1. Products of Gaussians are Gaussians

The product of two Gaussian probability densities is proportional to another Gaussian density. This is crucial for Bayesian inference because the posterior is proportional to the prior times the likelihood. If both are Gaussian, the posterior is also Gaussian.



If 
$p_1(x) = \mathcal{N}(x; \mu_1, \sigma_1^2)$ 
and 
$p_2(x) = \mathcal{N}(x; \mu_2, \sigma_2^2)$, 
their product is:

$$
p_1(x) p_2(x) \propto \mathcal{N}(x; \mu_{\text{post}}, \sigma_{\text{post}}^2)
$$

where the parameters of the resulting Gaussian are:

$$
\sigma_{\text{post}}^2 = \left( \frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2} \right)^{-1}
$$

$$
\mu_{\text{post}} = \sigma_{\text{post}}^2 \left( \frac{\mu_1}{\sigma_1^2} + \frac{\mu_2}{\sigma_2^2} \right)
$$

Notice that the **precisions** ($1/\sigma^2$) add up, and the new mean is a precision-weighted average of the original means.

#### Realistic Example: Combining Measurements

Imagine you are trying to determine the precise location of an object. You have two independent sensors, each providing a measurement that is a noisy estimate of the object's position. You can model each sensor's measurement uncertainty as a Gaussian distribution centered at the true position.

- **Sensor A:** Measurement $y_A$, modeled as $\mathcal{N}(y_A; \text{true\_pos}, \sigma_A^2)$.
- **Sensor B:** Measurement $y_B$, modeled as $\mathcal{N}(y_B; \text{true\_pos}, \sigma_B^2)$.

If you have a prior belief about the object's position, say $p(\text{true\_pos}) = \mathcal{N}(\text{true\_pos}; \mu_{\text{prior}}, \sigma_{\text{prior}}^2)$, the posterior distribution of the true position after observing both measurements is proportional to the prior times the likelihoods:

$$
p(\text{true\_pos} \mid y_A, y_B) \propto p(\text{true\_pos}) \, p(y_A \mid \text{true\_pos}) \, p(y_B \mid \text{true\_pos})
$$

Since $p(y \mid \text{true\_pos}) = \mathcal{N}(y; \text{true\_pos}, \sigma^2)$ is symmetric and can be viewed as a Gaussian in the variable $\text{true\_pos}$ with mean $y$ and variance $\sigma^2$, the posterior is a product of three Gaussian densities (one from the prior, two from the likelihoods). The resulting posterior distribution for $\text{true\_pos}$ will be Gaussian.

Let's implement the combination of two univariate Gaussians in JAX.

In [16]:
# Function to combine (multiply) two univariate Gaussian densities
def combine_univariate_gaussians(mu1, sigma_sq1, mu2, sigma_sq2):
    """
    Combines two univariate Gaussian densities N(x; mu1, sigma_sq1) * N(x; mu2, sigma_sq2).
    Returns the parameters (mu_post, sigma_sq_post) of the resulting Gaussian (up to a normalization constant).
    """
    # Calculate precisions (inverse variances)
    lambda1 = 1.0 / sigma_sq1
    lambda2 = 1.0 / sigma_sq2

    # Calculate the precision and variance of the resulting Gaussian
    lambda_post = lambda1 + lambda2
    sigma_sq_post = 1.0 / lambda_post

    # Calculate the mean of the resulting Gaussian
    mu_post = sigma_sq_post * (lambda1 * mu1 + lambda2 * mu2)

    return mu_post, sigma_sq_post


# Example: Combining a prior belief with a sensor reading
prior_mu = 50.0  # Prior belief about object position (e.g., meters)
prior_sigma_sq = 5.0  # Prior uncertainty

sensor_reading = 52.0
sensor_noise_sq = 2.0  # Variance of sensor noise
# Combine prior and sensor reading
# The likelihood N(sensor_reading | true_pos) is treated as a Gaussian in true_pos with mean sensor_reading
mu_posterior, sigma_sq_posterior = combine_univariate_gaussians(
    prior_mu, prior_sigma_sq, sensor_reading, sensor_noise_sq
)

print(f"Prior: N(mu={prior_mu}, sigma^2={prior_sigma_sq})")
print(f"Sensor Reading: {sensor_reading} (with variance {sensor_noise_sq})")
print(f"Posterior: N(mu={mu_posterior:.2f}, sigma^2={sigma_sq_posterior:.2f})")

# Notice the posterior mean is a value between the prior mean and the sensor reading,
# weighted by their respective precisions. The posterior variance is smaller than both,
# indicating increased certainty after incorporating the measurement.

# Let's add a second sensor reading
sensor_reading_B = 51.0
sensor_noise_sq_B = 1.5  # Variance of sensor B noise

# Combine the current posterior (after Sensor A) with Sensor B reading
mu_posterior_final, sigma_sq_posterior_final = combine_univariate_gaussians(
    mu_posterior, sigma_sq_posterior, sensor_reading_B, sensor_noise_sq_B
)

print(f"\nSensor B Reading: {sensor_reading_B} (with variance {sensor_noise_sq_B})")
print(
    f"Posterior after Sensor B: N(mu={mu_posterior_final:.2f}, sigma^2={sigma_sq_posterior_final:.2f})"
)

Prior: N(mu=50.0, sigma^2=5.0)
Sensor Reading: 52.0 (with variance 2.0)
Posterior: N(mu=51.43, sigma^2=1.43)

Sensor B Reading: 51.0 (with variance 1.5)
Posterior after Sensor B: N(mu=51.22, sigma^2=0.73)


## The Multivariate Gaussian Distribution

The multivariate Gaussian distribution extends the Gaussian to multiple dimensions. For a random vector $x \in \mathbb{R}^n$, the PDF is:

$$
N(x; \mu, \Sigma) = \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\left( -\frac{1}{2} (x - \mu)^\top \Sigma^{-1} (x - \mu) \right)
$$

Here, $\mu \in \mathbb{R}^n$ is the mean vector, and $\Sigma \in \mathbb{R}^{n \times n}$ is the covariance matrix. The covariance matrix must be symmetric positive definite (SPD).

- **Mean Vector ($\mu$):** A vector where each element is the mean of the corresponding dimension.
- **Covariance Matrix ($\Sigma$):** A matrix where the diagonal elements are the variances of each dimension, and the off-diagonal elements are the covariances between pairs of dimensions. $\Sigma_{ij} = \mathrm{Cov}(x_i, x_j)$.
- **Symmetric Positive Definite (SPD):** A matrix $\Sigma$ is SPD if $\Sigma = \Sigma^\top$ and for any non-zero vector $v \in \mathbb{R}^n$, $v^\top \Sigma v > 0$. This ensures the quadratic form in the exponent is always non-negative and the distribution is valid.

Let's implement the log PDF and sampling for a multivariate Gaussian in JAX.

The core idea behind sampling from a general multivariate Gaussian $\mathcal{N}(\mu, \Sigma)$ is to start with samples from a standard normal distribution $\mathcal{N}(0, I)$, where $0$ is the zero vector and $I$ is the identity matrix. Samples from $\mathcal{N}(0, I)$ are simply vectors where each element is drawn independently from a univariate standard normal distribution $\mathcal{N}(0, 1)$.

If $z \sim \mathcal{N}(0, I)$, we can transform $z$ to get a sample $x \sim \mathcal{N}(\mu, \Sigma)$ using a linear transformation:

$$x = L z + \mu
$$where $L$ is a matrix such that $\Sigma = L L^T$. This is because the covariance of $x$ will be $\text{Cov}(Lz + \mu) = L \text{Cov}(z) L^T = L I L^T = L L^T = \Sigma$. The mean of $x$ is $E[Lz + \mu] = L E[z] + \mu = L \cdot 0 + \mu = \mu$. Since linear transformations of Gaussian variables are Gaussian, $x$ will follow the desired distribution.

The matrix $L$ can be obtained in a couple of ways, leading to different sampling methods: Cholesky Decomposition or Eigenvalue Decomposition.

In [22]:
# Optional: Configure JAX for 64-bit precision, which can be useful for numerical stability
jax.config.update("jax_enable_x64", True)

# Set a random seed for reproducibility
# We'll generate a new key inside the interactive function to get fresh samples each time parameters change
initial_key = jrandom.PRNGKey(0)

In [37]:
# --- Interactive Widgets (Equivalent to Streamlit Controls) ---

# Sliders for the mean vector mu = [mu_0, mu_1]
mu0_slider = widgets.FloatSlider(
    min=-5.0, max=5.0, value=0.0, step=0.1, description="Mu_0:"
)
mu1_slider = widgets.FloatSlider(
    min=-5.0, max=5.0, value=0.0, step=0.1, description="Mu_1:"
)

# Sliders for the covariance matrix parameters (variances and correlation)
# Sigma = [[S11, rho*sqrt(S11*S22)], [rho*sqrt(S11*S22), S22]]
s11_slider = widgets.FloatSlider(
    min=0.1, max=5.0, value=1.0, step=0.1, description="Sigma_11:"
)  # Start variance above 0
s22_slider = widgets.FloatSlider(
    min=0.1, max=5.0, value=1.0, step=0.1, description="Sigma_22:"
)  # Start variance above 0
rho_slider = widgets.FloatSlider(
    min=-0.99, max=0.99, value=0.0, step=0.01, description="rho:"
)  # Correlation coefficient

# Slider for the number of samples
n_samples_slider = widgets.IntSlider(
    min=10,
    max=1000,
    value=200,
    step=10,
    description="Num Samples:",
    style={"description_width": "initial"},  # Show full description
)

# Radio buttons for sampling method selection
method_radio = widgets.RadioButtons(
    options=["Cholesky", "Eigenvalue"], description="Sampling Method:", disabled=False
)

controls = widgets.VBox(
    [
        widgets.Label("Gaussian Parameters:"),
        widgets.HBox([mu0_slider, mu1_slider]),
        widgets.HBox([s11_slider, s22_slider]),
        widgets.HBox([rho_slider, n_samples_slider]),
        method_radio,
    ]
)
display(controls)

VBox(children=(Label(value='Gaussian Parameters:'), HBox(children=(FloatSlider(value=0.0, description='Mu_0:',…

Method 1: Cholesky Decomposition

The Cholesky decomposition is a way to factor a symmetric positive definite (SPD) matrix $\Sigma$ into the product of a lower triangular matrix $L$ and its transpose: $\Sigma = L L^T$. This is a standard and numerically stable decomposition for SPD matrices. Here’s what this means and why it’s useful:

**What is Cholesky Decomposition?**

- A matrix is *symmetric* if it equals its transpose ($\Sigma = \Sigma^T$), and *positive definite* if $v^T \Sigma v > 0$ for any nonzero vector $v$.
- The Cholesky decomposition guarantees that for any SPD matrix, there exists a unique lower triangular matrix $L$ (with positive diagonal entries) such that $\Sigma = L L^T$.
- The matrix $L$ is easy to compute using efficient numerical algorithms, and it is always well-defined for SPD matrices.

**How Does the Cholesky Decomposition Algorithm Work?**

Given a symmetric positive definite matrix $\Sigma$, the Cholesky decomposition finds a unique lower triangular matrix $L$ such that $\Sigma = L L^T$. The algorithm proceeds as follows:

Suppose $\Sigma$ is an $n \times n$ matrix with entries $\Sigma_{ij}$. The entries of $L$ are computed one row at a time using:

- For each diagonal entry ($i = j$):
    $$
    L_{ii} = \sqrt{\Sigma_{ii} - \sum_{k=1}^{i-1} L_{ik}^2}
    $$
- For each off-diagonal entry ($i > j$):
    $$
    L_{ij} = \frac{1}{L_{jj}} \left( \Sigma_{ij} - \sum_{k=1}^{j-1} L_{ik} L_{jk} \right)
    $$
- For $i < j$, $L_{ij} = 0$ (since $L$ is lower triangular).

**Step-by-step:**
1. Start with the first row/column, compute $L_{11} = \sqrt{\Sigma_{11}}$.
2. For each subsequent row $i$, compute $L_{ij}$ for $j < i$ using the formula above, then compute $L_{ii}$.
3. Continue until all entries are filled.

**Why does this work?**  
At each step, the algorithm ensures that the part of $\Sigma$ explained so far matches $L L^T$ up to the current row/column. The square root and division steps are valid because $\Sigma$ is positive definite, so all diagonal elements remain positive.

**Numerical Stability:**  
Cholesky is more stable and efficient than general matrix decompositions for SPD matrices, and is widely used in scientific computing and machine learning for this reason.

**In Practice:**  
You rarely need to implement this by hand—libraries like NumPy, SciPy, and JAX provide efficient, optimized routines (e.g., `jnp.linalg.cholesky(Sigma)`). But understanding the algorithm helps explain why it only works for SPD matrices and why it is so efficient.

**Why is Cholesky Decomposition Helpful for Sampling?**

When we want to sample from a multivariate Gaussian $\mathcal{N}(\mu, \Sigma)$, we need to generate random vectors $x$ such that their covariance matches $\Sigma$. The trick is to start with samples $z$ from a standard normal distribution $\mathcal{N}(0, I)$ (which is easy to generate), and then transform them so that they have the desired covariance.

- If $z \sim \mathcal{N}(0, I)$, then $x = L z + \mu$ will have mean $\mu$ and covariance $\Sigma$.
- This works because:
    - $\mathbb{E}[x] = L \mathbb{E}[z] + \mu = \mu$
    - $\text{Cov}(x) = L \text{Cov}(z) L^T = L I L^T = L L^T = \Sigma$

**Intuition:**  
The Cholesky factor $L$ “shapes” the spherical cloud of standard normal samples into the elliptical cloud described by $\Sigma$. Each sample $z$ is first stretched and rotated by $L$, then shifted by the mean $\mu$.

**Summary:**  
Cholesky decomposition is a fast, stable way to generate samples from any multivariate Gaussian, by transforming standard normal samples. It’s widely used in probabilistic modeling, Bayesian inference, and machine learning whenever we need to work with multivariate Gaussians.

In [46]:
def cholesky_sample(raw_samples, mu, Sigma, circle_pts):
    # Cholesky decomposition: Sigma = L @ L.T
    L = jnp.linalg.cholesky(Sigma)
    # Transform samples and circle: x = L @ z + mu
    transformed_samples = jnp.dot(raw_samples, L.T)
    shifted_samples = transformed_samples + mu
    transformed_circle_pts = jnp.dot(L, circle_pts)
    shifted_circle_pts = transformed_circle_pts + mu[:, None]

    return (
        transformed_samples,
        shifted_samples,
        transformed_circle_pts,
        shifted_circle_pts,
    )

Method 2: Eigenvalue Decomposition
Eigenvalue decomposition is a fundamental concept in linear algebra, especially useful for understanding the structure of symmetric matrices like covariance matrices in Gaussian distributions.

#### What is Eigenvalue Decomposition?

Given a symmetric matrix $\Sigma \in \mathbb{R}^{n \times n}$ (such as a covariance matrix), eigenvalue decomposition expresses $\Sigma$ as:
$$
\Sigma = V D V^\top
$$
where:
- $V$ is an $n \times n$ orthogonal matrix whose columns are the eigenvectors of $\Sigma$.
- $D$ is a diagonal matrix whose diagonal entries are the eigenvalues of $\Sigma$.

The eigenvectors represent directions in space where the transformation $\Sigma$ acts as simple scaling, and the eigenvalues tell us how much scaling occurs along each direction.

#### How Does the Algorithm Work?

1. **Find Eigenvalues and Eigenvectors:**  
    For a symmetric matrix $\Sigma$, solve the equation
    $$
    \Sigma v = \lambda v
    $$
    for each eigenvector $v$ and its corresponding eigenvalue $\lambda$.

2. **Form the Matrices:**  
    - Stack the normalized eigenvectors as columns to form $V$.
    - Place the eigenvalues on the diagonal of $D$.

3. **Reconstruct the Matrix:**  
    The original matrix can be reconstructed as $V D V^\top$.

#### How Does Eigenvalue Decomposition Help with Sampling?

To sample from a multivariate Gaussian $\mathcal{N}(\mu, \Sigma)$:
1. **Sample $z$ from $\mathcal{N}(0, I)$:**  
    $z$ is a vector of independent standard normal variables.

2. **Transform the Samples:**  
    Compute $L = V \sqrt{D}$, where $\sqrt{D}$ is a diagonal matrix with $\sqrt{\lambda_i}$ on the diagonal.  
    The sample is then:
    $$
    x = L z + \mu = V \sqrt{D} z + \mu
    $$
    This transformation stretches and rotates the standard normal samples to match the covariance structure of $\Sigma$.

#### Comparison: Eigenvalue vs. Cholesky Decomposition

- **Cholesky Decomposition:**  
  - Only works for symmetric positive definite matrices.
  - Decomposes $\Sigma$ as $L L^\top$, where $L$ is lower triangular.
  - Fast and numerically stable.
  - Commonly used for sampling because of its efficiency.

- **Eigenvalue Decomposition:**  
  - Works for all symmetric matrices (including positive semi-definite).
  - Decomposes $\Sigma$ as $V D V^\top$.
  - Provides insight into the principal directions (eigenvectors) and variances (eigenvalues).
  - Can be more numerically sensitive if eigenvalues are close to zero or negative due to rounding errors.

#### Which is Better for Sampling?

- **Cholesky** is generally preferred for sampling from multivariate Gaussians because it is faster and more numerically stable for positive definite matrices (which all valid covariance matrices should be).
- **Eigenvalue decomposition** is useful for understanding the geometry of the distribution (principal axes and variances), and can be used for sampling, especially when the covariance matrix is only positive semi-definite (some eigenvalues may be zero).

**Summary Table:**

| Method      | Speed      | Stability   | Insight into Geometry | Handles Semi-definite |
|-------------|------------|-------------|----------------------|----------------------|
| Cholesky    | Fast       | Very good   | No                   | No                   |
| Eigenvalue  | Slower     | Can be less | Yes                  | Yes                  |

In practice, use Cholesky for efficient sampling, and eigenvalue decomposition when you want to analyze or visualize the principal directions and variances of your Gaussian.

In [47]:
def eigenvalue_sample(raw_samples, mu, Sigma, circle_pts):
    # Eigenvalue decomposition: Sigma = V @ D @ V.T
    D_diag, V = jnp.linalg.eigh(Sigma)  # Eigenvalues and eigenvectors
    # Ensure eigenvalues are non-negative for sqrt
    D_sqrt = jnp.diag(jnp.sqrt(jnp.maximum(0.0, D_diag)))
    L_eigen = jnp.dot(V, D_sqrt)  # L = V @ sqrt(D)
    # Transform samples and circle: x = L_eigen @ z + mu
    transformed_samples = jnp.dot(raw_samples, L_eigen.T)
    shifted_samples = transformed_samples + mu
    transformed_circle_pts = jnp.dot(L_eigen, circle_pts)
    shifted_circle_pts = transformed_circle_pts + mu[:, None]

    return (
        transformed_samples,
        shifted_samples,
        transformed_circle_pts,
        shifted_circle_pts,
    )

In [68]:
def update_plot(mu_0, mu_1, S11, S22, rho, N_samples, sampling_method):
    """
    Updates the plot based on the selected parameters and sampling method.
    """
    # Construct the mean vector and covariance matrix from widget values
    mu = jnp.asarray([mu_0, mu_1])
    # Ensure S11 and S22 are positive for sqrt
    s11_safe = jnp.maximum(S11, 1e-6)
    s22_safe = jnp.maximum(S22, 1e-6)
    # Ensure rho is within valid range [-1, 1] for correlation
    rho_safe = jnp.clip(rho, -0.999, 0.999)

    S12 = rho_safe * jnp.sqrt(s11_safe * s22_safe)
    Sigma = jnp.asarray([[s11_safe, S12], [S12, s22_safe]])

    # Regenerate key for fresh samples each time
    global initial_key
    initial_key, subkey = jrandom.split(initial_key)

    raw_samples = jrandom.normal(
        subkey, shape=(N_samples, 2)
    )  # Raw samples from N(0, I)

    transformed_samples = None
    shifted_samples = None
    transformed_circle_pts = None
    shifted_circle_pts = None
    method_title = ""
    error_message = None

    # Circles for visualization (unit circle)
    theta = jnp.linspace(0, 2 * jnp.pi, 100)
    circle_pts = jnp.stack([jnp.cos(theta), jnp.sin(theta)])  # Unit circle

    # Perform sampling based on the selected method
    try:
        if sampling_method == "Cholesky":
            (
                transformed_samples,
                shifted_samples,
                transformed_circle_pts,
                shifted_circle_pts,
            ) = cholesky_sample(raw_samples, mu, Sigma, circle_pts)
            method_title = "Cholesky Decomposition"

        elif sampling_method == "Eigenvalue":
            (
                transformed_samples,
                shifted_samples,
                transformed_circle_pts,
                shifted_circle_pts,
            ) = eigenvalue_sample(raw_samples, mu, Sigma, circle_pts)
            method_title = "Eigenvalue Decomposition"

    except jnp.linalg.LinAlgError:
        error_message = (
            "Error: Covariance matrix is not positive definite. Adjust parameters."
        )
        print(error_message)  # Print to console for debugging

    # --- Plotting using Plotly ---
    fig = go.Figure()

    if shifted_samples is not None:
        # Add traces for samples
        fig.add_trace(
            go.Scattergl(
                x=np.array(raw_samples[:, 0]),
                y=np.array(raw_samples[:, 1]),
                mode="markers",
                name="Raw Samples (N(0, I))",
                marker=dict(size=5, opacity=0.6, color="gray"),
            )
        )
        fig.add_trace(
            go.Scattergl(
                x=np.array(transformed_samples[:, 0]),
                y=np.array(transformed_samples[:, 1]),
                mode="markers",
                name="Transformed Samples (L @ z)",
                marker=dict(size=5, opacity=0.6, color="blue"),
            )
        )
        fig.add_trace(
            go.Scattergl(
                x=np.array(shifted_samples[:, 0]),
                y=np.array(shifted_samples[:, 1]),
                mode="markers",
                name="Shifted Samples (L @ z + mu)",
                marker=dict(size=5, opacity=0.6, color="red"),
            )
        )

        # Add traces for circles
        fig.add_trace(
            go.Scattergl(
                x=np.array(circle_pts[0, :]),
                y=np.array(circle_pts[1, :]),
                mode="lines",
                name="Unit Circle",
                line=dict(dash="dash", color="gray"),
            )
        )
        fig.add_trace(
            go.Scattergl(
                x=np.array(transformed_circle_pts[0, :]),
                y=np.array(transformed_circle_pts[1, :]),
                mode="lines",
                name="Transformed Circle",
                line=dict(dash="dash", color="blue"),
            )
        )
        fig.add_trace(
            go.Scattergl(
                x=np.array(shifted_circle_pts[0, :]),
                y=np.array(shifted_circle_pts[1, :]),
                mode="lines",
                name="Shifted Circle",
                line=dict(dash="dash", color="red"),
            )
        )

        # Update layout
        fig.update_layout(
            title=f"Gaussian Sampling via {method_title}",
            xaxis_title='<span class="math-inline">x1</span>',
            yaxis_title='<span class="math-inline">x2</span>',
            xaxis=dict(scaleanchor="y", scaleratio=1),  # Ensure equal aspect ratio
            yaxis=dict(scaleanchor="x", scaleratio=1),
            hovermode="closest",
            showlegend=True,
            width=1200,  # Set a fixed width for the plot
            height=600,  # Set a fixed height for the plot
        )

        # Set limits dynamically based on the shifted samples to ensure they are visible
        min_vals = jnp.min(shifted_samples, axis=0)
        max_vals = jnp.max(shifted_samples, axis=0)
        range_vals = jnp.max(
            max_vals - min_vals
        )  # Find the largest range in either dimension
        center = jnp.mean(shifted_samples, axis=0)

        # Set limits to be centered around the mean, covering a bit more than the sample range
        padding = range_vals * 0.3  # Add 30% padding
        x_range = [
            float(center[0] - range_vals / 2 - padding),
            float(center[0] + range_vals / 2 + padding),
        ]
        y_range = [
            float(center[1] - range_vals / 2 - padding),
            float(center[1] + range_vals / 2 + padding),
        ]

        fig.update_layout(xaxis_range=x_range, yaxis_range=y_range)

    else:
        # Display an empty plot or an error message if sampling failed
        fig.update_layout(title=error_message if error_message else "Plotting Error")

    fig.show()

In [69]:
# --- Link widgets to the update function ---

# Use interactive_output to link widget values to the function's arguments
interactive_plot = widgets.interactive_output(
    update_plot,
    {
        "mu_0": mu0_slider,
        "mu_1": mu1_slider,
        "S11": s11_slider,
        "S22": s22_slider,
        "rho": rho_slider,
        "N_samples": n_samples_slider,
        "sampling_method": method_radio,
    },
)

# Arrange widgets and the plot output
controls = widgets.VBox(
    [
        widgets.Label("Gaussian Parameters:"),
        widgets.HBox([mu0_slider, mu1_slider]),
        widgets.HBox([s11_slider, s22_slider]),
        widgets.HBox([rho_slider, n_samples_slider]),
        method_radio,
    ]
)

# Display the controls and the plot
display(controls)
display(interactive_plot)

VBox(children=(Label(value='Gaussian Parameters:'), HBox(children=(FloatSlider(value=0.0, description='Mu_0:',…

Output()

Visualizing the Sampling Process

The plot above shows the transformation steps:

* Raw Samples (Gray): These are the initial samples drawn from the standard normal distribution $\mathcal{N}(0, I)$. They form a spherical cloud centered at the origin.

* Transformed Samples (Blue): These samples are obtained by multiplying the raw samples by the matrix $L$ (from either Cholesky or Eigenvalue decomposition). This step scales and rotates the spherical cloud according to the covariance structure.

* Shifted Samples (Red): These are the final samples, obtained by adding the mean vector $\mu$ to the transformed samples. This shifts the center of the elliptical cloud to the mean $\mu$.

The circles similarly show how the unit circle (gray) is transformed by $L$ (blue) and then shifted by $\mu$ (red), illustrating the shape and location of the resulting Gaussian distribution.

**Conclusion**

Sampling from a multivariate Gaussian distribution is a fundamental technique. We learned that it can be done by transforming samples from a standard normal distribution using a matrix derived from the covariance matrix (via Cholesky or Eigenvalue decomposition) and shifting by the mean $\mu$. This interactive visualization helps demonstrate how the mean and covariance matrix completely define the location and shape of the Gaussian distribution.

Understanding these sampling methods provides a deeper insight into the structure of Gaussian distributions and their role in probabilistic modeling.

### Closure Properties: The Magic of Gaussians

One of the most powerful aspects of Gaussian distributions is their closure properties under linear operations. This means that if you start with Gaussian random variables and perform certain operations (like adding them, applying linear transformations, looking at subsets, or conditioning on some values), the resulting distributions are still Gaussian. This makes exact inference possible using linear algebra.

#### 1. Products of Gaussians are Gaussians

The product of two Gaussian probability densities is proportional to another Gaussian density. This is crucial for Bayesian inference because the posterior is proportional to the prior times the likelihood. If both are Gaussian, the posterior is also Gaussian.

**Univariate Case:**

If $p_1(x) = \mathcal{N}(x; \mu_1, \sigma_1^2)$ and $p_2(x) = \mathcal{N}(x; \mu_2, \sigma_2^2)$, their product is:

$$
p_1(x) p_2(x) \propto \mathcal{N}(x; \mu_{\text{post}}, \sigma_{\text{post}}^2)
$$

where the parameters of the resulting Gaussian are:

$$
\sigma_{\text{post}}^2 = \left(\frac{1}{\sigma_1^2} + \frac{1}{\sigma_2^2}\right)^{-1}
$$

$$
\mu_{\text{post}} = \sigma_{\text{post}}^2 \left(\frac{\mu_1}{\sigma_1^2} + \frac{\mu_2}{\sigma_2^2}\right)
$$

Notice that the precisions ($1/\sigma^2$) add up, and the new mean is a precision-weighted average of the original means.

**Multivariate Case:**

If $p_1(x) = \mathcal{N}(x; \mu_1, \Sigma_1)$ and $p_2(x) = \mathcal{N}(x; \mu_2, \Sigma_2)$ are two multivariate Gaussian densities over the same variable $x \in \mathbb{R}^n$, their product is proportional to another multivariate Gaussian density:

$$
p_1(x) p_2(x) \propto \mathcal{N}(x; \mu_{\text{post}}, \Sigma_{\text{post}})
$$

The parameters of the resulting Gaussian are given by:

$$
\Sigma_{\text{post}}^{-1} = \Sigma_1^{-1} + \Sigma_2^{-1}
$$

$$
\Sigma_{\text{post}}^{-1} \mu_{\text{post}} = \Sigma_1^{-1} \mu_1 + \Sigma_2^{-1} \mu_2
$$

These formulas generalize the univariate case: the precision matrices (inverse covariance matrices) add, and the precision-weighted means add.

**Realistic Example: Combining Sensor Measurements for 2D Position**

Let's extend the sensor fusion example to 2D. Imagine you're tracking an object's position in a 2D plane $(x, y)$. You have a prior belief about its position, modeled as a 2D Gaussian $p(\text{pos}) = \mathcal{N}(\text{pos}; \mu_{\text{prior}}, \Sigma_{\text{prior}})$. You receive measurements from two sensors, each providing a noisy 2D position estimate.

- **Sensor A**: Measurement $\mathbf{y}_A \in \mathbb{R}^2$, modeled as $\mathcal{N}(\mathbf{y}_A; \text{true\_pos}, \Sigma_A)$. The likelihood $p(\mathbf{y}_A|\text{true\_pos})$ is $\mathcal{N}(\mathbf{y}_A; \text{true\_pos}, \Sigma_A)$.

- **Sensor B**: Measurement $\mathbf{y}_B \in \mathbb{R}^2$, modeled as $\mathcal{N}(\mathbf{y}_B; \text{true\_pos}, \Sigma_B)$. The likelihood $p(\mathbf{y}_B|\text{true\_pos})$ is $\mathcal{N}(\mathbf{y}_B; \text{true\_pos}, \Sigma_B)$.

The posterior distribution of the true position after observing both measurements is proportional to the prior times the likelihoods:

$$
p(\text{true\_pos} | \mathbf{y}_A, \mathbf{y}_B) \propto p(\text{true\_pos}) p(\mathbf{y}_A|\text{true\_pos}) p(\mathbf{y}_B|\text{true\_pos}).
$$

Similar to the univariate case, $p(\mathbf{y}|\text{true\_pos}) = \mathcal{N}(\mathbf{y}; \text{true\_pos}, \Sigma)$ can be viewed as a Gaussian in the variable $\text{true\_pos}$ with mean $\mathbf{y}$ and covariance $\Sigma$. The posterior is a product of three multivariate Gaussian densities (prior and two likelihoods), resulting in a new multivariate Gaussian for $\text{true\_pos}$.


In [10]:
# Function to combine (multiply) two multivariate Gaussian densities
def combine_multivariate_gaussians(mu1, Sigma1, mu2, Sigma2):
    """
    Combines two multivariate Gaussian densities N(x; mu1, Sigma1) * N(x; mu2, Sigma2).
    Returns the parameters (mu_post, Sigma_post) of the resulting Gaussian (up to a normalization constant).
    """
    # Calculate precision matrices (inverse covariances)
    # Use jnp.linalg.inv for matrix inversion
    Sigma1_inv = jnp.linalg.inv(Sigma1)
    Sigma2_inv = jnp.linalg.inv(Sigma2)

    # Calculate the precision matrix and covariance matrix of the resulting Gaussian
    Sigma_post_inv = Sigma1_inv + Sigma2_inv
    Sigma_post = jnp.linalg.inv(Sigma_post_inv)

    # Calculate the mean of the resulting Gaussian
    mu_post = jnp.dot(Sigma_post, jnp.dot(Sigma1_inv, mu1) + jnp.dot(Sigma2_inv, mu2))

    return mu_post, Sigma_post


# Example: Combining a prior belief with a 2D sensor reading
prior_mu_2d = jnp.array([10.0, 20.0])  # Prior belief about 2D position
prior_Sigma_2d = jnp.array([[5.0, 1.0], [1.0, 5.0]])  # Prior uncertainty (covariance)

sensor_reading_2d = jnp.array([11.0, 21.0])  # Sensor A reading
sensor_noise_Sigma_2d = jnp.array(
    [[1.0, 0.2], [0.2, 1.0]]
)  # Covariance of sensor A noise

# Combine prior and sensor reading A
# The likelihood N(sensor_reading_2d | true_pos) is treated as a Gaussian in true_pos with mean sensor_reading_2d
mu_prior_sA, Sigma_prior_sA = combine_multivariate_gaussians(
    prior_mu_2d, prior_Sigma_2d, sensor_reading_2d, sensor_noise_Sigma_2d
)

print(f"Prior: N(mu={prior_mu_2d}, Sigma=\n{prior_Sigma_2d})")
print(f"Sensor A Reading: {sensor_reading_2d} (with Sigma=\n{sensor_noise_Sigma_2d})")
print(f"Posterior after Sensor A: N(mu={mu_prior_sA}, Sigma=\n{Sigma_prior_sA})")

# Let's add a second sensor reading
sensor_reading_2d_B = jnp.array([10.5, 20.8])  # Sensor B reading
sensor_noise_Sigma_2d_B = jnp.array(
    [[1.2, 0.1], [0.1, 1.2]]
)  # Covariance of sensor B noise

# Combine the current posterior (after Sensor A) with Sensor B reading
mu_posterior_final, Sigma_posterior_final = combine_multivariate_gaussians(
    mu_prior_sA, Sigma_prior_sA, sensor_reading_2d_B, sensor_noise_Sigma_2d_B
)

print(
    f"\nSensor B Reading: {sensor_reading_2d_B} (with Sigma=\n{sensor_noise_Sigma_2d_B})"
)
print(
    f"Posterior after Sensor B: N(mu={mu_posterior_final}, Sigma=\n{Sigma_posterior_final})"
)

# Notice how the posterior mean moves towards the sensor readings, weighted by their precision.
# The diagonal elements of the posterior covariance matrix (variances) decrease, indicating reduced uncertainty in both dimensions.
# The off-diagonal elements (covariances) also update based on the relationships in the prior and likelihood covariances.


Prior: N(mu=[10. 20.], Sigma=
[[5. 1.]
 [1. 5.]])
Sensor A Reading: [11. 21.] (with Sigma=
[[1.  0.2]
 [0.2 1. ]])
Posterior after Sensor A: N(mu=[10.833333 20.833334], Sigma=
[[0.8333333  0.16666669]
 [0.16666669 0.8333333 ]])

Sensor B Reading: [10.5 20.8] (with Sigma=
[[1.2 0.1]
 [0.1 1.2]])
Posterior after Sensor B: N(mu=[10.697018 20.810226], Sigma=
[[0.49015585 0.07506154]
 [0.07506154 0.49015588]])


### 2. Linear Transformations of Gaussians are Gaussians

If $z \sim \mathcal{N}(\mu, \Sigma)$ is an $n$-dimensional Gaussian vector and $y = Az + b$ is a linear transformation, where $A$ is an $m \times n$ matrix and $b$ is an $m$-dimensional vector, then the resulting $m$-dimensional vector $y$ is also Gaussian:

$$
y \sim \mathcal{N}(A\mu + b,\, A\Sigma A^\top)
$$

- The mean is the transformed mean: $A\mu + b$
- The covariance is transformed by $A$ and $A^\top$: $A\Sigma A^\top$

**Realistic Example: Predicting Related Variables**

Suppose you have a Gaussian model of a person's height and weight ($z = [\text{height},\, \text{weight}]^\top \sim \mathcal{N}(\mu_z, \Sigma_z)$). You are interested in predicting a new variable, say, their estimated calorie intake based on a simple linear model:

$$
\text{calories} = a \cdot \text{height} + b \cdot \text{weight} + c
$$

This is a linear transformation $y = Az + b$, where $A = [a,\, b]$ and $b = [c]$. The distribution of estimated calorie intake will be Gaussian.

In [11]:
# Example: Linear transformation of a 2D Gaussian
mu_z = jnp.array([175.0, 75.0])  # Mean height (cm), mean weight (kg)
Sigma_z = jnp.array([[60.0, 25.0], [25.0, 40.0]])  # Covariance matrix

# Define a linear transformation for estimated calorie intake
# Estimated Calories = 5 * height + 10 * weight - 500
A = jnp.array([[5.0, 10.0]])  # Matrix A (1x2)
b = jnp.array([-500.0])  # Vector b (1,)

# Calculate the mean and covariance of the transformed variable y (estimated calories)
mu_y = jnp.dot(A, mu_z) + b
Sigma_y = jnp.dot(jnp.dot(A, Sigma_z), A.T)

print(f"\nOriginal Gaussian (Height, Weight): mu = {mu_z}, Sigma = {Sigma_z}")
print(f"Linear Transformation for Calories: A = {A}, b = {b}")
print(f"Transformed Gaussian (Estimated Calories): mu_y = {mu_y}, Sigma_y = {Sigma_y}")

# The distribution of estimated calorie intake is Gaussian N(mu_y, Sigma_y). Sigma_y will be a 1x1 matrix (a scalar variance).


Original Gaussian (Height, Weight): mu = [175.  75.], Sigma = [[60. 25.]
 [25. 40.]]
Linear Transformation for Calories: A = [[ 5. 10.]], b = [-500.]
Transformed Gaussian (Estimated Calories): mu_y = [1125.], Sigma_y = [[8000.]]


### 3. Marginals of Gaussians are Gaussians

If a Gaussian random vector $z$ is partitioned into two sub-vectors, $z=\begin{bmatrix}x \\ y\end{bmatrix}$, where $x \in \mathbb{R}^m$ and $y \in \mathbb{R}^k$ (so $n=m+k$), and its distribution is $z \sim \mathcal{N}\left(\begin{bmatrix}\mu_x \\ \mu_y\end{bmatrix}, \begin{bmatrix}\Sigma_{xx} & \Sigma_{xy} \\ \Sigma_{yx} & \Sigma_{yy}\end{bmatrix}\right)$, then the marginal distribution of $x$ (obtained by integrating out $y$) is also Gaussian:

$$
p(x) = \int p(x, y) \, dy = \mathcal{N}(\mu_x, \Sigma_{xx})
$$

The marginal distribution of any subset of variables in a multivariate Gaussian is simply a Gaussian with the corresponding sub-vector of the mean and the corresponding submatrix of the covariance matrix.

#### Realistic Example: Focusing on a Subset of Health Metrics

Suppose you have collected various health metrics from patients and modeled their joint distribution as a multivariate Gaussian. These metrics might include blood pressure, heart rate, cholesterol level, and blood sugar. If you only want to analyze the distribution of blood pressure and heart rate, their marginal joint distribution is a 2D Gaussian. Its mean vector is just the means of blood pressure and heart rate from the full mean vector, and its covariance matrix is the $2 \times 2$ submatrix of the full covariance matrix corresponding to blood pressure and heart rate.


In [74]:
# Example: Marginalizing a 4D Gaussian
# Let z = [BP, HR, Cholesterol, BloodSugar]^T be a 4D Gaussian
mu_z_full = jnp.array([120.0, 70.0, 180.0, 90.0])  # Example mean vector
Sigma_z_full = jnp.array(
    [
        [10.0, 3.0, 1.0, 0.5],
        [3.0, 5.0, 0.8, 0.3],
        [1.0, 0.8, 20.0, 5.0],
        [0.5, 0.3, 5.0, 8.0],
    ]
)  # Example 4x4 SPD covariance matrix

# We want the marginal distribution of Blood Pressure (index 0) and Cholesterol (index 2).
# This corresponds to selecting dimensions 0 and 2.
selected_indices = jnp.array([0, 2])

# The marginal mean is the subset of the mean vector
mu_marginal = mu_z_full[selected_indices]

# The marginal covariance is the submatrix corresponding to the selected indices
Sigma_marginal = Sigma_z_full[jnp.ix_(selected_indices, selected_indices)]

print(f"\nOriginal 4D Gaussian (Health Metrics): \n mu = {mu_z_full}")
print("Original Sigma:\n", Sigma_z_full)
print(
    f"Marginal distribution of Blood Pressure and Cholesterol (indices \n {selected_indices}):"
)
print(f"  Marginal Mean:\n {mu_marginal}")
print("  Marginal Sigma:\n", Sigma_marginal)

# The marginal distribution p([BP, Cholesterol]) is Gaussian N(mu_marginal, Sigma_marginal).


Original 4D Gaussian (Health Metrics): 
 mu = [120.  70. 180.  90.]
Original Sigma:
 [[10.   3.   1.   0.5]
 [ 3.   5.   0.8  0.3]
 [ 1.   0.8 20.   5. ]
 [ 0.5  0.3  5.   8. ]]
Marginal distribution of Blood Pressure and Cholesterol (indices 
 [0 2]):
  Marginal Mean:
 [120. 180.]
  Marginal Sigma:
 [[10.  1.]
 [ 1. 20.]]


### 4. Conditionals of Gaussians are Gaussians

If a Gaussian random vector $z=\begin{bmatrix}x \\ y\end{bmatrix}$ is partitioned as above, the conditional distribution of $x$ given that $y$ takes a specific observed value is also Gaussian:

$$
p(x \mid y) = \mathcal{N}(\mu_{x \mid y}, \Sigma_{x \mid y})
$$

The parameters of the conditional distribution are given by:

$$
\mu_{x \mid y} = \mu_x + \Sigma_{xy} \Sigma_{yy}^{-1} (y - \mu_y)
$$

$$
\Sigma_{x \mid y} = \Sigma_{xx} - \Sigma_{xy} \Sigma_{yy}^{-1} \Sigma_{yx}
$$

These formulas are derived from the structure of the multivariate Gaussian PDF. $\Sigma_{yy}^{-1}$ is the inverse of the covariance matrix of the observed variables $y$.

#### Realistic Example: Predicting Unobserved Variables Based on Observations

This property is fundamental for prediction and state estimation.

- **Predicting missing data**: If you have a joint Gaussian model of several variables and you've only observed a subset of them ($y$), you can predict the distribution of the unobserved variables ($x$) using the conditional distribution $p(x \mid y)$. The conditional mean $\mu_{x \mid y}$ gives you the best linear prediction of $x$ given $y$, and the conditional covariance $\Sigma_{x \mid y}$ quantifies the uncertainty in this prediction.

- **Kalman Filter**: The Kalman filter, a widely used algorithm for state estimation in dynamic systems, relies heavily on the fact that if the system dynamics are linear and noise is Gaussian, the posterior distribution of the system state remains Gaussian and can be updated analytically using these conditional formulas.


In [13]:
# Example: Conditioning a 2D Gaussian
# Let z = [x, y]^T be a 2D Gaussian
mu_z_cond = jnp.array([1.0, 2.0])  # mu_x, mu_y
Sigma_z_cond = jnp.array(
    [[1.0, 0.5], [0.5, 1.0]]
)  # Sigma_xx, Sigma_xy, Sigma_yx, Sigma_yy

# Partition the mean and covariance for scalar x and scalar y
mu_x = mu_z_cond[0]
mu_y = mu_z_cond[1]
Sigma_xx = Sigma_z_cond[0, 0]
Sigma_xy = Sigma_z_cond[0, 1]
Sigma_yx = Sigma_z_cond[1, 0]  # For scalars, Sigma_yx = Sigma_xy
Sigma_yy = Sigma_z_cond[1, 1]

# Assume we observe y = 2.5
observed_y = 2.5

# Calculate the conditional mean and variance of x given y
# Need Sigma_yy_inv (which is just 1/Sigma_yy for a scalar)
Sigma_yy_inv = 1.0 / Sigma_yy

mu_x_given_y = mu_x + Sigma_xy * Sigma_yy_inv * (observed_y - mu_y)
Sigma_x_given_y = Sigma_xx - Sigma_xy * Sigma_yy_inv * Sigma_yx

print(f"\nOriginal 2D Gaussian: mu = {mu_z_cond}, Sigma = {Sigma_z_cond}")
print(f"Observed y = {observed_y}")
print(f"Conditional distribution p(x | y={observed_y}):")
print(f"  Conditional Mean (mu_x|y): {mu_x_given_y}")
print(f"  Conditional Variance (Sigma_x|y): {Sigma_x_given_y}")

# The conditional distribution p(x | y=observed_y) is Gaussian N(mu_x_given_y, Sigma_x_given_y).
# The conditional mean is adjusted from the marginal mean mu_x based on how far the observation y is from its mean mu_y, scaled by the covariance Sigma_xy and the inverse variance of y.
# The conditional variance Sigma_x|y is less than or equal to the marginal variance Sigma_xx, reflecting reduced uncertainty about x after observing y.


Original 2D Gaussian: mu = [1. 2.], Sigma = [[1.  0.5]
 [0.5 1. ]]
Observed y = 2.5
Conditional distribution p(x | y=2.5):
  Conditional Mean (mu_x|y): 1.25
  Conditional Variance (Sigma_x|y): 0.75


### Gaussian Inference is Linear Algebra

The core insight from these closure properties is that Gaussian inference is fundamentally linear algebra. If all random variables in your model are jointly Gaussian and their relationships are linear, then any marginal or conditional distribution you want to compute will also be Gaussian, and its mean and covariance can be found by applying linear algebra operations (matrix multiplication, inversion, addition) to the means and covariances of the original joint distribution.

This makes Gaussian models highly tractable for inference, even in high dimensions, provided you can handle the necessary matrix computations.

The slides summarize this with the theorem:

If $p(x) = \mathcal{N}(x; \mu, \Sigma)$ and $p(y \mid x) = \mathcal{N}(y; Ax + b, \Lambda)$, then the posterior $p(x \mid y)$ is Gaussian $\mathcal{N}(x; \mu_{\text{post}}, \Sigma_{\text{post}})$ with the update equations we saw for conditioning. These equations are just matrix manipulations of the parameters $\mu, \Sigma, A, b, \Lambda$.

#### Summary

Lecture 06 established the Gaussian distribution as a cornerstone of probabilistic machine learning due to its remarkable closure properties under linear operations:

- Products of Gaussian densities are proportional to Gaussian densities.
- Linear transformations of Gaussian variables result in Gaussian variables.
- Marginal distributions of Gaussian variables are Gaussian.
- Linear conditional distributions of Gaussian variables are Gaussian.

These properties imply that for linear Gaussian models, Bayesian inference is analytically tractable and can be performed using linear algebra. This is a powerful tool for modeling and inference in many real-world applications.
