# Diffusion Models
---
Gabrijel Boduljak, James Thornton
<br>


In this tutorial, we’ll build the theoretical foundation for understanding **continous-time diffusion models**. The key idea is simple but powerful: we can gradually corrupt real data with noise using a forward process, and then learn to reverse this corruption process step by step to generate new data samples from pure noise. We’ll start by introducing stochastic differential equations (SDEs) and how they let us formalize this forward noising process. Then we’ll see how the reverse process can recover the data, and why we need to approximate it with a neural network trained via **denoising score matching (DSM)**.  Finally, we’ll explain one of the most remarkable features of diffusion models: their **guidance ability**. This allows us to steer generation toward desired outcomes. We’ll explore and explain state-of-the-art **classifier-free guidance** and demonstrate how it can be used to generate specific digits from the MNIST dataset. We will also show how to incorporate classifier-free guidance in the state-of-the-art diffusion architecture, DiT.  By the end, you’ll have a clear picture of :
  1. What diffusion models actually learn?
  2. How to train diffusion models?
  3. How to sample from diffusion models, potentially guiding the sampling process towards the desired outcome?


Tutorial outline:
- Theory recap
- Setup (framework info and python imports)
- Practical 1. Fundamentals of Score-Based Diffusion Models (basic level)
- Practical 2. Controllable (Image) Generation (intermediate/advanced level)
- [Optional] Practical 3.: Flow Matching vs Diffusion Models (advanced level)

---


## Introduction

Diffusion / flow models entail traversing a stochastic process $(\mathbf{X}_t)^T_{t=0}$ between some data distribution, $p_\textrm{data}=p_0$, at $t=0$ and some easy to sample distribution, typically Gaussian as $t=T$, $p_\textrm{prior}=p_T$, where each marginal $\mathbf{X}_t \sim p_t$.

Without loss of generality we set $T=1$, and denote the reference process as $\mathbf{X}_t = \alpha_t\mathbf{X}_0 + \sigma_t\mathbf{X}_1$, or can be denoted $\mathbf{X}_t = \alpha_t\mathbf{X}_0 + \sigma_t \epsilon$, where $\mathbf{X}_1=\epsilon \sim N(\mathbf{0}, \mathbb{I})$.

Remarkably, training a network to estimate $\mathbb{E}[\mathbf{X}_0|x_t]$ is sufficient to use as a generative model, and indeed learning any linear combination of $\mathbb{E}[\mathbf{X}_0|x_t]$ and $\mathbb{E}[\mathbf{X}_1|x_t]$ would also be sufficient.

Approaches to choose $\alpha_t$, $\sigma_t$:

1) **Traditional Diffusion Models:** $\alpha_t$, $\sigma_t$ are derived from the perturbation kernel of forward stochastic differential equation (SDE)

Transition through time using reverse SDE.


2) **Flow Matching:** Set $\alpha_t$=(1-t), $\sigma_t=t$ directly, or ay other  coefficients interpolation between coupled points, sampled jointly from a coupling between $p_\textrm{data}$ and $p_\textrm{prior}$.

Transition through time using velocity $\partial_t \mathbf{X}_t$, and ordinary differential equation (ODE).

**Spoiler**:
The generative stochastic samplers common for diffuiosn models can be converted to an ODE with the same distribution, and the generative ODE flow matching can be converted to an SDE.

Furthermore, the exact flow matching training and sampling regime can be recovered from diffusion perspective when $p_\textrm{prior}$ is Gaussian by choosing an appropriaote forward SDE (see later).

## Theory recap
---


> "Creating noise from data is easy; creating data from noise is generative modeling."
Yang Song


### Very Quick Intro to SDE
A **stochastic differential equation (SDE)** describes the evolution of a continuous-time random process by combining a **deterministic trend** with **random perturbations**. Formally, it is written as

$$
\mathrm{d}\mathbf{x}_t = f(\mathbf{x}_t, t)\,\mathrm{d}t + g(t)\,\mathrm{d}\mathbf{w}_t,
$$

where $\mathbf{x}_t$ is the state at time $t$, and $\mathbf{w}_t$ is a Wiener process (Brownian motion). The first term, called the **drift** $f(\mathbf{x}_t, t)$, controls the average or deterministic direction in which the process evolves — like the "force" driving the system. The second term, called the **diffusion** $g(\mathbf{x}_t, t)$, scales the random noise and determines how much randomness or variability is added at each (infinitesimal) step.

When $g(\mathbf{x}_t, t)=0$, then the dynamics are given by an ODE, $\mathrm{d}\mathbf{x}_t = f(\mathbf{x}_t, t)\,\mathrm{d}t$. Thus, a stochastic differential equation can be seen as an ordinary differential equation, where at each infinitesimal step, we also have some random perturbation with a Brownian motion. Of course, rigorous construction of SDEs is outside of the scope of this notebook, but we provide more references in the Advanced Part. In practice, we work with discretizations of these anyway.

## Discretisation
All you need to know about SDE notation is through it's discretisation:
$$
\mathbf{x}_{t_\textrm{next}} = f(\mathbf{x}_t, t) (t_\textrm{next}-t) + g(t)\sqrt{(t_\textrm{next}-t)}ϵ
$$

$$ϵ \sim N(0,\mathbb{I})$$



## Time Reversal: from data to noise, then noise to data

### Forward process
Recall
$$
\mathrm{d}\mathbf{x}_t = f(\mathbf{x}_t, t)\mathrm{d}t + g(t)\,\mathrm{d}\mathbf{w}_t,
$$

$$\mathbf{x}_0 \sim p_\textrm{data}$$


#### Particular example: **variance-preserving SDE**
(Or Ornstein Uhlenbeck process for traditional SDE)

The forward process describes how clean data $\mathbf{x}_0$ is gradually perturbed with noise over (continous) time. It is formally specified as a stochastic differential equation.
There are many possible choices, but we focus on so called **variance-preserving SDE**,

Choose forward process such that $X_T$ is close to Gaussian.

$$
\mathrm{d}\mathbf{x} = -\tfrac{1}{2} \beta(t) \mathbf{x} \, \mathrm{d}t + \sqrt{\beta(t)} \, \mathrm{d}\mathbf{w},
$$

where $\beta(t)$ controls the noise rate and $\mathrm{d}\mathbf{w}$ is standard Brownian motion.



The forward process plays the role of **progressively destroying structure in the data** by injecting Gaussian noise in a controlled manner.

* At **early times** ($t \approx 0$), $\mathbf{x}_t$ is close to the original data.
* At **later times** ($t \to 1$), $\mathbf{x}_t$ becomes nearly pure Gaussian noise.


This gradual corruption provides an *interpolation* between the complex data distribution and a simple Gaussian prior. The reverse process then learns to recover the clean part from the interpolation during generation.




## Reverse Process

From noise to data, the generative model.

It can be shown the reverse process follows the following SDE:

$$
\mathrm{d}\mathbf{x}_t = [f(\mathbf{x}_t, t) - g^2(t)\nabla_{x_t} \log p_t(\mathbf{x}_t)]\,\mathrm{d}t + g(t)\,\mathrm{d}\mathbf{w}_t,
$$

$$\mathbf{x}_T \sim \mathbb{N}(0, \mathbb{I})$$

For the VP SDE:

$$
\mathrm{d}\mathbf{x} = \Big[-\frac12 \beta(t)\mathbf{x} - \beta(t)\nabla_{\mathbf{x}} \log p_t(\mathbf{x})\Big] \mathrm{d}t + \sqrt{\beta(t)}\,\mathrm{d}\bar{\mathbf{w}}.
$$

## Training via denoising score matching
We learn the score $\nabla_{x_t} \log p_t(\mathbf{x}_t)$ via denoising score matching

By Fisher's Identity (see proof sketch at bottom of page)
$$\nabla_{x_t} \log p_t(\mathbf{x}_t) = \mathbb{E}_{p(x_0|x_t)}[\nabla_{x_t} \log p_t(\mathbf{x}_t|x_0)]$$

So by definition of conditional expectation:
$$\nabla_{x_t} \log p_t(\mathbf{x}_t) = argmin_s \mathbb{E}_{p(x_0,x_t)}[\|\nabla_{x_t} \log p_t(\mathbf{x}_t|x_0)-s\|^2]$$

One can instead use a neural network $s_\theta$
$$\nabla_{x_t} \log p_t(\mathbf{x}_t) \approx argmin_{s_\theta} \mathbb{E}_{p(x_0,x_t)}[\|\nabla_{x_t} \log p_t(\mathbf{x}_t|x_0)-s_\theta(x_t,t)\|^2]$$

Here the target conditional score $\nabla_{x_t} \log p_t(\mathbf{x}_t|x_0)$ is tractable due to the Gaussian perturbation kernel (see below)

$$\nabla_{x_t} \log p_t(\mathbf{x}_t|x_0)  = -\frac{x_t-\alpha(t)x_0}{\sigma(t)^2}$$


Expanding Fisher's identity: $\nabla_{x_t} \log p_t(\mathbf{x}_t) = \mathbb{E}_{p(x_0|x_t)}[\nabla_{x_t} \log p_t(\mathbf{x}_t|x_0)]$

Gives the Tweedie formula:
$$\nabla_{x_t} \log p_t(\mathbf{x}_t) = \mathbb{E}_{p(x_0|x_t)}[-\frac{x_t-\alpha(t)x_0}{\sigma(t)^2}]$$
$$\nabla_{x_t} \log p_t(\mathbf{x}_t) = \frac{\alpha(t)\mathbb{E}_{p(x_0|x_t)}[\mathbf{x}_0|x_t]-x_t}{\sigma(t)^2}$$

Hence rather than learning the score, one could also approximate the denoiser $\mathbb{E}_{p(x_0|x_t)}[\mathbf{x}_0|x_t]$ via regression.

## Perturbation Kernel

A key ingredient to the scalability of denoising score matching is choosing the forward process so that it can be sampled without simulation.

By choosing the drift to be linear $f(\mathbf{x}_t, t)=f(t)\mathbf{x}_t$:

$$
\mathrm{d}\mathbf{x}_t = f(t)\mathbf{x}_t\mathrm{d}t + g(t)\,\mathrm{d}\mathbf{w}_t,
$$

the solution to the (forward) SDE is $ X_t \sim \mathbb{N}(\alpha(t)\mathbf{x}_0, \, \sigma^2(t)\mathbf{I}) $

Or

$$\mathbf{X}_t  = \alpha(t)\mathbf{X}_0 + \sigma_t ϵ$$

For the VP-SDE: $
\mathrm{d}\mathbf{x} = -\tfrac{1}{2} \beta(t) \mathbf{x} \, \mathrm{d}t + \sqrt{\beta(t)} \, \mathrm{d}\mathbf{w},
$

1.  $
  \alpha(t) = \exp\!\Big(-\tfrac{1}{2} \int_0^t \beta(s) \, \mathrm{d}s\Big),
  $
2.   $
  \sigma^2(t) = 1 - \alpha(t)^2,
  $

and $\mathbf{I}$ is the identity matrix.


In general, for $f(t) \neq 0$, the coefficients are:
\begin{align}
    \mathbf{X}_t | x_0 &\sim \mathbb{N}\left(e^{ \int^t_0 f(s) \mathrm{d} s}x_0, \frac{\int_0^t g(s)^2 \mathrm{d} s}{2\int_0^t f(s)\mathrm{d} s} \left[e^{2\int_0^t f(s) \mathrm{d} s}-1\right] \right)
\end{align}

See proof [here](https://www.diva-portal.org/smash/get/diva2:699061/FULLTEXT01.pdf)





## Classifier-Free Guidance


Suppose we want to **condition** generation on some label or feature $\mathbf{y}$. Ideally, we’d like to follow the **conditional score** $\nabla_{\mathbf{x}} \log p_t(\mathbf{x} \mid \mathbf{y})$.

By **Bayes’ theorem**,

$$
p_t(\mathbf{x}\mid \mathbf{y}) = \frac{p_t(\mathbf{y}\mid \mathbf{x})\, p_t(\mathbf{x})}{p_t(\mathbf{y})}.
$$

Taking the gradient w\.r.t. $\mathbf{x}$ gives:

$$
\nabla_{\mathbf{x}} \log p_t(\mathbf{x}\mid \mathbf{y}) = \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) + \nabla_{\mathbf{x}} \log p_t(\mathbf{y}\mid \mathbf{x}).
$$

So the conditional score is just the **unconditional score** plus a "classifier term", indicating how likely is the condition  $\mathbf{y}$ given the data point $ \mathbf{x}$.

**Classifier-free guidance (CFG)** amplifies this conditional term with a weight $w\ge 0$. We define the **guided score**:

$$
\begin{aligned}
\nabla_{\mathbf{x}} \log p_t(\mathbf{x})_\text{guided}
&= \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) + w \big(\nabla_{\mathbf{x}} \log p_t(\mathbf{x}\mid \mathbf{y}) - \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \big) \\
&= (1-w)\,\nabla_{\mathbf{x}} \log p_t(\mathbf{x}) + w\, \nabla_{\mathbf{x}} \log p_t(\mathbf{x}\mid \mathbf{y}).
\end{aligned}
$$

* $w=0$ → unconditional generation
* $w=1$ → standard conditional
* $w>1$ → stronger push toward $\mathbf{y}$

Finally, to apply CFG to a **reverse-time SDE**, we plug the guided score directly into the drift term:

$$
\mathrm{d}\mathbf{x} = \Big[-\frac12 \beta(t)\mathbf{x} - \beta(t) \big( (1-w)\,\nabla_{\mathbf{x}} \log p_t(\mathbf{x}) + w\, \nabla_{\mathbf{x}} \log p_t(\mathbf{x}\mid \mathbf{y}) \big) \Big] \mathrm{d}t + \sqrt{\beta(t)}\,\mathrm{d}\bar{\mathbf{w}}.
$$

And that’s it! We now have a **conditionally-guided reverse process**, derived in a principled way using Bayes.


How we train for CFG?

1. During training, **randomly drop the conditioning** $\mathbf{y}$ with some probability (e.g., 10–20%).
2. When $\mathbf{y}$ is dropped, the model learns the **unconditional score** $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$.
3. When $\mathbf{y}$ is provided, the model learns the **conditional score** $\nabla_{\mathbf{x}} \log p_t(\mathbf{x}\mid \mathbf{y})$.

At sampling time, we can then interpolate or amplify the conditional signal using the guided score formula. This **simple trick allows a single model to flexibly do CFG** without needing two models, the unconditional and conditional one.


## Probability Flow ODE: Converting Diffusion SDE to ODE at Sampling Time


Perhaps surprisingly, there exists an ODE that samples from the same distribution reverse diffusion

$$ d\mathbf{x}_t = \left[f(t)\mathbf{x}_t - \frac{g^2(t)}{2}\nabla_x \log p_t(\mathbf{x}_t)\right]dt $$
(here negative time increments $\mathrm{d}t < 0$)

For the VP-SDE:

$$ d\mathbf{x}_t = \left[-\frac{1}{2}\beta(t)\mathbf{x}_t - \frac{1}{2}\beta(t)s_\theta(\mathbf{x}_t,t)\right]dt $$

This ODE, also known as Probability Flow ODE, defines a continuous-time process where the state $x$ evolves backward from noise to data without any stochastic (random) component.
Unlike the SDE, this path is deterministic, meaning that starting from the same initial noisy state will always lead to the same final data sample.

We evolve the ODE using **Euler method**,

$$x_{t-\Delta t} = x_t - \left[-\frac{1}{2}\beta(t)x - \frac{1}{2}\beta(t)s_\theta(x,t)\right]\Delta t$$

where:
* $t$ is the current time step.
* $\Delta t = t_i - t_{i+1}$ is the positive step size.
* $x_{t-\Delta t}$ is the state at the next (earlier) time step.
* $s_\theta(x_t,t)$ is the estimated score function




# Flow Matching

Given $\mathbf{X}_t = \alpha_t\mathbf{X}_0 + \sigma_t\mathbf{X}_1$, one can construct a velocity field by taking partial derivatives wrt time, $t$.

$$v_t = \partial_{t}\mathbf{X}_t = \partial_{t}\alpha_t\mathbf{X}_0 + \partial_{t}\sigma_t\mathbf{X}_1$$

Or equivalently in physics notation:
$v_t = \dot{\mathbf{X}}_t = \dot{\alpha_t}\mathbf{X}_0 + \dot{\sigma_t}\mathbf{X}_1$

Whilst there are many choices of $\alpha_t, \sigma_t$, it has become standard to choose $\alpha_t = 1-t, \sigma_t = t$, and hence $\dot{\alpha_t}=-1, \dot{\sigma_t}=1$ leading to $v_t = \mathbf{X}_1-\mathbf{X}_0$

If we had access to this velocity $v_t$ we could simply sample $\mathbf{X}_1$ from Gaussian and traverse the ODE:
 $$\mathrm{d}\mathbf{X}_t = -v_t \mathrm{d}t$$

This idealised $v_t$ depends on $X_0$ and $X_1$ however, so it is not practical or useful for generative modelling.

Remarkably we can instead use $\mathbf{E}[v_t|x_t]$ (known as the Markovian projection) as a drop-in replacement and recover an ODE with the correct marginal distribution. See e.g. Theorem 2 [here](https://arxiv.org/abs/2210.02747) or Theorem 3.3 [here](https://arxiv.org/abs/2209.03003) for proof.

This projection may be approximated via a neural network:
$$v_\theta(x_t,t) \approx \mathbf{E}[v_t|x_t]$$ using a regression loss
$$\|v_\theta(x_t,t) - v_t\|^2$$ or
$$\|v_\theta(x_t,t) - (\mathbf{X}_1-\mathbf{X}_0)\|^2$$

And then sample along:
$\mathrm{d}\mathbf{X}_t = -v_\theta(x_t,t) \mathrm{d}t$

Note: negative sign here is due to $\mathbf{X}_0 \sim p_\textrm{data}$ to be consistent with the diffusion literature. In some papers, the order is reversed, in which case the ODE would be $\mathrm{d}\mathbf{X}_t = v_\theta(x_t,t) \mathrm{d}t$


## From ODE to SDE: Flow to Diffusion Model

Detour with Langevin dynamics SDE:

$$\mathrm{d}\mathbf{Y}_t = \frac{\lambda_t^2}{2}\nabla_{y_t}\log p_t({\mathbf{Y}_t})\mathrm{d}t + \lambda_t \mathrm{d}\mathbf{w}t$$

Forms a Markov chain targetting $p_t$

Combining Langevin with flow:

$$\mathrm{d}\mathbf{X}_t = \left[v_t + \frac{\lambda_t^2}{2}\nabla_{x_t}\log p_t({\mathbf{X}_t})\right]\mathrm{d}t + \lambda_t \mathrm{d}\mathbf{w}t$$

Leads to an SDE with the same marginal distributions as the ODE, recovering diffusion model sampling. Here $\lambda_t \geq 0$ can be chosen by the user and can be chosen to match the diffusion SDE to give a time reversal interpretation.


## Prediction target: flow vs v-prediction in diffusion models

Prior to the rise in popularity of flow matching it was common to choose $\alpha_t = \cos( \frac{\pi}{2}t), \sigma_t=\sin( \frac{\pi}{2}t)$.

Following the above, one can similarly compute the idealised velocity
$$
\begin{align}
v_t &= \dot{\mathbf{X}}_t = \dot{\alpha_t}\mathbf{X}_0 + \dot{\sigma_t}\mathbf{X}_1 \\
&=  -\sin( \frac{\pi}{2}t)\frac{\pi}{2}\mathbf{X}_0  + \cos( \frac{\pi}{2}t)\frac{\pi}{2}\mathbf{X}_1 \\
&= \frac{\pi}{2}[\alpha_t\mathbf{X}_1 -\sigma_t\mathbf{X}_0]
\end{align}
$$

This has historically been called v-prediction (introduced [here](https://arxiv.org/abs/2202.00512)) in the diffusion model literature. So indeed flow matching and v-prediction are equivalent.

## Flow Matching as a Diffusion Model

Does there exist a forward SDE that gives a perturbation kernel with $\alpha_t = 1-t, \sigma_t = t$?

**The answer is yes!**

Given linear SDE:
$$
\mathrm{d}\mathbf{X}_t = f(t)\mathbf{X}_t\,\mathrm{d}t + g(t)\,\mathrm{d}\mathbf{w}_t,
$$

We wish to find $f,g$ such that $\alpha_t = 1-t, \sigma_t=t$.


Inverting the general diffusion perturbation kernel (given above):
$$f(t) = \partial_t \log \alpha_t$$

$$g^2(t) = 2 \alpha(t)\sigma(t) \partial_t\frac{\sigma(t)}{\alpha(t)}$$

**Source**:
Conversion formula between flow and diffusion given in: [https://diffusionflow.github.io/](https://diffusionflow.github.io/).

For linear schedule:
If $\alpha(t) = 1-t, \sigma(t) =t$, then
$f(x_t,t) = -\frac{1}{1-t}x_t$

$g^2(t) = 2 (1-t)(t) \partial_t\frac{t}{1-t} = 2 \frac{(1-t)(t)}{(1-t)^2}$
$= 2\frac{t}{(1-t)}$

So using traditional diffusion models with the forward SDE below gives the flow interpolation as its perturbation kernel.

$$
\mathrm{d}\mathbf{x}_t = -\frac{1}{1-t}x_t\,\mathrm{d}t + \sqrt{2\frac{t}{(1-t)}}\,\mathrm{d}\mathbf{w}_t,
$$









Indeed the probability flow ODE of reverse generative process gives the same velocity field as the learnt flow matching vector field!

# Proof Sketches

## Fisher Identity for Score Matching
$$
\begin{align}
\nabla_{x_t} \log p(x_t) &= \frac{\nabla_{x_t} p(x_t)}{p(x_t)} = \frac{\nabla_{x_t} \int p(x_t, x_s)dx_s}{p(x_t)} = \frac{\int \nabla_{x_t} p(x_t, x_s)dx_s}{p(x_t)} \notag \\
&= \frac{\int p(x_t, x_s)\nabla_x \log p(x_t, x_s)dx_s}{p(x_t)} = \int p(x_s|x_t)\nabla_{x_t} \log p(x_t|x_s)dx_s \notag \\
&= \mathbb{E}_{s|t}[\nabla_{x_t} \log p(x_t|\mathbf{X}_s)]
\end{align}
$$

## Time Reversal
For simplicity ignoring drift, here $p(x_t|x_{t-\Delta t})$ is given as Gaussian
with standard deviation $\sigma_q$.

Using Bayes' rule
$$
\begin{align*}
\log p(x_{t-\Delta t} | x_t) &= \log p(x_t|x_{t-\Delta t}) + \log p_t(x_{t-\Delta t}) - {\log p_t(x_t)} \\
&= \log p(x_t|x_{t-\Delta t}) + \log p_t(x_{t-\Delta t}) + {O}(\Delta t) \\
&= -\frac{1}{2\sigma_q^2\Delta t} \|x_{t-\Delta t} - x_t\|_2^2 + \log p_t(x_{t-\Delta t}) \\
&= -\frac{1}{2\sigma_q^2\Delta t} \|x_{t-\Delta t} - x_t\|_2^2 + {\log p_t(x_t)} + \langle \nabla_x \log p_t(x_t), (x_{t-\Delta t} - x_t) \rangle + {O}(\Delta t) \\
&= -\frac{1}{2\sigma_q^2\Delta t} \left( \|x_{t-\Delta t} - x_t\|^2 - 2\sigma_q^2\Delta t \langle \nabla_x \log p_t(x_t), (x_{t-\Delta t} - x_t) \rangle \right) \\
&= -\frac{1}{2\sigma_q^2\Delta t} \|x_{t-\Delta t} - x_t - \sigma_q^2\Delta t \nabla_x \log p_t(x_t)\|^2 + C \\
&= -\frac{1}{2\sigma_q^2\Delta t} \|x_{t-\Delta t} - \mu\|^2
\end{align*}
$$
This is identical, up to additive factors, to the log-density of a Normal distribution with mean $\mu$ and variance $\sigma_q^2\Delta t$. Therefore,
$$
p(x_{t-\Delta t} | x_t) \approx \mathbb{N}(x_{t-\Delta t}; \mu, \sigma_q^2\Delta t).
$$

## Continuous version of DDPM

Consider the forward noising process as a discrete Markov chain: $p(x_{0:N}) = p(x_0) \prod_{k=1}^{N-1} p(x_{k+1}|x_k)$, where $p_0 = p_{\text{data}}$ and $x_k|x_{k-1} \sim \mathbb{N}(\sqrt{1-\beta_k}x_{k-1}, \beta_k\mathbf{I})$ for some positive schedule $(\beta_k)_k$. This forward process is a discrete approximation to the Ornstein Uhlenbeck process, hence $p_N \approx \mathbb{N}(\mathbf{0}, \mathbf{I})$ by construction.



The DDPM forward discretized SDE is applied independently per dimension, so here we just consider the univariate case. Consider the moments of an Ornstein Uhlenbeck (OU) process, let $\mu=0, \sigma=1$ and piecewise constant $\beta$, $\beta(t') = \beta_t$ for $t' \in (t, t+1)$, i.e.\ $\int_t^{t+1} \beta(t')dt' = \beta_t$.

Step wise: by a Taylor approximation:
-  $\mathbb{E}[\mathbf{X}_{t+1}|x_t] = e^{-\frac{1}{2}\int_t^{t+1} \beta(t')dt'}x_t = \sqrt{e^{-\beta_t}}x_t \approx \sqrt{1-\beta_t}x_t$
- Variance $\mathbb{V}[\mathbf{X}_{t+1}|x_t] = 1 - e^{-\beta_t} \approx \beta_t$. Therefore, $x_{t+1}|x_t \sim \mathbb{N}(\sqrt{1-\beta_t}x_t, \beta_t)$.
\\


Similarly, the closed-form perturbation kernel may be derived from the OU process.
-  $\mathbb{E}[\mathbf{X}_t|x_0] = e^{-\frac{1}{2}\int_0^t \beta(t')dt'}x_0 = e^{-\frac{1}{2}\sum_{k=1}^t \int_{k-1}^k \beta(t')dt'}x_0 = e^{-\frac{1}{2}\sum_{t'=1}^t \beta_{t'}}x_0 = \sqrt{\prod_{t'=1}^t e^{-\beta_{t'}}}x_0 = \sqrt{\prod_{t'=1}^t (1-\beta_{t'})}x_0 = \sqrt{\bar{\alpha}_t}x_0$

- Variance $\mathbb{V}[\mathbf{X}_t|x_0] = 1-\bar{\alpha}_t$ where $\bar{\alpha}_t = \prod_{t'=1}^t(1-\beta_t)$.
