# Diffusion Models
![diff](../resources/images/diff.png)

Starting with our original data - a conspicuously llama-shaped cloud, $x_0$, sampled from a true underlying distribution, $q$:
$$x_0 \sim q_\text{data}(x)$$

We see in the above image how the distribution of data at each subsequent step is dependent on the current step: the cloud-shaped llama at $t+1$ is a slightly distorted version of the cloud-shaped llama at time $t$. In other words, the evolution of the distribution is such that the state at $t+1$ depends only on the state at time $t$ - i.e., $q(x_t|x_{t-1})$. Under this assumption, the process is **Markovian**. Each perturbation step is governed by a **Markov transition kernel**, $K(x_{t+1}|x_t)$ which specifies a conditional distribution over possible next states given the current state by marginalising over all possible previous states:
$$q(x_{t+1})=\int K(x_{t+1}|x_t)\cdot q(x_t)\text{ }dx_t$$
When this stochastic transition is applied across the data distribution, the uncertainty it introduces leads to a gradual smoothing of the density function, destroying the fine structure in the data and causing the distribution to become increasingly diffuse. We therefore define the **forward process** as 
$$x_{t+1}\sim K(x_{t+1}|x_t).$$

So far, we have described the forward dynamics of a diffusion process: a Markovian evolution that progressively and irreversibly destroys fine-scale structure in the data. Nevertheless, one might observe that if the correspond **reverse-time dynamics** could be modelled, then the diffusion process could be repurposed as a generative model: by starting from a simple, highly diffused distribution, samples could progressively be transformed back into realistic data by simulating this reverse process. This requires three ingredients:
1. A simple, known distribution that approximates the terminal state of the forward process;
2. A reverse-time transition kernel that describes how samples evolve backwards through time; 
3. and a learnable parameterisation of the reverse-time dynamics.

In practice, we will choose a simple reference distribution, $p_T(x)$, that is easy to sample from - most commonly, an isotropic Gaussian, $p(x)=N(0, I)$. The forward process is defined, through the choice of kernel and its hyperparameters, so that after enough steps, $x_T$ is approximately distributed according to this reference distribution:
$$q(x_T)\approx p_T(x)$$
Put simply, we progressively corrupt the data until it resembles Gaussian noise.

Next, we need a reverse-time transition kernel, $x_{t-1}\sim T(x_{t-1}|x_t)$, that will evolve our reference distribution $p$:
$$p_{t-1}(x_{t-1})=\int T(x_{t-1}|x_t)\cdot p_t(x_t)\text{ }dx_t$$
This is easier said than done. To see why, we can derive it explicitly from the forward process. 

If the reverse process is to exactly invert the forward process in distribution, then its transitions need to match the conditional distribution induced by the forward dynamics. Therefore, the reverse-time transition kernel must be equal to the reverse conditional of the forward process:
$$T(x_{t-1}|x_t)=q(x_{t-1}|x_t)$$

To obtain that, we can use the definition of a conditional distribution:
$$q(x_{t-1}|x_t)=\frac{q(x_{t-1}, x_t)}{q(x_t)}$$

To obtain the joint distribution $q(x_{t-1}, x_t)$, we first re-visit our forward process. The joint distribution across an entire forward trajectory $x_{0:T}$ factorises as
$$q(x_{0:T})=q(x_0)\prod_{t=1}^T K(x_t|x_{t-1})$$
By taking that result, we can marginalise out everything except $(x_{t-1}, x_t)$ to give:
$$q(x_{t-1}, x_t) = q(x_{t-1}) \cdot K(x_t|x_{t-1})$$
Substituting back into our earlier expression, we can yield a definition for our reverse-time transition kernel:
$$q(x_{t-1}|x_t)=\frac{q(x_{t-1}) \cdot K(x_t|x_{t-1})}{q(x_t)}$$
While the forward kernel $K$ is known (we define it), the marginals $q(x_{t-1})$ and $q(x_t)$ are only defined through repeated application of the forward process to real data. Since the data distribution $q(x_0)$ is unknown and accessible only through samples, the intermediates cannot be evaluated in closed form making the true reverse-time kernel intractable. 

## Denoising Diffusion Probablistic Models

### References
- Ho, J. et al. (2020). Denoising Diffusion Probabilistic Models ([arxiv](https://arxiv.org/abs/2006.11239))
- Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics, Chapter 25: Diffusion Models ([MIT Press](https://probml.github.io/pml-book/book2.html))
- Sohl-Dickstein, J. et al. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics ([arxiv](https://arxiv.org/abs/1503.03585))
