# Diffusion model
Diffusion model is a type of generative model that uses deep neural networks to take some noisy input and denoise it into something meaningful. The diffusion process has two parts, the forward and the reverse diffusion process. The forward proccess works by first detroying the training data by repeated adding Gassusian noise, then reverse diffusion proccess let the model learns to predict the added noise and remove it to recover the original data. Eventually, after training, we can pass randomly sampled noise to the model and apply the denoising proccess to generate an new data

## Forward diffusion process
The forward diffusion process, $q$, will iteratively add Gaussian noise to the signal at each time step, $t$, until the last time step, $T$, where the image becomes completely noisy. In essence, this process is a linear combination of the original signal and noise

This process is represented by
$$x_t = \alpha_t x_{t-1} + \sigma_t \epsilon$$

$x_t$: the signal at the current time step, $t$

$x_{t-1}$: the signal at previous time step, $t-1$

$\alpha_t$: a coefficient between 0 and 1. It indicates the portion of the signal we want to keep from the previous time step at timestep $t$

$\sigma_t$: a cofficient that indicataes the portion of the noise that we want to add to the signal at timestep $t$. Its value depends on how the noise schedule is defined

$\epsilon$: a noise vector sampled from Gaussian distribution (typically $\mathcal{N}(0, I)$)

Note: $x_0$ represents the original input image and $x_T$ represents the completely noisy signal

This represents one step in the forward diffusion process, which combines a portion of the signal from the previous timestep and a portion of newly sampled noise

<img src="https://www.assemblyai.com/blog/content/images/2022/05/image.png">

In general, we at each timestep, we only want to add a very small portion of noise. This helps preserving the original signal and stablize the training for the reverse diffusion process. If we add all the noise at once, it's impossible for the model to learn predicting what the original image looks like

Rewrite of the forward process

$$x_t = \sqrt{\bar{\alpha_t}} x_0 + \sqrt{(1 - \bar{\alpha_t})} \epsilon$$

$\bar{\alpha_t} = \Pi_{t=1}^{t} \alpha_t$: a constant weighting applied to the original image, where $\alpha_t = 1 - \beta_t$

$\beta_t$: a linear function that monotonically increases with $t$, meaning $\bar{\alpha_t}$ will gradually becomes smaller, and $1 - \bar{\alpha_t}$ gradually becomes larger

$x_0$: original input image

$\epsilon$: a noise vector sampled from Gaussian distribution (typically $\mathcal{N}(0, I)$)

* At $t=0$, $\bar{\alpha_t} = 1$, so $x_t = x_0$ (no noise added)

* At $t=T$, $\bar{\alpha_t} = 0$, so $x_t = \epsilon$ (pure noise)

This equation represents the cumulative result of gradual noise addition, and we can compute $x_t$ at any time step $t$ directly in one step using the equation for efficiency. However, this is not the same as adding all the noise at once because we can still retrive the signal $x_t$ at any time steps

## Reverse diffusion process
The reverse diffusion proccess, $p$, will repeatedly remove noise to recover the original image; this is called the sampling or denoising process. 

<img src="https://www.assemblyai.com/blog/content/images/2022/05/image-1.png">

The process starts with $p(x_T) \sim \mathcal{N}(x_T,0,I)$ - a completely noisy sample at the final timestep, $T$, and we want to know how the noise was added, $q(x_{t-1}|x_t)$. However, this is not tractable because

$$q(x_{t-1} | x_t) = \frac{q(x_t | x_{t-1}) q(x_{t-1})}{q(x_t)}$$

This requires

1. $q(x_t | x_{t-1})$: this is known because it’s part of the forward process. It’s a simple Gaussian distribution defined by the noise schedule.

2. $q(x_{t-1})$: this is the marginal distribution of $x_{t-1}$, which is intractable to compute directly since it involves integrating over all possible prior states:
     $q(x_{t-1}) = \int q(x_{t-1} | x_0) p(x_0) \, dx_0$

3. $q(x_t)$: Similarly, the marginal $q(x_t)$ requires integrating over $q(x_t | x_0) p(x_0)$, which is computationally infeasible due to the high-dimensional nature of the data.

Therefore, we estimate $q(x_{t-1}|x_t)$ with a neural network, $q_{\theta}(x_{t-1}|x_t)$, parametrized by $\theta$, the model parameters, which is

$$p_{\theta}(x_0) = p(x_T)\Pi_{t=1}^{T} p_{\theta}(x_{t-1}|x_t)$$

$p_{\theta}(x_0)$: the probability of generating $x_0$ using the reverse process

$p(x_T)$: the final distribution at $T$ (pure Gaussian noise)

$\Pi_{t=1}^{T} p_{\theta}(x_{t-1}|x_t)$: the product of all conditional probabilities at each time step, representing the full reverse diffusion process

$p_{\theta}(x_{t-1}|x_t)$: the reverse conditional probability at timestep, $t$ (given $x_t$, predicts $x_{t-1}$). This represents one step of the denoising processs. This can also be written as
$$p_{\theta}(x_{t-1}|x_t) = \mathcal{N}(x_{t-1},\mu_\theta(x_t,t),\Sigma_\theta(x_t,t))$$

This means at timestep $t$, given $x_t$, the neural network will learns to predict the mean, $\mu_\theta(x_t,t)$, and variance, $\Sigma_\theta(x_t,t)$, added to the original signal at this timestep

This can be reparametrized as
$$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha_t}}} \epsilon_\theta(x_t,t)) + \sigma_t z$$

$\epsilon_\theta(x_t,t))$: predicted noise at timestep $t$, which is the output of the neural network. It represents the noise added during the forward process that the model learned to estimate

$\frac{\beta_t}{\sqrt{1 - \bar{\alpha_t}}}$: the portion of the noise component to be subtracted from $x_t$ to denoise it

$\bar{\alpha_t}$: the cumulative product of noise schedules up to timestep $t$

$\beta_t$: the noise variance for timestep $t$

Note: $\bar{\alpha_t}$ and $\beta_t$ represents the same things as the forward process. Essentailly, $x_{t-1} = \frac{1}{\sqrt{\alpha_t}}(x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha_t}}} \epsilon_\theta(x_t,t))$ is a rearrangment of the forward equation. It subtracts the predicted noise from $x_t$ to obtain $x_{t-1}$

$\sigma_t = \sqrt{\beta_t}$: the standard deviation of the noise to be added during the reverse process at timestep $t$

$z$: random Gaussian noise sample

The $\sigma_t z$ term introduce randomness to the reverse process, which prevents it from collapsing to a deterministic process (outputing the average of the dataset). This is very important

## Noise scheduler
A noise scheduler controlles the noise addition and removal process across multiple time steps to facilitate effective learning and high-quality generation.

###  DDPM scheduler
DDPM (Denoising Diffusion Probabilistic Models) scheduler is noise scheduler that controls $\beta_t$, which subsequently influences $\alpha_t$ ($\alpha_t = 1 - \beta_t$), which influences $\bar {\alpha_t}$ ($\bar{\alpha_t} = \Pi_{t=1}^{t} \alpha_t$) to control the addition and removal of noise

The noise scheduler controls beta by controlling its
1. Intial value: the value of $\beta_t$ when $t=1$
2. Increments: how $\beta_t$ increases as $t$ increases (linearly, quadratically, or according to a cosine function)
3. Final value: the value of $\beta_t$ when $t=T$

### Types of noise schedule
* Linear Schedule: increases linearly from a small value to a larger one over (T) steps

* Cosine Schedule: Uses a cosine function to vary, providing smoother transitions

* Quadratic and Other Non-Linear Schedules: variations that adjust the rate of noise addition in non-linear ways to optimize performance

<img src="https://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs10462-023-10504-5/MediaObjects/10462_2023_10504_Fig6_HTML.png">

## Model architecture
The model used to predict noise. The neural network takes in the noisy data, $x_t$ and the time step $t$ and output the predicted noise $\epsilon_\theta$. The input and output have the exact same size

<img src="https://learnopencv.com/wp-content/uploads/2023/02/denoising-diffusion-probabilistic-models_UNet_model_architecture.png" width=500>

* Time embedding: during the sampling process, we need to add the time embedding into the network by encoding the current timestep $t$ into a high-dimensional representation. This tells the model how much noise has been added to the data to help it making appropriate denoising predictions. At a particular timestep, $x_t$ is different from $x_t$ at other timestep due to cumulative noise introduced $1 - \bar{\alpha_t}$, so the time embedding is necessary to give the model context on how far along the reverse process it is and how much noise remains to be removed. Without this time context, the model will treats all inputs at different time context the same, which makes it impossible to reconstruct $x_0$. Common time embeddings includes sinusoidal, learned, or hybrid embeddings

* Context embedding: related to controlling the generation

## Training
The training aims to let the model learn to predict how much noise to remove at different timestep

Steps:
1. Sample clean training images from each batch
2. Sample a different timestep for each image in the batch (looking at different timestep stablize the training)
3. Sample a random noise for each clean triang image
4. Add noise to the clean images based on its corresponding timestep (each image may have a different timestep) to obtain the noisy images (forward diffusion process)
5. Input the noisy images and its timestep to the neural network, which outputs the predicted the total amount of noise added to the original data (reverse diffusion process) 
6. Compute the loss between the predicted noise and the sampled noise using MSE loss function
7. Backpropagate using the loss

### MSE formula
$$L = E_{x_0,\epsilon}||(\epsilon_t - \epsilon_\theta(x_t,t))||^2$$

$\epsilon_t$: the sampled nosie

$\epsilon_\theta$: the predicted noise

Essentially, we want to minimize the loss function to minimize the distance between the sampled and predicted noise


## Controllable generation
To make the process controllable, an additional input signal is introduced to guide the model during generation. This signal is the context and it can have the form of class label or text embedding to give model additional context for the denoising process

Therefore, the model will take in a noise, the timestep, and the context (label or text) when sampling. Then, it will convert the context into context embeddings (this can be done by using a few linear layers or more complex architecture like attention) and inject the context embedding to the signal at different layers with cross-attention or FiLM to produce signal that contains the context

The training process is the same as the normal training process except the need to input the context to the model. The model still learns to predict the noise added but with both the timestep and the context as input. Similarily, the sampling process remains the same with additional context input. If the model is well-trained, it will make generations based on a given context

<img src="https://arxiv.org/html/2403.01108v2/extracted/5629454/fig/pipeline/IP-adapter.png" width=600>

## DDIM
DDPM is slow because its sampling method involves many timesteps and markov chain, so the every next signal $x_{t-1}$ depends on the previous one $x_t$ in the reverse diffusion process. Alternatively, DDIM (Denoising Diffusion Implicit Model) modifies the reverse process to be non-Markovian and deterministic, so it requires less timestep for the reverse process

While both sharing the forward diffusion process, compared to DDPM, DDIM enables (10-100 times) faster sampling, but the generations will have slightly lower diversity and quality as trade-off. The common practice is that we can use DDPM to train the model (because they share the forward process, and training does not require sampling), since it's easier to implement, but after training, we can load the model and sample using DDIM for faster generations

<img src="https://miro.medium.com/v2/resize:fit:1400/0*gSM9LfAuZA6f914Y.png">

### Stochastic DDIM
The key for faster sampling process in DDIM is its non-markovian nature, which allows it to skip timesteps and not requiring all the past states. The reverse process for stochastic DDIM is discribed by 

$$x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \cdot \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon^{(t)}_\theta(x_t)}{\sqrt{\bar{\alpha}_t}} + \sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2} \cdot \epsilon^{(t)}_\theta(x_t) + \sigma_t \epsilon_t$$

$x_{t-1}$: the signal at timestep $t-1$

$\bar{\alpha}_{t-1}, \bar{\alpha}_{t}$: the cumulative noise schedule for timestep $t-1$ and $t$. It determines how much of the original signal, $x_0$, should be preserved at the next step $t-1$ and $t$

$\epsilon^{(t)}_\theta(x_t)$: the noise predicted by the neural network at timestep $t$

$\frac{x_t - \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon^{(t)}_\theta(x_t)}{\sqrt{\bar{\alpha}_t}}$: this term is a rearrangement for the forward process and predicts $x_0$ by subtracting the predicted noise from the neuarl network from $x_t$. This is the prediction made at timestep $t$ based on $x_t$; with smaller $t$, $x_t$ will be less noisy and the predictions for $x_0$ will be better

$\sigma_t^2$: the variance of the noise added at timestep $t$

$\sqrt{1 - \bar{\alpha}_{t-1} - \sigma_t^2} \cdot \epsilon^{(t)}_\theta(x_t)$: the correction term; it adds proper amount of noise back to the predicted $x_0$ term to obtain $x_{t-1}$. The intensity of the noise is smaller at $t-1$ compared to $t$ since $\bar{\alpha}_{t - 1}$ is greater than $\bar{\alpha}_{t}$, meaning we obtained a clearer image closer to $x_0$

$\sigma_t \epsilon_t$: same as DDPM, which adds a random Gaussian noise scaled by $\sigma_t$ to the signal to introduce diversity ($\epsilon_t$ is a random sampled Gaussian noise). $\sigma_t$ is the variance (or standard deviation) of the noise injected at timestep $t$, its given by

$$\sigma_t = \eta \cdot \sqrt{\frac{1 - \hat{\alpha}_{t-1}}{1 - \hat{\alpha}_t}} \cdot \sqrt{1 - \frac{\hat{\alpha}_t}{\hat{\alpha}_{t-1}}}$$

$\eta$: a hyperparameter between 0 and 1 that determines the amount of stochastic noise added during sampling. If $\eta = 0$, no noise is added, which corresponds to a deterministic DDIM process; if $\eta= 1$, all the noise is added, this is the same as the DDPM process

Therefore, DDIM is non-markovian since it deterministically predicts $x_0$ from $x_t$ without the need for intermediate timesteps. Also, we can determine how much noise to add back to $x_0$ to obtain $x_{t-1}$, meaning we can add less noise and skip timesteps

### Deterministic DDIM
Although stochastic DDIM is non-markovian, it still introduce stochasticity by adding controlled randomness, $\sigma_t \epsilon_t$. Determinstic DDIM is a special case of stochastic DDIM when the standard deviation, $\sigma_t$ is zero, with the formula 

$$x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \cdot \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon^{(t)}_\theta(x_t)}{\sqrt{\bar{\alpha}_t}} + \sqrt{1 - \bar{\alpha}_{t-1}} \cdot \epsilon^{(t)}_\theta(x_t)$$

This process is essentially the forward diffusion process with the predicted $x_0$ term, where $x_0 = \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \cdot \epsilon^{(t)}_\theta(x_t)}{\sqrt{\bar{\alpha}_t}}$  

Notice that there is no randomness in the reverse process, so it is completely deterministic, meaning given the same $x_T$, the output $x_0$ will always be the same. Thus, deterministic DDIM will have lower diversity compared to stochastic DDIM and DDPM

### Summary

| Feature              | **DDPM**                               | **Deterministic DDIM**                | **Stochastic DDIM**                   |
|----------------------|----------------------------------------|---------------------------------------|---------------------------------------|
| **Reverse Process**  | Markovian, stochastic                 | Non-Markovian, deterministic          | Becomes Markovian, partially stochastic   |
| **Stochasticity**    | Random noise added at each step       | No randomness; fully deterministic    | Controlled randomness ( $\sigma_t \epsilon_t$ ) |
| **Sampling Speed**   | Slow, many timesteps ( T = 1000 ) | Fast, fewer timesteps ( T = 50 )  | Fast, fewer timesteps (\( T = 50 \))  |
| **Sample Diversity** | High (stochastic sampling)            | Low (deterministic outputs)           | Moderate (some stochasticity added)   |
| **Mathematical Basis** | Probabilistic reverse diffusion       | Deterministic reparameterization      | Hybrid: deterministic + stochastic    |
| **Use Case**         | Creative tasks needing high diversity | Speed-critical or controlled tasks    | Balanced diversity and speed          |


## Classifier guidance
Classifier guidance is a technique used to allow conditional generations in diffusion model by using a pre-trained classifier. This approach requires to train an extra classifier, but the base diffusion model can be trained in an unconditional manner without explicit labels on data

### Train classifier
The goal of the classifier is to predict which class, $y$, the noisy input sample at timesetp $t$, $x_t$, belongs to. Therefore, we first need to apply the forward diffusion process to the labeled data to create noisy samples. Then, we input noisy samples, $x_t$, and the timestep, $t$, tothe classifier (CNN architecture) to predict $p(y | x_t, t)$. 

Steps:
1. Sample the batch in $(x, y)$ pair format, where $x$ are the sample data and $y$ are the corresponding labels
2. Sample a different timestep $t$ for each data-label pair in the batch (the number of $t$ sampled is the same of the batch size)
3. Sample a random noise for data-label pair
3. Apply forward diffusion process by adding noise to the data based on its corresponding timestep to obtain $x_t$
4. Input the noisy data $x_t$ and the timestep $t$ to the classifier to predict the label $\hat y$
5. Use the cross-entropy loss function to calculate the loss
6. Backprop to update parameters

### Sampling with classifier guidance
After trained the classifier and the diffusion model unconditionally, we can guild the model using the classifier to perform conditional generations

First, the model samples a random Gaussian noise and predict the noise added to it (like a normal diffusion model) to obtain 

$$\hat{\epsilon}_{uncond}(x_t, t)$$

This noise does not contain any conditional information. Next, the input the sample $x_t$ and timestep $t$ into the classifier and computes the gradient of the log-likelihood of the desired class $y$ with respect to the noisy sample $x_t$, which is 

$$\nabla_{x_t} \log {p(y | x_t, t)}$$

Because maximizing the log-likelihood means making $x_t$ more likely to belong to class $y$, its gradient points in the direction where the log-likelihood increases most rapidly. Essentially, this gradient encodes the changes needed to make $x_t$ closer to $y$


<img src="https://ffighting.net/wp-content/uploads/2023/08/image-2.png" width=500>