# Intro

## Generative models

A (certainly not complete) list:

- Latent Variable models (incl. VAEs)
- Autoregressive models (incl. GPT-style Language Models)
- GANs
- Flow-based models (incl. Normalizing Flows)
- Energy-Based Models (incl. Score-based models)
- **Diffusion models** (kind of mix of all previous points)
- Combinations

## Image generation

:::: {.columns}

::: {.column width="50%" align=center}
![](images/DDPM_celebA.png){width=60% fig-align="center" fig-alt="DDPM CelebA"}
\ifwindows 
\vspace{-14pt} 
\fi
\begin{center}
\tiny Source: Ho el al. [2020]
\end{center}
\normalsize
:::

::: {.column width="50%" align=center}
\small \textit{"A photo of a Corgi dog riding a bike in Times Square. It is wearing sunglasses and a beach hat."}

\tiny &nbsp;

![](images/corgi_imagen.jpg){width=100% fig-alt="Imagen corgi"}
\ifwindows 
\vspace{-12pt} 
\fi
\begin{center}
\tiny Source: Saharia et al. [2022]
\end{center}
\normalsize
:::

::::

## Video generation

:::: {.columns align=center}
::: {.column align=center width=100%}
[![](images/sora_1.png){width=100% fig-alt="OpenAI Sora" fig-align="center"}](https://player.vimeo.com/video/913331489?h=d6b3d4c2bd)
\ifwindows 
\vspace{-12pt} 
\fi
\begin{center}
\tiny Source: Brooks et al. [2024]
\end{center}
\normalsize
:::
::::

# Denoising Diffusion Probabilistic Models

## Denoising Diffusion Probabilistic Models

\large Outline

\normalsize

- The forward process
- The Nice™ property
- The reverse process
- Loss function
- Training algorithm
- The model
- Sampling algorithm

## Denoising Diffusion Probabilistic Models (DDPMs)

:::: {.columns align=center}
::: {.column align=center width=100%}
[![](images/ddpm_paper.png){height=80% fig-alt="OpenAI Sora" fig-align="center"}](https://arxiv.org/abs/2006.11239)
:::
::::

## Denoising Diffusion Probabilistic Models (DDPMs)

DDPMs work through many steps $t$ which are $0, 1, \ldots ,T$

![](images/ddpm_paper_process.png){height=7em}
```{=latex}
\ifwindows 
\vspace{-10pt} 
\fi
\begin{center}
\tiny Source: Ho et al. [2020]
\end{center}
\normalsize
\ifwindows 
\vspace{-10pt} 
\fi
```

- $\mathbf{x}_0$ is the original image
- $q(\mathbf{x}_t \mid \mathbf{x}_{t-1})$ is the **forward** diffusion process
- $p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_{t})$ will be the **reverse** diffusion process (learned by our model with weights $\theta$)

. . .

During forward diffusion we add Gaussian (Normal) noise to the image in every $t$, producing noisy images $\mathbf{x}_1, \mathbf{x}_2, \ldots \mathbf{x}_T$

As $t$ becomes higher, the image becomes more and more noisy

## The forward process {.t}

```{=latex}
\ifwindows 
\vspace{-8pt}
\fi
```

![](images/ddpm_paper_forward_process.png){height=4.5em fig-align="center"}

```{=latex}
\ifwindows 
\vspace{-16pt}
\fi
```

$$q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) \coloneqq \sqrt{1 - \beta_t} \cdot \mathbf{x}_{t-1} + \mathcal{N}(0, \beta_t \mathbf{I})$$

. . .

```{=latex}
\ifwindows 
\vspace{-8pt}
\fi
```

- Take an image at some point $t-1$ as $\mathbf{x}_{t-1}$
- Generate Gaussian noise from an isotropic [\color{SkyBlue}{multivariate Normal}](https://julioasotodv.github.io/interactive-demos/mvn/multivariate_normal.html) of size $\mathbf{x}_t$, with mean $0$ and **variance** $\beta_t$ 
- Scale $\mathbf{x}_{t-1}$ values by $\sqrt{1 - \beta_t}$ (so data scale does not grow as we add noise)
- Add the noise the the scaled image

. . .

It can be directly computed as $q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) \coloneqq \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})$

. . .

$\beta_t$ is called the *variance schedule* $\beta_1, \ldots, \beta_T$ that effectively controls how much noise is added in each step $t$ \footnote{In the paper it is made to grow linearly from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$ for $T=1000$}

## The forward process {.t}

```{=latex}
\ifwindows 
\vspace{-8pt}
\fi
```

![](images/ddpm_paper_forward_process.png){height=4.5em fig-align="center"}

The full forward process is therefore:

$$q(\mathbf{x}_{1:T} \mid \mathbf{x}_0) \coloneqq \prod^T_{t=1} q(\mathbf{x}_{t} \mid \mathbf{x}_{t-1})$$

For a large $T$, the final image is basically only noise (all original image info is essentially lost), so it becomes roughly $\mathbf{x}_T \sim \mathcal{N}(0, \mathbf{I})$

Demo [\color{SkyBlue}{here}](https://julioasotodv.github.io/ie-c4-466671-diffusion-models/forward_diffusion_demo/Forward_diffusion_demo.html)!

## The Nice™ property {.t}

```{=latex}
\ifwindows 
\vspace{-8pt}
\fi
```

![](images/ddpm_paper_nice_property.png){height=5.9em fig-align="center"}

```{=latex}
\ifwindows 
\vspace{-4pt}
\fi
```

Trick to get any $\mathbf{x}_t$ from $\mathbf{x}_0$ without having to compute the intermediate steps. Let's define $\alpha_t \coloneqq 1 - \beta_t$ and $\bar\alpha_t \coloneqq \prod^t_{s=1} \alpha_s$. We can use the [\color{SkyBlue}{reparametrization trick for the Normal distribution}](https://sassafras13.github.io/ReparamTrick/#the-math-behind-the-curtain) to get: \unfootnote{In the paper this is described as \textit{"A notable property"}. I believe that the first to call this as a nice property was \href{https://lilianweng.github.io/posts/2021-07-11-diffusion-models/}{\color{SkyBlue}{Weng [2021]}}. We will call it the Nice™ property}

:::: {.columns}

::: {.column width="85%" align=center}
```{=latex}
\ifwindows
\vspace{0pt}
\fi
```
$$q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar\alpha_t}\mathbf{x}_0, (1 - \bar\alpha_t)\mathbf{I})$$ 

```{=latex}
\onslide<+-> % like pause, but it does not add duplicate slides when tables and footnotes and pauses co-appear (see https://github.com/jgm/pandoc/issues/4390)
```

- Easier, faster computation
- Any image state $\mathbf{x}_t$ comes from a (Normal) probability distribution, drastically simplyfing derivations

:::

::: {.column width="15%" align=center}
###
\small Details in [\color{SkyBlue}{Appendix A}](https://julioasotodv.github.io/ie-c4-466671-diffusion-models/Appendices for lectures on diffusion models.html)! 
:::

::::

Demo [\color{SkyBlue}{here}](https://julioasotodv.github.io/ie-c4-466671-diffusion-models/the_nice_property_demo/The_Nice_property_demo.html)!

```{=latex}
\label<2>{np}
```

## The reverse process {.t}

```{=latex}
\ifwindows 
\vspace{-8pt}
\fi
```

![](images/ddpm_paper_reverse_process.png){height=4.5em fig-align=center}

```{=latex}
\ifwindows 
\vspace{-2pt}
\fi
```

We will train a model $p_\theta$ to learn to perform the reverse process

Starting from $p(\mathbf{x}_T) = \mathcal{N}(0, \mathbf{I})$, it will try to recreate the image!

. . .

```{=latex}
\ifwindows 
\vspace{-2pt}
\fi
```

```{=latex}
\renewcommand{\eqnhighlightshade}{100}
\renewcommand{\eqnhighlightheight}{\vphantom{\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)}\mathstrut}

\begin{equation*}
p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) \coloneqq \mathcal{N}(\mathbf{x}_{t-1}; \eqnmark[Green]{mean}{\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)}, \eqnmark[Thistle]{variance}{\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)})
\end{equation*}
\annotate[yshift=-0.5em]{below,left}{mean}{Will be a neural network prediction}
\annotate[yshift=-2em]{below,left}{variance}{Will be set to a value \begin{math}\sigma_t^2 \mathbf{I}\end{math} based on \begin{math}\beta_t\end{math}}
\renewcommand{\eqnhighlightshade}{17}
\label<2>{rp}
```

```{=latex}
\ifwindows 
\vspace{8pt}
\fi
```

. . .

And

```{=latex}
\ifwindows 
\vspace{-16pt}
\fi
```

$$p_{\theta}(\mathbf{x}_{0:T}) \coloneqq p(\mathbf{x}_T) \prod_{t=1}^T p_{\theta}(\mathbf{x}_{t-1} \mid \mathbf{x}_{t})$$

## Summary

![](images/ddpm_process_and_equations.png){width=100% fig-align="center"}

. . .

The *forward process posterior* is the ground-truth reverse diffusion process that the model will learn to approximate!

## Loss function {.t}

Just like in VAEs, the loss function is based on the Evidence Lower Bound (ELBO):

$$\text{ELBO} = \mathbb{E}_{q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)} \left[ \log \frac{p_\theta(\mathbf{x}_{0 : T})}{q(\mathbf{x}_{1:T} \mid \mathbf{x}_0)} \right]$$

. . .

Which becomes:

```{=latex}
\ifwindows 
\vspace{-4pt}
\fi
```

\footnotesize $$\mathbb{E}_q \left[ \underbrace {\mathcal{D}_{\text{KL}}(q(\mathbf{x}_T \mid \mathbf{x}_0) \mid \mid p(\mathbf{x}_T))}_{L_T} + \sum_{t=2}^T{\underbrace{\mathcal{D}_{\text{KL}}(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) \mid \mid p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t))}_{L_{t-1}}} \underbrace{ - \log p_\theta(\mathbf{x}_0 \mid \mathbf{x}_1)}_{L_0} \right]$$

:::: {.columns}

::: {.column width="83%" align=center}

\normalsize

. . .

```{=latex}
\begin{itemize}
\tightlist
\item
  \(L_T\) → prior matching term. Has no learnable parameters, so we
  ignore
\item
  \(L_{t-1}\) → denoising term
\item
  \(L_0\) → reconstruction term. Only learning how to go from
  \(\mathbf{x}_1\) to \(\mathbf{x}_0\), so authors ended up ignoring it
  (simpler and better results)
\end{itemize}
```

:::

::: {.column width="17%" align=center}

```{=latex}
\phantom{\includegraphics[width=\linewidth]{images/confused_nick_young_gif/frame_0.png}}

\ifwindows 
\vspace{-4pt}
\fi
```

```{=latex}
\begin{block}{}
\small Details in \href{https://julioasotodv.github.io/ie-c4-466671-diffusion-models/Appendices\%20for\%20lectures\%20on\%20diffusion\%20models.html\#b.-diffusion-loss-function-elbo-derivation}{\color{SkyBlue}{Appendix B}}!
\end{block}
\normalsize
```
:::

::::

## Loss function

Loss therefore focuses on $L_{t-1}$:

$$\mathcal{D}_{\text{KL}}(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) \mid \mid p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t))$$
Where:

- $q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)$ is the *forward process posterior* (i.e. what would be the *prefect*, ground-truth reverse process) conditioned on $\mathbf{x}_0$
- $p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$ will be our learned reverse process as in slide 13



:::: {.columns}

::: {.column width="83%" align=center}

. . .

The forward process posterior is tractable and can be computed as:

$$q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1};{\color{BlueMean}\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_{t},\mathbf{x}_{0})},{\color{red}\tilde{\beta}_t \mathbf{I}}) \label<2>{fpp}$$

Where
\small $${\color{BlueMean}\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_{t},\mathbf{x}_{0})} \coloneqq {\color{LightBlueMean} \frac{1}{\sqrt{\alpha_t}} \Big(\mathbf{x}_t -  \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\boldsymbol{\epsilon} \Big)} \quad \text{and} \quad {\color{red}\tilde{\beta}_t} \coloneqq {\color{OrangeVar}\frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \cdot \beta_t}$$ \normalsize

:::

::: {.column width="17%" align=center}

. . .

```{=latex}
\normalsize
\ifwindows 
\vspace{36pt}
\fi
\phantom{\includegraphics[width=\linewidth]{images/confused_math_lady_gif/frame_0.png}}
\ifwindows 
\vspace{-18pt}
\fi
\begin{block}{}
\small Details in \href{https://julioasotodv.github.io/ie-c4-466671-diffusion-models/Appendices\%20for\%20lectures\%20on\%20diffusion\%20models.html\#c.-the-forward-process-posterior}{\color{SkyBlue}{Appendix C}}!
\end{block}
\normalsize
```

:::

::::

## Loss function {.t}

Loss is therefore the KL divergence between two Normals: the forward process posterior $q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)$ and the reverse process that our model will learn $p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$

. . .


Since both are Normal distributions, this KL divergence is:

```{=latex}
\ifwindows 
\vspace{-7pt}
\fi
\renewcommand{\eqnhighlightshade}{100}
\renewcommand{\eqnhighlightheight}{}
\begin{equation*}
\mathbb{E}_{q} \left[ \frac{1}{2 \sigma_t^2} \left \lVert \eqnmark[BlueMean]{postmean}{\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_{t},\mathbf{x}_{0})} - \eqnmark[Green]{mean}{ \boldsymbol{\mu}_\theta(\mathbf{x}_t, t)} \right \rVert^2_2 \right]
\end{equation*}
\annotate[yshift=-0.5em]{below,left}{postmean}{Forward process posterior mean \hyperlink{fpp}{({\color{SkyBlue}slide 17})}}
\annotate[yshift=-2em]{below,left}{mean}{From model's prediction \hyperlink{rp}{({\color{SkyBlue}slide 14})}}
\renewcommand{\eqnhighlightshade}{17}
\ifwindows 
\vspace{22pt}
\fi
```

. . .


However: authors decide instead to **predict the noise** added during the forward process. Reformulating:

:::: {.columns}

:::{.column width="83%" align=center}

```{=latex}
\ifwindows 
\vspace{-24pt}
\fi
\renewcommand{\eqnhighlightshade}{100}
\renewcommand{\eqnhighlightheight}{\vphantom{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}\mathstrut}
\begin{equation*}
L_\text{simple} \coloneqq \mathbb{E}_{q} \left[ \left \lVert \eqnmark[LightBlueMean]{eps}{\boldsymbol{\epsilon}} - \eqnmark[GreenYellow]{epstheta}{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)} \right \rVert^2_2 \right]
\end{equation*}
\annotate[yshift=-0.5em]{below,left}{eps}{Added noise in forward pass}
\annotate[yshift=-2em]{below,left}{epstheta}{Model predicting the noise using \begin{math} \mathbf{x}_t \end{math} and \begin{math} t \end{math} as features}
\renewcommand{\eqnhighlightshade}{17}
```

:::

:::{.column width="17%" align=center}

. . .

```{=latex}
\ifwindows 
\vspace{-10pt} 
\fi 
\phantom{\includegraphics[width=\linewidth]{images/spongebob_magic_gif/frame_8.png}}
\ifwindows 
\vspace{-18pt}
\fi
\begin{block}{}
\small Details in \href{https://julioasotodv.github.io/ie-c4-466671-diffusion-models/Appendices\%20for\%20lectures\%20on\%20diffusion\%20models.html\#d.-objective-the-training-procedure}{\color{SkyBlue}{Appendix D}}!
\end{block}
\normalsize
```

:::

::::

## Training algorithm {.t}

![](images/ddpm_training_algo.png){height=50% fig-align="center"}
```{=latex}
\ifwindows 
\vspace{-18pt} 
\fi
\begin{center}
\tiny Source: Ho et al. [2020]
\end{center}
\normalsize 
```

. . .

Where $\sqrt{\bar\alpha_t}\mathbf{x}_0 + \sqrt{1 + \bar\alpha_t}\boldsymbol\epsilon$ is just $\mathbf{x}_t$ computed through \hyperlink{np}{{\color{SkyBlue}the Nice™ property}}!

## The model

Proposed model is a U-Net architecture ([\color{SkyBlue}{Ronneberger et al. [2015]}](https://arxiv.org/abs/1505.04597)) that includes self-attention blocks

\ifwindows 
\vspace{-6pt}
\fi

![](images/ddpm_unet.drawio.png){height=55% fig-align="center"}

. . .

\ifwindows 
\vspace{-12pt}
\fi

They also include GroupNorm ([\color{SkyBlue}{Wu and He [2018]}](https://arxiv.org/abs/1803.08494v3)) in ResNet and self-attention blocks

$t$ is added on every ResNet block through positional encoding ([\color{SkyBlue}{Vaswani et al. [2017]}](https://arxiv.org/abs/1706.03762))

## Sampling algorithm {.t}

Once the model is trained, we can generate new images by:

![](images/ddpm_sampling_algo.png){height=50% fig-align="center"}
```{=latex}
\ifwindows
\vspace{-18pt}
\fi
\begin{center}
\tiny Source: Ho et al. [2020]
\end{center}
\normalsize
\label{sampling}
```

. . .

Step 4 just applies the reparametrization trick to the \hyperlink{rp}{{\color{SkyBlue}learned reverse process}} $p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; {\color{Green} \boldsymbol{\mu}_\theta(\mathbf{x}_t, t)}, {\color{Thistle} \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)})$

## Sampling algorithm {.t}

Sampling is an iterative process: we progressively remove predicted noise

![](images/ddpm_sampling_algo_highlight.png){height=50% fig-align="center"}

:::: {.columns}

:::{.column width="83%" align=center}

\ifwindows
\vspace{-8pt}
\fi

. . .

You may wonder:

> *If at any single step we are predicting the full added noise ${\color{LightBlueMean}\boldsymbol{\epsilon}}$, why don't we remove it completely in a single step?*

:::

:::{.column width="17%" align=center}

\ifwindows
\vspace{0pt}
\fi

. . .

Answer: 
```{=latex}
\ifwindows
\vspace{-4pt}
\fi
\begin{block}{}
\small Details in \href{https://julioasotodv.github.io/ie-c4-466671-diffusion-models/Appendices\%20for\%20lectures\%20on\%20diffusion\%20models.html\#e.-the-sampling-procedure}{\color{SkyBlue}{Appendix E}}!
\end{block}
\normalsize
```
:::

::::

# Advancements and improvements

## Advancements and improvements

\large Outline

\normalsize

- Variance/noise schedulers
- Learning the reverse process variance
- Faster sampling: DDIMs
- Conditional generation
  - Classifier Guidance
  - Classifier-Free Guidance
  - Conditioning on images
  - ControlNet
  - Conditioning on text

## Variance/noise schedulers {.t}

[\color{SkyBlue}{Nichol and Dhariwal [2021]}](https://arxiv.org/abs/2102.09672) propose a different way to set $\beta_t$:

$$\bar{\alpha}_t = \frac{f(t)}{f(0)}, \qquad \text{where} f(t) = \cos\left( \frac{t / T + s}{1+s} \cdot \frac{\pi}{2}  \right)^2$$

And then $\beta_t = \min\left( 1 - \frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}} ,\, 0.999\right )$. $s$ is a small offset to prevent $\beta_t$ from being tiny when $t$ is close to 0 (they set it to $s=0.008$)

![](images/cosine_schedule.png){height=11em fig-align="center"}
```{=latex}
\ifwindows
\vspace{-10pt}
\fi
\begin{center}
\tiny Comparison between scheduler in DDPM and Nichol \& Dhariwal's \\ cosine scheduler proposal. Source: Nichol \& Dhariwal [2021]
\end{center}
\normalsize
```

## Variance/noise schedulers

![](images/cosine_scheduler_dogs.png){height=11em fig-align="center"}
```{=latex}
\ifwindows
\vspace{-18pt}
\fi
\begin{center}
\tiny Source: Nichol \& Dhariwal [2021]
\end{center}
\normalsize
```

More progressive forward diffusion process, especially for large values of $t$

```{=latex}
\, \\
```

Demo [\color{SkyBlue}{here}](https://julioasotodv.github.io/ie-c4-466671-diffusion-models/the_nice_property_demo/The_Nice_property_demo.html)!

## Learning the reverse process variance

Nichol \& Dhariwal [2021] also proposed to learn the \hyperlink{rp}{{\color{SkyBlue}reverse process variance}} ${\color{Thistle} \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)}$ instead of setting it upfront based on $\beta_t$:

$$ {\color{Thistle} \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)} = \exp(v \log \beta_t + (1- v) \log {\color{red}\tilde{\beta_t}})$$

So the learned variance is an interpolation between $\beta_t$ and ${\color{red}\tilde{\beta_t}}$ controlled by $v$, which is a mixing vector predicted by the model

```{=latex}
\, \\
```

. . .

Loss function is changed accordingly to include this (called now $L_{\text{hybrid}}$ in the paper)

## Faster sampling: DDIMs

The sampling algorithm in DDPMs is slow, as it requires $T$ iterations → slow image generation
```{=latex}
\, \\
```
In *Denoising Diffusion Implicit Models* or **DDIMs**, [\color{SkyBlue}{Song et al. [2020]}](https://arxiv.org/abs/2010.02502) made the diffusion process non-Markovian, so each step $t$ does not depend only on last step

. . .
```{=latex}
\, \\
```
This allows us to "skip" steps during the sampling process

. . .
```{=latex}
\, \\
```
How? Predicting how to get from $\mathbf{x_t}$ to $\mathbf{x}_0$, and then "go back" from $\mathbf{x}_0$ to for instance a $\mathbf{x}_{t-2}$. Therefore, that iteration brings us from $\mathbf{x_t}$ directly to $\mathbf{x}_{t-2}$ (skipping 1 sampling step)

## Faster sampling: DDIMs

![](images/ddim.png){height=9em fig-align="center"}
```{=latex}
\ifwindows
\vspace{-18pt}
\fi
\begin{center}
\tiny DDIM for accelerated sampling (in this diagram it is used to jump from \begin{math}\mathbf{x}_t\end{math} directly to \begin{math}\mathbf{x}_{t-2}\end{math})
\end{center}
\normalsize
```
\small $$\mathbf{x}_{t-2} = \sqrt{\bar\alpha_{t-2}} \underbrace{\left( \frac{\mathbf{x}_t - \sqrt{1 - \bar\alpha_t} {\color{GreenYellow} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}}{\sqrt{\bar\alpha_t}} \right)}_{\text{predicted } \mathbf{x}_0} + \underbrace{\sqrt{1 - \bar\alpha_{t-2} - \sigma^2_t} \cdot {\color{GreenYellow} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}}_{\text{direction pointing to } \mathbf{x_{t}}} + \underbrace{\sigma_t \mathbf{z}}_{\text{random noise}}$$

. . .

We can choose freely how many steps to skip per iteration (but quality can decrease if we skip too many!)

## Faster sampling: DDIMs

Most models can generate good quality images with few steps (such as 20 or 50)

![](images/ddim_samples.png){height=12em fig-align="center"}
```{=latex}
\ifwindows
\vspace{-18pt}
\fi
\begin{center}
\tiny Source: Song et al. [2020]
\end{center}
\normalsize
```
. . .

Many other sampling algorithm variants have been proposed after DDIM—especially [\color{SkyBlue}{for Stable Diffusion models}](https://stable-diffusion-art.com/samplers/)!

## Faster sampling: DDIMs

\small $$\mathbf{x}_{t-2} = \sqrt{\bar\alpha_{t-2}} \underbrace{\left( \frac{\mathbf{x}_t - \sqrt{1 - \bar\alpha_t} {\color{GreenYellow} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}}{\sqrt{\bar\alpha_t}} \right)}_{\text{predicted } \mathbf{x}_0} + \underbrace{\sqrt{1 - \bar\alpha_{t-2} - \sigma^2_t} \cdot {\color{GreenYellow} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)}}_{\text{direction pointing to } \mathbf{x_{t}}} + \underbrace{\sigma_t \mathbf{z}}_{\text{random noise}}$$

\normalsize Authors also state that $\sigma_t$ can be set to $0$, making the sampling algorithm deterministic
```{=latex}
\, \\
```
. . .

If done, the model is an *implicit probabilistic* one, hence the "I" in the name DDIM
```{=latex}
\, \\
```
. . .

Training does not change (same as in regular DDPMs)

## Conditional generation

So far the diffusion models we have seen generate *any* image from pure noise; we don't have control over it

```{=latex}
\, \\
```

But it is much more useful to tell the model in some way what kind of image we want to generate (e.g. dog, house, "a polar bear with sunglasses surfing in space"...)

. . .

```{=latex}
\, \\
```

**Conditional generation**: allows us to "inject" additional data to the model to obtain a specific kind of image

. . .

```{=latex}
\, \\
```

That additional data can be a class label (e.g. 0→dog, 1→cat, 2→house), a text prompt (a polar bear with sunglasses surfing in space), another image...

## Conditional generation: Classifier Guidance

To perform conditional generation, [\color{SkyBlue}{Dhariwal and Nichol [2021]}](https://arxiv.org/abs/2105.05233) trained a regular image classifier $p_\phi(y \mid \mathbf{x}_t, t)$ using partially noisy images

```{=latex}
\, \\
```

They used the classifier gradients $\nabla_{\mathbf{x}_t} \log{p_\phi(y \mid \mathbf{x}_t, t)}$ to guide the sampling algorithm towards $y$ (the class label e.g. 0→dog, 1→cat, 2→house)

. . .

```{=latex}
\, \\
```

To do so, for the \hyperlink{sampling}{{\color{SkyBlue}sampling algorithm}} they replace ${\color{GreenYellow}\boldsymbol{\epsilon}_\theta}$ with $\hat{\boldsymbol\epsilon}$, which is:

$$ \hat{\boldsymbol\epsilon}(\mathbf{x}_t, t) = {\color{GreenYellow}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)} - \sqrt{1 - \bar\alpha_t} \cdot s \cdot \nabla_{\mathbf{x}_t} \log{p_\phi(y \mid \mathbf{x}_t, t)}$$

Where $s$ is a weight/scale factor that controls the guidance strength. Higher $s$→higher fidelity (but less diverse) images

## Conditional generation: Classifier-Free Guidance

Cons of Classifier Guidance → having to train a separate classifier. Also: most information in the image $\mathbf{x}_t$ is not relevant for predicting $y$, and therefore taking its gradient w.r.t. $\mathbf{x}_t$ can yield somehow arbitrary guidance

. . .

```{=latex}
\, \\
```

[\color{SkyBlue}{Ho and Salimans [2022]}](https://arxiv.org/abs/2207.12598) proposed Classifier-Free Guidance, which avoids training a separate classifier

. . .

```{=latex}
\, \\
```

Instead, we can get an "implicit classifier" by jointly training a conditional and unconditional diffusion model. Applying Bayes' theorem:

$$\underbrace{p(y \mid \mathbf{x}_t, t)}_\text{classifier} \propto \underbrace{p_\theta(\mathbf{x}_t \mid y, t)}_{\substack{\text{conditional} \\ \text{diffusion model}}} \,\, / \underbrace{p_\theta(\mathbf{x}_t \mid t)}_{\substack{\text{unconditional} \\ \text{diffusion model}}}$$

Both conditional and unconditional models can be the same one by training a conditional model $p_\theta(\mathbf{x}_t \mid y, t)$ for which the class $y$ gets dropped at random during training with some probability (similar to what happens in dropout)

## Conditional generation: Classifier-Free Guidance

Specifically, we can say that our model becomes ${\color{GreenYellow}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t}, y{\color{GreenYellow})}$ with some probability of $y = \varnothing$ (this is, we sometimes feed the model with a special null class identifier instead of the real one during training) \unfootnote{The paper uses slightly different notation: they use $\mathbf{c}$ instead of $y$ and $t$ for the class and timestep feature respectively (they group them into $\mathbf{c}$), and use $\mathbf{z}_\lambda$ instead of $\mathbf{x}_t$}

```{=latex}
\onslide<+->
```

```{=latex}
\, \\
```

In the \hyperlink{sampling}{{\color{SkyBlue}sampling algorithm}} they replace ${\color{GreenYellow}\boldsymbol{\epsilon}_\theta}$ with $\tilde{\boldsymbol{\epsilon}}_\theta$, which is a combination of the unconditional and conditional predictions:

$$\tilde{\boldsymbol{\epsilon}}_\theta(\mathbf{x}_t, t, y) = (1 + w) \cdot {\color{GreenYellow}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t}, y{\color{GreenYellow})} - w \cdot {\color{GreenYellow}\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t}, y=\varnothing{\color{GreenYellow})}$$

Where $w$ controls the guidance strength

## Conditional generation: Classifier-Free Guidance

```{=latex}

\begin{center}
\tiny Source: Ho \& Salimans [2022]
\end{center}
\normalsize
\ifwindows
\vspace{-14pt}
\fi
```

:::: {.columns}

::: {.column width="33%" align=center}
![](images/classifier_guidance_1.png){width=100% fig-align="center"}
\begin{center}No guidance\end{center}
:::

::: {.column width="33%" align=center}
![](images/classifier_guidance_2.png){width=100% fig-align="center"}
\begin{center}$w=1$\end{center}
:::

::: {.column width="33%" align=center}
![](images/classifier_guidance_3.png){width=100% fig-align="center"}
\begin{center}$w=3$\end{center}
:::

::::

## Conditional generation: Conditioning on images

We can condition a diffusion model on something other than a class label. For instance, with other images

Palette by [\color{SkyBlue}{Saharia et al. [2022]}](https://arxiv.org/abs/2111.05826) performs image-to-image translation by training $p_\theta(\mathbf{x}_t \mid y, t)$ where $y$ is an image we use to condition generation (given as input data to the model both during training and sampling)

. . .

They use it to perform image colorization, where $y$ is a black and white image that we want to colorize. Also for inpainting, image restoration...

:::: {.columns}

::: {.column width="50%" align=center}
![](images/palette_1.png){width=100% fig-align="center"}
:::

::: {.column width="50%" align=center}
![](images/palette_2.png){width=100% fig-align="center"}
:::

::::
\ifwindows
\vspace{-10pt}
\fi
\begin{center}
\tiny Source: Saharia et al. [2022]
\end{center}
\normalsize

## Conditional generation: ControlNet {.t}

[\color{SkyBlue}{Zhang et al. [2023]}](https://arxiv.org/abs/2302.05543) proposed ControlNet, where they fine-tuned a diffusion model to include extra conditioning such as sketch images, depth maps, human pose data, image segmentation data and others

. . .

They did architectural changes to the diffusion model to accomodate the additional conditioning info, and then fine-tuned the model with datasets that include that extra conditioning data:

```{=latex}
\ifwindows
\vspace{-4pt}
\fi
```

![](images/controlnet_arch.png){height=10em fig-align="center"}

```{=latex}
\ifwindows
\vspace{-18pt}
\fi
\begin{center}
\tiny ControlNet block. They freeze the original pre-trained model layer, and make a trainable copy adding zero convolutions (1x1 conv layers with weights intialized to zeros) before and after the copied layer. $x$ is what we know as $\mathbf{x}_t$ and $\mathbf{c}$ is the extra conditioning image/data. Frozen and ControlNet block outputs are added together. Source: Zhang et al. [2023]
\end{center}
\normalsize
```

## Conditional generation: ControlNet

![](images/controlnet_examples.png){width=100% fig-align="center"}
\ifwindows
\vspace{-20pt}
\fi
\begin{center}
\tiny Source: Zhang et al. [2023]
\end{center}
\normalsize

## Conditional generation: Conditioning on text

Diffusion models can be also conditioned in natural language through a *prompt* e.g. "a polar bear with sunglasses surfing in space"

```{=latex}
\, \\
```

This can be done with Classifier-Free Guidance, but making $y$ be text (a list of tokens) instead of a class label, and adding some text-processing specific layers to the diffusion model (such as Transformer-like layers)

. . .

```{=latex}
\, \\
```

Another option is to use a pre-trained text model and use its outputs to condition the diffusion model

```{=latex}
\, \\
```

A way to do this is to use *CLIP guidance*

## Conditional generation: Conditioning on text {.t}

CLIP by [\color{SkyBlue}{Radford et al. [2021]}](https://arxiv.org/abs/2103.00020) is a discriminative (non-generative) model that uses contrastive learning to match images with their natural language descriptions

Made of a text encoder (Transformer) and an image encoder (Vision Transformer), it embeds both text and images into a common dimensional space

. . .


The model is trained to maximize the (cosine) similarity between matching (image, text) pairs in this dimensional space

```{=latex}
\ifwindows
\vspace{-8pt}
\fi
```

![](images/clip.png){height=12em fig-align="center"}
\ifwindows
\vspace{-12pt}
\fi
\begin{center}
\tiny CLIP learns to maximize the cosine similarity (which can be thought of as correlation) between images and texts that go together. It therefore learns a sort of "correlation matrix" between training images $\text{I}_1, \text{I}_2, \ldots, \text{I}_N$ and training textual descriptions $\text{T}_1, \text{T}_2, \ldots, \text{T}_N$. Source: Radford et al. [2021]
\end{center}
\normalsize

## Conditional generation: Conditioning on text

Once the CLIP model is trained, we can feed a text prompt to the text encoder and use the output as the conditioning for a diffusion model

```{=latex}
\, \\
```

CLIP guidance is applied in large diffusion models such as GLIDE and DALL-E 2, which we will cover next

# Large diffusion models

## Large diffusion models

\large Outline

\normalsize

- GLIDE
- DALL-E
- Imagen
- Stable Diffusion
- Flux
- Sana

## GLIDE

GLIDE (**G**uided **L**anguage to **I**mage **D**iffusion for generation and **E**diting) by [\color{SkyBlue}{Nichol et al. [2021]}](https://arxiv.org/abs/2112.10741) is a text-to-image model made of three sub-models:

- A base diffusion model that generates 64x64 images
- An upsampling diffusion model that increases the resolution to 256x256
- A Transformer-based text encoder to condition generation on text prompts (Classifier-Free Guidance)

```{=latex}
\, \\
```

They also tried to replace the Transformer with a pre-trained CLIP, but the Classifier-Free Guidance method worked better

## GLIDE

![](images/glide.png){width=90% fig-align="center"}
\ifwindows
\vspace{-20pt}
\fi
\begin{center}
\tiny Source: Nichol et al. [2021]
\end{center}
\normalsize

## DALL-E

Dall-E 1 by  [\color{SkyBlue}{Ramesh et al. [2021]}](https://arxiv.org/abs/2102.12092) was not a diffusion model, but Dall-E 2 and 3 are

Dall-E 2 (aka unCLIP) by [\color{SkyBlue}{Ramesh et al. [2022]}](https://arxiv.org/abs/2204.06125) uses a pre-trained CLIP model, and the text-to-image model itself is made of two main components:

- A *prior* model which produces image embeddings based on the text encoded by CLIP
- A *decoder* model that generates the actual image based on the Prior's output and the original prompt

. . .

![](images/dalle2_annotated_1.png){width=80% fig-align="center"}
\ifwindows
\vspace{-12pt}
\fi
\begin{center}
\tiny Source: adapted from Ramesh et al. [2022]
\end{center}
\normalsize

## DALL-E {.t}

![](images/dalle2_annotated_2.png){width=80% fig-align="center"}

\ifwindows
\vspace{-12pt}
\fi

Authors tried two variants for the *prior* model:

- An autoregressive one (top one inside the red rectangle), where the image embeddings are generated as a discrete set of tokens (like in LLMs)
- A diffusion one (bottom on inside the red rectangle)

The diffusion prior worked better for them

## DALL-E {.t}

![](images/dalle2_annotated_3.png){width=80% fig-align="center"}

\ifwindows
\vspace{-12pt}
\fi

The decoder model is actually 3 diffusion models:

- One to generate 64x64 images
- Another one to upscale to 256x256
- A third one to upscale to 1024x1024

The prior output intermediate representation can be tweaked, allowing for text-guided image manipulation tasks

## DALL-E

![](images/dalle2_examples_1.png){width=100% fig-align="center"}
\ifwindows
\vspace{-20pt}
\fi
\begin{center}
\tiny Source: Ramesh et al. [2022]
\end{center}
\normalsize

## DALL-E

![](images/dalle2_examples_2.png){height=85% fig-align="center"}
\ifwindows
\vspace{-14pt}
\fi
\begin{center}
\tiny Source: Ramesh et al. [2022]
\end{center}
\normalsize

## DALL-E

DALL-E 3 by [\color{SkyBlue}{Betker et al. [2023]}](https://cdn.openai.com/papers/dall-e-3.pdf) improves on DALL-E 2 mostly by curating a better dataset

```{=latex}
\, \\
```

They train an image captioner to generate better textual descriptions of images

```{=latex}
\, \\
```


They also use GPT-4 to "upsample" (extend) text prompts before generating the image

. . .

```{=latex}
\, \\
```

Model architecture is undisclosed

## Imagen

Imagen by [\color{SkyBlue}{Saharia et al. [2022]}](https://arxiv.org/abs/2205.11487) also uses "cascaded" diffusion models


Uses a pre-trained text encoder LLM (T5-XXL by [\color{SkyBlue}{Raffel et al. [2019]}](https://arxiv.org/abs/1910.10683)) to generate text embeddings → much more effective than CLIP and quality improves with the LLM size (increasing text LLM size > increasing diffusion models size)

. . .

Used models are *Efficient U-Nets* which slightly alter the original U-Net

\ifwindows
\vspace{-4pt}
\fi

![](images/imagen_1.png){height=55% fig-align="center"}
\ifwindows
\vspace{-13pt}
\fi
\begin{center}
\tiny Source: Saharia et al. [2022]
\end{center}
\normalsize

## Imagen

![](images/imagen_examples_1.png){height=80% fig-align="center"}
\ifwindows
\vspace{-20pt}
\fi
\begin{center}
\tiny Source: Saharia et al. [2022]
\end{center}
\normalsize

## Imagen

Imagen 2 (2024) and 3 ([\color{SkyBlue}{Imagen 3 team [2024]}](https://arxiv.org/abs/2408.07009)) are more recent, but their architectures remain largely undisclosed

![](images/imagen_examples_2.png){height=65% fig-align="center"}
\ifwindows
\vspace{-20pt}
\fi
\begin{center}
\tiny Source: Imagen 3 team [2024]
\end{center}
\normalsize

## Stable Diffusion

[\color{SkyBlue}{Rombach et al. [2022]}](https://arxiv.org/abs/2408.07009) published a **Latent Diffusion Model** (LDM), which would serve as the basis for Stable Diffusion

LDM makes image generation more efficient by running the diffusion process in a compressed, latent space instead of in the original image space

. . .

It uses an Autoencoder to project images into that latent space, and runs the diffusion there

![](images/latent_diffusion.png){height=40% fig-align="center"}
\ifwindows
\vspace{-12pt}
\fi
\begin{center}
\tiny (Read in order: top-left→top-right→bottom-right→bottom-left): An encoder $\mathcal{E}$ projects an image $x$ into a latent representation of it named as $z$, which then goes through the forward diffusion process (called Diffusion Process in the image) to get the noisy latent $z_T$. Then, the U-Net performs the reverse diffusion process to try to recover $z$, which gets unprojected using the decoder $\mathcal{D}$ back to a generated image $\tilde{x}$. Different conditionings (text, other images, etc) can be used to guide the generation process thanks to the condition encoders $\tau_{\theta}$ and the crossattention layers. Source: Rombach et al. [2022]
\end{center}
\normalsize

## Stable Diffusion

In June 2022, [\color{SkyBlue}{Stability AI}](https://stability.ai/) and [\color{SkyBlue}{Runway}](https://runwayml.com/) released an open LDM which they called Stable Diffusion

[\color{SkyBlue}{Versions 1.X}](https://github.com/CompVis/stable-diffusion) (June - Oct 2022) used OpenAI's pretrained CLIP for text conditioning and were capable of generating 512x512 images

. . .

```{=latex}
\, \\
```

[\color{SkyBlue}{Versions 2.X}](https://github.com/Stability-AI/stablediffusion) (Nov - Dec 2022) used a CLIP model trained by themselves, which was open sourced as [\color{SkyBlue}{OpenCLIP}](https://github.com/mlfoundations/open_clip). Also increased resolution to 768x768

. . .

```{=latex}
\, \\
```

XL versions (July - Nov 2023, [\color{SkyBlue}{Podell et al. [2023]}](https://arxiv.org/abs/2307.01952)) made the U-Net 3x larger and introduced a two-stage process: a base model that generates an image, and a *refiner* model that additional high-quality details to it. Resolution was increased to 1024x1024

. . .

```{=latex}
\, \\
```

Version 3 ([\color{SkyBlue}{Esser et al. [2024]}](https://arxiv.org/abs/2403.03206)) made many architectural changes, such as replacing CLIP as the text encoder with T5-XXL ([\color{SkyBlue}{Raffel et al. [2020]}](https://arxiv.org/abs/1910.10683)). But its release was controversial (both because of its unclear licensing and [\color{SkyBlue}{generation issues}](https://www.bentoml.com/blog/stable-diffusion-3-text-master-prone-problems))

## Stable Diffusion

![](images/sd_generations.jpeg){height=80% fig-align="center"}
\ifwindows
\vspace{-12pt}
\fi
\begin{center}
\tiny Sources: https://github.com/CompVis/stable-diffusion, https://github.com/Stability-AI/stablediffusion, https://github.com/huggingface/diffusers/blob/main/docs/source/en/using-diffusers/sdxl.md, https://huggingface.co/stabilityai/stable-diffusion-3-medium [2024]
\end{center}
\normalsize

## Flux

Authors of the LDM model founded [\color{SkyBlue}{Black Forest Labs}](https://blackforestlabs.ai/announcing-black-forest-labs/) in Aug 2024 and released inference code for a model called [\color{SkyBlue}{Flux}](https://github.com/black-forest-labs/flux)

. . .

```{=latex}
\, \\
```

Architecturally very similar to Stable Diffusion 3, it replaces the U-Net with a **Diffusion Transformer (DiT)** ([\color{SkyBlue}{Peebles and Xie [2023]}](https://arxiv.org/abs/2212.09748))

![](images/DiT.png){height=50% fig-align="center"}
\ifwindows
\vspace{-12pt}
\fi
\begin{center}
\tiny Just like LDM, DiT operates on the latent space. Like Vision Transformers (ViTs), it transforms the data into a sequence of "patches", which go through $N$ DiT blocks. They tried three different block architectures, being adaLN(adaptive Layer Norm)-Zero the one which gave the best results, as depicted in the right. Source: Peebles and Xie [2023]
\end{center}
\normalsize

## Flux

Like Stable Diffusion 3, Flux uses **Flow Matching** ([\color{SkyBlue}{Lipman et al. [2023]}](https://arxiv.org/abs/2210.02747)) to formulate, train and sample from the model

```{=latex}
\, \\
```

Flow matching is based on Continuous Normalizing Flows, and leverages generative modeling by progressively morphing a distribution into another one

```{=latex}
\, \\
```

. . .

Diffusion is a specific case of Flow Matching, where the morphing is done by adding random Normal noise to the data

```{=latex}
\, \\
```

Other Flow Matching formulations (like when using *Conditional Optimal Transport*) can be more efficient to train and sample from, at the expense of a bit of quality (which can be offset by more data)

. . .

```{=latex}
\, \\
```

We refer to the paper above and [\color{SkyBlue}{this video}](https://www.youtube.com/watch?v=5ZSwYogAxYg) by Lipman for further details

## Flux

![](images/flux_examples.png){height=80% fig-align="center"}
\ifwindows
\vspace{-20pt}
\fi
\begin{center}
\tiny Source: https://blackforestlabs.ai/announcing-black-forest-labs/ [2024]
\end{center}
\normalsize

## Sana

Sana by [\color{SkyBlue}{Xie et al. [2024]}](https://arxiv.org/abs/2410.10629) includes innovations such as:

- A *Deep Compression Autoencoder* (DC-AE), which compresses/decompresses aggresively input images up to 32x → efficient generation of 4K images

- A linear DiT that performs efficient sub-quadratic attention

- Uses Google's pre-trained Gemma-2 ([\color{SkyBlue}{Gemma Team [2024]}](https://arxiv.org/abs/2408.00118)) decoder-only LLM to perform text conditioning → they extract last layer's output as textual embeddings

\ifwindows
\vspace{-8pt}
\fi
![](images/sana_diagram.png){height=40% fig-align="center"}
\ifwindows
\vspace{-16pt}
\fi
\begin{center}
\tiny Figure (a) shows the model architecture, including a sample prompt for Gemma-2. Positional embedding is not required in the linear DiT. Figure (b) shows details on the linear DiT block. Source: Xie et al. [2024]
\end{center}
\normalsize

## Sana

![](images/sana_examples.png){height=80% fig-align="center"}
\ifwindows
\vspace{-20pt}
\fi
\begin{center}
\tiny Source: Xie et al. [2024]
\end{center}
\normalsize

# Beyond image generation

## Diffusion for video

Image generator diuffusion models work with 4D data in the form of $(\textit{Batch, Channels, Height, Width})$ 

```{=latex}
\, \\
```

Diffusion-based video generation models work by mostly adapting the model to handle 5D data $(\textit{Batch, Time, Channels, Height, Width})$ in various ways

```{=latex}
\, \\
```

. . .

Some of the most prominent publications are:

- [\color{SkyBlue}{Video Diffusion Models}](https://arxiv.org/abs/2204.03458) by Ho et al. [2022]
  
-  [\color{SkyBlue}{Imagen Video: High Definition Video Generation with Diffusion Models}](https://arxiv.org/abs/2210.02303) by Ho et al. [2022] and [\color{SkyBlue}{Veo}](https://deepmind.google/technologies/veo/) by Veo team [2024]
  
-  [\color{SkyBlue}{Photorealistic Video Generation with Diffusion Models}](https://arxiv.org/abs/2312.06662) by Gupta et al. [2023]
  
- [\color{SkyBlue}{Sora}](https://openai.com/research/video-generation-models-as-world-simulators) by Brooks et al. [2024]
  
- [\color{SkyBlue}{Movie Gen}](https://ai.meta.com/static-resource/movie-gen-research-paper) by The Movie Gen Team @ Meta [2024]

- [\color{SkyBlue}{CogVideoX}](https://arxiv.org/abs/2408.06072) by Yang et al. [2024]

- [\color{SkyBlue}{Pyramid Flow}](https://arxiv.org/abs/2410.05954) by Jin et al. [2024]

- [\color{SkyBlue}{Mochi 1}](https://www.genmo.ai/blog) by Genmo Team [2024]

## Diffusion for video

:::: {.columns align=center}
::: {.column align=center width=100%}
[![](images/movie_gen_example_1.png){width=100% fig-alt="Movie Gen" fig-align="center"}](https://imgur.com/7yw8MSc)
\ifwindows 
\vspace{-14pt} 
\fi
\begin{center}
\tiny Source: The Movie Gen Team @ Meta [2024]
\end{center}
\normalsize
:::
::::

## Diffusion for other applications

Some examples (by no means a thorough list):

3D rendering and reconstruction:

- [\color{SkyBlue}{DreamFusion: Text-to-3D using 2D Diffusion}](https://arxiv.org/abs/2209.14988) by Poole et al. [2022]

- [\color{SkyBlue}{CAT3D: Create Anything in 3D with Multi-View Diffusion Models}](https://arxiv.org/abs/2405.10314) by Gao et al. [2024]

Text generation:

- [\color{SkyBlue}{Diffusion-LM Improves Controllable Text Generation}](https://arxiv.org/abs/2205.14217) by Lisa Li et al. [2024]

Music and audio generation:

- [\color{SkyBlue}{Noise2Music: Text-conditioned Music Generation with Diffusion Models}](https://arxiv.org/abs/2302.03917) by Huang et al. [2023]
  
- [\color{SkyBlue}{Fast Timing-Conditioned Latent Audio Diffusion}](https://arxiv.org/abs/2402.04825) by Evans et al. [2024]
  
- [\color{SkyBlue}{QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation}](https://arxiv.org/abs/2405.15863) by Li et al. [2024]

## Diffusion for other applications

Text-to-speech:

- [\color{SkyBlue}{Better speech synthesis through scaling}](https://arxiv.org/abs/2305.07243) by Betker [2023]
  
- [\color{SkyBlue}{NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models}](https://arxiv.org/abs/2403.03100) by Ju et al. [2024]

- [\color{SkyBlue}{F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}](https://arxiv.org/abs/2403.03100) by Chen et al. [2024]


Life sciences:

- [\color{SkyBlue}{DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking}](https://arxiv.org/abs/2210.01776) by Corso et al. [2022]

- [\color{SkyBlue}{Protein generation with evolutionary diffusion: sequence is all you need}](https://www.biorxiv.org/content/10.1101/2023.09.11.556673v1) by Alamdari et al. [2023]

- [\color{SkyBlue}{Accurate structure prediction of biomolecular interactions with AlphaFold 3}](https://www.nature.com/articles/s41586-024-07487-w) by Abramson et al. [2024]

## Diffusion for other applications

Robotics:

- [\color{SkyBlue}{Diffusion Policy: Visuomotor Policy Learning via Action Diffusion}](https://arxiv.org/abs/2303.04137v5) by Chi et al. [2023]

Videogame engine creation:

- [\color{SkyBlue}{Diffusion Models Are Real-Time Game Engines}](https://arxiv.org/abs/2408.14837) by Valevski et al. [2024]

```{=latex}
\, \\
```

And many more (as [\color{SkyBlue}{there are more than 4,000 papers on arXiv}](https://vsehwag.github.io/blog/2023/2/all_papers_on_diffusion.html) at the time of writing!)

```{=latex}
\, \\
```

Recent survey papers like [\color{SkyBlue}{A Survey on Generative Diffusion Models}](https://arxiv.org/abs/2209.02646) by Cao et al. [2023] can be useful to stay up to date

## How to use diffusion models?

Do-It-Yourself:

- PyTorch

Training and fine-tuning of existing architectures:

- [\color{SkyBlue}{\raisebox{-2pt}{\includegraphics[height=12pt]{images/hugging-face-emoji.png}} Diffusers}](https://huggingface.co/docs/diffusers)
- [\color{SkyBlue}{Ostris AI toolkit}](https://github.com/ostris/ai-toolkit)
- [\color{SkyBlue}{Bghira's SimpleTuner}](https://github.com/bghira/SimpleTuner)
- [\color{SkyBlue}{Kohya}](https://github.com/bmaltais/kohya_ss)

Inference and image generation GUIs (no-code or low-code), from simpler to more complex:

- [\color{SkyBlue}{Automatic1111}](https://github.com/AUTOMATIC1111/stable-diffusion-webui)
- [\color{SkyBlue}{ForgeUI}](https://github.com/lllyasviel/stable-diffusion-webui-forge) → very similar to A1111, but faster and more modern
- [\color{SkyBlue}{SwarmUI}](https://github.com/mcmonkeyprojects/SwarmUI)
- [\color{SkyBlue}{ComfyUI}](https://github.com/comfyanonymous/ComfyUI) → tons of features; professional tool

\center \large This list gets outdated very quickly! Check for community updates on [\color{SkyBlue}{r/StableDiffusion}](https://www.reddit.com/r/StableDiffusion/) and [\color{SkyBlue}{Civitai}](https://civitai.com/)

## Some interesting topics we skipped

\small Connection to score-based generative models (what if $t$ was continuous instead of discrete?). Diffusion expressed as Ordinary and Stochastic Differential Equations:

- [\color{SkyBlue}{Generative Modeling by Estimating Gradients of the Data Distribution}](https://arxiv.org/abs/1907.05600) by Song and Ermon [2019]
- [\color{SkyBlue}{Improved Techniques for Training Score-Based Generative Models}](https://arxiv.org/abs/2006.09011) by Song and Ermon [2020]
- [\color{SkyBlue}{Score-Based Generative Modeling through Stochastic Differential Equations}](https://arxiv.org/abs/2011.13456) by Song et al. [2020]
- [\color{SkyBlue}{Elucidating the Design Space of Diffusion-Based Generative Models}](https://arxiv.org/abs/2206.00364) by Karras et al. [2022]

Perform sampling in even less steps:

- [\color{SkyBlue}{Progressive Distillation for Fast Sampling of Diffusion Models}](https://arxiv.org/abs/2202.00512) by Salimans and Ho [2022]
- [\color{SkyBlue}{Consistency Models}](https://arxiv.org/abs/2303.01469) by Song et al. [2023]
- [\color{SkyBlue}{Simple and Fast Distillation of Diffusion Models}](https://www.arxiv.org/abs/2409.19681) by Zhou et al. [2024]

Zero-terminal Signal-to-Noise Ratio (SNR):

- [\color{SkyBlue}{Common Diffusion Noise Schedules and Sample Steps are Flawed}](https://arxiv.org/abs/2305.08891) by Lin et al. [2024]

## Some interesting topics we skipped

\small Improve resolution through cascaded generation, used in e.g. DALL-E 2 and Imagen:

- [\color{SkyBlue}{Cascaded Diffusion Models for High Fidelity Image Generation}](https://arxiv.org/abs/2106.15282) by Ho et al. [2022]

Subject-driven image generation through fine-tuning (add specific characters to images):

- [\color{SkyBlue}{An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion}](https://arxiv.org/abs/2208.01618) by Gal et al. [2022]
- [\color{SkyBlue}{DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation}](https://arxiv.org/abs/2208.12242) by Ruiz et al. [2022]

Image and video editing:

- [\color{SkyBlue}{SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations}](https://arxiv.org/abs/2108.01073) by Meng et al. [2021]
- [\color{SkyBlue}{Prompt-to-Prompt Image Editing with Cross Attention Control}](https://arxiv.org/abs/2208.01626) by Hertz et al. [2022]
- [\color{SkyBlue}{RePaint: Inpainting using Denoising Diffusion Probabilistic Models}](https://arxiv.org/abs/2201.09865) by Lugmayr et al. [2022]
- [\color{SkyBlue}{Differential Diffusion: Giving Each Pixel Its Strength}](https://arxiv.org/abs/2306.00950) by Levin et al. [2023]
- [\color{SkyBlue}{Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation}](https://arxiv.org/abs/2212.11565) by Wu et al. [2023]
- [\color{SkyBlue}{Dreamix: Video Diffusion Models are General Video Editors}](https://arxiv.org/abs/2302.01329) by Moland et al. [2024]

## Some interesting topics we skipped

\small Diffusion Transformer variants and improvements:

- [\color{SkyBlue}{PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis}](https://arxiv.org/abs/2310.00426) by Chen et al. [2024]
- [\color{SkyBlue}{Dynamic Diffusion Transformer}](https://arxiv.org/abs/2410.03456) by Zhao et al. [2024]

ControlNet variants:

- [\color{SkyBlue}{Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models}](https://arxiv.org/abs/2305.16322) by Zhao et al. [2023]
- [\color{SkyBlue}{ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback}](https://arxiv.org/abs/2404.07987) by Li et al. [2024]
- [\color{SkyBlue}{CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation}](https://arxiv.org/abs/2410.09400) by Xu et al. [2024]

Flow Matching variants and improvements:

- [\color{SkyBlue}{InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation}](https://arxiv.org/abs/2309.06380) by Liu et al. [2023]
- [\color{SkyBlue}{Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow}](https://arxiv.org/abs/2410.07303) by Wang et al. [2024]

## Some interesting topics we skipped {.t}

\small Regional generation:

- [\color{SkyBlue}{Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs}](https://arxiv.org/abs/2401.11708) by Yang et al. [2024]
- [\color{SkyBlue}{Training-free Regional Prompting for Diffusion Transformers}](https://arxiv.org/abs/2411.02395) by Chen et al. [2024]]

General multi-task diffusion models:

- [\color{SkyBlue}{OmniGen: Unified Image Generation}](https://arxiv.org/abs/2409.11340) by Xiao et al. [2024]

## References {.allowframebreaks}

---
nocite: |
  @*
---
\tiny
::: {#refs}
:::