# Diffusion-DPO

Written by [Junkun Yuan](https://junkunyuan.github.io/) (yuanjk0921@outlook.com).

See paper reading list and notes [here](https://junkunyuan.github.io/paper_reading_list/paper_reading_list.html).

Last updated on Jul 12, 2025; &nbsp; First committed on Mar 30, 2025.

**References**
- [**Diffusion Model Alignment Using Direct Preference Optimization** *(CVPR 2024)*](https://arxiv.org/pdf/2311.12908)

**Contents**
- Diffusion-DPO
- PyTorch Implementations

## Diffusion-DPO

**Direct Preference Optimization (DPO)** is designed to directly optimize a model based on human preferences, <font color=red>without a reward model</font> or complex reinforcement learning algorithms.

**Why using DPO for diffusion models?** (1) Existing RL-based methods are limited to small prompt sets; (2) Training using feedback from a reward model suffers from *mode collapse or reward hacking* and limited feedback types.

DPO learns human preference from a preference dataset $\mathcal{D}$ consisting of data pairs of **winning samples** $\boldsymbol{x}_0^w$ & **losing samples** $\boldsymbol{x}_0^l$ associated with the same **condition/prompt** $\boldsymbol{c}$.

The **[Bradley-Terry (reward) model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)** $r$ parameterized by $\phi$ is used to learns human preference by maximizing likelihood for binary classification with sigmoid $\sigma$:

$$
L_{\text{BT}}(\phi) = -\mathbb{E}_{\boldsymbol{c},\boldsymbol{x}_0^w,\boldsymbol{x}_0^l}[\log\frac{e^{\boldsymbol{x}_0^w}}{e^{\boldsymbol{x}_0^w} + e^{\boldsymbol{x}_0^l}}] = -\mathbb{E}_{\boldsymbol{c},\boldsymbol{x}_0^w,\boldsymbol{x}_0^l}[\log\sigma(r_{\phi}(\boldsymbol{c},\boldsymbol{x}_0^w)-r_{\phi}(\boldsymbol{c},\boldsymbol{x}_0^l))]. \ \ \ \ \text{[Eq. (4) of the Diffusion-DPO paper]}
$$

Given a sample $\boldsymbol{x}_0$ with condition $\boldsymbol{c}$ generated by a model $p_{\theta}(\boldsymbol{x}_0|\boldsymbol{c})$, Reinforcement Learning from Human Feedback (**RLHF**) optimizes $p_{\theta}(\boldsymbol{x}_0|\boldsymbol{c})$ to maximize $r(\boldsymbol{c},\boldsymbol{x}_0)$ with weight regularization hyper-parameter $\beta$ from a reference model $p_{\text{ref}}$

$$
\max_{p_{\theta}}\mathbb{E}_{\boldsymbol{c}\sim\mathcal{D},\boldsymbol{x}_0\sim p_{\theta}(\boldsymbol{x}_0|\boldsymbol{c})}[r(\boldsymbol{c},\boldsymbol{x}_0)]-\beta\cdot\mathbb{D}_{\text{KL}}[p_{\theta}(\boldsymbol{x}_0|\boldsymbol{c})||p_{\text{ref}}(\boldsymbol{x}_0|\boldsymbol{c})]. \ \ \ \ \text{[Eq. (5) of the Diffusion-DPO paper]}
$$

Here, we use **diffusion latents** $\boldsymbol{x}_{0:T}$ to derive a solution:

\begin{equation}
\begin{aligned}
&\min_{p_{\theta}}\mathbb{E}_{p_{\theta}(\boldsymbol{x}_0|\boldsymbol{c})}[-r(\boldsymbol{c},\boldsymbol{x}_0)/\beta] + \mathbb{D}_{\text{KL}}(p_{\theta}(\boldsymbol{x}_0|\boldsymbol{c})||p_{\text{ref}}(\boldsymbol{x}_0|\boldsymbol{c})) \\
\le& \min_{p_{\theta}}\mathbb{E}_{p_{\theta}(\boldsymbol{x}_0|\boldsymbol{c})}[-r(\boldsymbol{c},\boldsymbol{x}_0)/\beta] + \mathbb{D}_{\text{KL}}(p_{\theta}(\boldsymbol{x}_{0:T}|\boldsymbol{c})||p_{\text{ref}}(\boldsymbol{x}_{0:T}|\boldsymbol{c})) \\
=& \min_{p_{\theta}}\mathbb{E}_{p_{\theta}(\boldsymbol{x}_{0:T}|\boldsymbol{c})}[-R(\boldsymbol{c},\boldsymbol{x}_{0:T})/\beta] + \mathbb{D}_{\text{KL}}(p_{\theta}(\boldsymbol{x}_{0:T}|\boldsymbol{c})||p_{\text{ref}}(\boldsymbol{x}_{0:T}|\boldsymbol{c})) \\
=&\min_{p_{\theta}}\mathbb{E}_{p_{\theta}(\boldsymbol{x}_{0:T}|\boldsymbol{c})}(\log\frac{p_{\theta}(\boldsymbol{x}_{0:T}|\boldsymbol{c})}{p_{\text{ref}}(\boldsymbol{x}_{0:T}|\boldsymbol{c})\exp(R(\boldsymbol{c},\boldsymbol{x}_{0:T})/\beta)/Z(\boldsymbol{c})}-\log Z(\boldsymbol{c}))\\
=&\min_{p_{\theta}}\mathbb{D}_{\text{KL}}(p_{\theta}(\boldsymbol{x}_{0:T}|\boldsymbol{c})||p_{\text{ref}}(\boldsymbol{x}_{0:T}|\boldsymbol{c})\exp(R(\boldsymbol{c},\boldsymbol{x}_{0:T})/\beta)/Z(\boldsymbol{c})), \ \ \ \ \text{[Eq. (17) of the Diffusion-DPO paper]}
\end{aligned}
\end{equation}

where $Z(\boldsymbol{c})=\sum_{\boldsymbol{x}}p_{\text{ref}}(\boldsymbol{x}_{0:T}|\boldsymbol{c})\exp(R(\boldsymbol{c},\boldsymbol{x}_{0:T})/\beta)$ is the partition function. It leads to the unique global optimal solution $p_{\theta}^*$:

$$
p_{\theta}^*(\boldsymbol{x}_{0:T}|\boldsymbol{c})=p_{\text{ref}}(\boldsymbol{x}_{0:T}|\boldsymbol{c})\exp(R(\boldsymbol{c},\boldsymbol{x}_{0:T})/\beta)/Z(\boldsymbol{c}) \ \ \ \ \text{[Eq. (6) of the Diffusion-DPO paper]}.
$$

Based on the eqution above, the reward function is derived as
$$
R(\boldsymbol{c},\boldsymbol{x}_{0:T})=\beta\cdot\log\frac{p_{\theta}^*(\boldsymbol{x}_{0:T}|\boldsymbol{c})}{p_{\text{ref}}(\boldsymbol{x}_{0:T}|\boldsymbol{c})} + \beta\cdot\log Z(\boldsymbol{c}). \ \ \text{[Eq. (7) of the Diffusion-DPO paper]}
$$

Based on Eq. (4) of the Diffusion-DPO paper, we have the reward objective (omit condition $\boldsymbol{c}$ here)

$$
L(\theta)=-\mathbb{E}_{\boldsymbol{x}_{1:T}^w\sim q(\boldsymbol{x}_{1:T}|\boldsymbol{x}_0^w), \boldsymbol{x}_{1:T}^l\sim q(\boldsymbol{x}_{1:T}|\boldsymbol{x}_0^l)}[\log\sigma(\beta\log\frac{p_{\theta}(\boldsymbol{x}_{0:T}^w)}{p_{\text{ref}}(\boldsymbol{x}^w_{0:T})} - \beta\log\frac{p_{\theta}(\boldsymbol{x}_{0:T}^l)}{p_{\text{ref}}(\boldsymbol{x}_{0:T}^l)})]. \ \ \text{[Eq. (8) of the Diffusion-DPO paper]} \\
$$

We then derive the first formulation of optimization objective of Diffusion-DPO:

\begin{equation}
\begin{aligned}
&\min L_{1}(\theta)\\
=&-\mathbb{E}_{\boldsymbol{x}_{1:T}^w\sim q(\boldsymbol{x}_{1:T}|\boldsymbol{x}_0^w), \boldsymbol{x}_{1:T}^l\sim q(\boldsymbol{x}_{1:T}|\boldsymbol{x}_0^l)}[\log\sigma(\beta\log\frac{p_{\theta}(\boldsymbol{x}_{0:T}^w)}{p_{\text{ref}}(\boldsymbol{x}^w_{0:T})} - \beta\log\frac{p_{\theta}(\boldsymbol{x}_{0:T}^l)}{p_{\text{ref}}(\boldsymbol{x}_{0:T}^l)})] \\
=&-\log\sigma(\beta\mathbb{E}_{\boldsymbol{x}_{1:T}^w\sim q(\boldsymbol{x}_{1:T}|\boldsymbol{x}_0^w), \boldsymbol{x}_{1:T}^l\sim q(\boldsymbol{x}_{1:T}|\boldsymbol{x}_0^l)}[\log\frac{p_{\theta}(\boldsymbol{x}_{0:T}^w)}{p_{\text{ref}}(\boldsymbol{x}^w_{0:T})} - \log\frac{p_{\theta}(\boldsymbol{x}_{0:T}^l)}{p_{\text{ref}}(\boldsymbol{x}_{0:T}^l)}]) \\
=&-\log\sigma(\beta\mathbb{E}_{\boldsymbol{x}_{1:T}^w\sim q(\boldsymbol{x}_{1:T}|\boldsymbol{x}_0^w), \boldsymbol{x}_{1:T}^l\sim q(\boldsymbol{x}_{1:T}|\boldsymbol{x}_0^l)}[\sum_{t=1}^T\log\frac{p_{\theta}(\boldsymbol{x}_{t-1}^w|\boldsymbol{x}_{t}^w)}{p_{\text{ref}}(\boldsymbol{x}_{t-1}^w|\boldsymbol{x}_{t}^w)} - \log\frac{p_{\theta}(\boldsymbol{x}_{t-1}^l|\boldsymbol{x}_{t}^l)}{p_{\text{ref}}(\boldsymbol{x}_{t-1}^l|\boldsymbol{x}_{t}^l)}]) \\
=&-\log\sigma(\beta T\mathbb{E}_t\mathbb{E}_{\boldsymbol{x}_{t-1,t}^w\sim q(\boldsymbol{x}_{t-1,t}|\boldsymbol{x}_0^w), \boldsymbol{x}_{t-1, t}^l\sim q(\boldsymbol{x}_{t-1, t}|\boldsymbol{x}_0^l)}[\log\frac{p_{\theta}(\boldsymbol{x}_{t-1}^w|\boldsymbol{x}_{t}^w)}{p_{\text{ref}}(\boldsymbol{x}_{t-1}^w|\boldsymbol{x}_{t}^w)} - \log\frac{p_{\theta}(\boldsymbol{x}_{t-1}^l|\boldsymbol{x}_{t}^l)}{p_{\text{ref}}(\boldsymbol{x}_{t-1}^l|\boldsymbol{x}_{t}^l)}]) \\
=&-\mathbb{E}_{t,\boldsymbol{x}_{t}^w\sim q(\boldsymbol{x}_{t}|\boldsymbol{x}_0^w), \boldsymbol{x}_{t}^l\sim q(\boldsymbol{x}_{t}|\boldsymbol{x}_0^l)}\log\sigma(\beta T\mathbb{E}_{\boldsymbol{x}_{t-1}^w\sim p_{\theta}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t^w), \boldsymbol{x}_{t-1}^l\sim p_{\theta}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t^l)}[\log\frac{p_{\theta}(\boldsymbol{x}_{t-1}^w|\boldsymbol{x}_{t}^w)}{p_{\text{ref}}(\boldsymbol{x}_{t-1}^w|\boldsymbol{x}_{t}^w)} - \log\frac{p_{\theta}(\boldsymbol{x}_{t-1}^l|\boldsymbol{x}_{t}^l)}{p_{\text{ref}}(\boldsymbol{x}_{t-1}^l|\boldsymbol{x}_{t}^l)}]) \\
=& -\mathbb{E}_{(\boldsymbol{x}_0^w, \boldsymbol{x}_0^l)\sim\mathcal{D},t\sim\mathcal{U}(0,T),\boldsymbol{x}_{t}^w\sim q(\boldsymbol{x}_{t}^w|\boldsymbol{x}_0^w),\boldsymbol{x}_{t}^l\sim q(\boldsymbol{x}_{t}^l|\boldsymbol{x}_0^l)}\log\sigma(-\beta T \\
&(\mathbb{D}_{\text{KL}}(q(\boldsymbol{x}_{t-1}^{w}|\boldsymbol{x}_{0,t}^{w})||p_{\theta}(\boldsymbol{x}_{t-1}^{w}|\boldsymbol{x}_{t}^{w}))-\mathbb{D}_{\text{KL}}(q(\boldsymbol{x}_{t-1}^{w}|\boldsymbol{x}_{0,t}^{w})||p_{\text{ref}}(\boldsymbol{x}_{t-1}^{w}|\boldsymbol{x}_{t}^{w}))-\\
&(\mathbb{D}_{\text{KL}}(q(\boldsymbol{x}_{t-1}^{l}|\boldsymbol{x}_{0,t}^{l})||p_{\theta}(\boldsymbol{x}_{t-1}^{l}|\boldsymbol{x}_{t}^{l}))-\mathbb{D}_{\text{KL}}(q(\boldsymbol{x}_{t-1}^{l}|\boldsymbol{x}_{0,t}^{l})||p_{\text{ref}}(\boldsymbol{x}_{t-1}^{l}|\boldsymbol{x}_{t}^{l}))))). \\
\end{aligned}
\end{equation}

Thus, we have

<font color=red>

\begin{equation}
\begin{aligned}
\min L_{1}(\theta)=&-\mathbb{E}_{t,\boldsymbol{\epsilon}^{w},\boldsymbol{\epsilon}^{l}}\log\sigma[-\beta T\omega(\lambda_t)\\
&(||\epsilon^w-\epsilon_{\theta}(\boldsymbol{x}_t^{w},t)||_2^2 - ||\epsilon^w-\epsilon_{\text{ref}}(\boldsymbol{x}_t^{w},t)||_2^2 -\\
&(||\epsilon^l-\epsilon_{\theta}(\boldsymbol{x}_t^{l},t)||_2^2 - ||\epsilon^l-\epsilon_{\text{ref}}(\boldsymbol{x}_t^{l},t)||_2^2))]. \ \ \ \ \text{[Eq. (14) of the Diffusion-DPO paper]} \\
\end{aligned}
\end{equation}

</font>

<font color=red>Intuitively, it encourages the online model to predict more accurately on winning samples compared to the reference model; conversely, it does the opposite on losing samples.</font>

We then derive the second formulation. Replace $q(\boldsymbol{x}_{t-1,t}|\boldsymbol{x}_0)$ by $p_{\theta}(\boldsymbol{x}_{t-1,t}|\boldsymbol{x}_0)$ in the 4-th equation of $\min L_1$, we have

\begin{equation}
\begin{aligned}
\min L_{2}(\theta)=&-\log\sigma(\beta T\mathbb{E}_t\mathbb{E}_{\boldsymbol{x}_{t-1,t}^w\sim p_{\theta}(\boldsymbol{x}_{t-1,t}|\boldsymbol{x}_0^w), \boldsymbol{x}_{t-1, t}^l\sim p_{\theta}(\boldsymbol{x}_{t-1, t}|\boldsymbol{x}_0^l)}[\log\frac{p_{\theta}(\boldsymbol{x}_{t-1}^w|\boldsymbol{x}_{t}^w)}{p_{\text{ref}}(\boldsymbol{x}_{t-1}^w|\boldsymbol{x}_{t}^w)} - \log\frac{p_{\theta}(\boldsymbol{x}_{t-1}^l|\boldsymbol{x}_{t}^l)}{p_{\text{ref}}(\boldsymbol{x}_{t-1}^l|\boldsymbol{x}_{t}^l)}]). \\
\end{aligned}
\end{equation}

We find that using $q(\boldsymbol{x}_{t}|\boldsymbol{x}_0)p_{\theta}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$ to approximate $p_{\theta}(\boldsymbol{x}_{t-1,t}|\boldsymbol{x}_0)$ yields lower error because

$$
\mathbb{D}_{\text{KL}}(q(\boldsymbol{x}_{t}|\boldsymbol{x}_0)p_{\theta}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)||p_{\theta}(\boldsymbol{x}_{t-1,t}|\boldsymbol{x}_0)) = \mathbb{D}_{\text{KL}}(q(\boldsymbol{x}_{t}|\boldsymbol{x}_0)||p_{\theta}(\boldsymbol{x}_{t}|\boldsymbol{x}_0)) < \mathbb{D}_{\text{KL}}(q(\boldsymbol{x}_{t-1,t}|\boldsymbol{x}_0)p_{\theta}(\boldsymbol{x}_{t-1,t}|\boldsymbol{x}_0)).
$$

This time, we use $q(\boldsymbol{x}_{t}|\boldsymbol{x}_0)p_{\theta}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t)$ to approximate $p_{\theta}(\boldsymbol{x}_{t-1,t}|\boldsymbol{x}_0)$ and rewrite the formulation:

\begin{equation}
\begin{aligned}
&\min L_{2}(\theta)\\
=&-\log\sigma(\beta T\mathbb{E}_t\mathbb{E}_{\boldsymbol{x}_{t-1,t}^w\sim q(\boldsymbol{x}_{t}|\boldsymbol{x}_0^w)p_{\theta}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t^w), \boldsymbol{x}_{t-1, t}^l\sim q(\boldsymbol{x}_{t}|\boldsymbol{x}_0^l)p_{\theta}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t^l)}[\log\frac{p_{\theta}(\boldsymbol{x}_{t-1}^w|\boldsymbol{x}_{t}^w)}{p_{\text{ref}}(\boldsymbol{x}_{t-1}^w|\boldsymbol{x}_{t}^w)} - \log\frac{p_{\theta}(\boldsymbol{x}_{t-1}^l|\boldsymbol{x}_{t}^l)}{p_{\text{ref}}(\boldsymbol{x}_{t-1}^l|\boldsymbol{x}_{t}^l)}]) \\
=&-\log\sigma(\beta T\mathbb{E}_{t,\boldsymbol{x}_{t}^w\sim q(\boldsymbol{x}_{t}|\boldsymbol{x}_0^w), \boldsymbol{x}_{t}^l\sim q(\boldsymbol{x}_{t}|\boldsymbol{x}_0^l)}\mathbb{E}_{\boldsymbol{x}_{t-1}^w\sim p_{\theta}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t^w), \boldsymbol{x}_{t-1}^l\sim p_{\theta}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t^l)}[\log\frac{p_{\theta}(\boldsymbol{x}_{t-1}^w|\boldsymbol{x}_{t}^w)}{p_{\text{ref}}(\boldsymbol{x}_{t-1}^w|\boldsymbol{x}_{t}^w)} - \log\frac{p_{\theta}(\boldsymbol{x}_{t-1}^l|\boldsymbol{x}_{t}^l)}{p_{\text{ref}}(\boldsymbol{x}_{t-1}^l|\boldsymbol{x}_{t}^l)}]) \\
=&-\mathbb{E}_{t,\boldsymbol{x}_{t}^w\sim q(\boldsymbol{x}_{t}|\boldsymbol{x}_0^w) \boldsymbol{x}_{t}^l\sim q(\boldsymbol{x}_{t}|\boldsymbol{x}_0^l)}\log\sigma(\beta T\mathbb{E}_{\boldsymbol{x}_{t-1}^w\sim p_{\theta}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t^w), \boldsymbol{x}_{t-1}^l\sim p_{\theta}(\boldsymbol{x}_{t-1}|\boldsymbol{x}_t^l)}[\log\frac{p_{\theta}(\boldsymbol{x}_{t-1}^w|\boldsymbol{x}_{t}^w)}{p_{\text{ref}}(\boldsymbol{x}_{t-1}^w|\boldsymbol{x}_{t}^w)} - \log\frac{p_{\theta}(\boldsymbol{x}_{t-1}^l|\boldsymbol{x}_{t}^l)}{p_{\text{ref}}(\boldsymbol{x}_{t-1}^l|\boldsymbol{x}_{t}^l)}]). \\
\end{aligned}
\end{equation}

Similarly, we have

<font color=red>

\begin{equation}
\begin{aligned}
\min L_{2}(\theta)=&-\mathbb{E}_{t,\boldsymbol{\epsilon}^w,\boldsymbol{\epsilon}^l}\log\sigma(-\beta T w(\lambda_t)(||\boldsymbol{\epsilon}_{\theta}(\boldsymbol{x}_t^w,t)-\boldsymbol{\epsilon}_{\text{ref}}(\boldsymbol{x}_t^w,t)||^2_2 - ||\boldsymbol{\epsilon}_{\theta}(\boldsymbol{x}_t^l,t)-\boldsymbol{\epsilon}_{\text{ref}}(\boldsymbol{x}_t^l,t)||^2_2)). \\
\end{aligned}
\end{equation}

</font>

<font color=red>Intuitively, it aligns the outputs of the online model and the reference model on winning samples; conversely, it does the opposite way on losing samples.</font> However, since the online model and the reference model are initialized from the same model, this loss can not directly be optimized.

**Experiments**

- Use 851K sample pairs with 59K unique prompts from Pick-a-Pic dataset.
- A <font color=red>learning rate of $\frac{2000}{\beta}2.048\cdot10^{-8}$</font> is used with 25% of linear warmup.
- <font color=red>$\beta=2000$ for SD1.5 and $\beta=5000$ for SDXL</font>.

## PyTorch Implementations

In [None]:
## --------------------------------------------------------------------------------
## Simple implementation of Diffusion-DPO loss
## Modified from https://github.com/SalesforceAIResearch/DiffusionDPO
## --------------------------------------------------------------------------------
import torch.nn.functional as F

def get_dpo_loss(model_pred, ref_pred, target, beta_dpo):
    """
    Calculate Diffusion-DPO L1 loss.
    model_pred (`torch.Tensor`): online model prediction on both winning and losing samples.
    ref_pred (`torch.Tensor`): reference model prediction on both winning and losing samples.
    target (`torch.Tensor`): ground-truth of prediction target for both winning and losing samples.
    beta_dpo (`float`): beta hyper-parameter.
    """
    ## Target prediction loss of online model on winning & losing sampels.
    model_losses = (model_pred - target).pow(2).mean(dim=[1,2,3])
    model_losses_w, model_losses_l = model_losses.chunk(2)

    ## Target prediction loss of reference model on winning & losing sampels.
    ref_losses = (ref_pred - target).pow(2).mean(dim=[1,2,3])
    ref_losses_w, ref_losses_l = ref_losses.chunk(2)
    
    term = model_losses_w - ref_losses_w - (model_losses_l  - ref_losses_l)
    loss = -F.logsigmoid(-0.5 * beta_dpo * term).mean()

    return loss