# DPOP

```{note}
In this work, first we show theoretically that the standard DPO loss can lead to a reduction
of the model’s likelihood of the preferred examples, as long as the relative probability between the
preferred and dispreferred classes increases.<br>
We design DPO-Positive (DPOP), a new loss function and
training procedure which avoids this failure mode.
```

## Failure Mode of DPO

The DPO loss

$$\mathcal{L}_{\text{DPO}}(\pi_{\theta};\pi_{\text{ref}}) = -\mathbb{E}_{(x, y_{w},y_{l})\sim\mathcal{D}}\left[\log\sigma\left(\beta\log\frac{\pi_{\theta}(y_{w}|x)}{\pi_{\text{ref}}(y_{w}|x)} -  \beta\log\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\text{ref}}(y_{l}|x)}\right)\right]$$

is a function only of the difference
in the log-ratios. which means that we can achieve a low loss value even if $\pi_{\text{ratio}}(y_w|x)$ is lowered below
1, as long as $\pi_{\text{ratio}}(y_l|x)$ is also lowered sufficiently.

**Edit Distance 1** Now we provide a specific case in
which DPO may cause a decrease in the probability of the better completion. Consider the case of trying
to improve a model’s math or reasoning abilities by comparing a completion of “2+2=4” to “2+2=5” which have an edit (Hamming) distance
of 1, i.e., all tokens in the completion are the same except for one.

Consider two completions with an edit distance of 1 which differ at token $m$, i.e. consider $y_{w}=(t_{1},\dots,t_{K})$ and $y_{l}=(t_{1},\dots,t_{m-1},t_{m}',t_{m+1},\dots, t_{K})$. Denote $y^{<r} = (t_{1},\dots,t_{r-1})$ and $y^{\ge r} = (t_{r},\dots, t_{K})$. Let $s_{i}^{\{x\}}$ represent the probability of the $i-$th token in the model’s vocabulary given the input $x$. 
While the LLM model parameters $\theta$ are numerous, we restrict our attention to the logits $\theta_{j}$ where $1\le j\le \text{vocab length}$. The gradient of DPO loss with respect to $\theta$ is proportional to the following:

$$\nabla_{\theta}\mathcal{L}_{DPO}(\pi_{\theta};\pi_{\text{ref}}) \propto -\left[\nabla_{\theta}\log\pi_{\theta}(y_{w}|x) - \nabla_{\theta}\log\pi_{\theta}(y_{l}|x)\right]$$

We note first that for all tokens from 1 to $m-1$ have no effect on the gradient. Therefore, without loss of generality, assume $m=1$, i.e. $y_{w}$ and $y_{l}$ differ only at the first token. Without loss of generality, we also assume that $t_{k}$ takes vocabulary position 1. Then for each $k>1$:

$$\nabla_{\theta_{j}}\log\pi_{\theta}(t_{k}|y_{w}^{<k},x) - \nabla_{\theta_{j}}\log\pi_{\theta}(t_{k}|y_{l}^{<k},x) = s_{j}^{\{y_{l}^{<k},x\}} - s_{j}^{\{y_{w}^{<k},x\}}$$

As we typically run DPO after SFT, the model is likely to be reasonably well optimised, so we have $s_{j}^{\{y_{w}^{<k},x\}} \le s_{j}^{\{y_{l}^{<k},x\}}$ for $j\ne 1$ and $s_{1}^{\{y_{w}^{<k},x\}} \ge s_{1}^{\{y_{l}^{<k},x\}}$, so $\nabla_{\theta_{j}}\mathcal{L}_{DPO}(\pi_{\theta};\pi_{\text{ref}}) \le 0$ for $j\ne 1$ and $\nabla_{\theta_{1}}\mathcal{L}_{DPO}(\pi_{\theta};\pi_{\text{ref}}) \ge 0$. We see that the gradient vector is decreasing in the correct logit dimension
and increasing in the wrong logit dimensions. Surprisingly, this suggests that under DPO, all tokens that
follow a mismatched token should have reduced probability of emitting the correct token when compared to $\pi_{\text{ref}}$.

## DPOP

The DPOP loss function:

$$
\begin{aligned}
\mathcal{L}_{\text{DPO}}(\pi_{\theta};\pi_{\text{ref}}) = -\mathbb{E}_{(x, y_{w},y_{l})\sim\mathcal{D}}\left[\log\sigma\left(\beta\log\frac{\pi_{\theta}(y_{w}|x)}{\pi_{\text{ref}}(y_{w}|x)} -  \beta\log\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\text{ref}}(y_{l}|x)}\right) \\
- \lambda\max\left(0, \log\frac{\pi_{\text{ref}}(y_{w}|x)}{\pi_{\theta}(y_{w}|x)}\right)\right]
\end{aligned}
$$

where $\lambda > 0$ is a hyperparameter that can be tuned. The model can no longer minimise the loss by significantly reducing
the log-likelihood of the dispreferred examples more than it reduces the log-likelihood of the preferred
examples; it must also ensure that the log-likelihood of the preferred examples remains high relative to the
log-likelihood under the reference model.

For our example, if $\pi_{\text{ratio}} < 1$, the DPOP gradients become:

$$
\begin{aligned}
\nabla_{\theta_{j}}&\left[\log\pi_{\theta}(t_{k}|y_{w}^{<k},x) - \log\pi_{\theta}(t_{k}|y_{l}^{<k},x) -\lambda\cdot\log\pi_{\theta}(t_{k}|y_{w}^{<k},x)\right]\\
=&
\begin{cases}
\lambda(1 - s_{j}^{\{y_{w}^{<k},x\}}) + s_{j}^{\{y_{l}^{<k},x\}} - s_{j}^{\{y_{w}^{<k},x\}}& j=1\\
-(\lambda+1)s_{j}^{\{y_{w}^{<k},x\}} + s_{j}^{\{y_{l}^{<k},x\}}&j\ne 1\\
\end{cases}
\end{aligned}
$$

Since $s_{j}^{\{y_{w}^{<k},x\}} \le 1$, for the case $j=1$, the gradient is guaranteed
to be positive for a large enough choice of $\lambda$. Similarly, for the case $j\ne 1$, the gradient is guaranteed to be
negative for a large enough $\lambda$.