# Secrets of RLHF in Large Language Models: Reward Modeling

```{note}
While reward models are often considered central to achieving high
performance, they face the following challenges in practice:<br>
1. Incorrect and ambiguous preference pairs in the dataset may hinder the reward
model from accurately capturing human intent.<br>
2. Reward models trained on data
from a specific distribution often struggle to generalize to examples outside that
distribution and are not suitable for iterative RLHF training.
```

## Preliminaries

In the reward modeling stage, the SFT model $\pi^{\text{SFT}}$ is prompted
with a user query denoted as $x$ to produce two distinct outputs $(y_1,y_2)\sim\pi^{\text{SFT}}(y|x)$. Human labelers
are instructed to choose their preferred output, resulting in $y_c\succ y_r$, where $y_c$ and $y_r$ represent the
chosen and rejected outputs, respectively, from the pair $(y_1,y_2)$. By following the Bradley-Terry
model, we formulate a preference distribution by employing the reward function $r_{\psi}(x,y)$ as
outlined below:

$$
p_{\psi}(y_c\succ y_r|x) = \sigma(r_{\psi}(x, y_c) - r_{\psi}(x, y_r))
$$

where $\sigma$ is the logistic function. Treating the problem as a binary classification task yields the negative
log-likelihood loss function:

$$
L(r_{\psi}) = -\mathbb{E}_{(x,y)\sim\mathcal{D}}[\log \sigma(r_{\psi}(x, y_c) - r_{\psi}(x, y_r))]
$$

## Measuring the Strength of Preferences

The **preference strength (difference)** between chosen and rejected responses can be quantified using

$$
d_{i,\psi} = r_{\psi}(x^{(i)}, y_{\text{c}}^{(i)}) - r_{\psi}(x^{(i)}, y_{\text{r}}^{(i)})
$$

We train $M$ reward models using the same preference data, with the training order randomized. By utilizing the ensemble of reward scores from these $M$ reward
models, we can calculate the mean and standard deviation (std) of preference strength for each
comparison pair:

$$
\hat{u}_{i} = \frac{1}{M}\sum_{m=1}^{M}d_{i,\psi_{m}},\quad\hat{\sigma}_{i} = \sqrt{\frac{\sum_{m=1}^{M}(d_{i,\psi_{m}} - \hat{u}_{i})^{2}}{M}}
$$

We observe that the mean of preference differences for approximately 25% of the data is less than
0.

![](../images/secret1.png)

![](../images/secret2x.png)

## Impacts of Different Data on RM Performance

We can use preference strength to partition the training data into different
groups. We are curious about the contributions that different groups of training sets have made to
modeling preferences. We train a reward model from scratch for each group, where each group’s
data size is 10% of the original training data size, and then evaluate its performance on the validation
set.

![](../images/secret3.png)

According to the results, we can observe that:

1. For the top 20% of data with the lowest preference
strength, they have a negative impact on the model’s performance on the validation set.

2. For data ranked between 20% and 40%, after
training, the model’s prediction accuracy on the validation set is approximately 0.5.

3. The remaining data significantly improves the model’s
performance. However, the top 10% of data with the highest preference strength does not achieve
the best performance when trained alone.

Based on the above results, we can roughly categorize
preference data into three types: incorrect data, ambiguous data (almost no difference), and normal
data (clear differences).

## Analyze and Leverage Diverse Data to its Fullest Potential

### Flipping the Labels

By flipping the labels of the bottom 20% of data with the lowest preference strength, the model could more effectively learn preference information for modeling, as demonstrated below.

![](../images/secret4.png)

### Label Smoothing

Label smoothing is another widely known technique to mitigate the overfitting problem by penalizing
overconfident model outputs:

$$
L_{\text{LS}} = -\mathbb{E}_{(x,y)\sim\mathcal{D}}[(1-\alpha)\log(p_{\psi}(y_{c}\succ y_r|x)) + \alpha\log(1-p_{\psi}(y_{c}\succ y_r|x))]
$$

where $\alpha$ is the smoothing parameter.

![](../images/secret-margin-2.png)

### Adaptive Margin

Using preference
strength information, we can guide the reward model to assign more discrepant scores to responses
with higher preference strength, which has been shown to be beneficial for preference modeling. Therefore, we add an adaptive margin component to the loss of the reward model:

$$
L(r_{\psi}) = -\mathbb{E}_{(x,y)\sim\mathcal{D}}[\log \sigma(r_{\psi}(x, y_c) - r_{\psi}(x, y_r) - \hat{u}(x,y))]
$$

where the marginal function $\hat{u}(x,y)$ serves as a continuous measure of preference strength. Adding a margin to all the data effectively enhances the performance of
preference modeling:

![](../images/secret-margin-3.png)

### Takeaways

* **Label Flipping** and **Label Smoothing** can effectively avoid the impact of noisy preferences
and improve performance, provided that you can accurately identify noisy preference data.

* When learning data with strong preference strength, the reward model may be prone to
overfitting, which can be mitigated by using **Label Smoothing**.

* **Adaptive margin** almost always benefits all preference data and can be widely applied to
reward modeling.

## How to Better Model Human Preference?

In this report, we mainly consider four methods to improve reward modeling. In our
practical experiments, these methods show improvements over the original reward modeling method:

* **Flip**: Flip the noise data labels in the preference data.

* **Margin**: Add an adaptive margin to the loss function for all preference pairs.

* **Flip + Margin**: Flip the noise data labels in the preference data and add an adaptive margin
to the loss function for all preference pairs.

* **Soft Label + Margin**: Apply label smoothing to data with the preference strength less than
0 and add an adaptive margin to the loss function for all preference pairs.

![](../images/secret-how.png)

```{caution}
Apply label smoothing to data with the preference strength less than
0 need first flip the labels?
```