# ArmoRM-MoE

```{note}
Interpretable preferences via multi-objective reward modeling and mixture-of-experts.
```

## The Need for Interpretable Reward Models

![](../images/armo1.png)

When applying RLHF for LLM alignment, the phenomenon of reward hacking is widely observed. A notable example of this is the verbosity bias, where aligned LLMs produce longer-than-necessary responses because the RM favors length, regardless of quality.

![](../images/armo2.png)

How can we mitigate the reward hacking issue? We believe one solution is to make the reward model more interpretable and debuggable. Let’s continue considering the verbosity bias example. Suppose the RM’s output is interpretable, explaining that it assigns a high score to a response due to two factors: 40% for its helpfulness and 60% for its length. We could adjust its decision-making process to base its scoring 100% on helpfulness, regardless of response length, thus mitigating the verbosity bias.

To build RMs with interpretable preferences, we propose a two-stage approach:

1. Train an Absolute-Rating Multi-Objective Reward Model (ArmoRM) with multi-dimensional absolute-rating data;

2. Employ a Mixture-of-Experts (MoE) strategy with a gating network that automatically selects the most suitable reward objectives based on the context.

## Stage-1: Multi-Objective Reward Modeling

Some recent high-quality datasets have multi-objective absolute ratings. For instance, the UltraFeedback dataset is curated with 5-objective absolute ratings: Overall Score, Instruction Following, Truthfulness, Honesty, and Helpfulness.

![](../images/armo3.png)

We consider each example to consist of a prompt $x$, response $y$, and a $k$-dimensional rating vector $r\in\mathbb{R}^{k}$, where each dimension corresponds to a reward objective such as helpfulness and truthfulness. Now, we take a pre-trained decoder-only LLM without the original output linear layer as the feature extractor $f_{\theta}$, and pass $(x,y)$ through the decoder layers to take the hidden state of the final decoder layer on the last token as a 
$d$-dimensional feature. Also, we attach a new linear regression layer $w\in\mathbb{R}^{d\times k}$ on top of $f_{\theta}$, which outputs 
$k$-dimensional rating prediction. The model can be straightforwardly trained with regression loss:

$$
\underset{\theta,w}{\min}\mathbb{E}_{x,y,r}\|w^{\intercal}f_{\theta}(x,y)-r\|^{2}
$$

## Stage-2: Mixture-of-Experts Aggregation of Reward Objectives

We follow the common MoE practice to add a gating layer:

$$
g_{\phi}:\mathbb{R}^{d}\to v\in\mathbb{R}^{k},v_{i}\ge 0\text{ and }\sum{v}_{i}=1
$$

based on the feature extracted from the prompt $f_{\theta}(x)\in\mathbb{R}^{d}$, i.e. the hidden state on the last token of $x$. The gating layer $g_{\phi}$ can simply be a shallow MLP.

![](../images/armo4.png)

However, most reward objectives are highly correlated with `verbosity`, which indicates a strong verbosity bias. Using non-negative gating coefficients would make the final output inherit the bias. To resolve the issue, we adjust each reward objective, $r_{i}$, with a penalty using the verbosity reward objective,

$$
r_{i}'\gets r_{i} - \lambda_{i}r_{\text{verbose}}
$$

where the penalty coefficient $\lambda_{i}$ is chosen such that for a proper correction metric (e.g., Pearson or Spearman correlation coefficient) and a reference data distribution $\mathcal{D}$:

$$
\mathbb{E}_{\mathcal{D}}\text{Corr}(r_{i}',r_{\text{verbose}}) = 0.
$$

The adjusted reward vector is denoted as $r'\in\mathbb{R}^{k}$.

Finally, we multiply the gating coefficients to the multi-objective rewards, to obtain a scalar score for the response 
$y$ given prompt $x$,

$$
R = g_{\phi}(f_{\theta}(x))^{\intercal}r'
$$

To train the gating layer, we freeze the parameters of the backbone and the regression layer, and only train the gating layer with the Bradley-Terry loss,

$$
\underset{\phi}{\min}\mathbb{E}\left[\frac{\exp(R_{\text{chosen}})}{\exp(R_{\text{chosen}} + R_{\text{rejected}})}\right]
$$

where $R_{\text{chosen}}$ and $R_{\text{rejected}}$ are the preference scores for the chosen and rejected responses in each pairwise example, $(x,y_{\text{chosen}},y_{\text{rejected}})$.