# Self-Rewarding Language Models

```{note}
Our approach first assumes access to a base pretrained language model, and a small amount
of human-annotated seed data. We then build a model that aims to possess two skills
simultaneously:

1. **Instruction following**: given a prompt that describes a user request, the ability to
generate a high quality, helpful (and harmless) response.

2. **Self-Instruction creation**: the ability to generate and evaluate new instructionfollowing
examples to add to its own training set.

These skills are used so that the model can perform self-alignment, i.e., they are the
components used to iteratively train itself using AI Feedback (AIF).
```

![](../images/self-reward.png)

## Initialization

**Seed instruction following data:** We are given a seed set of human-authored (instruction
prompt, response) general instruction following examples that we use for training in a
supervised fine-tuning (SFT) manner, starting from a pretrained base language model.
Subsequently this will be referred to as Instruction Fine-Tuning (IFT) data.

**Seed LLM-as-a-Judge instruction following data:** We also assume we are provided
a seed set of (evaluation instruction prompt, evaluation result response) examples which
can also be used for training. While this is not strictly necessary, as the model using IFT
data will already be capable of training an LLM-as-a-Judge, we show that such training
data can give improved results. Subsequently
this will be referred to as Evaluation Fine-Tuning (EFT) data.

![](../images/self-reward2.png)

## Self-Instruction Creation

Using the model we have trained, we can make it self-modify its own training set. Specifically,
we generate additional training data for the next iteration of training:

1. **Generate a new prompt**: We generate a new prompt $x_i$ using few-shot prompting, sampling prompts from the original seed IFT data.

2. **Generate candidate responses**: We then generate $N$ diverse candidate responses $\{y_{i}^{1},\dots,y_{i}^{N}\}$ for the given prompt $x_i$ from our model using sampling.

3. **Evaluate candidate responses**: Finally, we use the LLM-as-a-Judge ability of our same model to evaluate its own candidate responses with scores $r_{i}^{n}\in[0,5]$.

## Instruction Following Training

After performing the self-instruction creation procedure, we can
then augment the seed data with additional examples for training, which we refer to as AI
Feedback Training (AIFT) data. We try two variants of such feedback:

* **Preference pairs**: We construct training data of the form (instruction prompt $x_i$,
winning response $y_{i}^{w}$, losing response $y_{i}^{l}$). To form the winning and losing pair
we take the highest and lowest scoring responses from the $N$ evaluated candidate
responses.

* **Positive examples only**: We add additional examples of (instruction
prompt, response) to the seed set for supervised fine-tuning. In this setup we only add examples
where the candidate response was evaluated to give a perfect score of $r_{i}^{n}=5$.

We find that learning from preference pairs gives superior performance, and hence recommend that approach.

## Overall Self-Alignment Algorithm

Our overall procedure trains a series of models $M_1, . . . ,M_T$ where
each successive model $t$ uses augmented training data created by the ${t − 1}^{\text{th}}$ model. We thus
define $\text{AIFT}(M_t)$ to mean AI Feedback Training data created using model $M_t$.

**Model Sequence**

* $M_0$: Base pretrained LLM with no fine-tuning.

* $M_1$: Initialized with $M_0$, then fine-tuned on the IFT+EFT seed data using SFT.

* $M_2$: Initialized with $M_1$, then trained with $\text{AIFT}(M_1)$ data using DPO.

* $M_3$: Initialized with $M_2$, then trained with $\text{AIFT}(M_2)$ data using DPO.

## Experiments

### Experimental Setup

**Base Model:** We use Llama 2 70B as our base pretrained model.

**IFT Seed Data:** We use the human-authored examples provided in the Open Assistant
dataset for instruction fine-tuning.

**EFT Seed Data:** The Open Assistant data also provides multiple ranked human responses
per prompt from which we can construct evaluation fine-tuning data.

**Instruction Following Evaluation:** We evaluate head-to-head performance between various models
using GPT-4 as an evaluator over 256 test prompts using the AlpacaEval evaluation prompt. We try the prompt in both orders comparing pairwise.

**Reward Modeling Evaluation:** We evaluate the correlation with human rankings on the evaluation set
we derived from the Open Assistant dataset.

### Instruction Following Ability Results

* EFT+IFT seed training performs similarly to IFT alone

* Iteration 2 ($M_2$) improves over Iteration 1 ($M_1$) and SFT Baseline

* Iteration 3 ($M_3$) improves over Iteration 2 ($M_2$)

![](../images/self-reward-3.png)

### Reward Modeling Ability Results

* EFT augmentation improves over SFT baseline

* Reward Modeling ability improves with Self-Training

* Importance of the LLM-as-a-Judge Prompt: we also tried various
other prompts to decide the most effective one to use, our prompt describes
the points as additive, covering various aspects of quality is the best.

![](../images/self-reward-5.png)