# Instruct GPT

```{note}
The language modeling objective used for many recent large LMs—predicting the next token on a webpage from the internet is different from the objective “follow the user’s instructions helpfully and safely”.
```

## methodology

We start with a pretrained language
model, a distribution of prompts on which we want our model to produce aligned outputs, and a team
of trained human labelers. We then apply the following three steps.

![](../images/instruct-1.png)

Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best
policy, which is used to train a new RM and then a new policy.

## Dataset

Our prompt dataset consists primarily of text prompts submitted to the OpenAI API. To train the very first InstructGPT models, we asked labelers to write prompts themselves.

From these prompts, we produce three different datasets used in our fine-tuning procedure:

1. our SFT dataset, with labeler demonstrations used to train our SFT models.

2. our RM dataset, with labeler rankings of model outputs used to train our RMs.

3. our PPO dataset, without any human labels, which are used as inputs for RLHF fine-tuning.

![](../images/instruct-2.png)

## Supervised fine-tuning (SFT)

We fine-tune GPT-3 on our labeler demonstrations using supervised
learning trained for 16 epochs. 

```{caution}
We do our final SFT model selection based on the RM score on the validation set?
```

We find that our SFT models overfit on validation loss after 1 epoch; however, we find
that training for more epochs helps both the RM score and human preference ratings, despite this
overfitting.

## Reward modeling (RM)

Starting from the SFT model with the final unembedding layer removed,
we trained a model to take in a prompt and response, and output a scalar reward.

In order to speed up comparison collection, we present labelers with anywhere between $K=4$ and $K=9$ responses to rank. This produces $\binom{K}{2}$ comparisons for each prompt shown to a labeler. We train on all $\binom{K}{2}$ comparisons from each prompt as a single batch element.

The loss function for the reward model is:

$$
\log(\theta) = \frac{1}{-\binom{K}{2}}\mathbb{E}_{(x, y_{w}, y_{l})\sim D}\left[\log(\sigma(r_{\theta}(x, y_{w}) - r_{\theta}(x, y_{l})))\right]
$$

where $r_{\theta}(x, y)$ is the scalar output of the reward model for prompt $x$ and completion $y$ with parameters $\theta$, $y_{w}$ is the preferred completion out of the pair $y_{w}$ and $y_{l}$, and $D$ is the dataset of human
comparisons.

## Reinforcement learning (RL)

We fine-tuned the SFT model on our environment using PPO. Given
the prompt and response, the reward model produces a reward and ends the episode. In addition, we add a per-token KL penalty from the SFT model at each token to mitigate over-optimization
of the reward model. The value function is initialized from the RM. We call these
models “PPO.”

We also experiment with mixing the pretraining gradients into the PPO gradients, in order to fix the
performance regressions on public NLP datasets. We call these models “PPO-ptx.” We maximize the
following combined objective function in RL training:

$$
\begin{aligned}
\text{objective} = &\mathbb{E}_{(x, y)\sim D_{\pi_{\phi}^{\text{RL}}}}\left[r_{\theta}(x, y) - \beta\log\left(\pi_{\phi}^{\text{RL}}(y|x) / \pi^{\text{SFT}}(y|x)\right)\right] + \\
&\gamma\mathbb{E}_{x\sim D_{\text{pretrain}}}\left[\log(\pi_{\phi}^{\text{RL}}(x))\right]
\end{aligned}
$$

where $\pi_{\phi}^{\text{RL}}$ is the learned RL policy, $\pi^{\text{SFT}}$ is the supervised trained model, and $D_{\text{pretrain}}$ is the
pretraining distribution. The KL reward coefficient, $\beta$, and the pretraining loss coefficient, $\gamma$, control
the strength of the KL penalty and pretraining gradients respectively.

![](../images/trl.png)