Clarification on reward/value heads in PPOV2 #1783

SalmanMohammadi · 2024-06-27T15:38:03Z

First, thank you for your efforts in helping to bring accurate and performant RLHF techniques to the open-source community.
I'm raising this issue hoping to get some clarification on a couple implementation details in PPOV2:

--- 1 ---
The default AutoModelForSequenceClassification implementation in Transformers uses bias=False for the classification nn.Linear. In a recent fork for training reward models, and alongside the suggestion in The N Implementation Details, the bias is correctly initialised prior to reward model training.

However, when I run the snippet from examples/scripts/ppo/ppo.py for an exemplar RM:

# Load model directly
from transformers import AutoModelForSequenceClassification

reward_model = AutoModelForSequenceClassification.from_pretrained("trl-internal-testing/rm_descriptiveness_1b")

""" output:
config.json: 100%
 869/869 [00:00<00:00, 12.0kB/s]
model.safetensors: 100%
 3.64G/3.64G [02:49<00:00, 25.7MB/s]

Some weights of the model checkpoint at trl-internal-testing/rm_descriptiveness_1b were not used when initializing GPTNeoXForSequenceClassification: ['score.bias']
- This IS expected if you are initializing GPTNeoXForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPTNeoXForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
"""
sd = model.state_dict().items()
"score.bias" in sd
"""
False
"""

Is this expected behaviour - to not use the bias during PPO training?

--2--

In the previous PPO implementation, the value head model is simply another head which shares the base model backbone, however, in PPOV2, it seems the value model is instantiated separately. Is my understanding correct here? If so, I'm curious about the reasoning behind this, since a separate value model would require an additional reward-model-size memory capacity. Do you see an improvment in algorithm performance here?

Many thanks!

P.S. For context, I've been working on a PPO implementation in parallel in Torchtune pytorch/torchtune#1005, and I've found all the empirical work and implementation details invaluable so far.

The text was updated successfully, but these errors were encountered:

vwxyzjn · 2024-06-27T17:42:01Z

Thanks for the issue. Regarding the model

The default AutoModelForSequenceClassification implementation in Transformers uses bias=False for the classification nn.Linear

That is expected because bias cancels out in the RM loss. Here is a script that trains using the said RM

examples/scripts/ppo/ppo.py --output_dir models/minimal/ppo1 --num_ppo_epochs 4 --num_mini_batches 1 --learning_rate 3e-6 --per_device_train_batch_size 32 --gradient_accumulation_steps 16 --local_rollout_forward_batch_size 32 --total_episodes 100000 --model_name_or_path EleutherAI/pythia-1b-deduped --sft_model_path EleutherAI/pythia-1b-deduped --reward_model_path trl-internal-testing/rm_sentiment_1b --kl_coef 0.1 --stop_token period --non_eos_penalty --min_response_length 13 --penalty_reward_value -3

wandb here: https://wandb.ai/costa-huang/huggingface/runs/fmof4oxq/workspace?nw=nwusercostahuang

You can see the model's completion works as intended: the output text becomes more positive.

Screen.Recording.2024-06-27.at.1.40.49.PM.mov

t seems the value model is instantiated separately. Is my understanding correct here?

yes. separate value network is per OpenAI's setting in summarize from feedback and instructGPT

P.S. For context, I've been working on a PPO implementation in parallel in Torchtune pytorch/torchtune#1005, and I've found all the empirical work and implementation details invaluable so far.

that's amazing 💪👍!

SalmanMohammadi · 2024-06-27T19:06:35Z

Thanks so much for the reply!

Here is a script that trains using the said RM

I've been hunting for this whilst I was doing some replication work against PPOV2 - so helpful, thanks :)

separate value network is per OpenAI's setting in summarize from feedback and instructGPT

I'd be really interested to hear if you have any thoughts on reducing the memory footprint of PPO - I noticed you guys were trying out some PEFT stuff similar to PPOV1, did you end up scaling the PEFT experiments to compare?

vwxyzjn · 2024-06-27T22:09:45Z

Yes peft absolutely helps with the memory. In the N+ impl work @mnoukhov did some peft work and they perform pretty well, too. See screenshot below (missing the 6.9B lora checkpoint results, but it's pretty promising).

SalmanMohammadi mentioned this issue Jul 3, 2024

RLHF with PPO pytorch/torchtune#1005

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification on reward/value heads in PPOV2 #1783

Clarification on reward/value heads in PPOV2 #1783

SalmanMohammadi commented Jun 27, 2024

vwxyzjn commented Jun 27, 2024

SalmanMohammadi commented Jun 27, 2024

vwxyzjn commented Jun 27, 2024

Clarification on reward/value heads in PPOV2 #1783

Clarification on reward/value heads in PPOV2 #1783

Comments

SalmanMohammadi commented Jun 27, 2024

vwxyzjn commented Jun 27, 2024

SalmanMohammadi commented Jun 27, 2024

vwxyzjn commented Jun 27, 2024