Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on reward/value heads in PPOV2 #1783

Open
SalmanMohammadi opened this issue Jun 27, 2024 · 3 comments
Open

Clarification on reward/value heads in PPOV2 #1783

SalmanMohammadi opened this issue Jun 27, 2024 · 3 comments

Comments

@SalmanMohammadi
Copy link

First, thank you for your efforts in helping to bring accurate and performant RLHF techniques to the open-source community.
I'm raising this issue hoping to get some clarification on a couple implementation details in PPOV2:

--- 1 ---
The default AutoModelForSequenceClassification implementation in Transformers uses bias=False for the classification nn.Linear. In a recent fork for training reward models, and alongside the suggestion in The N Implementation Details, the bias is correctly initialised prior to reward model training.

However, when I run the snippet from examples/scripts/ppo/ppo.py for an exemplar RM:

# Load model directly
from transformers import AutoModelForSequenceClassification

reward_model = AutoModelForSequenceClassification.from_pretrained("trl-internal-testing/rm_descriptiveness_1b")

""" output:
config.json: 100%
 869/869 [00:00<00:00, 12.0kB/s]
model.safetensors: 100%
 3.64G/3.64G [02:49<00:00, 25.7MB/s]

Some weights of the model checkpoint at trl-internal-testing/rm_descriptiveness_1b were not used when initializing GPTNeoXForSequenceClassification: ['score.bias']
- This IS expected if you are initializing GPTNeoXForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPTNeoXForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
"""
sd = model.state_dict().items()
"score.bias" in sd
"""
False
"""

Is this expected behaviour - to not use the bias during PPO training?

--2--

In the previous PPO implementation, the value head model is simply another head which shares the base model backbone, however, in PPOV2, it seems the value model is instantiated separately. Is my understanding correct here? If so, I'm curious about the reasoning behind this, since a separate value model would require an additional reward-model-size memory capacity. Do you see an improvment in algorithm performance here?

Many thanks!

P.S. For context, I've been working on a PPO implementation in parallel in Torchtune pytorch/torchtune#1005, and I've found all the empirical work and implementation details invaluable so far.

@vwxyzjn
Copy link
Collaborator

vwxyzjn commented Jun 27, 2024

Thanks for the issue. Regarding the model

The default AutoModelForSequenceClassification implementation in Transformers uses bias=False for the classification nn.Linear

That is expected because bias cancels out in the RM loss. Here is a script that trains using the said RM

examples/scripts/ppo/ppo.py --output_dir models/minimal/ppo1 --num_ppo_epochs 4 --num_mini_batches 1 --learning_rate 3e-6 --per_device_train_batch_size 32 --gradient_accumulation_steps 16 --local_rollout_forward_batch_size 32 --total_episodes 100000 --model_name_or_path EleutherAI/pythia-1b-deduped --sft_model_path EleutherAI/pythia-1b-deduped --reward_model_path trl-internal-testing/rm_sentiment_1b --kl_coef 0.1 --stop_token period --non_eos_penalty --min_response_length 13 --penalty_reward_value -3

wandb here: https://wandb.ai/costa-huang/huggingface/runs/fmof4oxq/workspace?nw=nwusercostahuang

You can see the model's completion works as intended: the output text becomes more positive.

Screen.Recording.2024-06-27.at.1.40.49.PM.mov

t seems the value model is instantiated separately. Is my understanding correct here?

yes. separate value network is per OpenAI's setting in summarize from feedback and instructGPT

P.S. For context, I've been working on a PPO implementation in parallel in Torchtune pytorch/torchtune#1005, and I've found all the empirical work and implementation details invaluable so far.

that's amazing 💪👍!

@SalmanMohammadi
Copy link
Author

Thanks so much for the reply!

Here is a script that trains using the said RM

I've been hunting for this whilst I was doing some replication work against PPOV2 - so helpful, thanks :)

separate value network is per OpenAI's setting in summarize from feedback and instructGPT

I'd be really interested to hear if you have any thoughts on reducing the memory footprint of PPO - I noticed you guys were trying out some PEFT stuff similar to PPOV1, did you end up scaling the PEFT experiments to compare?

@vwxyzjn
Copy link
Collaborator

vwxyzjn commented Jun 27, 2024

Yes peft absolutely helps with the memory. In the N+ impl work @mnoukhov did some peft work and they perform pretty well, too. See screenshot below (missing the 6.9B lora checkpoint results, but it's pretty promising).

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants