-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification on reward/value heads in PPOV2 #1783
Comments
Thanks for the issue. Regarding the model
That is expected because
wandb here: https://wandb.ai/costa-huang/huggingface/runs/fmof4oxq/workspace?nw=nwusercostahuang You can see the model's completion works as intended: the output text becomes more positive. Screen.Recording.2024-06-27.at.1.40.49.PM.mov
yes. separate value network is per OpenAI's setting in summarize from feedback and instructGPT
that's amazing 💪👍! |
Thanks so much for the reply!
I've been hunting for this whilst I was doing some replication work against PPOV2 - so helpful, thanks :)
I'd be really interested to hear if you have any thoughts on reducing the memory footprint of PPO - I noticed you guys were trying out some PEFT stuff similar to PPOV1, did you end up scaling the PEFT experiments to compare? |
Yes peft absolutely helps with the memory. In the N+ impl work @mnoukhov did some peft work and they perform pretty well, too. See screenshot below (missing the 6.9B lora checkpoint results, but it's pretty promising). ![]() |
First, thank you for your efforts in helping to bring accurate and performant RLHF techniques to the open-source community.
I'm raising this issue hoping to get some clarification on a couple implementation details in PPOV2:
--- 1 ---
The default
AutoModelForSequenceClassification
implementation in Transformers usesbias=False
for the classificationnn.Linear
. In a recent fork for training reward models, and alongside the suggestion in The N Implementation Details, the bias is correctly initialised prior to reward model training.However, when I run the snippet from
examples/scripts/ppo/ppo.py
for an exemplar RM:Is this expected behaviour - to not use the bias during PPO training?
--2--
In the previous PPO implementation, the value head model is simply another head which shares the base model backbone, however, in PPOV2, it seems the value model is instantiated separately. Is my understanding correct here? If so, I'm curious about the reasoning behind this, since a separate value model would require an additional reward-model-size memory capacity. Do you see an improvment in algorithm performance here?
Many thanks!
P.S. For context, I've been working on a PPO implementation in parallel in Torchtune pytorch/torchtune#1005, and I've found all the empirical work and implementation details invaluable so far.
The text was updated successfully, but these errors were encountered: