-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PPO Questions #119
Comments
It could be good to make things like this configurable in a branch and learning how these implementation details transfer to RLHF. |
imo, residual clipping seems beneficial to prevent policy loss spiking reported in #101 . It's probably coming from instability in value estimation. |
Yeah, I'm running residual clipping example(s), we'll see. At least it'll be good to have the option to try both. |
Also not a big help via the other approx KL formulation. W&B here. Though, it's slightly more stable? |
Closing this for now, feel free to reopen if there's an update. |
I'm comparing the PPO implementation to the OpenAI one and the implementation details blog post that goes through it. Wondering if some of these things improve performance. If not, it's good for understanding.
I'm guessing the discrepancy comes from the original vs learn to summarize work, which is interesting.
Some things to confirm:
returns = advantages + values
l693, instead ofadv = returns - values
why did it end up like that?The text was updated successfully, but these errors were encountered: