Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPO Questions #119

Closed
2 of 3 tasks
natolambert opened this issue Jan 28, 2023 · 6 comments
Closed
2 of 3 tasks

PPO Questions #119

natolambert opened this issue Jan 28, 2023 · 6 comments
Labels
question Further information is requested

Comments

@natolambert
Copy link
Contributor

natolambert commented Jan 28, 2023

I'm comparing the PPO implementation to the OpenAI one and the implementation details blog post that goes through it. Wondering if some of these things improve performance. If not, it's good for understanding.

I'm guessing the discrepancy comes from the original vs learn to summarize work, which is interesting.

Some things to confirm:

  • PPO update question: I was a little confused seeing returns = advantages + values l693, instead of adv = returns - values why did it end up like that?
  • Some implementations use a residual value prediction in clipping. Compared to TRL.
  • consider approximate KL used in TRLX and discussed on john schulmans blog.
@natolambert natolambert added the question Further information is requested label Jan 28, 2023
@natolambert
Copy link
Contributor Author

It could be good to make things like this configurable in a branch and learning how these implementation details transfer to RLHF.

@DaehanKim
Copy link

imo, residual clipping seems beneficial to prevent policy loss spiking reported in #101 . It's probably coming from instability in value estimation.

@natolambert
Copy link
Contributor Author

Yeah, I'm running residual clipping example(s), we'll see. At least it'll be good to have the option to try both.

@natolambert
Copy link
Contributor Author

natolambert commented Jan 30, 2023

Residual value prediction didn't help with stability (it's crimson-wish)
Screenshot 2023-01-30 at 1 41 28 PM

@natolambert
Copy link
Contributor Author

Also not a big help via the other approx KL formulation. W&B here. Though, it's slightly more stable?

We'll see how this run finishes converging.
Screenshot 2023-01-30 at 2 33 23 PM

@lvwerra
Copy link
Member

lvwerra commented Jun 1, 2023

Closing this for now, feel free to reopen if there's an update.

@lvwerra lvwerra closed this as completed Jun 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants