PPO Questions #119

natolambert · 2023-01-28T01:53:45Z

I'm comparing the PPO implementation to the OpenAI one and the implementation details blog post that goes through it. Wondering if some of these things improve performance. If not, it's good for understanding.

I'm guessing the discrepancy comes from the original vs learn to summarize work, which is interesting.

Some things to confirm:

PPO update question: I was a little confused seeing returns = advantages + values l693, instead of adv = returns - values why did it end up like that?
Some implementations use a residual value prediction in clipping. Compared to TRL.
consider approximate KL used in TRLX and discussed on john schulmans blog.

The text was updated successfully, but these errors were encountered:

natolambert · 2023-01-28T01:55:33Z

It could be good to make things like this configurable in a branch and learning how these implementation details transfer to RLHF.

DaehanKim · 2023-01-28T14:19:48Z

imo, residual clipping seems beneficial to prevent policy loss spiking reported in #101 . It's probably coming from instability in value estimation.

natolambert · 2023-01-30T19:26:28Z

Yeah, I'm running residual clipping example(s), we'll see. At least it'll be good to have the option to try both.

natolambert · 2023-01-30T21:41:42Z

Residual value prediction didn't help with stability (it's crimson-wish)

natolambert · 2023-01-30T22:33:35Z

Also not a big help via the other approx KL formulation. W&B here. Though, it's slightly more stable?

We'll see how this run finishes converging.

lvwerra · 2023-06-01T12:28:33Z

Closing this for now, feel free to reopen if there's an update.

natolambert added the ❓ question Seeking clarification or more information label Jan 28, 2023

natolambert mentioned this issue Feb 6, 2023

Entropy Regularization in PPOTrainer #131

Closed

lvwerra closed this as completed Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO Questions #119

PPO Questions #119

natolambert commented Jan 28, 2023 •

edited

Loading

natolambert commented Jan 28, 2023

DaehanKim commented Jan 28, 2023

natolambert commented Jan 30, 2023

natolambert commented Jan 30, 2023 •

edited

Loading

natolambert commented Jan 30, 2023

lvwerra commented Jun 1, 2023

PPO Questions #119

PPO Questions #119

Comments

natolambert commented Jan 28, 2023 • edited Loading

natolambert commented Jan 28, 2023

DaehanKim commented Jan 28, 2023

natolambert commented Jan 30, 2023

natolambert commented Jan 30, 2023 • edited Loading

natolambert commented Jan 30, 2023

lvwerra commented Jun 1, 2023

natolambert commented Jan 28, 2023 •

edited

Loading

natolambert commented Jan 30, 2023 •

edited

Loading