Skip to content

Generalized PPO loss (& improve current GRPO loss) #626

@qingquansong

Description

@qingquansong

🚀 The feature, motivation and pitch

  • 1. Current GRPO assume KL term is added and advantage is computed inside the loss, we wanna open this to become configurable by user
  • 2. Clipping is not used in GRPO and need to be added (also as an option for generalized PPO case)
  • 3. Reference model logits/prob can be provided directly without providing LM head + hidden states as an option
  • 4. Old policy prob or LM head/hidden states is not provided for importance sampling purpose in GRPO.

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions