Generalized PPO loss (& improve current GRPO loss)

### 🚀 The feature, motivation and pitch

- [ ] 1. Current GRPO assume KL term is added and advantage is computed inside the loss, we wanna open this to become configurable by user
- [ ] 2. Clipping is not used in GRPO and need to be added (also as an option for generalized PPO case)
- [ ] 3. Reference model logits/prob can be provided directly without providing LM head + hidden states as an option
- [ ] 4. Old policy prob or LM head/hidden states is not provided for importance sampling purpose in GRPO.

### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generalized PPO loss (& improve current GRPO loss) #626

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Generalized PPO loss (& improve current GRPO loss) #626

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions