### 🚀 The feature, motivation and pitch - [ ] 1. Current GRPO assume KL term is added and advantage is computed inside the loss, we wanna open this to become configurable by user - [ ] 2. Clipping is not used in GRPO and need to be added (also as an option for generalized PPO case) - [ ] 3. Reference model logits/prob can be provided directly without providing LM head + hidden states as an option - [ ] 4. Old policy prob or LM head/hidden states is not provided for importance sampling purpose in GRPO. ### Alternatives _No response_ ### Additional context _No response_