-
Notifications
You must be signed in to change notification settings - Fork 232
Description
Is your feature request related to a problem? Please describe.
I'm working on the Tunix hackathon and running into stability issues training on rubric-based rewards (for the "show your work" objective). The current GRPO works well when you have binary correct/wrong signals from a verifier, but my setup uses continuous scores from a reward model evaluating reasoning quality.
The problem is GRPO's std normalization amplifies noise when rewards are subjective. I'm also seeing reward hacking since there's no KL mechanism to keep the policy from drifting too far to exploit the reward model.
Describe the solution you'd like
Add RLOO (REINFORCE Leave-One-Out) as an alternative advantage estimator. The main difference from GRPO:
# GRPO (current)
A_i = (R_i - mean(R)) / std(R)
# DrGRPO (removes std, already exists)
A_i = R_i - mean(R)
# RLOO (leave-one-out baseline + KL in reward)
A_i = R_i - mean(R_j where j != i) - β * KL_iRLOO uses a leave-one-out baseline instead of batch mean, which is more numerically stable. Critically, it folds KL directly into the reward R'_i = R_i - β * KL_i before advantage computation, rather than adding it to the loss function afterward.
Ahmadian et al. 2024 showed RLOO is more robust to noisy rewards than PPO/GRPO and outperforms DPO on preference tasks. It's also what PRIME uses under the hood.
Implementation approach
Verified that Tunix already has the infrastructure for this (function_registry.py):
- Add
advantage_estimator='rloo'option inGRPOConfig - Add
kl_in_reward: bool = Falseparameter to control where KL is applied - Register new advantage estimator function with
@function_registry.register_advantage_estimator("rloo")(following drgrpo_learner.py pattern) - Modify reward computation in
_generate_and_compute_advantage()to optionally fold KL into rewards whenkl_in_reward=True
This follows the existing pluggable advantage estimator pattern and doesn't require a separate learner class.
Additional context
This would help hackathon participants working on non-verifiable tasks. Right now GRPO is tuned for math/code with binary rewards - RLOO would extend Tunix to creative/subjective domains.
Reference implementations:
Checklist
- I have searched the existing issues for similar feature requests.
- I have verified the codebase already supports pluggable advantage estimators via function_registry.
- This is not a support question (please use the "bug template" for that).