Skip to content

Feature Request: RLOO (REINFORCE Leave-One-Out) Advantage Estimator #953

@42euge

Description

@42euge

Is your feature request related to a problem? Please describe.

I'm working on the Tunix hackathon and running into stability issues training on rubric-based rewards (for the "show your work" objective). The current GRPO works well when you have binary correct/wrong signals from a verifier, but my setup uses continuous scores from a reward model evaluating reasoning quality.

The problem is GRPO's std normalization amplifies noise when rewards are subjective. I'm also seeing reward hacking since there's no KL mechanism to keep the policy from drifting too far to exploit the reward model.

Describe the solution you'd like

Add RLOO (REINFORCE Leave-One-Out) as an alternative advantage estimator. The main difference from GRPO:

# GRPO (current)
A_i = (R_i - mean(R)) / std(R)

# DrGRPO (removes std, already exists)
A_i = R_i - mean(R)

# RLOO (leave-one-out baseline + KL in reward)
A_i = R_i - mean(R_j where j != i) - β * KL_i

RLOO uses a leave-one-out baseline instead of batch mean, which is more numerically stable. Critically, it folds KL directly into the reward R'_i = R_i - β * KL_i before advantage computation, rather than adding it to the loss function afterward.

Ahmadian et al. 2024 showed RLOO is more robust to noisy rewards than PPO/GRPO and outperforms DPO on preference tasks. It's also what PRIME uses under the hood.

Implementation approach

Verified that Tunix already has the infrastructure for this (function_registry.py):

  1. Add advantage_estimator='rloo' option in GRPOConfig
  2. Add kl_in_reward: bool = False parameter to control where KL is applied
  3. Register new advantage estimator function with @function_registry.register_advantage_estimator("rloo") (following drgrpo_learner.py pattern)
  4. Modify reward computation in _generate_and_compute_advantage() to optionally fold KL into rewards when kl_in_reward=True

This follows the existing pluggable advantage estimator pattern and doesn't require a separate learner class.

Additional context

This would help hackathon participants working on non-verifiable tasks. Right now GRPO is tuned for math/code with binary rewards - RLOO would extend Tunix to creative/subjective domains.

Reference implementations:

Checklist

  • I have searched the existing issues for similar feature requests.
  • I have verified the codebase already supports pluggable advantage estimators via function_registry.
  • This is not a support question (please use the "bug template" for that).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions