Skip to content

murphy4122/LUSPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Paper Github Huggingface

Overview

we introduce Length-Unbiased Sequence Policy Optimization (LUSPO), a novel reinforcement learning algorithm for training large language models. By applying a length-aware adjustment to sequence-level optimization, LUSPO addresses the response length bias inherent in GSPO, resulting in improved training stability and performance in both text-only and multimodal tasks.

Table: Evaluation results on text-only and multimodal benchmark.

Figure: Response length of GSPO and LUSPO during training.

Installation

  1. The codebase is based on verl. You need to install the verl environment first. For detailed instructions, please refer to the official documentation.

  2. Replace verl’s core_algos.py with our core_algos.py.

    cp core_algo.py /verl/verl/trainer/ppo/core_algos.py

Training

  1. Download the model weights and dataset from Huggingface. The directory structure should be:

    LUSPO/
    ├── verl/
    ├── models/
    │   └── Qwen2.5-7B/
    ├── datasets/
    │   ├── dapo-math-17k.parquet
    │   └── aime-2024.parquet
    ├── outputs/
    │   └── checkpoints/
    └── ...
  2. Run the training scrpit. We provide a script train.sh as an example. If you want to customize the training parameters, make sure to set loss_mode to luspo

    loss_mode=luspo

Evaluation

  1. If you want to perform evaluation during training, you can set data val_files="$test_files" in the script.

  2. If you want to use other evaluation framework, you can convert the trained checkpoints to the HuggingFace format.

    python -m verl.model_merger merge \
        --backend fsdp \
        --local_dir checkpoints/.../actor \
        --target_dir /path/to/merged_hf_model

Acknowledgement

  • verl: the codebase we built upon.

Citation

@misc{liu2026lengthunbiasedsequencepolicyoptimization,
      title={Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR}, 
      author={Fanfan Liu and Youyang Yin and Peng Shi and Siqi Yang and Zhixiong Zeng and Haibo Qiu},
      year={2026},
      eprint={2602.05261},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.05261}, 
}

About

Official code implementation of Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors