Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Overview

we introduce Length-Unbiased Sequence Policy Optimization (LUSPO), a novel reinforcement learning algorithm for training large language models. By applying a length-aware adjustment to sequence-level optimization, LUSPO addresses the response length bias inherent in GSPO, resulting in improved training stability and performance in both text-only and multimodal tasks.

Table: Evaluation results on text-only and multimodal benchmark.

Figure: Response length of GSPO and LUSPO during training.

Installation

The codebase is based on verl. You need to install the verl environment first. For detailed instructions, please refer to the official documentation.

Replace verl’s core_algos.py with our core_algos.py.

cp core_algo.py /verl/verl/trainer/ppo/core_algos.py

Training

Download the model weights and dataset from Huggingface. The directory structure should be:

LUSPO/
├── verl/
├── models/
│   └── Qwen2.5-7B/
├── datasets/
│   ├── dapo-math-17k.parquet
│   └── aime-2024.parquet
├── outputs/
│   └── checkpoints/
└── ...

Run the training scrpit. We provide a script train.sh as an example. If you want to customize the training parameters, make sure to set loss_mode to luspo
```
loss_mode=luspo
```

Evaluation

If you want to perform evaluation during training, you can set data val_files="$test_files" in the script.

If you want to use other evaluation framework, you can convert the trained checkpoints to the HuggingFace format.

python -m verl.model_merger merge \
    --backend fsdp \
    --local_dir checkpoints/.../actor \
    --target_dir /path/to/merged_hf_model

Acknowledgement

verl: the codebase we built upon.

Citation

@misc{liu2026lengthunbiasedsequencepolicyoptimization,
      title={Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR}, 
      author={Fanfan Liu and Youyang Yin and Peng Shi and Siqi Yang and Zhixiong Zeng and Haibo Qiu},
      year={2026},
      eprint={2602.05261},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.05261}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
img		img
README.md		README.md
core_algos.py		core_algos.py
luspo.pdf		luspo.pdf
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Overview

Installation

Training

Evaluation

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Overview

Installation

Training

Evaluation

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages