we introduce Length-Unbiased Sequence Policy Optimization (LUSPO), a novel reinforcement learning algorithm for training large language models. By applying a length-aware adjustment to sequence-level optimization, LUSPO addresses the response length bias inherent in GSPO, resulting in improved training stability and performance in both text-only and multimodal tasks.
-
The codebase is based on verl. You need to install the verl environment first. For detailed instructions, please refer to the official documentation.
-
Replace verl’s
core_algos.pywith ourcore_algos.py.cp core_algo.py /verl/verl/trainer/ppo/core_algos.py
-
Download the model weights and dataset from Huggingface. The directory structure should be:
LUSPO/ ├── verl/ ├── models/ │ └── Qwen2.5-7B/ ├── datasets/ │ ├── dapo-math-17k.parquet │ └── aime-2024.parquet ├── outputs/ │ └── checkpoints/ └── ...
-
Run the training scrpit. We provide a script
train.shas an example. If you want to customize the training parameters, make sure to setloss_modetoluspoloss_mode=luspo
-
If you want to perform evaluation during training, you can set
data val_files="$test_files"in the script. -
If you want to use other evaluation framework, you can convert the trained checkpoints to the HuggingFace format.
python -m verl.model_merger merge \ --backend fsdp \ --local_dir checkpoints/.../actor \ --target_dir /path/to/merged_hf_model
- verl: the codebase we built upon.
@misc{liu2026lengthunbiasedsequencepolicyoptimization,
title={Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR},
author={Fanfan Liu and Youyang Yin and Peng Shi and Siqi Yang and Zhixiong Zeng and Haibo Qiu},
year={2026},
eprint={2602.05261},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.05261},
}

