Skip to content

millioniron/async_fixed

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

452 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Missing Old Logits in Asynchronous Agentic RL:
Semantic Mismatch and Repair Methods for Off-Policy Correction

arXiv Hugging Face Papers License

中文


News

  • [2026-05-13] The project README is updated with the paper overview, released links, and implementation entry points.
  • [2026-05] Paper released: arXiv and Hugging Face Papers.

Overview

Synchronous versus asynchronous RL

Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization. This decoupling also creates a failure mode for PPO-style off-policy correction: the historical training-side logits required for a clean correction may no longer be available when a trajectory reaches the actor.

The total importance ratio should ideally separate two effects:

  • Training-inference discrepancy: the numerical gap between inference-side rollout probabilities and training-side forward probabilities at the same behavior-policy version.
  • Policy staleness: the update gap between the historical behavior policy and the current train policy.

When old training-side logits are missing, these two effects become entangled. Discrepancy masks and PPO clipping then operate on a semantically mixed ratio, causing threshold choices to interfere with each other and making asynchronous training harder to tune.

This repository contains the ROLL-based implementation used to study and repair this missing-old-logit problem. It keeps the async rollout/training paths, train-infer correction utilities, snapshot-based old-logprob recomputation, and EWMA-style proximal reference policies used by the paper.

Method

Exact old-logit acquisition

We study two complementary directions.

  • Exact old-logit acquisition: recover semantically correct old logits through system support. The paper analyzes snapshot-based version tracking, a dedicated old-logit model, and synchronization through partial rollout interruption.
  • Low-cost approximate correction: when exact old logits are too expensive, use a revised PPO-EWMA reference policy to preserve most of the benefit of decoupled correction without introducing heavy system overhead.

The implementation in this branch focuses on the practical routes below.

Route Purpose
snapshot_old Exact correction through actor snapshots and version-aware old log-prob recomputation.
current_recompute A simple baseline that recomputes old log-probs from the current train actor.
interp_prox_loglinear A log-linear proximal reference used as an approximation baseline.
ewma_prox An EWMA proximal reference for lower-overhead off-policy correction.
ppo_ewma_async An EWMA-based reference-policy variant for asynchronous training.

Results

PPO-EWMA is the strongest practical method in the main benchmark table. Snapshot is included as an idealized exact-recovery reference because it assumes exact old logits are available.

Backbone Method Retail avg@4 Retail pass@4 Airline avg@4 Airline pass@4 Telecom avg@2 Telecom pass@2 Vita In-store avg@2 Vita In-store pass@2 Vita Delivery avg@2 Vita Delivery pass@2
Qwen3-4B Decoupled PPO 63.96 88.60 53.5 72 40 50 19.83 37 19.56 33
Qwen3-4B Linear_prox 64.40 86.84 54 72 37.5 50 22.37 40 19.10 28
Qwen3-4B PPO-EWMA 65.72 90.35 54 74 42.5 52.5 25 50 25.88 39
Qwen3-4B Snapshot† 66.23 89.47 56 76 42.5 52.5 28.89 47 27.33 42
Qwen3-30B-A3B Decoupled PPO 65.43 89.47 57 76 44.75 55 18.28 32 25.88 39
Qwen3-30B-A3B Linear_prox 65.8 87.7 53.5 74 44 55 31.47 47 20.74 33
Qwen3-30B-A3B PPO-EWMA 67.82 92.1 60 82 45 57.5 33.41 48 28.49 43
Qwen3-30B-A3B Snapshot† 69.70 92.1 59 80 45 57.5 34.62 50 30.74 45

On the dense 4B model, PPO-EWMA obtains the best practical retail pass@4 and VitaBench in-store pass@2, while tying the best telecom scores. On the 30B MoE model, it is strongest on airline and ties the best practical retail pass@4 and telecom avg@2.

The cost comparison also explains why exact recovery is not always the practical choice.

Method 4B CPU storage 30B CPU storage 4B extra time 30B extra time
Snapshot 40 GB 76.4 GB 25 s 150 s
PPO-EWMA 7.9 GB 15.2 GB 8 s 34 s

Threshold trade-off

The experiments show a consistent trade-off: stricter discrepancy and staleness thresholds filter more mismatched tokens and stabilize training, while looser settings improve early optimization speed but can accumulate unstable updates later.

PPO-EWMA beta comparison

PPO-EWMA provides a practical middle ground. It constructs a smoother reference from historical policies, and the asynchronous variant uses staleness-aware decay selection and auto-reset to avoid excessive stale-policy accumulation.

PPO-EWMA auto-reset

Getting Started

Pull the prepared Docker image:

docker pull roll-registry.cn-hangzhou.cr.aliyuncs.com/roll/pytorch:nvcr-25.06-py3-torch280-vllm0110

Start the container. Replace <path-to-ROLL> with the absolute path to this repository on your machine.

docker run -itd --name ROLL \
  --network host \
  --ipc host \
  -p 127.0.0.1:5007:5007 \
  --gpus all \
  -e PYTHONPATH=/ROLL \
  -v <path-to-ROLL>:/ROLL \
  --entrypoint /bin/bash \
  roll-registry.cn-hangzhou.cr.aliyuncs.com/roll/pytorch:nvcr-25.06-py3-torch280-vllm0110

Install the local adapter and SGLang dependencies inside the container:

docker exec -it ROLL bash
cd /ROLL/mcore_adapter
pip install -e .
pip install "sglang[srt,torch-memory-saver]==0.5.2"

Experiment configs are collected under examples/fixed_async_offpolicy. They cover snapshot-based recovery, current-policy recomputation, log-linear proximal references, and EWMA-based asynchronous variants.

Key implementation paths:

This branch does not carry the dedicated actor_old_logit / old_logit_model implementation path.

Citation

@article{guan2026missing,
  title={Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction},
  author={Guan, Zhong and Guo, Yongjian and Sun, Haoran and Huang, Wen and Di, Shuai and Wu, Xiong Jun and Wu, Likang and Zhao, Hongke},
  journal={arXiv preprint arXiv:2605.12070},
  year={2026}
}

License

This repository continues to use the upstream Apache 2.0 license. See LICENSE.

About

ROLL Off-Policy Branch

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 96.3%
  • JavaScript 1.6%
  • HTML 0.9%
  • MDX 0.5%
  • CSS 0.3%
  • Shell 0.3%
  • Other 0.1%