# SWE-Bench RL with AGS

AGS 是腾讯云提供的沙箱服务，为每个 SWE-Bench 任务按需创建隔离的 Linux 容器环境。
本 Cookbook 演示如何使用 AGS 沙箱作为 Agent RL 训练的执行后端，完成从数据准备到 PPO 训练的全流程。

**技术栈：**
- [rLLM](https://github.com/rllm-org/rllm) — Agent RL 训练框架
- [verl](https://github.com/volcengine/verl) — 分布式 PPO 训练引擎
- [R2E-Gym](https://github.com/R2E-Gym/R2E-Gym) — SWE-Bench 环境与数据集

**本 Notebook 内容：**
1. 安装依赖
2. 配置 AGS 凭证与运行时环境
3. 准备 SWE-Bench 训练/验证数据集
4. 配置并启动 PPO 训练（AGS 后端）

## 架构概览

训练循环中，AGS 负责为每个 SWE-Bench 任务提供独立的沙箱环境：

```
┌─────────────────────────────────────────────────────────────┐
│                      PPO Training Loop                      │
│                                                             │
│  ┌──────────┐    ┌──────────────┐    ┌──────────────────┐  │
│  │ SWE-Bench│───>│  vLLM Rollout│───>│    SWEAgent      │  │
│  │   Data   │    │ (生成代码编辑) │    │ (解析为工具调用)  │  │
│  └──────────┘    └──────────────┘    └────────┬─────────┘  │
│                                               │             │
│                                               ▼             │
│                                      ┌────────────────┐    │
│                                      │  AGS 沙箱       │    │
│                                      │  (执行编辑、     │    │
│                                      │   运行测试)      │    │
│                                      └────────┬───────┘    │
│                                               │             │
│                                               ▼             │
│  ┌──────────┐    ┌──────────────┐    ┌────────────────┐    │
│  │ PPO 更新  │<───│   Reward      │<───│ 测试通过/失败   │    │
│  │ (更新权重) │    │   Signal      │    │ = reward 0/1   │    │
│  └──────────┘    └──────────────┘    └────────────────┘    │
└─────────────────────────────────────────────────────────────┘
```

关键设置：`+rllm.env.env_args.backend=ags` 指定使用 AGS 云端沙箱，替代本地 Docker 或 Kubernetes。

## 前置条件

| 项目 | 要求 |
|------|------|
| **GPU** | H20 x 8（每卡 96GB 显存）|
| **AGS 凭证** | `E2B_API_KEY`、`TENCENTCLOUD_SECRET_ID`、`TENCENTCLOUD_SECRET_KEY` |
| **AGS 沙箱工具** | 已为数据集中的 docker_image 创建好沙箱工具（参考 [沙箱工具创建 Demo](ags-tool/example/swe_bench_ags_tool.ipynb)）|
| **模型权重** | 本 demo 使用 Qwen3-8B，需提前下载至本地路径 |
| **网络** | 可访问 HuggingFace（或 HF 镜像）下载数据集 |

## Step 1: 安装依赖

克隆源码并安装所有 Python 包。首次运行需要下载依赖。

In [None]:
!git clone https://github.com/R2E-Gym/R2E-Gym.git R2E-Gym
!(cd R2E-Gym && git checkout 0d94c4e && git apply ../../patches/r2e-gym-ags-clean.patch)

!git clone --depth=1 https://github.com/rllm-org/rllm.git rllm

!git clone https://github.com/verl-project/verl.git verl && cd verl && git checkout 2c6c65c

%pip install -e './verl[vllm]'
%pip install -e '../ags_tool[e2b]'
%pip install -e './R2E-Gym'
%pip install -e './rllm'

# 版本修复:
# - datasets>=4.5.0: 旧版本加载 R2E-Gym-Subset 的嵌套字段时会报错
# - numpy<2.3: vLLM 依赖 numba，numba 要求 NumPy<=2.2
%pip install "datasets>=4.5.0" "numpy<2.3"

Cloning into 'verl'...
remote: Enumerating objects: 27998, done.[K
remote: Counting objects: 100% (468/468), done.[K
remote: Compressing objects: 100% (347/347), done.[K
Receiving objects: 100% (27998/27998), 18.27 MiB | 7.02 MiB/s, done.
remote: Total 27998 (delta 326), reused 121 (delta 121), pack-reused 27530 (from 3)[K
Resolving deltas: 100% (20218/20218), done.
Note: switching to '2c6c65c'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 2c6c65cb [doc,algo] feat: Rollout Correction - Fix Metrics, Add D

## Step 2: 环境变量配置

配置 AGS 凭证和运行时参数。按用途分组，只有 **AGS 凭证** 部分需要替换为你自己的值。

In [None]:
import os

env_config = {
    # ── AGS 凭证（必填：替换为你自己的值）──────────────────
    "E2B_API_KEY": "xxx",
    "TENCENTCLOUD_SECRET_ID": "xxx",
    "TENCENTCLOUD_SECRET_KEY": "xxx",
    "AGS_REGION": "ap-guangzhou",
    "TENCENTCLOUD_REGION": "ap-guangzhou",

    # ── HuggingFace 镜像 ─────────────────────────────────
    "HF_ENDPOINT": "https://hf-mirror.com",

    # ── vLLM 运行时 ──────────────────────────────────────
    "VLLM_USE_V1": "1",
    "VLLM_ALLOW_LONG_MAX_MODEL_LEN": "1",
    "VLLM_ENGINE_ITERATION_TIMEOUT_S": "1000000000",

    # ── PyTorch 内存 ──────────────────────────────────────
    "PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:False",

    # ── MLflow 监控（可选）────────────────────────────────
    "MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING": "true",
    "MLFLOW_TRACKING_URI": "xxx",
    "MLFLOW_TRACKING_USERNAME": "xxx",
    "MLFLOW_TRACKING_PASSWORD": "xxx",
}

os.environ.update(env_config)
print("环境变量已配置。")

环境变量已配置。


## Step 3: 数据集准备

从 HuggingFace 下载两个数据集：
- **R2E-Gym/SWE-Bench-Lite**（300 条，test split）→ 用作验证集
- **R2E-Gym/R2E-Gym-Subset**（4,578 条，train split）→ 用作训练集

通过 `MAX_SAMPLES` 控制每个数据集取多少条。数据按 `docker_image` 排序后截取，
以确保与镜像推送工具 `push_to_tcr.py` 选取的数据一致。

In [7]:
from datasets import load_dataset

from rllm.data.dataset import DatasetRegistry

# ========== 可调参数 ==========
MAX_SAMPLES = 8  # 每个数据集取前 N 条样本，设为 None 则使用全部数据


SWE_DATASETS = [                                                                                                                                                 
    "R2E-Gym/SWE-Bench-Lite",  # 300 条                                                                                                     
    # "R2E-Gym/SWE-Bench-Verified",  # 500 条                                                                                                        
    "R2E-Gym/R2E-Gym-Subset",  # 4,578 条                                                                                                                      
    # "R2E-Gym/R2E-Gym-Lite",  # 11,788 条       
    # "R2E-Gym/R2E-Gym-V1", # 7.47k
    # "r2e-edits/SweSmith-RL-Dataset",                                                                                                           
] 
# ==============================

def prepare_swe_data(max_samples: int | None = None):
    """
    Prepare and register SWE datasets for training and testing.

    Args:
        max_samples: If set, only use the first N examples per split.
                     Rows are sorted by docker_image (alphabetical) before slicing,
                     to match the ordering used by push_to_tcr.py.

    Returns:
        tuple: (train_datasets, test_datasets) - lists of registered datasets
    """

    def make_process_fn():
        def process_fn(row):
            row_dict = dict(row)
            return row_dict

        return process_fn

    process_fn = make_process_fn()
    train_datasets = []
    test_datasets = []

    for dataset_name in SWE_DATASETS:
        print(f"Processing dataset: {dataset_name}")
        try:
            # Load the dataset dictionary (which contains splits like 'train' or 'test')
            dataset_splits = load_dataset(dataset_name)
        except Exception as e:
            print(f"Failed to load dataset {dataset_name}: {e}")
            continue

        dataset_key = dataset_name.split("/")[-1].replace("-", "_")

        # Process train split if it exists
        if "train" in dataset_splits:
            print(f"Processing 'train' split for {dataset_name}")
            train_data = [process_fn(row) for row in dataset_splits["train"]]
            train_data.sort(key=lambda r: r.get("docker_image", ""))
            if max_samples is not None:
                train_data = train_data[:max_samples]
            train_dataset = DatasetRegistry.register_dataset(f"{dataset_key}", train_data, "train")
            train_datasets.append(train_dataset)
            print(f"Registered train dataset with {len(train_data)} examples")

        # Process test split if it exists
        if "test" in dataset_splits:
            print(f"Processing 'test' split for {dataset_name}")
            test_data = [process_fn(row) for row in dataset_splits["test"]]
            test_data.sort(key=lambda r: r.get("docker_image", ""))
            if max_samples is not None:
                test_data = test_data[:max_samples]
            test_dataset = DatasetRegistry.register_dataset(f"{dataset_key}", test_data, "test")
            test_datasets.append(test_dataset)
            print(f"Registered test dataset with {len(test_data)} examples")

        # If neither train nor test exists, use the first available split as train
        if "train" not in dataset_splits and "test" not in dataset_splits:
            available_splits = list(dataset_splits.keys())
            if available_splits:
                split_name = available_splits[0]
                print(f"Using '{split_name}' split as train data for {dataset_name}")
                train_data = [process_fn(row) for row in dataset_splits[split_name]]
                train_data.sort(key=lambda r: r.get("docker_image", ""))
                if max_samples is not None:
                    train_data = train_data[:max_samples]
                train_dataset = DatasetRegistry.register_dataset(f"{dataset_key}", train_data, "train")
                train_datasets.append(train_dataset)
                print(f"Registered train dataset with {len(train_data)} examples")

    return train_datasets, test_datasets


train_datasets, test_datasets = prepare_swe_data(max_samples=MAX_SAMPLES)
print("\nSummary:")
print(f"Total train datasets: {len(train_datasets)}")
print(f"Total test datasets: {len(test_datasets)}")

if train_datasets:
    print("Sample train example from first dataset:")
    print(train_datasets[0].get_data()[0])

if test_datasets:
    print("Sample test example from first dataset:")
    print(test_datasets[0].get_data()[0])

Processing dataset: R2E-Gym/SWE-Bench-Lite


2026-02-11 13:20:58,406 - rllm.data.dataset - INFO - Registered dataset 'SWE_Bench_Lite' split 'test' with 8 examples. Verl-processed version saved at /workspace/rllm/rllm/data/datasets/SWE_Bench_Lite/test_verl.parquet.


Processing 'test' split for R2E-Gym/SWE-Bench-Lite
Registered test dataset with 8 examples
Processing dataset: R2E-Gym/R2E-Gym-Subset
Processing 'train' split for R2E-Gym/R2E-Gym-Subset


2026-02-11 13:21:16,516 - rllm.data.dataset - INFO - Registered dataset 'R2E_Gym_Subset' split 'train' with 8 examples. Verl-processed version saved at /workspace/rllm/rllm/data/datasets/R2E_Gym_Subset/train_verl.parquet.


Registered train dataset with 8 examples

Summary:
Total train datasets: 1
Total test datasets: 1
Sample train example from first dataset:
Sample test example from first dataset:
{'repo': 'astropy/astropy', 'instance_id': 'astropy__astropy-12907', 'base_commit': 'd16bfe05a744909de4b27f5875fe0d4ed41ce607', 'patch': "diff --git a/astropy/modeling/separable.py b/astropy/modeling/separable.py\n--- a/astropy/modeling/separable.py\n+++ b/astropy/modeling/separable.py\n@@ -242,7 +242,7 @@ def _cstack(left, right):\n         cright = _coord_matrix(right, 'right', noutp)\n     else:\n         cright = np.zeros((noutp, right.shape[1]))\n-        cright[-right.shape[0]:, -right.shape[1]:] = 1\n+        cright[-right.shape[0]:, -right.shape[1]:] = right\n \n     return np.hstack([cleft, cright])\n \n", 'test_patch': "diff --git a/astropy/modeling/tests/test_separable.py b/astropy/modeling/tests/test_separable.py\n--- a/astropy/modeling/tests/test_separable.py\n+++ b/astropy/modeling/tests/test_sep

## Step 4: 配置并启动训练

使用 rLLM 的 `AgentTrainer` 启动 PPO 训练。配置通过 Hydra override 传入，基于 `agent_ppo_trainer.yaml`。

**训练流程：**
1. vLLM Rollout 引擎为每个 SWE-Bench 任务生成 agent 动作序列
2. SWEAgent 将 LLM 输出解析为工具调用（str_replace_editor, execute_bash, submit）
3. SWEEnv 在 AGS 沙箱中执行工具调用
4. 沙箱运行项目测试套件，测试通过 → reward=1，失败 → reward=0
5. PPO 算法根据 reward 更新模型权重

### 配置参数分组说明

| 参数组 | 作用 | 本 demo 的关键设置 |
|--------|------|-------------------|
| `algorithm.*` | PPO/RLOO 算法参数 | `adv_estimator=rloo`, `kl_coef=0.001` |
| `data.*` | 批次大小、序列长度 | `train_batch_size=4`, `max_response_length=32768` |
| `actor_rollout_ref.*` | 模型、优化器、vLLM 推理、参考模型 | `model.path=Qwen3-8B`, `rollout.n=4`, `gpu_memory_utilization=0.5` |
| `rllm.*` | Agent/Environment 设置 | **`env.env_args.backend=ags`**（启用 AGS）|
| `trainer.*` | 日志、检查点、GPU 拓扑 | `n_gpus_per_node=8`, `total_epochs=2` |

**关键约束**：`train_batch_size × rollout.n` 必须能被 `n_gpus_per_node` 整除（本 demo: 4 × 4 = 16，16 ÷ 8 = 2）。

In [None]:
import importlib, os
from hydra import compose, initialize_config_dir
from hydra.core.global_hydra import GlobalHydra

from rllm.agents.swe_agent import SWEAgent
from rllm.environments.swe.swe import SWEEnv
from rllm.trainer.agent_trainer import AgentTrainer

# Locate rllm's config directory
# rllm_pkg_dir = importlib.util.find_spec('rllm').submodule_search_locations[0]
rllm_pkg_dir = os.path.realpath(importlib.util.find_spec('rllm').submodule_search_locations[0])
config_dir = os.path.join(rllm_pkg_dir, "rllm", "trainer", "config")


# Clear any previous Hydra state (safe for re-runs in Jupyter)
GlobalHydra.instance().clear()

with initialize_config_dir(config_dir=config_dir, version_base=None):
    config = compose(config_name="agent_ppo_trainer", overrides=[
        # Algorithm
        "algorithm.adv_estimator=rloo",
        "algorithm.kl_ctrl.kl_coef=0.001",
        # Data (train_files/val_files are set automatically by AgentTrainer from Dataset objects)
        "data.train_batch_size=4",
        "data.val_batch_size=4",
        "data.max_prompt_length=4096",
        "data.max_response_length=32768",
        "data.filter_overlong_prompts=True",
        "data.filter_overlong_prompts_workers=32",
        # Actor / Rollout / Ref
        "actor_rollout_ref.model.path=/mnt/cfs-turbo/Qwen3-8B",
        "actor_rollout_ref.hybrid_engine=True",
        "actor_rollout_ref.actor.optim.lr=1e-6",
        "actor_rollout_ref.model.use_remove_padding=True",
        "actor_rollout_ref.actor.loss_agg_mode=seq-mean-token-sum",
        "actor_rollout_ref.actor.ppo_mini_batch_size=8",
        "actor_rollout_ref.actor.use_dynamic_bsz=False",
        "actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1",
        "actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=True",
        "actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1",
        "actor_rollout_ref.actor.ppo_max_token_len_per_gpu=40960",
        "actor_rollout_ref.actor.use_kl_loss=False",
        "actor_rollout_ref.actor.clip_ratio_high=0.28",
        "actor_rollout_ref.actor.kl_loss_coef=0.001",
        "actor_rollout_ref.actor.kl_loss_type=low_var_kl",
        "actor_rollout_ref.actor.ulysses_sequence_parallel_size=1",
        "actor_rollout_ref.model.enable_gradient_checkpointing=True",
        "actor_rollout_ref.actor.fsdp_config.param_offload=False",
        "actor_rollout_ref.actor.fsdp_config.optimizer_offload=False",
        "actor_rollout_ref.rollout.tensor_model_parallel_size=8",
        "actor_rollout_ref.rollout.name=vllm",
        "actor_rollout_ref.rollout.mode=async",
        "actor_rollout_ref.rollout.enforce_eager=False",
        "actor_rollout_ref.rollout.temperature=1.0",
        "actor_rollout_ref.rollout.gpu_memory_utilization=0.5",
        "actor_rollout_ref.rollout.n=4",
        "actor_rollout_ref.rollout.val_kwargs.n=1",
        "actor_rollout_ref.rollout.val_kwargs.temperature=0",
        "actor_rollout_ref.ref.fsdp_config.param_offload=False",
        "actor_rollout_ref.actor.entropy_coeff=0.0",
        # rllm
        "rllm.mask_truncated_samples=False",
        "rllm.filter_token_mismatch=False",
        "rllm.env.name=swe",
        "+rllm.env.env_args.backend=ags",
        "rllm.agent.name=sweagent",
        "rllm.agent.max_steps=50",
        "rllm.agent.overlong_filter=True",
        "+rllm.agent.trajectory_timeout=5400",
        # Trainer
        "trainer.critic_warmup=0",
        "trainer.logger=[console]",
        "trainer.project_name=AgentRL-with-ags",
        "trainer.experiment_name=swe-agent-rl",
        "trainer.val_before_train=False",
        "trainer.n_gpus_per_node=8",
        "trainer.nnodes=1",
        "trainer.save_freq=10",
        "trainer.test_freq=10",
        "trainer.default_hdfs_dir=null",
        "trainer.total_epochs=2",
    ])

# train_datasets[0] = R2E_Gym_Subset, test_datasets[0] = SWE_Bench_Lite
# 数据量由上方 MAX_SAMPLES 参数控制
trainer = AgentTrainer(
    agent_class=SWEAgent,
    env_class=SWEEnv,
    config=config,
    train_dataset=train_datasets[0],
    val_dataset=test_datasets[0],
)


import ray
from rllm.trainer.verl.ray_runtime_env import get_ppo_ray_runtime_env

runtime_env = get_ppo_ray_runtime_env()
runtime_env["env_vars"].update(env_config)

if ray.is_initialized():
    ray.shutdown()
ray.init(runtime_env=runtime_env)

trainer.train()

2026-02-11 13:22:04,081	INFO worker.py:1588 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
2026-02-11 13:22:04,083	INFO worker.py:1723 -- Connecting to existing Ray cluster at address: 10.32.5.203:6379...
2026-02-11 13:22:04,090	INFO worker.py:1908 -- Connected to Ray cluster. View the dashboard at [1m[32m10.32.5.203:8265 [39m[22m


[36m(TaskRunner pid=1398266)[0m TaskRunner hostname: VM-5-203-tencentos, PID: 1398266
[36m(TaskRunner pid=1398266)[0m {'actor_rollout_ref': {'actor': {'_target_': 'verl.workers.config.FSDPActorConfig',
[36m(TaskRunner pid=1398266)[0m                                  'checkpoint': {'_target_': 'verl.trainer.config.CheckpointConfig',
[36m(TaskRunner pid=1398266)[0m                                                 'async_save': False,
[36m(TaskRunner pid=1398266)[0m                                                 'load_contents': ['model',
[36m(TaskRunner pid=1398266)[0m                                                                   'optimizer',
[36m(TaskRunner pid=1398266)[0m                                                                   'extra'],
[36m(TaskRunner pid=1398266)[0m                                                 'save_contents': ['model',
[36m(TaskRunner pid=1398266)[0m                                                                   'optimizer',
[3

[36m(TaskRunner pid=1398266)[0m   self.use_critic = need_critic(self.config)


[36m(TaskRunner pid=1398266)[0m dataset len: 8


Generating train split: 8 examples [00:00, 483.12 examples/s]
[36m(TaskRunner pid=1398266)[0m num_proc must be <= 8. Reducing num_proc to 8 for dataset of size 8.
[36m(TaskRunner pid=1398266)[0m Setting TOKENIZERS_PARALLELISM=false for forked processes.
Filtering prompts longer than 4096 tokens (num_proc=8):   0%|          | 0/8 [00:00<?, ? examples/s]
Filtering prompts longer than 4096 tokens (num_proc=8):  12%|█▎        | 1/8 [00:00<00:04,  1.50 examples/s]
Filtering prompts longer than 4096 tokens (num_proc=8):  38%|███▊      | 3/8 [00:00<00:01,  4.13 examples/s]
Filtering prompts longer than 4096 tokens (num_proc=8):  50%|█████     | 4/8 [00:00<00:00,  5.15 examples/s]
Filtering prompts longer than 4096 tokens (num_proc=8):  75%|███████▌  | 6/8 [00:01<00:00,  6.96 examples/s]
Filtering prompts longer than 4096 tokens (num_proc=8): 100%|██████████| 8/8 [00:01<00:00,  8.32 examples/s]


[36m(TaskRunner pid=1398266)[0m filter dataset len: 8
[36m(TaskRunner pid=1398266)[0m Using dataset class: RLHFDataset


Filtering prompts longer than 4096 tokens (num_proc=8): 100%|██████████| 8/8 [00:01<00:00,  5.48 examples/s]


[36m(TaskRunner pid=1398266)[0m dataset len: 8


Generating train split: 8 examples [00:00, 558.24 examples/s]
[36m(TaskRunner pid=1398266)[0m num_proc must be <= 8. Reducing num_proc to 8 for dataset of size 8.
[36m(TaskRunner pid=1398266)[0m Setting TOKENIZERS_PARALLELISM=false for forked processes.
Filtering prompts longer than 4096 tokens (num_proc=8):   0%|          | 0/8 [00:00<?, ? examples/s]
Filtering prompts longer than 4096 tokens (num_proc=8):  12%|█▎        | 1/8 [00:00<00:04,  1.59 examples/s]
Filtering prompts longer than 4096 tokens (num_proc=8):  25%|██▌       | 2/8 [00:00<00:02,  2.68 examples/s]
Filtering prompts longer than 4096 tokens (num_proc=8):  50%|█████     | 4/8 [00:00<00:00,  5.27 examples/s]
Filtering prompts longer than 4096 tokens (num_proc=8):  62%|██████▎   | 5/8 [00:01<00:00,  6.00 examples/s]
Filtering prompts longer than 4096 tokens (num_proc=8):  75%|███████▌  | 6/8 [00:01<00:00,  6.52 examples/s]
Filtering prompts longer than 4096 tokens (num_proc=8): 100%|██████████| 8/8 [00:01<00:00,  7.44

[36m(TaskRunner pid=1398266)[0m filter dataset len: 8
[36m(TaskRunner pid=1398266)[0m Size of train dataloader: 2, Size of val dataloader: 2
[36m(TaskRunner pid=1398266)[0m Total training steps: 4
[36m(TaskRunner pid=1398266)[0m Using trajectory-level advantage, max_prompt_length and max_response_length will be applied episode-wise
[36m(TaskRunner pid=1398266)[0m colocated worker base class <class 'verl.single_controller.base.worker.Worker'>
[36m(TaskRunner pid=1398266)[0m bind role actor_rollout method chat_completion to class <class 'verl.single_controller.ray.base.create_colocated_worker_cls.<locals>.WorkerDict'>
[36m(TaskRunner pid=1398266)[0m bind role actor_rollout method generate to class <class 'verl.single_controller.ray.base.create_colocated_worker_cls.<locals>.WorkerDict'>
[36m(TaskRunner pid=1398266)[0m bind role actor_rollout method get_zeromq_address to class <class 'verl.single_controller.ray.base.create_colocated_worker_cls.<locals>.WorkerDict'>
[36m(Ta

[36m(WorkerDict pid=1398912)[0m Flash Attention 2 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen3ForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
[36m(WorkerDict pid=1398912)[0m Flash Attention 2 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen3Model is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`
Loading checkpoint sha

[36m(WorkerDict pid=1398910)[0m Model config after override: Qwen3Config {
[36m(WorkerDict pid=1398910)[0m   "architectures": [
[36m(WorkerDict pid=1398910)[0m     "Qwen3ForCausalLM"
[36m(WorkerDict pid=1398910)[0m   ],
[36m(WorkerDict pid=1398910)[0m   "attention_bias": false,
[36m(WorkerDict pid=1398910)[0m   "attention_dropout": 0.0,
[36m(WorkerDict pid=1398910)[0m   "eos_token_id": 151645,
[36m(WorkerDict pid=1398910)[0m   "head_dim": 128,
[36m(WorkerDict pid=1398910)[0m   "hidden_act": "silu",
[36m(WorkerDict pid=1398910)[0m   "hidden_size": 4096,
[36m(WorkerDict pid=1398910)[0m   "initializer_range": 0.02,
[36m(WorkerDict pid=1398910)[0m   "intermediate_size": 12288,
[36m(WorkerDict pid=1398910)[0m   "layer_types": [
[36m(WorkerDict pid=1398910)[0m     "full_attention",
[36m(WorkerDict pid=1398910)[0m     "full_attention",
[36m(WorkerDict pid=1398910)[0m     "full_attention",
[36m(WorkerDict pid=1398910)[0m     "full_attention",
[36m(WorkerDict 

Loading checkpoint shards:  20%|██        | 1/5 [00:07<00:30,  7.58s/it]
[36m(WorkerDict pid=1398910)[0m Flash Attention 2 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen3Model is torch.float32. You should run training or inference using Automatic Mixed-Precision via the `with torch.autocast(device_type='torch_device'):` decorator, or load the model with the `torch_dtype` argument. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)`[32m [repeated 14x across cluster][0m
Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s][32m [repeated 7x across cluster][0m
Loading checkpoint shards:  40%|████      | 2/5 [00:13<00:20,  6.77s/it][32m [repeated 8x across cluster][0m
Loading checkpoint shards:  60%|██████    | 3/5 [00:20<00:13,  6.70s/it][32m [repeated 8x across cluster][0m
Loading checkpoint shards:  80%|████████  | 4/5 [00:25<00:06,  6.06s/it][32m

[36m(WorkerDict pid=1398910)[0m Monkey patch state_dict in AutoModelForCausalLMWithValueHead. 
[36m(WorkerDict pid=1398910)[0m Monkey patch _flash_attention_forward in transformers.integrations.flash_attention
[36m(WorkerDict pid=1398910)[0m Skipping monkey patch for Qwen3ForCausalLM as use_fused_kernels is False or fused_kernels_backend is torch
[36m(WorkerDict pid=1398910)[0m Qwen3ForCausalLM contains 8.19B parameters
[36m(WorkerDict pid=1398910)[0m wrap_policy: functools.partial(<function _or_policy at 0x7efbfd10b400>, policies=[functools.partial(<function transformer_auto_wrap_policy at 0x7efbfd10b2e0>, transformer_layer_cls={<class 'transformers.models.qwen3.modeling_qwen3.Qwen3DecoderLayer'>})])
[36m(WorkerDict pid=1398910)[0m VM-5-203-tencentos:1398910:1400272 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
[36m(WorkerDict pid=1398910)[0m VM-5-203-tencentos:1398910:1400272 [0] NCCL INFO Bootstrap: Using bond0:29.194.30.130<0>
[36m(WorkerDict pid=13989

Loading checkpoint shards:  80%|████████  | 4/5 [00:25<00:06,  6.09s/it][32m [repeated 7x across cluster][0m


[36m(WorkerDict pid=1398910)[0m Total steps: 4, num_warmup_steps: 0
[36m(WorkerDict pid=1398910)[0m Actor use_remove_padding=True
[36m(WorkerDict pid=1398910)[0m Actor use_fused_kernels=False


Loading checkpoint shards: 100%|██████████| 5/5 [00:27<00:00,  5.60s/it][32m [repeated 7x across cluster][0m


[36m(WorkerDict pid=1398914)[0m Monkey patch state_dict in AutoModelForCausalLMWithValueHead. [32m [repeated 7x across cluster][0m
[36m(WorkerDict pid=1398914)[0m Monkey patch _flash_attention_forward in transformers.integrations.flash_attention[32m [repeated 7x across cluster][0m
[36m(WorkerDict pid=1398914)[0m Skipping monkey patch for Qwen3ForCausalLM as use_fused_kernels is False or fused_kernels_backend is torch[32m [repeated 7x across cluster][0m
[36m(WorkerDict pid=1398916)[0m VM-5-203-tencentos:1398916:1401145 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0[32m [repeated 14x across cluster][0m
[36m(WorkerDict pid=1398916)[0m VM-5-203-tencentos:1398916:1400169 [0] NCCL INFO Bootstrap: Using bond0:29.194.30.130<0>[32m [repeated 7x across cluster][0m
[36m(WorkerDict pid=1398916)[0m VM-5-203-tencentos:1398916:1400169 [0] NCCL INFO cudaDriverVersion 12080[32m [repeated 7x across cluster][0m
[36m(WorkerDict pid=1398916)[0m VM-5-203-tencentos:139

[36m(vLLMHttpServer pid=1401492)[0m INFO:2026-02-11 13:23:15,527:vLLMHttpServer, replica_rank: 0, master address: 10.32.5.203, master port: 38037, data parallel master port: 36007
[36m(vLLMHttpServer pid=1401492)[0m INFO:2026-02-11 13:23:15,533:override_generation_config: {'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'repetition_penalty': 1.0, 'max_new_tokens': 32768}
[36m(vLLMHttpServer pid=1401492)[0m Using blocking ray.get inside async actor. This blocks the event loop. Please use `await` on object ref with asyncio.gather if you want to yield execution to the event loop instead.
[36m(vLLMHttpServer pid=1401492)[0m INFO:2026-02-11 13:23:16,275:replica_rank=0, node_rank=0, nnodes=1, get worker zmq addresses: ['ipc:///tmp/verl_vllm_zmq_1398910_root.ipc', 'ipc:///tmp/verl_vllm_zmq_1398911_root.ipc', 'ipc:///tmp/verl_vllm_zmq_1398912_root.ipc', 'ipc:///tmp/verl_vllm_zmq_1398913_root.ipc', 'ipc:///tmp/verl_vllm_zmq_1398914_root.ipc', 'ipc:///tmp/verl_vllm_zmq_1398915_root.ipc', 'i

[36m(WorkerDict pid=1398910)[0m CL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
[36m(WorkerDict pid=1398910)[0m VM-5-203-tencentos:1398910:1401139 [0] NCCL INFO CC Off, workFifoBytes 1048576
[36m(WorkerDict pid=1398910)[0m VM-5-203-tencentos:1398910:1401139 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
[36m(WorkerDict pid=1398910)[0m VM-5-203-tencentos:1398910:1401139 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol.
[36m(WorkerDict pid=1398910)[0m VM-5-203-tencentos:1398910:1401139 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
[36m(WorkerDict pid=1398910)[0m VM-5-203-tencentos:1398910:1401139 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
[36m(WorkerDict pid=1398910)[0m VM-5-203-tencentos:1398910:1401139 [0] NCCL INFO ncclCommInitRankConfig comm 0x7ef81b7b4b40 rank 0 nranks 

Capturing CUDA graph shapes:   0%|          | 0/67 [00:00<?, ?it/s]
Capturing CUDA graph shapes:   3%|▎         | 2/67 [00:00<00:04, 14.23it/s]
Capturing CUDA graph shapes:   7%|▋         | 5/67 [00:00<00:03, 18.18it/s]
Capturing CUDA graph shapes:  12%|█▏        | 8/67 [00:00<00:03, 19.37it/s]
Capturing CUDA graph shapes:  16%|█▋        | 11/67 [00:00<00:02, 19.62it/s]
Capturing CUDA graph shapes:  21%|██        | 14/67 [00:00<00:02, 19.59it/s]
Capturing CUDA graph shapes:  24%|██▍       | 16/67 [00:00<00:02, 18.95it/s]
Capturing CUDA graph shapes:  27%|██▋       | 18/67 [00:00<00:02, 18.08it/s]
Capturing CUDA graph shapes:  30%|██▉       | 20/67 [00:01<00:02, 17.25it/s]
Capturing CUDA graph shapes:  33%|███▎      | 22/67 [00:01<00:02, 17.13it/s]
Capturing CUDA graph shapes:  36%|███▌      | 24/67 [00:01<00:02, 15.65it/s]
Capturing CUDA graph shapes:  39%|███▉      | 26/67 [00:01<00:02, 15.28it/s]
Capturing CUDA graph shapes:  43%|████▎     | 29/67 [00:01<00:02, 16.70it/s]
Capturing C

[36m(WorkerDict pid=1398916)[0m VM-5-203-tencentos:1398916:1405769 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1[32m [repeated 14x across cluster][0m
[36m(WorkerDict pid=1398916)[0m VM-5-203-tencentos:1398916:1400169 [0] NCCL INFO Connected all trees[32m [repeated 7x across cluster][0m
[36m(WorkerDict pid=1398916)[0m VM-5-203-tencentos:1398916:1402038 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 177[32m [repeated 7x across cluster][0m
[36m(WorkerDict pid=1398916)[0m VM-5-203-tencentos:1398916:1400169 [0] NCCL INFO ncclCommInitRank comm 0x7fd3254b33c0 rank 6 nranks 8 cudaDev 0 nvmlDev 6 busId c3000 commId 0x91a76f5afb58ec05 - Init COMPLETE[32m [repeated 7x across cluster][0m
[36m(WorkerDict pid=1398916)[0m VM-5-203-tencentos:1398916:1400169 [0] NCCL INFO Init timings - ncclCommInitRank: rank 6 nranks 8 total 0.24 (kernels 0.00, alloc 0.00, bootstrap 0.00, allgathers 0.01, topo 0.04, graphs 0.01, connections 0.18, rest 0.01)[32m [repeated 7x across cluste

[36m(vLLMHttpServer pid=1401492)[0m INFO:2026-02-11 13:24:27,663:Initializing a V1 LLM engine with config: model='/mnt/cfs-turbo/Qwen3-8B', speculative_config=None, tokenizer='/mnt/cfs-turbo/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=36864, download_dir=None, load_format=dummy, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/cfs-turbo/Qwen3-8B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_pr

[36m(TaskRunner pid=1398266)[0m n_parallel_agents: 16
[36m(TaskRunner pid=1398266)[0m train_sampling_params: {'temperature': 1.0, 'top_k': -1, 'top_p': 1, 'logprobs': 1}
[36m(TaskRunner pid=1398266)[0m val_sampling_params: {'temperature': 0, 'top_k': -1, 'top_p': 1.0, 'logprobs': 1}
[36m(TaskRunner pid=1398266)[0m Checkpoint tracker file does not exist: /workspace/checkpoints/AgentRL-with-ags/swe-agent-rl/latest_checkpointed_iteration.txt
[36m(TaskRunner pid=1398266)[0m Training from scratch
[36m(TaskRunner pid=1398266)[0m Time taken to validate agent: 4.315376281738281e-05
[36m(TaskRunner pid=1398266)[0m 'epoch 0, step 1 started'


[36m(TaskRunner pid=1398266)[0m 2026-02-11 13:24:29,886 - rllm.parser.chat_template_parser - INFO - model_name: /mnt/cfs-turbo/qwen3-8b, tokenizer_cls: qwen2tokenizerfast
[36m(TaskRunner pid=1398266)[0m 2026-02-11 13:24:29,886 - rllm.parser.chat_template_parser - INFO - Using QwenChatTemplateParser for /mnt/cfs-turbo/Qwen3-8B
[36m(TaskRunner pid=1398266)[0m 2026-02-11 13:24:29,901 - rllm.parser.chat_template_parser - INFO - model_name: /mnt/cfs-turbo/qwen3-8b, tokenizer_cls: qwen2tokenizerfast
[36m(TaskRunner pid=1398266)[0m 2026-02-11 13:24:29,901 - rllm.parser.chat_template_parser - INFO - Using QwenChatTemplateParser for /mnt/cfs-turbo/Qwen3-8B


[36m(TaskRunner pid=1398266)[0m ✅ AGS 客户端创建成功✅ AGS 客户端创建成功
[36m(TaskRunner pid=1398266)[0m 
[36m(TaskRunner pid=1398266)[0m ✅ AGS 客户端创建成功✅ AGS 客户端创建成功✅ AGS 客户端创建成功
[36m(TaskRunner pid=1398266)[0m ✅ AGS 客户端创建成功
[36m(TaskRunner pid=1398266)[0m ✅ AGS 客户端创建成功
[36m(TaskRunner pid=1398266)[0m 
[36m(TaskRunner pid=1398266)[0m 
[36m(TaskRunner pid=1398266)[0m ✅ AGS 客户端创建成功✅ AGS 客户端创建成功
[36m(TaskRunner pid=1398266)[0m 
[36m(TaskRunner pid=1398266)[0m ✅ AGS 客户端创建成功
[36m(TaskRunner pid=1398266)[0m ✅ AGS 客户端创建成功✅ AGS 客户端创建成功
[36m(TaskRunner pid=1398266)[0m 
[36m(TaskRunner pid=1398266)[0m ✅ AGS 客户端创建成功✅ AGS 客户端创建成功
[36m(TaskRunner pid=1398266)[0m 
[36m(TaskRunner pid=1398266)[0m 
[36m(TaskRunner pid=1398266)[0m 
[36m(TaskRunner pid=1398266)[0m 
[36m(TaskRunner pid=1398266)[0m ✅ AGS 客户端创建成功
[36m(TaskRunner pid=1398266)[0m ✅ AGS 客户端创建成功
[36m(TaskRunner pid=1398266)[0m 
[36m(TaskRunner pid=1398266)[0m 🚀 Creating e2b sandbox with template: aiohttp_f-032fb571f2

[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 9549. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 8 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 2/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 11 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 3/16 completed[0m


[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 9447. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.
[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 11271. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 15 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 4/16 completed[0m


[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 4856. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 10 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 5/16 completed[0m


[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 10160. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 13 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 6/16 completed[0m


[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 20400. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 3 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 7/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 4 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 8/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 12 completed due to: MAX_STEPS. Reward is 0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [31mTrajectory 12 is masked out due to overlong filter.[0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 9/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [32mTrajectory 7 completed due to: ENV_DONE. Reward is 1.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 10/16 completed[0m


[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 25682. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 14 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 11/16 completed[0m


[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 15350. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 0 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 12/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 1 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 13/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 6 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 14/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 2 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 15/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 5 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [

[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 16378. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 5 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 5/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 0 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 6/16 completed[0m


[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 12955. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 7 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 7/16 completed[0m


[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 6700. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 2 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 8/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 12 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 9/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 11 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 10/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 9 completed due to: TRUNCATION. Reward is 0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [31mTrajectory 9 is masked out due to overlong filter.[0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 11/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33

[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 14685. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 8 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 13/16 completed[0m


[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 16553. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 15 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 14/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 6 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 15/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 10 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 16/16 completed[0m


[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 17666. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [36m[1m
[36m(TaskRunner pid=1398266)[0m Sample 0[0m
[36m(TaskRunner pid=1398266)[0m [[2mmasked[0m [34munmasked[0m [42mreward > 0[0m [41mreward <= 0[0m]
[36m(TaskRunner pid=1398266)[0m [2m<|im_start|>[0m[2msystem[0m[2m\n[0m[2mYou[0m[2m are[0m[2m a[0m[2m programming[0m[2m agent[0m[2m who[0m[2m is[0m[2m provided[0m[2m a[0m[2m github[0m[2m issue[0m[2m and[0m[2m repository[0m[2m bash[0m[2m environment[0m[2m and[0m[2m is[0m[2m tasked[0m[2m to[0m[2m solve[0m[2m certain[0m[2m tasks[0m[2m ([0m[2me[0m[2m.g[0m[2m.,[0m[2m file[0m[2m localization[0m[2m,[0m[2m testcase[0m[2m generation[0m[2m,[0m[2m code[0m[2m repair[0m[2m and[0m[2m editing[0m[2m etc[0m[2m)[0m[2m to[0m[2m resolve[0m[2m the[0m[2m issue[0m[2m.\n\n[0m[2mWe[0m[2m have[0m[2m access[0m[2m to[0m[2m the[0m[2m following[0m[2m functions[0m[2m:\n\n[0m[2m––[0m[2m BEGIN[0m[2m FUNCTION

[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 14958. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 14 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 3/16 completed[0m


[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 14648. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 1 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 4/16 completed[0m


[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 1862. Expected: [563, 79463, 13652, 563, 2266], Got: [47372, 563, 13652, 563, 2266]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 5 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 5/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 6 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 6/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 4 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 7/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 10 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 8/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 13 completed due to: MAX_STEPS. Reward is 0. 
[36m(TaskRunner pid=1398266)[0m [0m


[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 7616. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 9 completed due to: TRUNCATION. Reward is 0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [31mTrajectory 9 is masked out due to overlong filter.[0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 11/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 2 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 12/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 12 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 13/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 3 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 14/16 completed[0m


[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 20959. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 8 completed due to: TRUNCATION. Reward is 0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [31mTrajectory 8 is masked out due to overlong filter.[0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 15/16 completed[0m
[36m(TaskRunner pid=1398266)[0m [33mTrajectory 0 completed due to: ENV_DONE. Reward is 0.0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 16/16 completed[0m


[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 21046. Expected: [198, 151645], Got: [151645, 198]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [36m[1m
[36m(TaskRunner pid=1398266)[0m Sample 0[0m
[36m(TaskRunner pid=1398266)[0m [[2mmasked[0m [34munmasked[0m [42mreward > 0[0m [41mreward <= 0[0m]
[36m(TaskRunner pid=1398266)[0m ----------------
[36m(TaskRunner pid=1398266)[0m step:3 - traj/steps_mean:45.0625 - traj/steps_min:22 - traj/steps_max:50 - traj/reward_time_mean:5.127382883658776 - traj/reward_time_min:0.8436102867126465 - traj/reward_time_max:9.985153198242188 - traj/env_time_mean:12.746348813176155 - traj/env_time_min:3.600742816925049 - traj/env_time_max:32.30750226974487 - traj/llm_time_mean:83.14109659194946 - traj/llm_time_min:47.07543659210205 - traj/llm_time_max:164.82861471176147 - traj/total_time_mean:95.88744540512562 - traj/total_time_min:50.6761794090271 - traj/total_time_max:168.56003594398499 - traj/token_mismatch_mean:0.375 - traj/token_mismatch_min:0.0 - traj/token_mismatch_max:1.0 - batch/solve_none:4 - batch/solve_all:0 - batch/solve_partial:0 - act

[36m(TaskRunner pid=1398266)[0m When assemble steps, detect the trajectory not accumulative at position 4855. Expected: [73803, 21352, 522, 16181, 397], Got: [29, 64648, 522, 16181, 397]. Setting response_masks to all 0s. This is likely due to retokenization.


[36m(TaskRunner pid=1398266)[0m [33mTrajectory 1 completed due to: TRUNCATION. Reward is 0. 
[36m(TaskRunner pid=1398266)[0m [0m
[36m(TaskRunner pid=1398266)[0m [31mTrajectory 1 is masked out due to overlong filter.[0m
[36m(TaskRunner pid=1398266)[0m [36mNumber of Trajectories 4/4 completed[0m
[36m(TaskRunner pid=1398266)[0m [36m[1m
[36m(TaskRunner pid=1398266)[0m Sample 0[0m
[36m(TaskRunner pid=1398266)[0m [[2mmasked[0m [34munmasked[0m [42mreward > 0[0m [41mreward <= 0[0m]
[36m(TaskRunner pid=1398266)[0m [2m<|im_start|>[0m[2msystem[0m[2m\n[0m[2mYou[0m[2m are[0m[2m a[0m[2m programming[0m[2m agent[0m[2m who[0m[2m is[0m[2m provided[0m[2m a[0m[2m github[0m[2m issue[0m[2m and[0m[2m repository[0m[2m bash[0m[2m environment[0m[2m and[0m[2m is[0m[2m tasked[0m[2m to[0m[2m solve[0m[2m certain[0m[2m tasks[0m[2m ([0m[2me[0m[2m.g[0m[2m.,[0m[2m file[0m[2m localization[0m[2m,[0m[2m testcase[0m[2m gener

[33m(raylet)[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffdb07151be87a790e7b8e56b50f000000 Worker ID: 1e7d3edadf54bb5d83eb08734cdde14a08a73213026a346614d16d27 Node ID: 86c54756ea071fcfadc9600b8d7058a6698bfdca88fe1f137cc10eb9 Worker IP address: 10.32.5.203 Worker port: 10202 Worker PID: 1398916 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly by a signal. SystemExit is raised (sys.exit is called). Exit code: 1. The process receives a SIGTERM.


### 预期输出

训练启动后，控制台会依次输出：
1. Ray 集群初始化信息
2. vLLM 引擎加载模型（Qwen3-8B）
3. 每个 step 的 rollout 进度：`Number of Trajectories x/16 completed`（`train_batch_size=4 × rollout.n=4 = 16` 条 trajectory）
4. PPO 更新日志：loss、entropy、KL divergence 等

本 demo 配置 `total_epochs=2`，`train_batch_size=4`，`MAX_SAMPLES=8`，每个 epoch 有 2 个 step（8 ÷ 4），共 4 个 step。

### （可选）CLI 方式启动训练

也可以通过 shell 脚本启动训练：

```bash
cd rllm/examples/swe
bash train_deepswe.sh
```

该脚本使用相同的 Hydra 配置，但预设了不同的硬件参数。
根据你的 GPU 数量调整 `n_gpus_per_node` 和 `nnodes`。

## 常见问题排查

| 问题 | 原因 | 解决方案 |
|------|------|----------|
| AGS 连接 401/403 | 凭证错误或区域不匹配 | 检查 `E2B_API_KEY`、Secret ID/Key，确认 `AGS_REGION` 与沙箱工具区域一致 |
| `Numba needs NumPy 2.2 or less` | NumPy 版本过高 | `%pip install 'numpy<2.3'` |
| `must be called with a dataclass type or instance` | datasets 版本过低 | `%pip install 'datasets>=4.5.0'` |
| vLLM OOM | 显存不足 | 减小 `gpu_memory_utilization`、`max_response_length`，或使用更小模型 |
| Ray 初始化失败 | 残留 Ray 进程 | `!ray stop --force` 后重试 |
| 数据集下载失败 | HF 不可达 | 确认 `HF_ENDPOINT` 已设置为可用的镜像地址 |

## 总结

本 Notebook 演示了使用 AGS 作为沙箱后端进行 SWE-Bench Agent RL 训练的完整流程：

1. 安装 rLLM、verl、R2E-Gym、ags_tool 依赖
2. 配置 AGS 凭证和运行时环境变量
3. 从 HuggingFace 下载并注册 SWE-Bench 数据集
4. 通过 Hydra 配置 PPO 训练参数，关键设置 `backend=ags`
5. 使用 `AgentTrainer.train()` 启动分布式训练

### 核心组件

| 组件 | 作用 |
|------|------|
| **AGS** | 提供按需创建的云端沙箱，每个 SWE-Bench 任务在独立容器中执行 |
| **SWEAgent** | 解析 LLM 输出为代码编辑工具调用 |
| **SWEEnv** | 在 AGS 沙箱中执行工具调用并计算 reward |
| **AgentTrainer** | 封装 verl PPO 训练循环 |

### 相关资源
- 沙箱工具创建：`ags-tool/example/swe_bench_ags_tool.ipynb`
- rLLM 文档：https://github.com/rllm-org/rllm
- R2E-Gym：https://github.com/R2E-Gym/R2E-Gym