[2026.04] SPARK accepted at ACL 2026 Main! 🎉🎉🎉
SPARK (Strategic Policy-Aware exploRation via Key-state dynamic branching) is a reinforcement learning framework specifically engineered for long-horizon agentic tasks.
Traditional exploration often suffers from "blind coverage." SPARK’s key insight is to activate adaptive branching exploration only at critical decision points. This allows for precise resource allocation, prioritizing sampling quality over quantity.
Figure 1: Paradigm comparison: uniform vs. strategic exploration.
- Root Initialization: Generates diverse starting trajectories to broaden the search space.
- Autonomous Branching: Selectively expands trajectories at high-uncertainty states using intrinsic
<explore>signals. - Budget Enforcement: Strategically constrains tree growth to stay within computational limits.
Figure 2: Overview of SPARK framework.
Experiments on challenging tasks (ALFWorld, ScienceWorld, WebShop) demonstrate that SPARK achieves superior performance over baselines.
Figure 3: Multi-benchmark performance comparison on 1.5B backbone.
We recommend maintaining a separate conda environment for each environment.
Install with pip:
# conda environment
conda create -n spark-alfworld python==3.10 -y
conda activate spark-alfworld
# training infra
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.5
# env package
pip3 install gymnasium==0.29.1
pip3 install stable-baselines3==2.6.0
pip3 install alfworldDownload PDDL & Game files and pre-trained MaskRCNN detector (will be stored in ~/.cache/alfworld/ by default).:
# Download to ~/.cache/alfworld/ by default
alfworld-download -fNote: You must correctly specify env.alfworld.data_path in verl/trainer/config/ppo_trainer.yaml; otherwise, the environment will fail to load the data.
Play a Textworld game:
alfworld-play-tw# conda environment
conda create -n spark-sciworld python==3.10 -y
conda activate spark-sciworld
# training infra
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.5
# env package
pip3 install scienceworldCreate a new spark-webshop environment
# conda environment
conda create -n spark-webshop python==3.10 -y
conda activate spark-webshopInstall WebShop
# env package
cd ./agent_system/environments/env_package/webshop/webshop
./setup.sh -d allhttps://drive.google.com/, get your Google Drive cookie, and paste it into .cache/gdown/cookies.txt.
Or you may need to manually download the files.
After WebShop is installed, return to the root directory of the repository and install the verl package:
# training infra
cd repo_root/
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.5The warnings can be safely ignored.
Query LLMs to generate reasoning paths for samples with golden actions located in data/expert_traj.
python data/all_annotation.py
python data/formulate_annotation_data.pyExtract interaction logs between teacher models and environments.
# Choose the conda environment: spark-alfworld, spark-sciworld, or spark-webshop
conda activate spark-alfworld
bash data/teacher_api_agent_all.shConsolidate the reasoning annotations and interaction logs into a final dataset.
python merge_sft_data.pyOur main experiments are based on Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct. Please download these two models first for use.
Use the script below for the cold-start process. Set the model_path to your downloaded model and trainer.default_local_dir for the SFT output, then run the following command to start training:
# For 1.5B Backbone
bash examples/sft/cold_start/all_1.5b.sh
# For 7B Backbone
bash examples/sft/cold_start/all_7b.shBefore running the RL training script, ensure cs_model_path points to your cold-start model directory and SAVE_PATH is set to your desired checkpoint location. Once configured, execute the script:
export VERL_AUTO_PADDING=1
bash examples/spark_trainer/ys_run_alfworld_branch_cs_bsz16.sh-
To customize the main loop logic: Modify the
branching_multi_turn_loop_with_masksfunction located inagent_system/multi_turn_rollout/rollout_loop.py. -
To adjust branching criteria: Replace or update the
explore_branching_condition functioninagent_system/multi_turn_rollout/branch_utils.py.
Please follow the instrcutions below:
-
Create your environment package in
agent_system/environments/env_package/. -
Write the corresponding prompt files in the
agent_system/environments/prompts/directory. -
Add an environment manager in
agent_system/environments/env_manager.py.
We sincerely appreciate the following awesome projects:
- Agentic RL training infrastructure: veRL, verl-agent and RLVMR.
- Interactive environments: ALFWorld, ScienceWorld, WebShop
If you find this repo useful for your research, we would appreciate it if you could cite our work:
@misc{wu2026sparkstrategicpolicyawareexploration,
title={Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning},
author={Jinyang Wu and Shuo Yang and Changpeng Yang and Yuhao Shen and Shuai Zhang and Zhengqi Wen and Jianhua Tao},
year={2026},
eprint={2601.20209},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.20209},
}
