The rapid development of large vision-language model (VLM) has greatly promoted the research of GUI agent. However, GUI agents still face significant challenges in handling long-horizon tasks. First, single-agent models struggle to balance high-level capabilities and low-level execution capability, facing prevalent issues of responsibility coupling and capability conflicts. Second, agents lack awareness of the task state, leading to progress loss in long-horizon tasks. To address these challenges, we propose a staged execution-feedback reinforcement learning algorithm. Unlike training a unified policy model, we focus on training high-level scheduling models. Specifically, we propose and train two agents: a Coordinator, responsible for the strategic planning and task decomposition; and a State Tracker, responsible for context compression and information management to maintain the task's state and coherence. Based on this, we built the Coordinator-Executor-State Tracker (CES) multi-agent framework, which can be integrated with any low-level Executor model, assisting the Executor in solving long-horizon tasks through task scheduling and state management. Experiments on long-horizon task benchmarks demonstrate that CES significantly enhances the system's planning and state management capabilities. Furthermore, analysis confirms that our trained high-level scheduling module is a generalizable, plug-and-play module that significantly enhances the long-horizon capabilities of various Executors
- Multi-Agent Decoupling: We build CES multi-agent framework, featuring general-purpose, plug-and-play high-level components (Coordinator and State Tracker) that can integrate with various Executors and enhance their abilities.
- State context compression: We introduce a State Tracker, whose core task is dynamic context compression and state summarization, effectively resolving the state unawareness problem and maintaining the agent's logical coherence in long-horizon tasks.
- Staged Execution-Feedback RL: We propose a staged execution-feedback RL strategy. The core of this algorithm is to decouple high-level capabilities from low-level execution: it freezes a pre-trained Executor and uses the reward signals it generates to exclusively train the high-level Coordinator and State Tracker.
- Compelling Performance: Extensive experiments demonstrate that our method significantly enhances the long-horizon scheduling and state management capabilities of various Executor models and surpasses existing baselines.
Create conda virtual environment:
conda create --name ces python=3.10
conda activate ces
pip install -r requirements.txt
We use LLaMA-Factory to do the warm-up sft.
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
The data has been put in LLaMA-Factory\data, called planner_vl_sft and memory_sft respectively.
You just need:
bash examples/sft/train_coordinator.sh
bash examples/sft/train_tracker.sh
llamafactory-cli export examples/merge_lora/qwen2_5vl_lora_sft.yaml
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
Download the dataset from huggingface and put them to ./data
We use GUI-R1-7B as Executor model, so download it first.
You can also try other powful model, and maybe you can get higher performance.
Change your SFT_model and data path in train_coordinator.sh, and then:
cd ../
bash examples/train_rl/train_coordinator.sh
Remember the path of coordinator and change it in train_tracker.sh, and then:
bash examples/train_rl/train_tracker.sh
We use original data to evaluate directly:
python examples/eval/eval.py
We thank for the code repository: verl, LLaMA-Factory, vLLM, SWIRL, GUI-R1.
If you think our work helpful, please cite our paper. Thank you very much!
article{deng2025training,
title={Training High-Level Schedulers with Execution-Feedback Reinforcement Learning for Long-Horizon GUI Automation},
author={Deng, Zehao and Ju, Tianjie and Wu, Zheng and Zhang, Zhuosheng and Liu, Gongshen},
journal={arXiv preprint arXiv:2511.22235},
year={2025}
}

