A Long-Horizon GUI Automation Agent Framework with Enhanced Perception, Deep Reflection, and Compensating Execution
LongHorizonUI is an agent framework designed for long-horizon GUI automation tasks. Existing GUI agents suffer from rapid success-rate degradation in long-step tasks (>10 steps) due to error accumulation. LongHorizonUI addresses this problem through three core modules:
| Module | Description |
|---|---|
| Multi-source Enhanced Perceiver (MEP) | Runs icon detection and OCR in parallel, resolves compound widget ambiguity via IoU semantic binding, and repairs missing key elements with template matching |
| Deep Reflective Decider (DRD) | Multi-step look-ahead reasoning, retrospective action review, and causal inference on UI states for high-quality action decisions |
| Compensating Action Executor (CAE) | Three-level fallback strategy (Index → Relative → Absolute+ε), post-execution verification, progress monitoring, and automatic rollback |
# Clone the repository
git clone <your-repo-url>
cd LongHorizonUI
# Install dependencies
pip install -r requirements.txtcp .env.example .env
# Edit the .env file and fill in the API keys for your LLM providerSupported LLM Providers:
| Provider | Required Environment Variables |
|---|---|
| Gemini (Recommended) | LLM_PROJECT, LLM_LOCATION |
| Azure OpenAI | AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY |
| OpenAI | OPENAI_ENDPOINT, OPENAI_API_KEY |
We provide the LongGUIBench benchmark dataset for evaluation:
🤗 LongGUIBench Dataset (link available upon publication)
After downloading, place the data under the data/ directory:
data/
├── general/ # General application scenarios
│ ├── app_a/
│ │ ├── task_001/
│ │ │ ├── screenshot/ # UI screenshot sequences
│ │ │ │ ├── 001.png
│ │ │ │ ├── 002.png
│ │ │ │ └── ...
│ │ │ └── task_infos.json # Task description and annotations
│ │ └── ...
│ ├── app_b/
│ └── ...
└── game/ # Game application scenarios
├── hero/
└── ...
task_infos.json format example:
{
"task_name": "Create a new email in a mail app and send it to a contact",
"task_steps": [
{"action": "Click the menu button in the top-left corner"},
{"action": "Select the compose email option"},
{"action": "Enter the recipient address"},
{"action": "Enter the email subject"},
{"action": "Click the send button"}
]
}No phone connection required. Simulates the agent's full reasoning and execution pipeline based on pre-recorded screenshot sequences. Suitable for:
- Offline evaluation and experiment reproduction
- Development and debugging without an Android device
# Low instruction mode (detailed step-by-step instructions provided)
python run.py offline \
--data_dir data/general/app_a \
--instruction_level low \
--provider gemini \
--model gemini-2.5-pro
# High instruction mode (only task description provided, agent plans autonomously)
python run.py offline \
--data_dir data/game/game_a \
--instruction_level high \
--provider gemini \
--model gemini-2.5-pro| Parameter | Description | Default |
|---|---|---|
--provider |
LLM provider | gemini |
--model |
Model name | gemini-2.5-pro |
--instruction_level |
Instruction level: high / low |
low |
--max_steps |
Maximum execution steps | 100 |
--temperature |
LLM sampling temperature | 0.4 |
--output_dir |
Output directory | ./output |
On our self-constructed LongGUIBench, LongHorizonUI significantly outperforms existing methods across both general and game long-horizon scenarios.
On the ScreenSpot cross-platform UI grounding benchmark, LongHorizonUI surpasses previous state-of-the-art methods, validating the effectiveness of the IoU semantic binding strategy in the enhanced perception module.
If this project is helpful for your research, please cite:
@inproceedings{anonymous2026longhorizonui,
title={LongHorizon{UI}: A Unified Framework for Robust long-horizon Task Automation of {GUI} Agent},
author={Anonymous},
booktitle={Conference on Learning Representations},
year={2026},
url={#}
}This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


