Skip to content

kane2kang/LongHorizonUI

Repository files navigation

🖥️ LongHorizonUI

A Long-Horizon GUI Automation Agent Framework with Enhanced Perception, Deep Reflection, and Compensating Execution

Dataset Paper License


📖 Introduction

LongHorizonUI is an agent framework designed for long-horizon GUI automation tasks. Existing GUI agents suffer from rapid success-rate degradation in long-step tasks (>10 steps) due to error accumulation. LongHorizonUI addresses this problem through three core modules:

Module Description
Multi-source Enhanced Perceiver (MEP) Runs icon detection and OCR in parallel, resolves compound widget ambiguity via IoU semantic binding, and repairs missing key elements with template matching
Deep Reflective Decider (DRD) Multi-step look-ahead reasoning, retrospective action review, and causal inference on UI states for high-quality action decisions
Compensating Action Executor (CAE) Three-level fallback strategy (Index → Relative → Absolute+ε), post-execution verification, progress monitoring, and automatic rollback

Framework Overview


🚀 Quick Start

1. Environment Setup

# Clone the repository
git clone <your-repo-url>
cd LongHorizonUI

# Install dependencies
pip install -r requirements.txt

2. Configure LLM API

cp .env.example .env
# Edit the .env file and fill in the API keys for your LLM provider

Supported LLM Providers:

Provider Required Environment Variables
Gemini (Recommended) LLM_PROJECT, LLM_LOCATION
Azure OpenAI AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY
OpenAI OPENAI_ENDPOINT, OPENAI_API_KEY

3. Download Dataset (Optional)

We provide the LongGUIBench benchmark dataset for evaluation:

🤗 LongGUIBench Dataset (link available upon publication)

After downloading, place the data under the data/ directory:

data/
├── general/          # General application scenarios
│   ├── app_a/
│   │   ├── task_001/
│   │   │   ├── screenshot/     # UI screenshot sequences
│   │   │   │   ├── 001.png
│   │   │   │   ├── 002.png
│   │   │   │   └── ...
│   │   │   └── task_infos.json # Task description and annotations
│   │   └── ...
│   ├── app_b/
│   └── ...
└── game/             # Game application scenarios
    ├── hero/
    └── ...

task_infos.json format example:

{
  "task_name": "Create a new email in a mail app and send it to a contact",
  "task_steps": [
    {"action": "Click the menu button in the top-left corner"},
    {"action": "Select the compose email option"},
    {"action": "Enter the recipient address"},
    {"action": "Enter the email subject"},
    {"action": "Click the send button"}
  ]
}

💻 Usage

Mode 1: Offline (Screenshot Simulation)

No phone connection required. Simulates the agent's full reasoning and execution pipeline based on pre-recorded screenshot sequences. Suitable for:

  • Offline evaluation and experiment reproduction
  • Development and debugging without an Android device
# Low instruction mode (detailed step-by-step instructions provided)
python run.py offline \
  --data_dir data/general/app_a \
  --instruction_level low \
  --provider gemini \
  --model gemini-2.5-pro

# High instruction mode (only task description provided, agent plans autonomously)
python run.py offline \
  --data_dir data/game/game_a \
  --instruction_level high \
  --provider gemini \
  --model gemini-2.5-pro

Common Parameters

Parameter Description Default
--provider LLM provider gemini
--model Model name gemini-2.5-pro
--instruction_level Instruction level: high / low low
--max_steps Maximum execution steps 100
--temperature LLM sampling temperature 0.4
--output_dir Output directory ./output

📊 Main Results

LongGUIBench

On our self-constructed LongGUIBench, LongHorizonUI significantly outperforms existing methods across both general and game long-horizon scenarios.

LongGUIBench Results

ScreenSpot

On the ScreenSpot cross-platform UI grounding benchmark, LongHorizonUI surpasses previous state-of-the-art methods, validating the effectiveness of the IoU semantic binding strategy in the enhanced perception module.

ScreenSpot Results


📝 Citation

If this project is helpful for your research, please cite:

@inproceedings{anonymous2026longhorizonui,
  title={LongHorizon{UI}: A Unified Framework for Robust long-horizon Task Automation of {GUI} Agent},
  author={Anonymous},
  booktitle={Conference on Learning Representations},
  year={2026},
  url={#}
}

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

About

ICLR2026

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages