AerialVP: Towards Accurate UAV Image Perception——Guiding Vision-Language Models with Stronger Task Prompts
This project introduces AerialVP (Aerial Visual Perception), an agent framework that automatically enhances task prompts for UAV image understanding with Vision-Language Models (VLMs). AerialVP analyzes the input task, selects suitable tools from a modular repository, and generates refined, information-enriched prompts to support perception, reasoning, and grounding in aerial imagery. The project also provides AerialSense, a unified benchmark suite covering Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding, enabling standardized evaluation across diverse UAV scenarios.
AerialSense is a large-scale UAV perception benchmark curated from multiple public UAV sources and organized under a unified evaluation protocol. It is designed to support comprehensive, diverse, and realistic assessment of multimodal perception in aerial settings.
AerialSense jointly supports three core UAV perception tasks within a single benchmark:
- 🧠 Aerial Visual Reasoning (VR)
- ❓ Aerial Visual Question Answering (VQA)
- 🎯 Aerial Visual Grounding (VG)
- 🖼️ 7,119 UAV images
- 🧪 53,374 task samples This high task density enables stable and systematic evaluation across multiple capabilities.
For grounding-oriented evaluation, AerialSense includes fine-grained referring expressions paired with object localization annotations, emphasizing:
- 🎨 semantic attributes (e.g., color, category)
- 📍 spatial cues (e.g., orientation/relative position)
- 🔗 relational descriptions (e.g., “next to”, “in front of”)
AerialSense covers 40 object categories and diverse target types, including:
- 🚗🚶 terrestrial moving objects (cars, pedestrians)
- 🏢
🅿️ stationary structures/regions (buildings, parking lots) - 🚢 waterborne moving objects (ships) It supports evaluation across terrestrial and aquatic scenes with varied operational contexts.
AerialSense includes images spanning a wide resolution range (approximately 512×512 to 4000×2250), enabling robust testing of:
- 🔎 multi-scale perception
- 📷 detail-sensitive recognition and localization
Task instructions are designed to be information-rich and reasoning-oriented, with an average length of 20.76 words and substantial coverage of:
- object attributes
- spatial relations
- functional states This encourages deeper language–vision alignment beyond short keyword-style prompts.
To reflect real UAV conditions, AerialSense incorporates broad visual variation in:
- 💡 illumination
- 🧩 scene complexity
- 🛰️ resolution supporting fair evaluation of model adaptability and generalization in real-world aerial perception.
- [2025/12/23] 📢 Dataset Release: AerialSense on Hugging Face We have released the AerialSense dataset on Hugging Face: https://huggingface.co/datasets/GuoMN/AerialSense. The dataset covers three UAV perception tasks: Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding. The Aerial Visual Grounding task is mainly built upon samples from original data sources, while a limited number of samples lacking grounding instructions are manually annotated by our team. The Aerial Visual Reasoning and Aerial Visual Question Answering datasets are manually constructed and validated. If you have any questions or concerns regarding the data, please feel free to contact us.
If you find our work useful in your research, please cite our paper:
@article{guo2025towards,
title={Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts},
author={Guo, Mingning and Wu, Mengwei and Li, Shaoxian and Li, Haifeng and Tao, Chao},
journal={arXiv preprint arXiv:2512.07302},
year={2025}
}

