AerialVP: Towards Accurate UAV Image Perception——Guiding Vision-Language Models with Stronger Task Prompts

This project introduces AerialVP (Aerial Visual Perception), an agent framework that automatically enhances task prompts for UAV image understanding with Vision-Language Models (VLMs). AerialVP analyzes the input task, selects suitable tools from a modular repository, and generates refined, information-enriched prompts to support perception, reasoning, and grounding in aerial imagery. The project also provides AerialSense, a unified benchmark suite covering Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding, enabling standardized evaluation across diverse UAV scenarios.

📦 AerialSense Dataset

AerialSense is a large-scale UAV perception benchmark curated from multiple public UAV sources and organized under a unified evaluation protocol. It is designed to support comprehensive, diverse, and realistic assessment of multimodal perception in aerial settings.

🧭 Multi-Task Coverage

AerialSense jointly supports three core UAV perception tasks within a single benchmark:

🧠 Aerial Visual Reasoning (VR)
❓ Aerial Visual Question Answering (VQA)
🎯 Aerial Visual Grounding (VG)

📊 Scale & Density

🖼️ 7,119 UAV images
🧪 53,374 task samples This high task density enables stable and systematic evaluation across multiple capabilities.

🧩 Rich Annotations for Grounding

For grounding-oriented evaluation, AerialSense includes fine-grained referring expressions paired with object localization annotations, emphasizing:

🎨 semantic attributes (e.g., color, category)
📍 spatial cues (e.g., orientation/relative position)
🔗 relational descriptions (e.g., “next to”, “in front of”)

🏷️ Broad Object & Scene Diversity

AerialSense covers 40 object categories and diverse target types, including:

🚗🚶 terrestrial moving objects (cars, pedestrians)
🏢🅿️ stationary structures/regions (buildings, parking lots)
🚢 waterborne moving objects (ships) It supports evaluation across terrestrial and aquatic scenes with varied operational contexts.

🛰️ Multi-Resolution UAV Imagery

AerialSense includes images spanning a wide resolution range (approximately 512×512 to 4000×2250), enabling robust testing of:

🔎 multi-scale perception
📷 detail-sensitive recognition and localization

📝 Instruction Richness

Task instructions are designed to be information-rich and reasoning-oriented, with an average length of 20.76 words and substantial coverage of:

object attributes
spatial relations
functional states This encourages deeper language–vision alignment beyond short keyword-style prompts.

🌦️ Realistic Visual Variations

To reflect real UAV conditions, AerialSense incorporates broad visual variation in:

💡 illumination
🧩 scene complexity
🛰️ resolution supporting fair evaluation of model adaptability and generalization in real-world aerial perception.

Release Notes

[2025/12/23] 📢 Dataset Release: AerialSense on Hugging Face We have released the AerialSense dataset on Hugging Face: https://huggingface.co/datasets/GuoMN/AerialSense. The dataset covers three UAV perception tasks: Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding. The Aerial Visual Grounding task is mainly built upon samples from original data sources, while a limited number of samples lacking grounding instructions are manually annotated by our team. The Aerial Visual Reasoning and Aerial Visual Question Answering datasets are manually constructed and validated. If you have any questions or concerns regarding the data, please feel free to contact us.

Citation

If you find our work useful in your research, please cite our paper:

@article{guo2025towards,
  title={Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts},
  author={Guo, Mingning and Wu, Mengwei and Li, Shaoxian and Li, Haifeng and Tao, Chao},
  journal={arXiv preprint arXiv:2512.07302},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
imgs		imgs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AerialVP: Towards Accurate UAV Image Perception——Guiding Vision-Language Models with Stronger Task Prompts

📦 AerialSense Dataset

🧭 Multi-Task Coverage

📊 Scale & Density

🧩 Rich Annotations for Grounding

🏷️ Broad Object & Scene Diversity

🛰️ Multi-Resolution UAV Imagery

📝 Instruction Richness

🌦️ Realistic Visual Variations

Release Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AerialVP: Towards Accurate UAV Image Perception——Guiding Vision-Language Models with Stronger Task Prompts

📦 AerialSense Dataset

🧭 Multi-Task Coverage

📊 Scale & Density

🧩 Rich Annotations for Grounding

🏷️ Broad Object & Scene Diversity

🛰️ Multi-Resolution UAV Imagery

📝 Instruction Richness

🌦️ Realistic Visual Variations

Release Notes

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages