Skip to content

lostwolves/AerialVP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 

Repository files navigation

AerialVP: Towards Accurate UAV Image Perception——Guiding Vision-Language Models with Stronger Task Prompts

This project introduces AerialVP (Aerial Visual Perception), an agent framework that automatically enhances task prompts for UAV image understanding with Vision-Language Models (VLMs). AerialVP analyzes the input task, selects suitable tools from a modular repository, and generates refined, information-enriched prompts to support perception, reasoning, and grounding in aerial imagery. The project also provides AerialSense, a unified benchmark suite covering Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding, enabling standardized evaluation across diverse UAV scenarios.

AerialVP Overview

📦 AerialSense Dataset

AerialSense is a large-scale UAV perception benchmark curated from multiple public UAV sources and organized under a unified evaluation protocol. It is designed to support comprehensive, diverse, and realistic assessment of multimodal perception in aerial settings.

AerialSense Overview

🧭 Multi-Task Coverage

AerialSense jointly supports three core UAV perception tasks within a single benchmark:

  • 🧠 Aerial Visual Reasoning (VR)
  • Aerial Visual Question Answering (VQA)
  • 🎯 Aerial Visual Grounding (VG)

📊 Scale & Density

  • 🖼️ 7,119 UAV images
  • 🧪 53,374 task samples This high task density enables stable and systematic evaluation across multiple capabilities.

🧩 Rich Annotations for Grounding

For grounding-oriented evaluation, AerialSense includes fine-grained referring expressions paired with object localization annotations, emphasizing:

  • 🎨 semantic attributes (e.g., color, category)
  • 📍 spatial cues (e.g., orientation/relative position)
  • 🔗 relational descriptions (e.g., “next to”, “in front of”)

🏷️ Broad Object & Scene Diversity

AerialSense covers 40 object categories and diverse target types, including:

  • 🚗🚶 terrestrial moving objects (cars, pedestrians)
  • 🏢🅿️ stationary structures/regions (buildings, parking lots)
  • 🚢 waterborne moving objects (ships) It supports evaluation across terrestrial and aquatic scenes with varied operational contexts.

🛰️ Multi-Resolution UAV Imagery

AerialSense includes images spanning a wide resolution range (approximately 512×512 to 4000×2250), enabling robust testing of:

  • 🔎 multi-scale perception
  • 📷 detail-sensitive recognition and localization

📝 Instruction Richness

Task instructions are designed to be information-rich and reasoning-oriented, with an average length of 20.76 words and substantial coverage of:

  • object attributes
  • spatial relations
  • functional states This encourages deeper language–vision alignment beyond short keyword-style prompts.

🌦️ Realistic Visual Variations

To reflect real UAV conditions, AerialSense incorporates broad visual variation in:

  • 💡 illumination
  • 🧩 scene complexity
  • 🛰️ resolution supporting fair evaluation of model adaptability and generalization in real-world aerial perception.

Release Notes

  • [2025/12/23] 📢 Dataset Release: AerialSense on Hugging Face We have released the AerialSense dataset on Hugging Face: https://huggingface.co/datasets/GuoMN/AerialSense. The dataset covers three UAV perception tasks: Aerial Visual Reasoning, Aerial Visual Question Answering, and Aerial Visual Grounding. The Aerial Visual Grounding task is mainly built upon samples from original data sources, while a limited number of samples lacking grounding instructions are manually annotated by our team. The Aerial Visual Reasoning and Aerial Visual Question Answering datasets are manually constructed and validated. If you have any questions or concerns regarding the data, please feel free to contact us.

Citation

If you find our work useful in your research, please cite our paper:

@article{guo2025towards,
  title={Towards Accurate UAV Image Perception: Guiding Vision-Language Models with Stronger Task Prompts},
  author={Guo, Mingning and Wu, Mengwei and Li, Shaoxian and Li, Haifeng and Tao, Chao},
  journal={arXiv preprint arXiv:2512.07302},
  year={2025}
}

About

AerialVP is an agent framework for UAV image perception tasks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors