Shuoyuan Wang1, Yiran Wang2, Hongxin Wei1*
1Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen, China
2Department of Earth and Space Sciences, Southern University of Science and Technology, Shenzhen, China
*Corresponding author
We introduce MarsRetrieval, an extensive retrieval benchmark for evaluating the utility of vision-language models in Martian geospatial discovery. Specifically, MarsRetrieval organizes evaluation into 3 complementary tasks: (1) paired image–text retrieval, (2) landform retrieval and (3) global geo-localization, covering multiple spatial scales and diverse geomorphic origins. MarsRetrieval aims to bridge the gap between multimodal AI capabilities and the needs of real-world planetary research.
1. Installation
For installation and other package requirements, please follow the instructions detailed in docs/INSTALL.md.
2. Data preparation
Please follow the instructions at docs/DATASET.md to prepare all datasets.
Example runs by task (e.g., PE-Core, Qwen3-VL-Embedding, MarScope):
Paired Image-Text Retrieval
GPU_ID=0
bash scripts/paired_image_text_retrieval/openclip.sh ${GPU_ID}
bash scripts/paired_image_text_retrieval/qwen3_vl_embedding.sh ${GPU_ID}
bash scripts/paired_image_text_retrieval/marscope.sh ${GPU_ID}Landform Retrieval
GPU_ID=0
bash scripts/landform_retrieval/openclip.sh ${GPU_ID}
bash scripts/landform_retrieval/qwen3_vl_embedding.sh ${GPU_ID}
bash scripts/landform_retrieval/marscope.sh ${GPU_ID}Global Geo-Localization
We highly recommend building the database in distributed mode first, then you can run experiments on a single GPU.
# Distributed DB build with 8 GPUs
bash scripts/geolocalization/openclip.sh 0,1,2,3,4,5,6,7
# Single-GPU runs
GPU_ID=0
bash scripts/geolocalization/openclip.sh ${GPU_ID}
bash scripts/geolocalization/qwen3_vl_embedding.sh ${GPU_ID}
bash scripts/geolocalization/marscope.sh ${GPU_ID}For more models, see the scripts under scripts/geolocalization, scripts/landform_retrieval, and scripts/paired_image_text_retrieval, respectively.
Encoder-based
| Model | Paper | Code |
|---|---|---|
| DFN2B-CLIP-ViT-L-14 | link | link |
| ViT-L-16-SigLIP-384 | link | link |
| ViT-L-16-SigLIP2-512 | link | link |
| PE-Core-L-14-336 | link | link |
| BGE-VL-large | link | link |
| aimv2-large-patch14-224 | link | link |
| aimv2-large-patch14-448 | link | link |
| dinov3-vitl16 | link | link |
MLLM-based
| Model | Paper | Code |
|---|---|---|
| E5-V | link | link |
| gme | link | link |
| B3++ | link | link |
| jina-embeddings-v4 | link | link |
| VLM2Vec-V2.0 | link | link |
| Ops-MM-embedding-v1 | link | link |
| Qwen3-VL-Embedding | link | link |
If you find this useful in your research, please consider citing:
@article{wang2026marsretrieval,
title={MarsRetrieval: Benchmarking Vision-Language Models for Planetary-Scale Geospatial Retrieval on Mars},
author={Wang, Shuoyuan and Wang, Yiran and Wei, Hongxin},
journal={arXiv preprint arXiv:2602.13961},
year = {2026}
}
