MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

🌐 Homepage | 📖 Paper | 🤗 Dataset

Introduction

MARBLE is a challenging multimodal reasoning benchmark designed to scrutinize multimodal language models (MLLMs) in their ability to carefully reason step-by-step through complex multimodal problems and environments. MARBLE is composed of two highly challenging tasks, M-Portal and M-Cube, that require the crafting and understanding of multistep plans leveraging spatial, visual, and physical constraints. We find that current MLLMs perform poorly on MARBLE—all the 12 advanced models obtain near-random performance on M-Portal and 0% accuracy on M-Cube. Only in simplified subtasks some models outperform the random baseline, indicating that complex reasoning is still a challenge for existing MLLMs. Moreover, we show that perception remains a bottleneck, where MLLMs occasionally fail to extract information from the visual inputs. By shedding a light on the limitations of MLLMs, we hope that MARBLE will spur the development of the next generation of models with the ability to reason and plan across many, multimodal reasoning steps.

Environment

Prepare the environment using the following code:

pip install -r requirements.txt

Evaluation

To evaluate on M-CUBE, replace <Your API KEY> in the code and run:

python eval_cube_api.py --subset <SUBSET> --model_name gpt-4o

Set <SUBSET> to cube or cube_easy. You can also use cube_perception to evaluate model on the simple perception task.

To evaluate on M-Portal, first download the map images and unzip it:

wget https://huggingface.co/datasets/mrble/MARBLE/resolve/main/images.zip
unzip images.zip -d .

Then replace <Your API KEY> in the code and run:

python eval_portal.py --subset <SUBSET> --model_name gpt-4o

Set <SUBSET> to portal_binary (plan-correctness) or portal_blanks (fill-the-blanks).

Similarly, one can use eval_cube_local.py and eval_portal_local.py to evaluate open-source models.

Results will be saved at ./output

Contact

Yulun Jiang: yulun.jiang@epfl.ch

BibTex

@article{jiang2025marble,
  title={MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning},
  author={Jiang, Yulun and Chai, Yekun and Brbi'c, Maria and Moor, Michael},
  journal={arXiv preprint arXiv:2506.22992},
  year={2025},
  url={https://arxiv.org/abs/2506.22992}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cube_helper.py		cube_helper.py
eval_cube_api.py		eval_cube_api.py
eval_cube_local.py		eval_cube_local.py
eval_portal_api.py		eval_portal_api.py
eval_portal_local.py		eval_portal_local.py
overview.png		overview.png
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

Introduction

Environment

Evaluation

Contact

BibTex

About

Uh oh!

Releases

Packages

Languages

inferloop-tech/multimodal-reasoning-bench

Folders and files

Latest commit

History

Repository files navigation

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

Introduction

Environment

Evaluation

Contact

BibTex

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages