🌐 Homepage | 📖 Paper | 🤗 Dataset
MARBLE is a challenging multimodal reasoning benchmark designed to scrutinize multimodal language models (MLLMs) in their ability to carefully reason step-by-step through complex multimodal problems and environments. MARBLE is composed of two highly challenging tasks, M-Portal and M-Cube, that require the crafting and understanding of multistep plans leveraging spatial, visual, and physical constraints. We find that current MLLMs perform poorly on MARBLE—all the 12 advanced models obtain near-random performance on M-Portal and 0% accuracy on M-Cube. Only in simplified subtasks some models outperform the random baseline, indicating that complex reasoning is still a challenge for existing MLLMs. Moreover, we show that perception remains a bottleneck, where MLLMs occasionally fail to extract information from the visual inputs. By shedding a light on the limitations of MLLMs, we hope that MARBLE will spur the development of the next generation of models with the ability to reason and plan across many, multimodal reasoning steps.
Prepare the environment using the following code:
pip install -r requirements.txt
To evaluate on M-CUBE, replace <Your API KEY> in the code and run:
python eval_cube_api.py --subset <SUBSET> --model_name gpt-4o
Set <SUBSET> to cube or cube_easy. You can also use cube_perception to evaluate model on the simple perception task.
To evaluate on M-Portal, first download the map images and unzip it:
wget https://huggingface.co/datasets/mrble/MARBLE/resolve/main/images.zip
unzip images.zip -d .
Then replace <Your API KEY> in the code and run:
python eval_portal.py --subset <SUBSET> --model_name gpt-4o
Set <SUBSET> to portal_binary (plan-correctness) or portal_blanks (fill-the-blanks).
Similarly, one can use eval_cube_local.py and eval_portal_local.py to evaluate open-source models.
Results will be saved at ./output
- Yulun Jiang: yulun.jiang@epfl.ch
@article{jiang2025marble,
title={MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning},
author={Jiang, Yulun and Chai, Yekun and Brbi'c, Maria and Moor, Michael},
journal={arXiv preprint arXiv:2506.22992},
year={2025},
url={https://arxiv.org/abs/2506.22992}
}