Skip to content

inferloop-tech/multimodal-reasoning-bench

 
 

Repository files navigation

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

🌐 Homepage | 📖 Paper | 🤗 Dataset

Introduction

MARBLE is a challenging multimodal reasoning benchmark designed to scrutinize multimodal language models (MLLMs) in their ability to carefully reason step-by-step through complex multimodal problems and environments. MARBLE is composed of two highly challenging tasks, M-Portal and M-Cube, that require the crafting and understanding of multistep plans leveraging spatial, visual, and physical constraints. We find that current MLLMs perform poorly on MARBLE—all the 12 advanced models obtain near-random performance on M-Portal and 0% accuracy on M-Cube. Only in simplified subtasks some models outperform the random baseline, indicating that complex reasoning is still a challenge for existing MLLMs. Moreover, we show that perception remains a bottleneck, where MLLMs occasionally fail to extract information from the visual inputs. By shedding a light on the limitations of MLLMs, we hope that MARBLE will spur the development of the next generation of models with the ability to reason and plan across many, multimodal reasoning steps.

Alt text

Environment

Prepare the environment using the following code:

pip install -r requirements.txt

Evaluation

To evaluate on M-CUBE, replace <Your API KEY> in the code and run:

python eval_cube_api.py --subset <SUBSET> --model_name gpt-4o

Set <SUBSET> to cube or cube_easy. You can also use cube_perception to evaluate model on the simple perception task.

To evaluate on M-Portal, first download the map images and unzip it:

wget https://huggingface.co/datasets/mrble/MARBLE/resolve/main/images.zip
unzip images.zip -d .

Then replace <Your API KEY> in the code and run:

python eval_portal.py --subset <SUBSET> --model_name gpt-4o

Set <SUBSET> to portal_binary (plan-correctness) or portal_blanks (fill-the-blanks).

Similarly, one can use eval_cube_local.py and eval_portal_local.py to evaluate open-source models.

Results will be saved at ./output

Contact

BibTex

@article{jiang2025marble,
  title={MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning},
  author={Jiang, Yulun and Chai, Yekun and Brbi'c, Maria and Moor, Michael},
  journal={arXiv preprint arXiv:2506.22992},
  year={2025},
  url={https://arxiv.org/abs/2506.22992}
}

About

MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%