Flat-Pack Bench:
Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang, Noah Snavely, Bharath Hariharan

CVPR 2026

Overview

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.

Setup

Installation and environment setup instructions are maintained in setup/README.md. The setup guide covers the default fpb environment plus the fpb-llava, fpb-plm, and fpb-sam2 environments used by model-specific experiments.

Experiments

Experiment and evaluation code lives under src/. See src/eval/ for benchmark inference, prompt rendering, evaluation, and tabulation, and src/tva/ for Temporal Video Agent experiments.

Run Evaluations

To run model evaluations, start with the evaluation package documentation in src/eval/README.md. It documents the Hydra configs, media pipelines, model wrappers, inference entrypoint, scoring scripts, and result tabulation utilities.

Dataset download and local data layout details are documented in data/README.md.

Limitations

While we strive to maintain the highest quality of annotations, some imperfections might exist. If you notice annotation issues, ambiguous questions, or other dataset problems, please point them out to us so we can improve the benchmark.

Acknowledgements

Flat-Pack Bench builds on IKEA Manuals at Work, whose furniture assembly videos and annotations made this benchmark possible. We thank the IKEA-Manuals-At-Work authors for releasing this valuable resource.

Citation

If you use Flat-Pack Bench in your research, please consider citing our paper:

@InProceedings{Chetan_2026_CVPR,
    author    = {Chetan, Aditya and Cai, Eric and Kushwaha, Peeyush and Kani, Bharath Raj Nagoor and Mall, Utkarsh and Wang, Qianqian and Snavely, Noah and Hariharan, Bharath},
    title     = {Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {16624-16634}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets/readme		assets/readme
data		data
scripts		scripts
setup		setup
src		src
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Flat-Pack Bench:
Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Overview

Setup

Experiments

Run Evaluations

Limitations

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Flat-Pack Bench:Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Overview

Setup

Experiments

Run Evaluations

Limitations

Acknowledgements

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Flat-Pack Bench:
Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Packages