Haozhe Qi1,2*, Kevin Qu3, Mahdi Rad1, Rui Wang1, Alexander Mathis2, Marc Pollefeys1,3
1Microsoft Spatial AI Lab 2EPFL 3ETH Zurich
*work done during an internship at Microsoft
AdaptToken is a training-free framework for long video understanding with MLLMs. It uses response entropy as a global uncertainty signal to allocate token budgets across video groups, together with cross-modal attention for intra-group token ranking. This enables both strong long-context performance and an efficient early-stopping variant (AdaptToken-Lite).
We are currently preparing the codebase for release. Stay tuned.
If you find our work useful, please consider citing:
@misc{qi2026adapttoken,
title={AdaptToken: Entropy-based Adaptive Token Selection for MLLM Long Video Understanding},
author={Haozhe Qi and Kevin Qu and Mahdi Rad and Rui Wang and Alexander Mathis and Marc Pollefeys},
year={2026},
eprint={2603.28696},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.28696},
}