Jihwan Kim1,2, Nikhil Parthasarathy1, Danfeng Qin1, Junhwa Hur1, Deqing Sun1, Bohyung Han1,2, Ming-Hsuan Yang1, Boqing Gong1
1Google DeepMind ย ย ย ย ย ย 2Seoul National University
TL;DR: We propose LiteFrame, a highly efficient video encoder for Video Large Language Models that unlocks scalable, long-form video understanding by resolving inefficiencies in both the LLM and the ViT.
๐ง Note: Code and model weights will be released soon.
LiteFrame.mp4
- [2026.05.18] Our paper, LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs, has been archived.
If you find our work useful for your research, please consider citing:
@article{kim2026liteframe,
title={LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs},
author={Kim, Jihwan and Parthasarathy, Nikhil and Qin, Danfeng and Hur, Junhwa and Sun, Deqing and Han, Bohyung and Yang, Ming-Hsuan and Gong, Boqing},
journal={arXiv preprint arXiv:2605.17260},
year={2026}
}