VideoBrain: Learning Adaptive Frame Sampling for Video Understanding

Junbo Zou, Ziheng Huang, Shengjie Zhang, Liwen Zhang, Weining Shen

🔥 Update

[2026.5] Our paper has been accepted to ICML 2026!

Overview

Long-form video understanding remains challenging for VLMs due to the tension between computational constraints and the need to capture information distributed across thousands of frames. Existing approaches either sample frames uniformly (risking information loss) or select keyframes in a single pass (with no recovery from poor choices).

VideoBrain solves this with dual complementary agents and a behavior-aware reward function:

CLIP Sample Agent — semantic retrieval across the video (finds specific visual content regardless of temporal location)
Uniform Sample Agent — dense temporal sampling within intervals (captures fine-grained sequential information)

Unlike prior agent-based methods that rely on text-only LLMs orchestrating visual tools, our VLM directly perceives frames and reasons about information sufficiency — truly "thinking with video."

Training Pipeline

Training uses a two-stage process:

Data Classification — A dual-model pipeline classifies video QA samples into three categories based on whether agent invocation is beneficial:
- Direct: Initial frames are sufficient; agent calls not needed
- Adaptive: Model capability gap; either strategy may help
- Active: Additional visual information is genuinely required
Supervised Fine-Tuning (SFT) — Cold-start training on teacher-generated trajectories for Adaptive and Active samples, teaching reasoning and tool usage
Reinforcement Learning (GRPO) — Refines agent usage policy with a behavior-aware reward that prevents reward hacking by discouraging unnecessary agent calls on Direct questions while encouraging exploration on Active ones

Installation

# Clone the repository
git clone https://github.com/Jacob-Chow/VideoBrain.git
cd VideoBrain

# Install dependencies
pip install -r requirements.txt

Core Requirements

Python >= 3.10
PyTorch >= 2.4.0
vLLM == 0.11.1
transformers == 4.57.1

Download Vision Model

Download the SigLIP2 model for CLIP-based frame retrieval:

Model: google/siglip-so400m-patch14-384
Place at: examples/agent/clip_module/SigLIP2_ViT/

Inference

Run

python examples/agent/infer.py

Training

Training uses verl for distributed GRPO. The base model is Qwen3-VL-8B-Instruct.

Training Data

~8K samples curated from:

Split: ~1.6K for SFT, ~6.4K for RL, both trained for 1 epoch.

Infrastructure

8× H20 GPUs, ~472 GPU hours total.

Acknowledgements

This work builds upon verl, DeepEyes, FrameThinker, PyTorch, Transformers, vLLM, and Qwen-VL. All third-party code retains its original license.

Citation

@inproceedings{videobrain2026,
  title     = {VideoBrain: Learning Adaptive Frame Sampling for Video Understanding},
  author    = {Junbo Zou and Ziheng Huang and Shengjie Zhang and Liwen Zhang and Weining Shen},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026},
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
figures		figures
rl		rl
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VideoBrain: Learning Adaptive Frame Sampling for Video Understanding

🔥 Update

Overview

Training Pipeline

Installation

Core Requirements

Download Vision Model

Inference

Run

Training

Training Data

Infrastructure

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VideoBrain: Learning Adaptive Frame Sampling for Video Understanding

🔥 Update

Overview

Training Pipeline

Installation

Core Requirements

Download Vision Model

Inference

Run

Training

Training Data

Infrastructure

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages