Jisheng Dang1, Quan Wan1, Dewei Liu1, Ziyue Wang1, Bimei Wang, Pei Liu3, Hong Peng1, Bin Hu1, Tat-Seng Chua4
1Lanzhou University, 2The Hong Kong University of Science and Technology, 3The Hong Kong University of Science and Technology, 4National University of Singapore
TL;DR: We propose a Hierarchical Multi-Agent RAG framework that decomposes complex video queries, retrieves external knowledge, and aggregates answers for robust video understanding.
VideoRAG addresses the limitations of current LLMs in video-text alignment and long-horizon reasoning. By coordinating specialized agents hierarchically, our framework effectively fuses internal temporal understanding with external knowledge retrieval.
Our approach consists of three specialized agents working in synergy:
- Question Decomposition Agent: Reformulates complex/ambiguous queries into structured sub-tasks.
- Multi-source Reasoning Agents:
- Web Agent: Retrieves external open-world knowledge.
- Memory-based Agent: Captures long-range temporal dependencies within videos.
- Answer Aggregation Agent: Synthesizes results, resolves contradictions, and generates the final prediction.
2025.12.3๐ง Initial release of the multi-agent framework.
We compare our method with state-of-the-art models across four challenging benchmarks. Despite having fewer parameters (2B), our framework achieves the best performance across all metrics.
| Method | Size | Acc@MME | Acc@QA | Acc@MVB | Acc@MLVU |
|---|---|---|---|---|---|
| FrozenBiLM | 1.2B | 32.5 | 48.6 | 31.0 | - |
| Video-ChatGPT | 7B | 38.5 | 55.2 | 33.8 | 39.4 |
| Otter | 9B | 45.3 | 59.1 | 40.5 | 41.2 |
| mPLUG-Owl | 7B | 48.6 | 56.5 | 51.4 | 46.2 |
| MovieChat | 7B | 46.5 | 58.2 | 46.8 | 48.1 |
| LLaMA-VID | 7B | 42.1 | 57.8 | 41.3 | 43.5 |
| TinyLLaVA | 3B | 44.2 | 58.1 | 45.5 | 44.8 |
| LLaVA-Phi | 2.7B | 42.5 | 56.4 | 43.1 | 41.2 |
| ST-LLM | 7B | 50.1 | 59.6 | 51.9 | 49.8 |
| VILA-2.7B | 2.7B | 48.9 | 60.5 | 49.2 | 51.0 |
| Ours | 2B | 53.26 | 66.62 | 52.8 | 62.3 |
- Release the code
- Clone the repository
git clone https://github.com/hanzif1/videoRAG.git
cd videoRAG