Symphony is a cognitively-inspired multi-agent system designed to tackle the challenges of Long-Form Video Understanding (LVU). By emulating human cognition patterns, Symphony decomposes complex LVU tasks into fine-grained subtasks and incorporates a deep reasoning collaboration mechanism enhanced by reflection.
We are currently organizing the code.
Long-form video understanding is critical for applications like sports commentary, intelligent surveillance, and film analysis. However, existing Multimodal Large Language Model (MLLM) agents struggle with high information density and extended temporal spans. Simple task decomposition and retrieval-based methods often lose key information or fail at complex reasoning.
Symphony addresses these limitations through:
- Cognitive Task Decomposition: Specialized agents for planning, grounding, perception, and language.
- Reflection-Enhanced Collaboration: A dynamic mechanism that evaluates and refines reasoning chains.
- VLM-Based Grounding: Precise localization of video segments using LLM query expansion and VLM relevance scoring.
- Cognitively-Inspired Architecture: Decouples reasoning capabilities into functional dimensions (Planning, Reflection, Grounding, Subtitle, Visual Perception) to reduce cognitive load on individual models.
- Dynamic Collaboration: Uses a reflection-enhanced dynamic reasoning framework (inspired by Actor-Critic) to iteratively refine solutions based on critique.
- Advanced Grounding: Leverages LLMs for query decomposition and VLMs for semantic relevance scoring, handling abstract concepts and multi-hop reasoning better than traditional CLIP-based retrieval.
- State-of-the-Art Performance: Achieves superior results on major LVU benchmarks including LVBench, LongVideoBench, VideoMME, and MLVU.
Symphony consists of five specialized agents orchestrated by a central planning mechanism:
- Planning Agent: Central coordinator responsible for global task planning, multi-agent scheduling, and answer generation.
- Reflection Agent: Evaluates the reasoning trajectory. If logical inconsistencies or insufficient evidence are detected, it generates critiques to initiate refinement.
- Grounding Agent: Identifies relevant video segments using either VLM-based scoring or CLIP-based retrieval, depending on query complexity.
- Subtitle Agent: Analyzes textual subtitles for entity recognition, sentiment analysis, and topic modeling.
- Visual Perception Agent: Conducts multi-dimensional visual perception using tools like frame inspector, global summary, and multi-segment analysis.
(Figure: The reflection-enhanced dynamic reasoning framework in Symphony.)
Symphony achieves state-of-the-art performance across four representative LVU datasets.
| Method | LVBench | LongVideoBench (Val) | Video MME (Long) | MLVU |
|---|---|---|---|---|
| Commercial VLMs | ||||
| Gemini-1.5-Pro | 33.1 | 64.0 | 67.4 | - |
| GPT-4o | 48.9 | 66.7 | 65.3 | 54.9 |
| VLMs | ||||
| Seed 1.6 VL | 58.1 | 66.1 | 68.4 | 65.3 |
| Agent Based | ||||
| DVD | 66.8 | 67.2 | 61.5 | - |
| VideoDeepResearch | 55.5 | 70.6 | 76.3 | 64.5 |
| Ours (Symphony) | 71.8 | 77.1 | 78.1 | 81.0 |
On the challenging LVBench, Symphony surpasses the prior state-of-the-art method by 5.0%.
(Note: Specific installation instructions should be derived from the actual code repository. The following is a general guideline based on the paper.)
-
Clone the repository:
git clone https://github.com/Haiyang0226/Symphony.git cd Symphony -
Install dependencies:
conda env create -f environment.yml conda activate sym
-
Configure API Keys: Ensure you have access to the required models (DeepSeek, Seed VL, etc.) and configure your API keys in the
config.yamlfile. -
Run Inference:
python run.py --question "Your question here" --video_path "path/to/video.mp4"
If you find Symphony useful in your research, please consider citing our paper:
@article{yan2026symphony,
title = {Symphony: A Cognitively-Inspired Multi-Agent System for Long-Video Understanding},
author = {Yan, Haiyang and Zhou, Hongyun and Xu, Peng and Feng, Xiaoxue and Liu, Mengyi},
journal = {arXiv preprint arXiv:2603.17307},
year = {2026},
eprint = {2603.17307},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}This work was done during an internship at Kuaishou Technology. We thank the contributors and the open-source community for their valuable tools and datasets.
This project is licensed under the MIT License - see the LICENSE file for details.