Skip to content

marinero4972/Open-o3-Video

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Open-o3 Video

by Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang and Zhuochen Wang,

[📖 Paper] | [🌟 Project Page] | [🎥 Introduction] | [🤗 Model] | [🤗 Data]

TL; DR: Open-o3 Video integrates explicit spatio-temporal evidence (key timestamps and bounding boxes) into video reasoning through curated STGR dataset and a two-stage SFT–RL training strategy, achieving state-of-the-art results on V-STAR and delivering verifiable, reliable reasoning for video understanding.

Abstract: Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging, as it requires joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3 Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning, and carefully collect training data and design training strategies to address the aforementioned challenges. The model highlights key timestamps, objects, and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations. To enable this functionality, we first curate and build two high-quality datasets, STGR-CoT-30k for SFT and STGR-RL-36k for RL, with carefully constructed temporal and spatial annotations, since most existing datasets offer either temporal spans for videos or spatial boxes on images, lacking unified spatio-temporal supervision and reasoning traces. Then, we adopt a cold-start reinforcement learning strategy with multiple specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mLGM by 24.2% on the Qwen2.5-VL baseline. Consistent improvements are also observed on a broad range of video understanding benchmarks, including VideoMME, WorldSense, VideoMMMU, and TVGBench. Beyond accuracy, the reasoning traces produced by Open-o3 Video also provide valuable signals for test-time scaling, enabling confidence-aware verification and improving answer reliability.

Open-o3 Video Model:

Stage 1: Cold-start initialization on STGR-CoT-30k equips the model with basic grounded reasoning.

Stage 2: Reinforcement learning with Group Sequence Policy Optimization stabilizes long-horizon optimization. We propose adaptive temporal proximity and temporal gating in the thinking reward design.

Updates

  • 2025.11, Improved Results with Qwen3-VL-8B: We have trained Open-o3 Video using Qwen3-VL-8B as the base model. On the V-STAR benchmark, our latest model achieves: mAM:35.5% and mLGM: 49.0%. This result establishes a new state-of-the-art in spatio-temporal grounded video reasoning.

  • 2025.11, Media Coverage: Our work was recently featured by 量子位(QbitAI). 👉 Read the article here

Quick Start

Environment setup:

git clone https://github.com/marinero4972/Open-o3-Video 
cd Open-o3-Video 

conda create -n open-o3-video python=3.11
conda activate open-o3-video
bash setup.sh

Data Preparation:

To provide unified spatio-temporal supervision for grounded video reasoning, we build two datasets: STGR-CoT-30k for supervised fine-tuning and STGR-RL-36k for reinforcement learning.

Json data download link and video source data download instructions: STGR

The overall data structure should be:

DATA_ROOT
├── json_data
│   └── STGR-RL.json
│   └── STGR-SFT.json
└── videos
    └── gqa
    └── stgr
        └── plm
        └── temporal_grounding
    └── timerft
    └── treevgr
    └── tvg_r1
    └── videoespresso
    └── videor1

You should refine the DATA_ROOT in src/r1-v/configs/data_root.py according to your data path.

Training:

# cold start initialization
bash ./src/scripts/run_sft_video.sh

# reinforcement learning with GSPO
bash ./src/scripts/run_grpo_video.sh

Evaluation:

Evaluate on benchmarks:

cd eval
bash ./scripts/eval_all.sh

Infernce on examples:

cd eval
python ./inference_example.py

📊 Main Results

Performance on V-STAR benchmark

Performance on the V-STAR benchmark, which evaluates spatio-temporal reasoning across three dimensions. Chain1 denotes what–when–where, while Chain2 corresponds to what–where–when. Open-o3 Video sets a new state-of-the-art with mAM improved by +14.4%, and mLGM by +24.2%, surpassing GPT-4o and Gemini-2-Flash.

🎬 Demos

Each pair below shows the spatio-temporal grounded reasoning visualization of different videos.
Our model provides textual reasoning while highlighting when (temporal evidence) and where (spatial evidence) the key events occur, offering explicit, interpretable visual traces that ground the reasoning process.


License

This project is licensed under the Apache-2.0 License.

Citation

If you use our work or our implementation in this repo, or find them helpful, please consider giving a citation in the following format.

@article{meng2025openo3,
  title={Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence}, 
  author={Meng, Jiahao and Li, Xiangtai and Wang, Haochen and Tan, Yue and Zhang, Tao and Kong, Lingdong  and Tong, Yunhai and Wang, Anran and Teng, Zhiyang and Wang, Yujing and Wang, Zhuochen},
  journal={arXiv preprint arXiv:2510.20579},
  year={2025}
}

Acknowledgements

We sincerely thank the following projects for their contributions to this work:

About

Official implementation of "Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published