1AMD 2University of Rochester
This repository contains the official implementation of the VideoSeek agent.
VideoSeek is a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video.

VideoSeek (w/ subtitles) achieves the best performance while processing only about 1/300 as many frames as the second-best video agent.
Toolkit of the VideoSeek agent, including <overview>, <skim>, and <focus> tools:

<overview>: rapidly scans the entire video to build a coarse storyline and highlight promising intervals.<skim>: takes a quick glance at these candidate intervals at low cost to check whether query-relevant evidence is nearby.<focus>: zooms in on a fine-grained clip with dense inspection to obtain answer-critical observations.
- Create a conda virtual environment and activate it:
conda create -n videoseek python==3.13
conda activate videoseek- Clone the repository:
git clone https://github.com/jylins/videoseek
cd videoseek- Install the package:
pip install -e .Notes:
ffmpegis required.
Please execute videoseek-cli -h for help:
usage: videoseek [-h] --video_path VIDEO_PATH --user_query USER_QUERY [--subtitle_path SUBTITLE_PATH] [--output_dir OUTPUT_DIR] [--verbose] [--model_name MODEL_NAME] [--api_base API_BASE] [--api_key API_KEY] [--api_version API_VERSION]
[--reasoning_effort REASONING_EFFORT] [--seed SEED] [--temperature TEMPERATURE] [--max_tokens MAX_TOKENS] [--max_steps MAX_STEPS]
options:
-h, --help show this help message and exit
--video_path VIDEO_PATH
YouTube URL or local path to video.
--user_query USER_QUERY
Question/query towards this video.
--subtitle_path SUBTITLE_PATH
Local path to subtitle file.
--output_dir OUTPUT_DIR
Directory to write outputs (default: ./output/).
--verbose Print agent step logs.
--model_name MODEL_NAME
Model name.
--api_base API_BASE API base.
--api_key API_KEY API key.
--api_version API_VERSION
API version.
--reasoning_effort REASONING_EFFORT
Reasoning effort of the LLM.
--seed SEED Seed.
--temperature TEMPERATURE
Temperature.
--max_tokens MAX_TOKENS
Max output tokens of the LLM.
--max_steps MAX_STEPS
Max steps of the VideoSeek agent.
Run with a local video:
# Example from LVBench (qid: 3094)
videoseek-cli \
--video_path ./wgBlACG927Y.mp4 \
--subtitle_path ./wgBlACG927Y.srt \
--user_query "What animal statue is under the Dong Men Ding Food Street sign?\n(A) Hawk\n(B) Panda\n(C) Tiger\n(D) Lion\nPlease directly answer with the best option's letter from the given choices directly (A, B, C, or D)." \
--verboseOutputs are written under output/<VIDEO_ID>_<timestamp>/:
prediction.jsontrajectory.json
If you find our work useful, please consider citing:
@article{lin2026videoseek,
title={VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking},
author={Lin, Jingyang and Wu, Jialian and Liu, Jiang and Sun, Ximeng and Wang, Ze and Yu, Xiaodong and Luo, Jiebo and Liu, Zicheng and Barsoum, Emad},
journal={arXiv preprint arXiv:2603.20185},
year={2026}
}