Authors: Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi, Lama Ayash, Tanveer Hussain and Naeemullah Khan
The VideoAtlas Environment. The state space is a hierarchical grid stack where the root grid covers the entire video. Deeper levels provide finer temporal resolution. Workers explore assigned subtrees concurrently while a Master steers exploration via uncertainty analysis.
- Lossless Hierarchical Representation: VideoAtlas renders any video as a navigable K×K image grid — no captions, no offline preprocessing, no context ceiling. Any frame is reachable in O(log T) steps.
- Video-RLM: A parallel Master-Worker architecture extending Recursive Language Models to video. Workers explore grid subtrees concurrently and accumulate evidence in a lossless Visual Scratchpad, while a Master steers exploration via uncertainty analysis.
- Logarithmic Compute Scaling: As video duration grows 600×, compute grows sub-linearly. Video-RLM uses up to 9.7× fewer tokens than linear-scaling caption baselines, further amplified by a 30–60% multimodal cache hit rate.
- Environment Budgeting: Bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter, directly controlling temporal resolution rather than frame quantity.
- Emergent Adaptive Compute: Scattered answers (3+ temporal positions) consume 40% more tokens than localized ones — without any explicit supervision.
- VLM-Agnostic: Works with any backbone (Qwen3.5, Gemini-3-Flash, etc.) without modifying the environment or coordination protocol.
- [March 2026] VideoAtlas & VideoRLM Initial Release! 🎉
Video-RLM overview. In each round, the Master examines the root grid (with dead zones masked) and the scratchpad, then assigns promising cells to Workers. Workers autonomously explore their regions. After all Workers return, memories are updated and the Master performs uncertainty analysis.
At the core of VideoAtlas is a recursive K×K image grid (default K=8, yielding 64 cells). Given a video of duration T seconds:
- The root grid S₀ renders the full video as a contact sheet — a bird's-eye view at a glance.
- EXPAND recursively zooms into any cell, increasing temporal resolution by K² per level.
- At depth d, temporal resolution is Δtd = T / K^(2(d+1)), reaching sub-second precision for 10-hour videos.
- Sub-grids are generated on-the-fly — no offline decoding, no RAM bottleneck.
| Category | Actions |
|---|---|
| Navigation | EXPAND(cell), BACKTRACK(), MARK_PROMISING(cells) |
| Perception | ZOOM(cell), INVESTIGATE(cell, before/after) |
| Commit | ADD_TO_SCRATCHPAD(items), FINISHED() |
- Positive Memory (M⁺, Visual Scratchpad): Lossless multimodal evidence tuples — image patch, subtitle, timestamp, confidence, description. Rendered as a labeled grid for cross-referencing.
- Negative Memory (M⁻, Dead Zones): Explored intervals with no findings are blacked out in the grid, physically preventing hallucination over already-explored regions.
Logarithmic compute scaling with video duration. Video-RLM's hierarchical grid grows sub-linearly O(log T), requiring up to 9.7× fewer tokens than linear-scaling baselines at 10-hour duration.
| Method | Active Params | LVB | VMME |
|---|---|---|---|
| GPT-4o | Prop. | 66.7 | 65.3 |
| GPT-5 | Prop. | 72.6 | 81.8 |
| Gemini-3-Flash | Prop. | 74.5 | — |
| Claude-Opus-4.5 | Prop. | 67.2 | 77.6 |
| InternVL3.5-241B | 28B | 67.1 | 72.9 |
| GLM-4.5V-106B | 12B | 76.7 | 74.6 |
| MR.Video (Gemini+GPT-4o) | Prop. | 61.6 | 61.8 |
| VideoARM (GPT-o3+GPT-4o) | Prop. | 76.4 | 81.2 |
| Qwen3.5 (uniform, 160 fr.) | 3B | 61.5 | 63.8 |
| LLM over Captions | Prop.+3B | 62.4 | 64.2 |
| Video-RLM (Qwen3.5-35B) | 3B | 52.5 | 50.4 |
| Video-RLM (Gemini-3-Flash) | Prop. | 72.0 | 76.2 |
| Method | LVB-10hr Acc. | Δ | Tokens | VMME-10hr Acc. | Δ | Tokens |
|---|---|---|---|---|---|---|
| Qwen3.5 (uniform) | 49.2 | -12.3 | 212K | 50.6 | -13.2 | 232K |
| LLM over Captions | 62.1 | -0.3 | 207K‡ | 36.0 | -28.2 | 235K‡ |
| Video-RLM (Qwen) | 47.7 | -4.8 | 148K† | 49.7 | -0.7 | 403K† |
| Video-RLM (Gemini) | 70.1 | -1.9 | 307K | 69.1 | -7.1 | 390K |
∆: Accuracy drop from standard benchmarks. †Effective tokens after vLLM multimodal prefix cache (avg. 36–42% hit rate). ‡QA tokens only, excludes GPT-4o captioning cost.
Left: Environment budgeting — accuracy and tokens vs. max depth on LVB-10hr. Green marks the optimal depth (first sub-second layer). Right: Adaptive compute — average tokens scale with evidence spread without ground-truth supervision.
Worker scaling: wall-clock time (normalized) vs. number of workers on LVB-10hr. Accuracy (annotated) remains stable across all configurations while throughput improves 2.25× from 1 to 7 workers.
- Python 3.8+
- Gemini API key (Google AI Studio) or Vertex AI access (service account JSON)
ffmpeginstalled and on PATH (for subtitle extraction)
# Clone the repository
git clone https://github.com/mohammad2012191/VideoAtlas.git
cd VideoAtlas
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # Linux / macOS
.venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
Edit config.py to select your backend and model:
# Use Google AI (Gemini API key)
BACKEND = "google"
GOOGLE_API_KEY = "your-api-key-here"
# Or use Vertex AI (service account JSON)
BACKEND = "vertex"
SERVICE_ACCOUNT_FILE = "vertex_key.json"
Key parameters in config.py:
| Parameter | Description | Default |
|---|---|---|
GRID_K |
Grid size (K×K cells) | 8 |
EXPLORE_MODE |
"auto", "dfs", or "bfs" |
"dfs" |
NUM_CALLS |
Total VLM call budget | 512 |
BUDGET_PER_CELL |
Steps per worker per cell | 8 |
DFS_MAX_DEPTH |
Max exploration depth ("auto" = stop at <1s span) |
"auto" |
NUM_WORKERS |
Number of parallel workers | 4 |
python main.py
You will be prompted for:
- Video file path — any format supported by OpenCV/ffmpeg
- Subtitle JSON path — optional, leave blank to skip
- Question — supports both multiple-choice and open-ended questions
- Answer options — enter one per line, or press Enter immediately to use open-ended mode
Multiple-choice example:
Video file path: videos/lecture.mp4
Subtitle JSON path: subtitles/lecture_eng.json
Question: What topic is discussed at the beginning?
Option 0: Machine learning
Option 1: Computer vision
Option 2: Natural language processing
Option 3: ← press Enter to finish
Open-ended example:
Video file path: videos/documentary.mp4
Subtitle JSON path: ← press Enter to skip
Question: Summarize the main events that occur in the video.
Option 0: ← press Enter immediately for open-ended mode
Extract embedded subtitle tracks from a video and save as JSON (compatible with the pipeline):
# Auto-detect and extract subtitles
python extract_subtitles.py --video myvideo.mp4
# List available subtitle tracks
python extract_subtitles.py --video myvideo.mp4 --list
# Extract a specific track
python extract_subtitles.py --video myvideo.mp4 --track 1 --output subs.json
Subtitle JSON files are saved to the subtitles/ folder by default.
Generate a replay video from a completed run's debug images:
# Interactive mode
python visualize_run.py
# Direct mode
python visualize_run.py --run results/run_20260310_230348_images --fps 1.5
# With explicit result JSON
python visualize_run.py \
--run results/run_20260310_230348_images \
--result results/result_20260310_230348.json \
--fps 1.0
The replay video shows:
- Left panel: The agent's frame-by-frame exploration in true run order
- Right panel: Live scratchpad with evidence thumbnails and per-item reasoning (description, confidence, subtitle)
- Header: Frame category, progress counter, and predicted answer overlay
- End freeze: Last frame held for 4 seconds for easy reading
Each run produces the following in results/:
results/
├── run_<timestamp>.log # Full verbose log
├── result_<timestamp>.json # Predicted answer + metrics
└── run_<timestamp>_images/
├── 0001_global_grid.jpg # Root grid (master overview)
├── 0002_W1_C0_step0.jpg # Worker 1, Cell 0, Step 0
├── ...
├── 0051_scratchpad_5items.jpg # Evidence grid image
├── 0051_scratchpad_5items_reasoning.json # Per-item reasoning sidecar
└── replay.mp4 # Auto-generated replay video
{
"question": "How many yellow cards were shown?",
"candidates": ["3", "5", "8", "9"],
"predicted_choice": 2,
"predicted_answer": "8",
"reasoning": "Evidence [A] at 803.8s shows a referee holding a yellow card...",
"metrics": {
"vlm_calls": 24,
"total_tokens": 185400,
"frames_decoded": 512,
"coverage_pct": 38.5,
"wall_time_s": 513.8,
"mode_used": "DFS"
}
}
If you use VideoAtlas in your research, please cite:
@misc{eltahir2026videoatlasnavigatinglongformvideo,
title={VideoAtlas: Navigating Long-Form Video in Logarithmic Compute},
author={Mohamed Eltahir and Ali Habibullah and Yazan Alshoibi and Lama Ayash and Tanveer Hussain and Naeemullah Khan},
year={2026},
eprint={2603.17948},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.17948},
}