HERMES

[ACL 2026] KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

🔥 News

[2026.04.06] HERMES is accepted to ACL 2026 Main 🎉
[2026.03.23] Full code released!
[2025.01.23] HERMES reached #3 Paper of the day on Hugging Face Daily Papers!
[2025.01.21] HERMES is available on arXiv.

🛠️ Installation

For LLaVA model inference:

conda create -n hermes-llava python=3.12 -y
conda activate hermes-llava
pip install -r requirements_llava.txt
pip install flash-attn --no-build-isolation

For Qwen2.5-VL model inference:

conda create -n hermes-qwen python=3.12 -y
conda activate hermes-qwen
pip install -r requirements_qwen.txt
pip install flash-attn --no-build-isolation

📦 Preparation

Model Preparation

Create a models directory and download the model weights from HuggingFace:

mkdir models

We support the following models (choose one or more):

Model Family	Model	HuggingFace Link
LLaVA-OneVision	llava-onevision-qwen2-0.5b-ov-hf	llava-hf/llava-onevision-qwen2-0.5b-ov-hf
LLaVA-OneVision	llava-onevision-qwen2-7b-ov-hf	llava-hf/llava-onevision-qwen2-7b-ov-hf
LLaVA-OneVision	llava-onevision-qwen2-72b-ov-hf	llava-hf/llava-onevision-qwen2-72b-ov-hf
Qwen2.5-VL	Qwen2.5-VL-3B-Instruct	Qwen/Qwen2.5-VL-3B-Instruct
Qwen2.5-VL	Qwen2.5-VL-7B-Instruct	Qwen/Qwen2.5-VL-7B-Instruct
Qwen2.5-VL	Qwen2.5-VL-32B-Instruct	Qwen/Qwen2.5-VL-32B-Instruct

Data Preparation

Download the benchmark videos from their official sources and place them according to the paths specified in the annotation files:

Streaming Benchmarks:

Benchmark	Video Path	Official Source
StreamingBench	`/data/streamingbench/videos/`	🤗 StreamingBench
OVO-Bench	`/data/ovobench/videos/`	🤗 OVO-Bench
RVS-Ego	`/data/rvs/ego/videos/`	🤗 RVS
RVS-Movie	`/data/rvs/movie/videos/`	🤗 RVS

Offline Benchmarks:

Benchmark	Video Path	Official Source
VideoMME	`/data/videomme/videos/`	🤗 VideoMME
MVBench	`/data/mvbench/videos/`	🤗 MVBench
EgoSchema	`/data/egoschema/videos/`	🤗 EgoSchema

The annotation JSON files contain the same information as officially provided, with formatting adjustments to adapt to our codebase.

After preparation, the project structure should look like this:

HERMES/
├── asset/
│   └── logo.png
├── data/
│   ├── egoschema/
│   │   ├── videos/
│   │   └── egoschema.json
│   ├── mvbench/
│   │   ├── videos/
│   │   └── mvbench.json
│   ├── ovobench/
│   │   ├── videos/
│   │   └── ovobench_realtime_backeward.json
│   ├── rvs/
│   │   ├── ego/
│   │   │   ├── videos/
│   │   │   └── ego4d_oe.json
│   │   └── movie/
│   │       ├── videos/
│   │       └── movienet_oe.json
│   ├── streamingbench/
│   │   ├── videos/
│   │   └── streamingbench_realtime.json
│   └── videomme/
│       ├── videos/
│       └── videomme.json
├── eval/
│   ├── eval_multiple_choice.py
│   └── eval_open_ended.py
├── inference/
│   ├── abstract_hermes.py
│   ├── llavaov_hermes.py
│   ├── qwenvl_hermes.py
│   ├── reindex_1d.py
│   └── reindex_3d.py
├── models/
│   ├── llava-onevision-qwen2-0.5b-ov-hf/
│   ├── llava-onevision-qwen2-7b-ov-hf/
│   ├── llava-onevision-qwen2-72b-ov-hf/
│   ├── Qwen2.5-VL-3B-Instruct/
│   ├── Qwen2.5-VL-7B-Instruct/
│   └── Qwen2.5-VL-32B-Instruct/
├── scripts/
│   └── run_infer.sh
├── video_qa/
│   ├── base.py
│   ├── hermes_vqa.py
│   └── run_infer.py
├── LICENSE
├── README.md
├── requirements_llava.txt
└── requirements_qwen.txt

🚀 Inference

Simply run the inference script:

bash scripts/run_infer.sh

Here is the content of scripts/run_infer.sh:

export PYTHONPATH=$(cd "$(dirname "$0")/.." && pwd):$PYTHONPATH

num_chunks=8
model=llava_ov_7b
dataset=streamingbench

python video_qa/run_infer.py \
    --num_chunks $num_chunks \
    --model ${model} \
    --dataset ${dataset} \
    --sample_fps 0.5 \
    --kv_size 6000

Arguments:

Argument	Description
`model`	Model to use. Options: `llava_ov_0.5b`, `llava_ov_7b`, `llava_ov_72b`, `qwen2.5_vl_3b`, `qwen2.5_vl_7b`, `qwen2.5_vl_32b`
`dataset`	Benchmark dataset. Options: `videomme`, `mvbench`, `egoschema`, `rvs_ego`, `rvs_movie`, `ovobench`, `streamingbench`
`num_chunks`	Number of parallel processes for evaluation, typically set to the number of GPUs
`sample_fps`	Frame sampling rate (frames per second) from the video
`kv_size`	Maximum KV cache size for HERMES hierarchical memory management
`only_eval`	If set, skip inference and only run evaluation on existing results

📊 Evaluation

The evaluation scripts compute metrics on the inference results:

Multiple-choice benchmarks (VideoMME, MVBench, EgoSchema, OVBench, StreamingBench) are evaluated by eval/eval_multiple_choice.py, which takes a subcommand as its first argument:

Subcommand	Description	Used by
`general`	Compute overall accuracy, task-specific breakdown (auto-detects OVBench / StreamingBench), and prediction error analysis	MVBench, OVBench, StreamingBench, VideoMME
`videomme`	Report accuracy broken down by video duration (short / medium / long)	VideoMME
`egoschema`	Generate EgoSchema submission CSV file	EgoSchema

python eval/eval_multiple_choice.py general --results_path results/llava_ov_7b/streamingbench/fps0.5-kv6000/results.csv

Open-ended benchmarks (RVS-Ego, RVS-Movie) are evaluated by eval/eval_open_ended.py, which uses GPT for answer scoring:

python eval/eval_open_ended.py \
    --pred_path results/llava_ov_7b/rvs_ego/fps0.5-kv6000/results.csv \
    --output_dir results/llava_ov_7b/rvs_ego/fps0.5-kv6000/tmp \
    --output_json results/llava_ov_7b/rvs_ego/fps0.5-kv6000/results.json

📧 Contact

For any questions regarding the paper or the technical implementation, please feel free to contact haowei.zhang123@gmail.com

🙏 Acknowledgements

Our codebase is built upon ReKV. We gratefully acknowledge their contributions to the community.

📝 Citation

If you find our work useful for research, please cite our paper and give us a precious star 😄:

@misc{zhang2026hermeskvcachehierarchical,
      title={HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding}, 
      author={Haowei Zhang and Shudong Yang and Jinlan Fu and See-Kiong Ng and Xipeng Qiu},
      year={2026},
      eprint={2601.14724},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.14724}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HERMES

🔥 News

🛠️ Installation

📦 Preparation

Model Preparation

Data Preparation

🚀 Inference

📊 Evaluation

📧 Contact

🙏 Acknowledgements

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
asset		asset
data		data
eval		eval
inference		inference
scripts		scripts
video_qa		video_qa
LICENSE		LICENSE
README.md		README.md
requirements_llava.txt		requirements_llava.txt
requirements_qwen.txt		requirements_qwen.txt

Folders and files

Latest commit

History

Repository files navigation

HERMES

🔥 News

🛠️ Installation

📦 Preparation

Model Preparation

Data Preparation

🚀 Inference

📊 Evaluation

📧 Contact

🙏 Acknowledgements

📝 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages