Yingxin Lai1, Zitong Yu1⋆, Jun Wang1⋆, Linlin Shen2, Yong Xu3, and Xiaochun Cao4
1 Great Bay University
2 Shenzhen University
3 Harbin Institute of Technology
4 School of Cyber Science and Technology, Sun Yat-sen University
Multimodal Large Language Models (MLLMs) enable interpretable multimedia forensics by generating textual rationales for forgery detection. However, processing dense visual sequences incurs high computational cost, especially for high-resolution images and videos. Existing visual token pruning methods are mostly semantic-driven: they preserve salient objects while often discarding background regions where manipulation traces such as high-frequency anomalies and temporal jitters reside.
To address this issue, we introduce ForensicZip, a training-free framework that reformulates token compression from a forgery-driven perspective. ForensicZip models temporal token evolution as a Birth-Death Optimal Transport problem with a slack dummy node, quantifying physical discontinuities associated with transient generative artifacts. The final forensic score further integrates transport-based novelty with high-frequency priors, allowing forensic evidence to be preserved under large-ratio compression.
On deepfake and AIGC benchmarks, ForensicZip delivers strong detection performance at aggressive compression ratios, achieving 2.97× speedup and over 90% FLOPs reduction at 10% token retention while maintaining state-of-the-art accuracy.
Figure 1. Overview of the ForensicZip framework. The method preserves forgery-relevant evidence under aggressive token compression by combining transport-based novelty with forensic priors.
forensiczip/— method implementation and helper utilitiesfakevlm/— FakeVLM-compatible skeleton modulesscripts/— evaluation entrypointsdocs/— running and data preparation notesimgs/— method figures
conda create -n forensiczip python=3.10 -y
conda activate forensiczip
pip install -r requirements.txtIf you already have a compatible environment, you can reuse it directly.
MODEL_PATH_7B=<MODEL_PATH> \
FAKECLUE_TEST_JSON=<FAKECLUE_JSON> \
FAKECLUE_DATA_BASE=<FAKECLUE_MEDIA_DIR> \
CUDA_DEVICES=0 \
PYTHON_BIN=python \
bash scripts/eval_forensiczip_fakeclue.shMODEL_PATH_7B=<MODEL_PATH> \
LOKI_JSON_DIR=<LOKI_JSON_DIR> \
LOKI_MEDIA_ROOT=<LOKI_MEDIA_ROOT> \
CUDA_DEVICES=0 \
PYTHON_BIN=python \
bash scripts/eval_forensiczip_loki.shRETENTION_RATIOS_STRVAL_BATCH_SIZEWORKERSMAX_LENGTHMAX_NEW_TOKENSFORENSICZIP_SELECT_LAYERFORENSICZIP_BIRTH_COSTFORENSICZIP_DEATH_COSTFORENSICZIP_SINKHORN_EPSFORENSICZIP_SINKHORN_ITERSFORENSICZIP_EMA_BETAFORENSICZIP_BIRTH_WEIGHTFORENSICZIP_POS_LAMBDAFORENSICZIP_FORENSIC_ETA
Detailed usage notes are available in docs/running.md.
These resources are used by this repository but are not introduced by this work.
See docs/data_preparation.md for the expected local file layout.
This codebase is built on top of FakeVLM. We thank the FakeVLM project for providing the base model and evaluation structure used in this release.
If you find this repository useful, please consider citing:
@article{lai2026forensiczip,
title={ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models},
author={Lai, Yingxin and Yu, Zitong and Wang, Jun and Shen, Linlin and Xu, Yong and Cao, Xiaochun},
journal={arXiv preprint},
year={2026}
}For questions about this repository, please contact: yingxinlai2@gmail.com