Res2CLIP is a few-shot generalist anomaly detection framework that aligns visual residuals with text residuals using a frozen CLIP backbone. It supports two modes:
| Mode | Symbol | Description |
|---|---|---|
training-free |
Res2CLIP* | Direct three-branch fusion on frozen CLIP features without fine-tuning. |
finetune |
Res2CLIP† | Lightweight adapters trained on an auxiliary dataset for higher performance. |
conda create -n res2clip python=3.10
conda activate res2clip
pip install -r requirements.txtDataset metadata JSON files are generated following the same procedure as AnomalyCLIP, please refer to AnomalyCLIP for scripts and instructions.
We use the CLIP ViT-L/14@336px backbone. The model is downloaded automatically on first run to ./clip_model/ViT-L-14-336px.pt (or download manually from the OpenAI CLIP releases and place it there).
Edit paths in train.sh, then:
bash train.shAdapters are trained separately on MVTec AD and VisA. Checkpoints are saved to ./checkpoints/{mvtec,visa}/.
Training-free (Res2CLIP*):
bash test_trainingfree.shFine-tuned (Res2CLIP†):
bash test_finetune.shWe thank AnomalyCLIP for their open-source codebase, on which clip_lib/ is based.
If you think this work is helpful to you, please consider citing our paper.
@article{liu2026res2clip,
title={Res$^2$CLIP: Few-Shot Generalist Anomaly Detection with Residual-to-Residual Alignment},
author={Liu, Xinyue and Wang, Jianyuan and Leng, Biao and Zhang, Shuo},
journal={arXiv preprint arXiv:2605.16171},
year={2026}
}