FT-CLIP

This repo is the official implementation of "CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet".

Introduction

Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%, 88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset. These observations challenge the conventional conclusion that CLIP is not suitable for fine-tuning, and motivate us to rethink recently proposed improvements based on CLIP.

Results

	ViT-Base/16₂₂₄	ViT-Base/16₃₈₄	ViT-Large/16₃₈₄	ViT-Large/14₂₂₄	ViT-Large/14₃₃₆
FLOPS	17.5G	55.4G	190.7G	80.7G	190.6G
Supervised Baseline
ImageNet-21K	84.0	86.2	87.1	----	----
JFT-300M	----	86.7	88.0	----	----
JFT-3B	----	86.6	88.5	----	----
MIM with CLIP as prediction target
MVP	84.4	----	----	----	----
FD-CLIP	84.9	----	----	----	----
CAE-v2	85.3	----	----	----	----
BEiT-2	85.5	----	----	----	----
Fine-tuning CLIP directly
FT-CLIP(ours)	85.7	86.6	----	88.0	88.3

Setup

PyTorch, Timm and DeepSpeed is needed. CUDA version or GPU difference may slightly influence the results.

pip install torch==1.10.2+cu113 torchvision==0.11.3+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install --user timm==0.4.12
pip install --user deepspeed==0.4.0

Fine-tuning configs

The CLIP-Base/16 model can be fine-tuned on ImageNet-1k using 8 A100-40GB:

MODEL=CLIP_B16
OUTPUT_DIR=/path/to/save/your_model
DATA_PATH=/path/to/imagenet

echo $OUTPUT_DIR
mkdir -p $OUTPUT_DIR
cp $0 $OUTPUT_DIR

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 run_class_finetuning.py \
    --model ${MODEL} --data_path $DATA_PATH \
    --input_size 224 \
    --finetune True \
    --num_workers 8 \
    --output_dir ${OUTPUT_DIR} \
    --batch_size 256 --lr 6e-4 --update_freq 1 \
    --warmup_epochs 10 --epochs 50 \
    --layer_decay 0.6 \
    --drop_path 0 \
    --dist_eval --eval_all --no_save_ckpt \
    --enable_deepspeed \
    --clip_mean_and_std \
    --layer_scale_init_value 0 \
    --abs_pos_emb --disable_rel_pos_bias \
    --weight_decay 0.05 --mixup 0 --cutmix 0 \
    --nb_classes 1000  --model_prefix visual.\
    --model_ema --model_ema_decay 0.9998 \
    2>&1 | tee -a ${OUTPUT_DIR}/log.txt

--batch_size: batch size per GPU.
Effective batch size = number of GPUs * --batch_size * --update_freq. So in the above example, the effective batch size is 8*256*1 = 2048.
--lr: base learning rate.
--layer_decay: layer-wise learning rate decay. The LR of i_th layer is lr * layer_decay ** i.
--warmup_epochs: learning rate warmup epochs.
--epochs: total pre-training epochs.
--clip_mean_and_std: use the CLIP norm factor, instead of the ImageNet norm.

see scripts/ for more config

Acknowledgments

This repository is modified from BEiT, built using the timm library, the DeiT repository and the CLIP repository. The CLIP model file is modified from DeCLIP.

Citation

If you use this code for your research, please cite our paper.

@article{dong2022ftclip,
  title={CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet},
  author={Dong, Xiaoyi and Bao, Jianmin and Zhang, Ting and Chen, Dongdong and Shuyang, Gu and Zhang, Weiming and Yuan, Lu and Chen, Dong and Wen, Fang and Yu, Nenghai},
  journal={arXiv preprint arXiv:2212.06138},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
logs		logs
models		models
scripts		scripts
README.md		README.md
datasets.py		datasets.py
engine_for_finetuning.py		engine_for_finetuning.py
optim_factory.py		optim_factory.py
pipeline.png		pipeline.png
run_class_finetuning.py		run_class_finetuning.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

logs

logs

models

models

scripts

scripts

README.md

README.md

datasets.py

datasets.py

engine_for_finetuning.py

engine_for_finetuning.py

optim_factory.py

optim_factory.py

pipeline.png

pipeline.png

run_class_finetuning.py

run_class_finetuning.py

utils.py

utils.py

Repository files navigation

FT-CLIP

Introduction

Results

Setup

Fine-tuning configs

Acknowledgments

Citation

About

Releases

Packages

Languages

LightDXY/FT-CLIP

Folders and files

Latest commit

History

Repository files navigation

FT-CLIP

Introduction

Results

Setup

Fine-tuning configs

Acknowledgments

Citation

About

Resources

Stars

Watchers

Forks

Languages