Poor performance on ResNet. #10

jingzhengli · 2022-11-30T08:25:15Z

Although good performace obtained by fine tuning ViT model, I found the poor performance on the ResNet models. Thus, How to fine tune the CLIP model by using pre-trained ResNet models? Thanks.

gabrielilharco · 2022-11-30T18:39:28Z

Hi @jingzhengli, could you give more details on your experimental setting and which results you are getting? Thanks!

jingzhengli · 2022-12-01T05:53:34Z

Hi @jingzhengli, could you give more details on your experimental setting and which results you are getting? Thanks!

Hi, thanks for your quick reply and your nice work. For fine-tuning CLIP, I have some questions.
The first question is how to fine-tune CLIP: different from the baselines "end-to-end" and "linear classifier" in your paper, I full fine-tune (update both vision and text encoders) the ViT-B/16-based CLIP on 11 public datasets with 16-shots, found a boost on all the datasets. However, if the ViT-based vision encoder is replaced with Resnet50 or Resnet101, then the performance becomes worse compared to zero-shot CLIP.
I also implement the "linear classifier" to fine-tune the ResNet-based CLIP on Imagenet dataset, the accuracy is 56% compared to the 60% of zero-shot CLIP.
The second question is the implemention of fine-tuning CLIP. I run the experiments using my own implementation not your released code, I would like to know how to initialize learning rate.
Thanks again for your consideration.

gabrielilharco · 2022-12-05T19:19:43Z

Hi @jingzhengli. It seems like there are quite a few experimental differences then, so it's hard to pinpoint what the issue might be. If I understood correctly, it's a bit odd that your linear classifier is giving lower accuracy than the corresponding zero-shot model. If you are initializing the head with the zero-shot weights, this is likely an issue with your hyper-parameters or a bug.

Re. learning rate, I'd recommend doing a sweep, since your experimental setting is different. Also note that weight interpolation (and thus WiSE-FT) can perform poorly if the learning rate is too large, so I'd recommend erring on the side of smaller learning rates if you can't do a proper hyper-parameter search.

gabrielilharco closed this as completed Apr 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance on ResNet. #10

Poor performance on ResNet. #10

jingzhengli commented Nov 30, 2022

gabrielilharco commented Nov 30, 2022

jingzhengli commented Dec 1, 2022 •

edited

gabrielilharco commented Dec 5, 2022

Poor performance on ResNet. #10

Poor performance on ResNet. #10

Comments

jingzhengli commented Nov 30, 2022

gabrielilharco commented Nov 30, 2022

jingzhengli commented Dec 1, 2022 • edited

gabrielilharco commented Dec 5, 2022

jingzhengli commented Dec 1, 2022 •

edited