Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance on ResNet. #10

Closed
jingzhengli opened this issue Nov 30, 2022 · 3 comments
Closed

Poor performance on ResNet. #10

jingzhengli opened this issue Nov 30, 2022 · 3 comments

Comments

@jingzhengli
Copy link

Although good performace obtained by fine tuning ViT model, I found the poor performance on the ResNet models. Thus, How to fine tune the CLIP model by using pre-trained ResNet models? Thanks.

@gabrielilharco
Copy link
Contributor

Hi @jingzhengli, could you give more details on your experimental setting and which results you are getting? Thanks!

@jingzhengli
Copy link
Author

jingzhengli commented Dec 1, 2022

Hi @jingzhengli, could you give more details on your experimental setting and which results you are getting? Thanks!

Hi, thanks for your quick reply and your nice work. For fine-tuning CLIP, I have some questions.
The first question is how to fine-tune CLIP: different from the baselines "end-to-end" and "linear classifier" in your paper, I full fine-tune (update both vision and text encoders) the ViT-B/16-based CLIP on 11 public datasets with 16-shots, found a boost on all the datasets. However, if the ViT-based vision encoder is replaced with Resnet50 or Resnet101, then the performance becomes worse compared to zero-shot CLIP.
I also implement the "linear classifier" to fine-tune the ResNet-based CLIP on Imagenet dataset, the accuracy is 56% compared to the 60% of zero-shot CLIP.
The second question is the implemention of fine-tuning CLIP. I run the experiments using my own implementation not your released code, I would like to know how to initialize learning rate.
Thanks again for your consideration.

@gabrielilharco
Copy link
Contributor

Hi @jingzhengli. It seems like there are quite a few experimental differences then, so it's hard to pinpoint what the issue might be. If I understood correctly, it's a bit odd that your linear classifier is giving lower accuracy than the corresponding zero-shot model. If you are initializing the head with the zero-shot weights, this is likely an issue with your hyper-parameters or a bug.

Re. learning rate, I'd recommend doing a sweep, since your experimental setting is different. Also note that weight interpolation (and thus WiSE-FT) can perform poorly if the learning rate is too large, so I'd recommend erring on the side of smaller learning rates if you can't do a proper hyper-parameter search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants