Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to finetune? #18

Closed
afiaka87 opened this issue Sep 14, 2021 · 15 comments
Closed

Possible to finetune? #18

afiaka87 opened this issue Sep 14, 2021 · 15 comments

Comments

@afiaka87
Copy link

Is it possible to finetune from the existing Open AI checkpoints rather than train them from scratch with this codebase?

@mitchellnw
Copy link
Contributor

Do you mean fine-tune on something like ImageNet or fine-tune on more image-caption pairs?

@afiaka87
Copy link
Author

Do you mean fine-tune on something like ImageNet or fine-tune on more image-caption pairs?

The latter. Image-text pairs; not classes.

@mitchellnw
Copy link
Contributor

I think you can do this with the flag --openai_pretrained but I do not know how tested this is.. let us know if this works!

@mitchellnw
Copy link
Contributor

FWIW the former (fine-tune on ImageNet) will soon be available in the upcoming code release of this paper https://arxiv.org/abs/2109.01903

@afiaka87
Copy link
Author

afiaka87 commented Sep 21, 2021

@mitchellnw Exciting stuff; can't wait for the release!.

Indeed I found that --openai_pretrained parameter and had a few issues with it. I believe it was specific to the dataset I'm working with though. I'll be sure to push any fixes I find upstream although I'm a bit distracted with DALLE-pytorch atm.

Do you have experience with finetuning on lower batch sizes? My understanding is CLIP needs rather high batch sizes to work effectively. I'm able to pull a batch size of 108 on my RTX 2070 Super. Is there concern of "forgetting" too much when finetuning on lower batch sizes?

@mitchellnw
Copy link
Contributor

Okay great! We haven't seen low batch sizes be a concern, especially less concerned for fine-tuning but do not know for sure..

@MyLtYkRiTiK
Copy link

@mitchellnw
Hello! Why do you use jit version of model for fine-tuning in --openai_pretrained? Is it bug or feature?

@mitchellnw
Copy link
Contributor

We set jit=False when using the pretrained version as I think you cannot fine-tune a jit=True model

@mitchellnw
Copy link
Contributor

mitchellnw commented Nov 3, 2021

As an update, you can now fine-tune CLIP on supervised learning tasks via this repo: https://github.com/mlfoundations/wise-ft

@iremonur
Copy link

iremonur commented Feb 9, 2022

Hello, I would like to fine-tune CLIP on my own specific dataset (app. 50k image-text pairs), I used provided ViT-B/32 checkpoints as an initial model but the accuracy starts with %1 and after 32 epochs, it reaches only around %30. (I tried various weight decay and LR combinations, the best of them is weight decay=0.001 and LR=5e-4.) Have you tried to fine-tune CLIP on a small specific dataset, if so how is the performance? @afiaka87

@gabrielilharco
Copy link
Collaborator

Hi @iremonur, a few questions: 1) when you say accuracy, what does this refer to? Are you doing image-to-text or text-to-image retrieval, using all samples from your dataset? 2) Do you expect performance on your dataset to be high? Would a human get reasonable performance on this task? I'm asking because it could be the case that the dataset has a lot of similar images/captions, which might make it hard to get a good retrieval peformance

@iremonur
Copy link

iremonur commented Feb 10, 2022

I reported the training accuracy on my own specific dataset during the fine-tuning, meaning I calculated the ratio of true matching with respect to ground truth using all samples from my dataset, during the fine-tuning.
Actually, the samples in my dataset are quite similar to each other, I intended to fine-tune the model on this challenging dataset in the first place because the official network parameters do not show a good performance on that.
I agree with you that it is difficult for the model to perform well on such similar samples but at the end of the first epoch, the accuracy reaches only the %1 (extremely low) which is surprising for me since I use the pre-trained model parameters as the initial parameters.

@gabrielilharco
Copy link
Collaborator

Hi @iremonur, thanks for the clarification. As a sanity check, it could be good to check what is the pre-trained model's accuracy without any fine-tuning. If it's significantly higher than 1%, there's probably something wrong with loading the checkpoint, or some hyperparameters might be destabilizing fine-tuning (e.g. learning rate is too high)

@iremonur
Copy link

iremonur commented Feb 11, 2022

Thank you for your response @gabrielilharco, I'll check that out. Also, when I reduced the batch size (like to 8), the model reaches much higher accuracy (around %50) but I worry that this may cause poor generalization performance since the model learns the similarity of samples in smaller groups. Did you encounter/think does reducing the batch size cause bad generalization performance? This is a kind of a similar question to #18 (comment) asked by @afiaka87.

@gabrielilharco
Copy link
Collaborator

I'm not sure about generalization, but note that the batch size should affect training accuracy: smaller batch sizes means you have less options to choose from, which means the task of finding the correct match for an image or text is easier. So the raw training accuracy numbers might not be representative of the model's capabilities.

rom1504 pushed a commit that referenced this issue Nov 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants