-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct CLIPForImageClassification #30373
Conversation
cc @NielsRogge |
Interesting, thanks for the plot! I decided to average pool the patch tokens by default as it was shown that this generally works better than using the final hidden state of the CLS token for downstream image classification. Question for @amyeroberts, we could switch to using the pooler output by default as the |
@Xmaster6y Could you re-run this experiment on the development branch now that the weight initialization fix has been merged into main? Based on that we can make a better decision :). If it's little to no difference, or this changes makes things worse, then there's nothing to change. If there's still a large difference, then maybe. |
Sonds good, I'll try it this weekend. |
My results hold. IMO, it is expected knowing that CLIP was trained on the pooler output, the other tokens might be less relevant. Here, it might be overfitting on the additional "noise". I proposed a change in the forward pass (not ideal) if we want to keep both. Otherwise, I think it's best to just change the target (and also do it on SigCLIP?). |
Hi @Xmaster6y, thanks for re-running!
This is a very good example of why this is so difficult: there's many different ways one can implement this and what's best will likely depend on a variety of situations. Unless the model is already pre-defined, then there's no clear implementation. In fact, this is a good indication that this task-specific head shouldn't never really have been added at all: the general rule for the library is to only add if there's checkpoints available. This is a bit special because it's specifically mentioned in the openai release. What I would suggest is this: we follow the pattern in other image classification models, where we have an optional |
I get how to integrate |
@Xmaster6y Apologies for the delay in my response here. No, I wasn't thinking about I was thinking this it would be more like beit, where the vision model, CLIPVisionTransformer takes an optional argument
However, this would then also mean introducing an if/else statement within All in all, I don't think we should make this change, as it adds complexity to the model. It unfortunate that it's implemented this way, but thankfully it's easy for people to use the base CLIP model and to build their own task heads. |
Thanks for the clarification, I do agree that it would become cluttered. |
Closing as not planned then. |
What does this PR do?
This PR fixes
CLIPForImageClassification
by using thepooler_output
instead of the mean pooling of thelast_hidden_state
to compute the class logits. See the graph below.Suggestion
Maybe we could give a choice (in the config?) to choose what method to use.
Performances
Here is a training comparison for a small dataset (everything is equal in the two runs). All the parameters are frozen except
classifier.weight
andclassifier.bias
.(pink being with the pooled output and orange with the mean pooling).
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
Related to #28952 @NielsRogge