Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct CLIPForImageClassification #30373

Closed
wants to merge 5 commits into from

Conversation

Xmaster6y
Copy link

What does this PR do?

This PR fixes CLIPForImageClassification by using the pooler_output instead of the mean pooling of the last_hidden_state to compute the class logits. See the graph below.

Suggestion

Maybe we could give a choice (in the config?) to choose what method to use.

Performances

Here is a training comparison for a small dataset (everything is equal in the two runs). All the parameters are frozen except classifier.weight and classifier.bias.

W B Chart 4_21_2024, 5 37 11 PM

(pink being with the pooled output and orange with the mean pooling).

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Related to #28952 @NielsRogge

@amyeroberts
Copy link
Collaborator

cc @NielsRogge

@NielsRogge
Copy link
Contributor

NielsRogge commented Apr 24, 2024

Interesting, thanks for the plot! I decided to average pool the patch tokens by default as it was shown that this generally works better than using the final hidden state of the CLS token for downstream image classification.

Question for @amyeroberts, we could switch to using the pooler output by default as the CLIPForImageClassification class is pretty new, or we could make it configurable (although adding attributes to an existing config is a bit anti-pattern in the Transformers library).

@amyeroberts
Copy link
Collaborator

@Xmaster6y Could you re-run this experiment on the development branch now that the weight initialization fix has been merged into main?

Based on that we can make a better decision :). If it's little to no difference, or this changes makes things worse, then there's nothing to change.

If there's still a large difference, then maybe.

@Xmaster6y
Copy link
Author

Sonds good, I'll try it this weekend.

@Xmaster6y
Copy link
Author

Xmaster6y commented Apr 28, 2024

My results hold. IMO, it is expected knowing that CLIP was trained on the pooler output, the other tokens might be less relevant. Here, it might be overfitting on the additional "noise".

W B Chart 4_28_2024, 3 06 19 PM
W B Chart 4_28_2024, 3 06 04 PM

I proposed a change in the forward pass (not ideal) if we want to keep both. Otherwise, I think it's best to just change the target (and also do it on SigCLIP?).

@Xmaster6y
Copy link
Author

Xmaster6y commented Apr 29, 2024

Btw I read a bit more about the article you shared, and smth is bugging me. From ViT paper:

Screenshot_2024-04-29_10-17-47

So I wouldn't trust the paper so much since they didn't explore so much the learning rates. And it might not apply here since I froze the encoder.

@amyeroberts
Copy link
Collaborator

Hi @Xmaster6y, thanks for re-running!

It looks like they didn't explore it in depth w.r.t. the learning rates. And it might not apply here since I froze the encoder.

This is a very good example of why this is so difficult: there's many different ways one can implement this and what's best will likely depend on a variety of situations. Unless the model is already pre-defined, then there's no clear implementation. In fact, this is a good indication that this task-specific head shouldn't never really have been added at all: the general rule for the library is to only add if there's checkpoints available. This is a bit special because it's specifically mentioned in the openai release.

What I would suggest is this: we follow the pattern in other image classification models, where we have an optional add_pooling_layer flag when constructing the image classification model and a use_mean_pooling flag to denote how things are pooled e.g. like here for beit. To keep backwards compatibility, we might need another flag which controls the addition of the layernorm

@Xmaster6y
Copy link
Author

I get how to integrate use_mean_pooling in the config, but to be clear, are you suggesting removing the CLIPForImageClassification and using add_pooling_layer? If we do that, it feels that this parameter should be in the CLIP(Text|Vision)Config and thus only have this classification output/loss in the VisionModelOutput. I can propose changes for that if necessary.

@amyeroberts
Copy link
Collaborator

@Xmaster6y Apologies for the delay in my response here. No, I wasn't thinking about CLIPForImageClassification being removed. In fact, we'll need to keep it for backwards compatibility reasons.

I was thinking this it would be more like beit, where the vision model, CLIPVisionTransformer takes an optional argument add_pooling_layer. For beit this is added automatically for BeitForImageClassification. In this case, for CLIPForImageClassification we can do something else:

  • If the config has e.g. use_mean_pooling=True then add_pooling_layer is False and we use the current behaviour (should remain the default)
  • If use_mean_pooling=False then add_pooling_layer=True

However, this would then also mean introducing an if/else statement within CLIPForImageClassification's forward pass

All in all, I don't think we should make this change, as it adds complexity to the model. It unfortunate that it's implemented this way, but thankfully it's easy for people to use the base CLIP model and to build their own task heads.

@Xmaster6y
Copy link
Author

Thanks for the clarification, I do agree that it would become cluttered.

@Xmaster6y
Copy link
Author

Closing as not planned then.

@Xmaster6y Xmaster6y closed this Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants