Correct CLIPForImageClassification #30373

Xmaster6y · 2024-04-21T15:43:35Z

What does this PR do?

This PR fixes CLIPForImageClassification by using the pooler_output instead of the mean pooling of the last_hidden_state to compute the class logits. See the graph below.

Suggestion

Maybe we could give a choice (in the config?) to choose what method to use.

Performances

Here is a training comparison for a small dataset (everything is equal in the two runs). All the parameters are frozen except classifier.weight and classifier.bias.

(pink being with the pooled output and orange with the mean pooling).

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Related to #28952 @NielsRogge

amyeroberts · 2024-04-24T17:32:17Z

cc @NielsRogge

NielsRogge · 2024-04-24T19:43:33Z

Interesting, thanks for the plot! I decided to average pool the patch tokens by default as it was shown that this generally works better than using the final hidden state of the CLS token for downstream image classification.

Question for @amyeroberts, we could switch to using the pooler output by default as the CLIPForImageClassification class is pretty new, or we could make it configurable (although adding attributes to an existing config is a bit anti-pattern in the Transformers library).

amyeroberts · 2024-04-25T19:34:41Z

@Xmaster6y Could you re-run this experiment on the development branch now that the weight initialization fix has been merged into main?

Based on that we can make a better decision :). If it's little to no difference, or this changes makes things worse, then there's nothing to change.

If there's still a large difference, then maybe.

Xmaster6y · 2024-04-26T09:15:03Z

Sonds good, I'll try it this weekend.

Xmaster6y · 2024-04-28T13:09:03Z

My results hold. IMO, it is expected knowing that CLIP was trained on the pooler output, the other tokens might be less relevant. Here, it might be overfitting on the additional "noise".

I proposed a change in the forward pass (not ideal) if we want to keep both. Otherwise, I think it's best to just change the target (and also do it on SigCLIP?).

Xmaster6y · 2024-04-29T08:22:55Z

Btw I read a bit more about the article you shared, and smth is bugging me. From ViT paper:

So I wouldn't trust the paper so much since they didn't explore so much the learning rates. And it might not apply here since I froze the encoder.

amyeroberts · 2024-04-29T13:51:21Z

Hi @Xmaster6y, thanks for re-running!

It looks like they didn't explore it in depth w.r.t. the learning rates. And it might not apply here since I froze the encoder.

This is a very good example of why this is so difficult: there's many different ways one can implement this and what's best will likely depend on a variety of situations. Unless the model is already pre-defined, then there's no clear implementation. In fact, this is a good indication that this task-specific head shouldn't never really have been added at all: the general rule for the library is to only add if there's checkpoints available. This is a bit special because it's specifically mentioned in the openai release.

What I would suggest is this: we follow the pattern in other image classification models, where we have an optional add_pooling_layer flag when constructing the image classification model and a use_mean_pooling flag to denote how things are pooled e.g. like here for beit. To keep backwards compatibility, we might need another flag which controls the addition of the layernorm

Xmaster6y · 2024-04-29T22:20:29Z

I get how to integrate use_mean_pooling in the config, but to be clear, are you suggesting removing the CLIPForImageClassification and using add_pooling_layer? If we do that, it feels that this parameter should be in the CLIP(Text|Vision)Config and thus only have this classification output/loss in the VisionModelOutput. I can propose changes for that if necessary.

amyeroberts · 2024-06-13T11:03:35Z

@Xmaster6y Apologies for the delay in my response here. No, I wasn't thinking about CLIPForImageClassification being removed. In fact, we'll need to keep it for backwards compatibility reasons.

I was thinking this it would be more like beit, where the vision model, CLIPVisionTransformer takes an optional argument add_pooling_layer. For beit this is added automatically for BeitForImageClassification. In this case, for CLIPForImageClassification we can do something else:

If the config has e.g. use_mean_pooling=True then add_pooling_layer is False and we use the current behaviour (should remain the default)
If use_mean_pooling=False then add_pooling_layer=True

However, this would then also mean introducing an if/else statement within CLIPForImageClassification's forward pass

All in all, I don't think we should make this change, as it adds complexity to the model. It unfortunate that it's implemented this way, but thankfully it's easy for people to use the base CLIP model and to build their own task heads.

Xmaster6y · 2024-06-18T16:57:37Z

Thanks for the clarification, I do agree that it would become cluttered.

Xmaster6y · 2024-06-18T16:57:58Z

Closing as not planned then.

Xmaster6y added 3 commits April 21, 2024 16:38

classify on pooler output

25289a8

docstring

0d57af2

fix lint

974eb35

Xmaster6y mentioned this pull request Apr 22, 2024

Bad weight init dependant of a processor import #30374

Closed

4 tasks

Xmaster6y added 2 commits April 28, 2024 13:21

Merge branch 'main' into correct-clip-clf

7bd140c

tmp solution

b951880

NielsRogge mentioned this pull request May 22, 2024

🚨 fix(SigLip): remove spurious exclusion of first vision output token #30952

Open

huggingface deleted a comment from github-actions bot May 24, 2024

amyeroberts mentioned this pull request Jun 5, 2024

Add Idefics2ForSequenceClassification #31170

Closed

5 tasks

Xmaster6y closed this Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correct CLIPForImageClassification #30373

Correct CLIPForImageClassification #30373

Xmaster6y commented Apr 21, 2024

amyeroberts commented Apr 24, 2024

NielsRogge commented Apr 24, 2024 •

edited

amyeroberts commented Apr 25, 2024

Xmaster6y commented Apr 26, 2024

Xmaster6y commented Apr 28, 2024 •

edited

Xmaster6y commented Apr 29, 2024 •

edited

amyeroberts commented Apr 29, 2024

Xmaster6y commented Apr 29, 2024

amyeroberts commented Jun 13, 2024

Xmaster6y commented Jun 18, 2024

Xmaster6y commented Jun 18, 2024

Correct CLIPForImageClassification #30373

Correct CLIPForImageClassification #30373

Conversation

Xmaster6y commented Apr 21, 2024

What does this PR do?

Suggestion

Performances

Before submitting

Who can review?

amyeroberts commented Apr 24, 2024

NielsRogge commented Apr 24, 2024 • edited

amyeroberts commented Apr 25, 2024

Xmaster6y commented Apr 26, 2024

Xmaster6y commented Apr 28, 2024 • edited

Xmaster6y commented Apr 29, 2024 • edited

amyeroberts commented Apr 29, 2024

Xmaster6y commented Apr 29, 2024

amyeroberts commented Jun 13, 2024

Xmaster6y commented Jun 18, 2024

Xmaster6y commented Jun 18, 2024

NielsRogge commented Apr 24, 2024 •

edited

Xmaster6y commented Apr 28, 2024 •

edited

Xmaster6y commented Apr 29, 2024 •

edited