🚨 Add Blip2ForImageTextRetrieval #29261

jpizarrom · 2024-02-23T19:37:53Z

What does this PR do?

Add Blip2ForImageTextRetrieval, Blip2TextModelWithProjection, Blip2VisionModelWithProjection models to be able to get Image Text Matching scores, and extract text,image,multimodal features.

Fixes part of #25300 #25245

This is continuation of #25612, I tried to apply most of the feedback received in that PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker @amyeroberts

src/transformers/models/blip_2/modeling_blip_2.py

ArthurZucker · 2024-02-27T06:23:14Z

cc @NielsRogge and @younesbelkada if one of you want to review on @jpizarrom makes the CIs go green!

jpizarrom · 2024-03-09T06:46:32Z

cc @NielsRogge and @younesbelkada if one of you want to review on @jpizarrom makes the CIs go green!

Hi, what could I do to makes the CIs go green! shall I just merge to upstream/main, or rebase to it?

amyeroberts · 2024-03-11T12:17:07Z

@jpizarrom It's preferable for you to rebase onto main. To see how to make the CIs green, you'll need to click on details and look at the output error logs from circleci. I'd suggest doing this after rebasing so see which errors are coming from this branch.

amyeroberts

Thanks for adding this! Overall looks great, just a few small comments

Once they're addressed we can move the checkpoints to be under the salesforce org

amyeroberts · 2024-03-27T12:06:48Z

src/transformers/models/blip_2/configuration_blip_2.py

+    @classmethod
+    def from_vision_qformer_configs(
+        cls,
+        vision_config: Blip2VisionConfig,
+        qformer_config: Blip2QFormerConfig,
+        **kwargs,
+    ):
+        r"""
+        Instantiate a [`Blip2Config`] (or a derived class) from a BLIP-2 vision and Q-Former model configurations.
+
+        Returns:
+            [`Blip2Config`]: An instance of a configuration object
+        """
+
+        return cls(
+            vision_config=vision_config.to_dict(),
+            qformer_config=qformer_config.to_dict(),
+            **kwargs,
+        )


I don't think it's necessary to add a separate method here. We can just make text_config optional in from_vision_qformer_text_config

from_vision_qformer_configs was removed

src/transformers/models/blip_2/modeling_blip_2.py

amyeroberts · 2024-04-04T12:49:00Z

src/transformers/models/blip_2/modeling_blip_2.py

+        if self.device != torch.device("cpu"):
+            with torch.cuda.amp.autocast(dtype=torch.float16):
+                vision_outputs = self.vision_model(
+                    pixel_values=pixel_values,
+                    output_attentions=output_attentions,
+                    output_hidden_states=output_hidden_states,
+                    return_dict=return_dict,
+                )
+        else:
+            vision_outputs = self.vision_model(
+                pixel_values=pixel_values,
+                output_attentions=output_attentions,
+                output_hidden_states=output_hidden_states,
+                return_dict=return_dict,
+            )


Autocasting and typing should be handled outside of the model definition

Suggested change

if self.device != torch.device("cpu"):

with torch.cuda.amp.autocast(dtype=torch.float16):

vision_outputs = self.vision_model(

pixel_values=pixel_values,

output_attentions=output_attentions,

output_hidden_states=output_hidden_states,

return_dict=return_dict,

)

else:

vision_outputs = self.vision_model(

pixel_values=pixel_values,

output_attentions=output_attentions,

output_hidden_states=output_hidden_states,

return_dict=return_dict,

)

vision_outputs = self.vision_model(

pixel_values=pixel_values,

output_attentions=output_attentions,

output_hidden_states=output_hidden_states,

return_dict=return_dict,

)

this was done because in the original model the autocast was applied only to the vision layers, don't know yet how to do this in a different way.

https://github.com/salesforce/LAVIS/blob/ac8fc98c93c02e2dfb727e24a361c4c309c8dbbc/lavis/models/blip2_models/blip2_qformer.py#L423-L424

cc @amyeroberts

it was removed, as discussed in #29261 (comment)

tests/models/blip_2/test_modeling_blip_2.py

src/transformers/models/blip_2/modeling_blip_2.py

amyeroberts · 2024-04-04T13:19:00Z

src/transformers/models/blip_2/modeling_blip_2.py

+        if config.use_qformer_text_input:
+            self.embeddings = Blip2TextEmbeddings(config)


Instead of using this config argument to conditionally call and create this layer, I'd suggest instead call self.embeddings if input_ids is not None

Suggested change

if config.use_qformer_text_input:

self.embeddings = Blip2TextEmbeddings(config)

self.embeddings = Blip2TextEmbeddings(config)

when this layer is created_always_, I got this type of errors, don't know how to fix them.
Some Blip2 models do not use this bert based embeddings, they use opt or flan-t5 to create the query_embeds. Maybe I could try to refactor the code to move the Blip2TextEmbeddings outside of Blip2QFormerModel and pass always query_embeds. what do you think?

FAILED tests/models/blip_2/test_modeling_blip_2.py::Blip2ForConditionalGenerationDecoderOnlyTest::test_training_gradient_checkpointing - AssertionError: False is not true : qformer.embeddings.word_embeddings.weight in Blip2ForConditionalGeneration has no gradient!

FAILED tests/models/blip_2/test_modeling_blip_2.py::Blip2ModelTest::test_training_gradient_checkpointing - AssertionError: False is not true : qformer.embeddings.word_embeddings.weight in Blip2ForConditionalGeneration has no gradient!

I did a refactor, embeddings were removed from Blip2QFormerModel, and place them into Blip2ForImageTextRetrieval and Blip2TextModelWithProjection, but to do so i needed to add query_length param to Blip2QFormerModel.forward.

amyeroberts · 2024-04-04T13:19:24Z

src/transformers/models/blip_2/modeling_blip_2.py

        # past_key_values_length
        past_key_values_length = (
            past_key_values[0][0].shape[2] - self.config.query_length if past_key_values is not None else 0
        )

        query_length = query_embeds.shape[1] if query_embeds is not None else 0

-        embedding_output = self.layernorm(query_embeds)
+        if self.config.use_qformer_text_input:


Suggested change

if self.config.use_qformer_text_input:

if input_ids is not None:

this is outdated, because embeddings were removed from Blip2QFormerModel

amyeroberts · 2024-04-04T13:19:56Z

src/transformers/models/blip_2/modeling_blip_2.py

+        # TODO: maybe have a cleaner way to cast the input (from `Blip2Processor` side?)
+        expected_dtype = self.dtype
+        if encoder_hidden_states is not None and encoder_hidden_states.dtype != expected_dtype:
+            encoder_hidden_states = encoder_hidden_states.to(expected_dtype)


Is this even necessary?

Should not be necessary indeed given that modeling code is by default in torch.float32

it was removed

github-actions · 2024-04-30T08:05:02Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

NielsRogge · 2024-05-01T14:45:36Z

src/transformers/models/blip_2/modeling_blip_2.py

+        )
+
+        if self.device != torch.device("cpu"):
+            with torch.cuda.amp.autocast(dtype=torch.float16):


As far as I can tell we don't add torch.cuda.amp.autocast code to modeling files, they are just in float32 by default. This was discussed on the original BLIP-2 model addition PR from what I remember. It's up to users to call something like torch.cuda.amp.autocast themselves if they wish to load the model in a different precision than the default one (cc @younesbelkada).

Hence in the conversion script I casted both the original weights and my BLIP-2 implementation to float32 in order to verify the conversion.

ok, so this means that i need to remove maybe_autocast from https://github.com/NielsRogge/LAVIS/blob/blip2_float32/lavis/models/blip2_models/blip2_image_text_matching.py#L57-L58, right?

Yes that's right

it was removed, a PR on your fork was opened to also remove the autocast from the ITM model NielsRogge/LAVIS#1

NielsRogge · 2024-05-01T14:50:18Z

src/transformers/models/blip_2/modeling_blip_2.py

@@ -84,6 +84,99 @@ def to_tuple(self) -> Tuple[Any]:
        )


+@dataclass
+class Blip2ImageTextMatchingModelOutput(ModelOutput):


Not sure if feasible, but it'd be nice to match the output class of CLIP, which is also an image-text matching model. It consists of the following keys:

loss

logits_per_image (this I assume is the itm_score)

logits_per_text (this I assume is the itm_score transposed)

and some other keys which are CLIP-specific.

Making sure that Blip2ForImageTextRetrieval matches this would allow it to be added to the zero-shot image classification pipeline, which relies on this output key:

transformers/src/transformers/pipelines/zero_shot_image_classification.py

Line 143 in bbaa8ce

"logits": outputs.logits_per_image,

Otherwise we will have a hard time adding BLIP-2 support to the zero-shot image classification pipeline.

Hi @NielsRogge, i updated the output to match CLIP output, but this PR is not being updated with my latest commits

NielsRogge

Thanks for your work! Would request some changes however in order to be able to make BLIP-2 compatible with the zero-shot image classification pipeline.

src/transformers/models/blip_2/modeling_blip_2.py

NielsRogge · 2024-05-01T15:01:24Z

src/transformers/models/blip_2/modeling_blip_2.py

+        input_ids: Optional[torch.FloatTensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        query_embeds: Optional[torch.FloatTensor] = None,
+        past_key_values_length: int = 0,


Suggested change

past_key_values_length: int = 0,

past_key_values are not used I assume

it was removed. thanks

NielsRogge · 2024-05-25T11:15:06Z

@jpizarrom once the CI is green I can assign a core maintainer for a final approval

jpizarrom · 2024-05-25T11:19:52Z

@jpizarrom once the CI is green I can assign a core maintainer for a final approval

I believe the CI errors are not related to this branch, i see modeling_mra and other non related error logs, don't know how to make CI green, maybe rebase to a more recent commit of main branch?

NielsRogge · 2024-05-25T12:04:06Z

Could you rebase on main and push? (git fetch upstream followed by git merge upstream/main, assuming your upstream is set)

…t_retrieval_model

jpizarrom · 2024-05-25T12:45:46Z

Could you rebase on main and push? (git fetch upstream followed by git merge upstream/main, assuming your upstream is set)

Thanks,
i merged with upstream, but the CI is still not green, other errors occurs, i believe not related to this branch.
RUN_SLOW=1 python -m pytest tests/models/blip_2/test_modeling_blip_2.py pass locally

NielsRogge · 2024-05-27T08:31:05Z

Ok, pinging @ydshieh here

…t_retrieval_model

jpizarrom · 2024-05-27T18:36:40Z

Ok, pinging @ydshieh here

Hi, i merged main again, there are some errors in tests/utils/test_offline.py

ydshieh · 2024-05-27T19:29:46Z

Ok, pinging @ydshieh here

Hi, i merged main again, there are some errors in tests/utils/test_offline.py

Those could be ignored. But we could probably get them fixed in another PR soon.

…t_retrieval_model

HuggingFaceDocBuilderDev · 2024-06-01T10:26:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Thanks for working on this - exciting to have this feature finally added!

Mostly a few nits. Main comment is about identifying the cause of the change in integration tests values and possibly rectifying given it indicates a degradation in performance.

amyeroberts · 2024-06-04T10:29:04Z

src/transformers/models/blip_2/configuration_blip_2.py

@@ -347,6 +361,6 @@ def from_vision_qformer_text_configs(
        return cls(
            vision_config=vision_config.to_dict(),
            qformer_config=qformer_config.to_dict(),
-            text_config=text_config.to_dict(),
+            text_config=text_config.to_dict() if text_config is not None else None,


Making this optional is a bit funny given the name of the method. We should at least update the docstring to indicate that language model config is optional.

docstring were updated

amyeroberts · 2024-06-04T10:30:58Z

src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py

-    assert unexpected_keys == ["qformer.embeddings.position_ids"]
+
+    if "itm" in model_name:
+        unexpected_keys = list(filter(lambda x: not x.startswith("Qformer.cls"), unexpected_keys))


Why is this filtering necessary here?

there are some fields that were excluded from the original model.

Qformer.cls.predictions.bias, Qformer.cls.predictions.transform.dense.weight, Qformer.cls.predictions.transform.dense.bias, Qformer.cls.predictions.transform.LayerNorm.weight, Qformer.cls.predictions.transform.LayerNorm.bias, Qformer.cls.predictions.decoder.weight

tests/models/blip_2/test_modeling_blip_2.py

amyeroberts · 2024-06-04T12:31:43Z

tests/models/blip_2/test_modeling_blip_2.py

+            [2, 15610, 1597, 2977, 6, 13011, 1594, 43052, 50118],
        )
-        self.assertEqual(generated_text, "it's not a city, it's a beach")
+        self.assertEqual(generated_text, "san diego, california")


Hmmmm - this doesn't seem right (the picture is indeed of a beach, not a city).

Could you try the following to see if you're able to recover the previous generations:

Try without the additional generation kwargs

Try without the added scaling included in the modeling file c.f. https://github.com/huggingface/transformers/pull/29261/files#r1501123761

ok, I will change the test, I was only trying to get a similar result to test_inference_t5 since in that one the answer shows san diego. it was not related to the scaling change.

tests/models/blip_2/test_modeling_blip_2.py

jpizarrom · 2024-06-08T09:31:11Z

Thanks for working on this - exciting to have this feature finally added!

Mostly a few nits. Main comment is about identifying the cause of the change in integration tests values and possibly rectifying given it indicates a degradation in performance.

Hi, thanks for the feedback, I am making the suggested changes.

I don't know what values from the integration tests you are referring to that indicate performance degradation, could you give more context?

…t_retrieval_model

amyeroberts

Thanks for all the work adding this!

Only thing left to do is to update the checkpoint references to point to ones under the salesforce org

jpizarrom · 2024-06-16T08:11:41Z

Thanks for all the work adding this!

Only thing left to do is to update the checkpoint references to point to ones under the salesforce org

Shall I do it? can i publish a model under salesforce org?

jpizarrom · 2024-06-16T08:20:27Z

src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py

@@ -79,6 +82,12 @@ def create_rename_keys(config):
    # QFormer
    rename_keys.append(("Qformer.bert.embeddings.LayerNorm.weight", "qformer.layernorm.weight"))
    rename_keys.append(("Qformer.bert.embeddings.LayerNorm.bias", "qformer.layernorm.bias"))
+    rename_keys.append(("Qformer.bert.embeddings.word_embeddings.weight", "embeddings.word_embeddings.weight"))


Hi, I just found that i got an error converting blip2-opt-2.7b model
KeyError: 'Qformer.bert.embeddings.word_embeddings.weight'

I'm going to have to make this rename keys for the itm models only.

Yes it'd be great to keep backwards compatibility for the existing checkpoints, and also making sure that users don't get warnings when loading the existing checkpoints (like unexpected keys in checkpoint etc.)

I fixed the keys issues on the convert script

then the logit comparison on convert script for blip2-opt-2.7b were failing, I needed to revert back a change that i did, now the scale is done after the dot product between "query" and "key"

slow tests on test_modeling_blip_2.py are passing. RUN_SLOW=1 python -m pytest tests/models/blip_2/test_modeling_blip_2.py

but the tests on CI are failing, it look for a reason different than this PR

The convert script for blip2-opt-2.7b is failing in the generation step, it looks like it is not related to the changes of this PR, because i got the same error on the main branch.
ValueError: Input length of input_ids is 0, but `max_length` is set to -14. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.

cc @zucchini-nlp who fixed this error, we might need to update the conversion script of BLIP-2 for that

Can you check that model's generation config has a high enough max_length (if it doesn't have max_length the default is 20)?

Right now BLIP will count all tokens (image and text) towards max_length, so we can either add higher max_length or max_new_tokens in model's generation config so that one run generate directly with model.generate(**inputs)

I added max_length=50 to the generate call of the lavis model on the convert script, now conversion of blip2-opt-2.7b works

NielsRogge · 2024-06-16T09:09:08Z

Shall I do it? can i publish a model under salesforce org?

I'll make sure the checkpoints get transferred

NielsRogge · 2024-06-28T10:17:50Z

Hi @jpizarrom could you push the checkpoints to your HF profile so that I can transfer them to the Salesforce org?

jpizarrom · 2024-06-29T06:31:19Z

Hi @jpizarrom could you push the checkpoints to your HF profile so that I can transfer them to the Salesforce org?

Hi, I just ran the conversion script, so the checkpoints are updated in.

https://huggingface.co/jpizarrom/blip2-itm-vit-g
https://huggingface.co/jpizarrom/blip2-itm-vit-g-coco

jpizarrom marked this pull request as draft February 23, 2024 19:41

jpizarrom commented Feb 23, 2024

View reviewed changes

src/transformers/models/blip_2/modeling_blip_2.py Outdated Show resolved Hide resolved

jpizarrom marked this pull request as ready for review February 23, 2024 20:14

jpizarrom changed the title ~~WIP Add Blip2ForImageTextRetrieval~~ Add Blip2ForImageTextRetrieval Feb 23, 2024

jpizarrom changed the title ~~Add Blip2ForImageTextRetrieval~~ 🚨 Add Blip2ForImageTextRetrieval Mar 2, 2024

jpizarrom force-pushed the add_blip2_image_text_retrieval_model branch from 0e82065 to 9aa9a15 Compare March 22, 2024 15:16

jpizarrom requested a review from amyeroberts March 22, 2024 15:43

amyeroberts reviewed Apr 4, 2024

View reviewed changes

jpizarrom and others added 4 commits May 1, 2024 08:43

add Blip2ForImageTextRetrieval

e106340

use one line and remove unnecessary space in tests

164d824

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

use value from the config, rather than hardcoded

715f6fa

change order of params in Blip2QFormerModel.forward

da0cc83

jpizarrom changed the title ~~🚨 Add Blip2ForImageTextRetrieval~~ WIP 🚨 Add Blip2ForImageTextRetrieval May 1, 2024

jpizarrom force-pushed the add_blip2_image_text_retrieval_model branch from 05327aa to da0cc83 Compare May 1, 2024 06:57

jpizarrom added 4 commits May 1, 2024 09:29

update docstring

02a0e08

fix style

360f537

update test_inference_opt

5e7764f

move embeddings out of Blip2QFormerModel

7e7135a

NielsRogge reviewed May 1, 2024

View reviewed changes

NielsRogge requested changes May 1, 2024

View reviewed changes

remove from_vision_qformer_configs

258f349

NielsRogge reviewed May 1, 2024

View reviewed changes

src/transformers/models/blip_2/modeling_blip_2.py Outdated Show resolved Hide resolved

NielsRogge reviewed May 1, 2024

View reviewed changes

src/transformers/models/blip_2/modeling_blip_2.py Outdated Show resolved Hide resolved

NielsRogge reviewed May 1, 2024

View reviewed changes

remove autocast float16 in Blip2QFormerModel

cf42e9b

jpizarrom added 2 commits May 25, 2024 10:38

fix small typo in the CLIPOutput docstring

e6da638

add Blip2ForImageTextRetrieval to Zero Shot Image Classification mapping

c65ea33

Merge remote-tracking branch 'upstream/main' into add_blip2_image_tex…

a39f9fd

…t_retrieval_model

Merge remote-tracking branch 'upstream/main' into add_blip2_image_tex…

a2b99a9

…t_retrieval_model

Merge remote-tracking branch 'upstream/main' into add_blip2_image_tex…

6468947

…t_retrieval_model

NielsRogge requested a review from amyeroberts June 1, 2024 10:04

amyeroberts reviewed Jun 4, 2024

View reviewed changes

jpizarrom added 5 commits June 8, 2024 11:48

update docstring and add require_torch_fp16

9bda979

rollback test_inference_opt

efa8041

use use_image_text_matching_head=True in convert

cb7beae

Merge remote-tracking branch 'upstream/main' into add_blip2_image_tex…

05cb8d4

…t_retrieval_model

skip test_model_get_set_embeddings

8eca5e5

amyeroberts approved these changes Jun 12, 2024

View reviewed changes

jpizarrom commented Jun 16, 2024

View reviewed changes

jpizarrom added 3 commits June 16, 2024 16:21

fix create_rename_keys error on new itm fields

e31b7e5

revert to do scale after dot product between "query" and "key"

8808b8a

fix ValueError on convert script for blip2-opt-2.7b

0a72567

		if config.use_qformer_text_input:
		self.embeddings = Blip2TextEmbeddings(config)

	if self.config.use_qformer_text_input:
	if input_ids is not None:

🚨 Add Blip2ForImageTextRetrieval #29261

Are you sure you want to change the base?

🚨 Add Blip2ForImageTextRetrieval #29261

Conversation

jpizarrom commented Feb 23, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

ArthurZucker commented Feb 27, 2024

jpizarrom commented Mar 9, 2024

amyeroberts commented Mar 11, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpizarrom May 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpizarrom May 18, 2024 • edited Loading

Choose a reason for hiding this comment

NielsRogge commented May 25, 2024

jpizarrom commented May 25, 2024 • edited Loading

NielsRogge commented May 25, 2024

jpizarrom commented May 25, 2024

NielsRogge commented May 27, 2024

jpizarrom commented May 27, 2024

ydshieh commented May 27, 2024

HuggingFaceDocBuilderDev commented Jun 1, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpizarrom commented Jun 8, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

jpizarrom commented Jun 16, 2024

jpizarrom Jun 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge commented Jun 16, 2024 • edited Loading

NielsRogge commented Jun 28, 2024

jpizarrom commented Jun 29, 2024

jpizarrom commented Feb 23, 2024 •

edited

Loading

jpizarrom May 1, 2024 •

edited

Loading

jpizarrom May 18, 2024 •

edited

Loading

jpizarrom commented May 25, 2024 •

edited

Loading

jpizarrom Jun 16, 2024 •

edited

Loading

NielsRogge commented Jun 16, 2024 •

edited

Loading