Add InstructBLIP #23460

NielsRogge · 2023-05-19T07:02:29Z

What does this PR do?

This PR adds InstructBLIP, a visual instruction tuned version of BLIP-2.

It's a bit like an open-source multimodal GPT-4, leveraging Flan-T5 and Vicuna pre-trained checkpoints.

Basic usage is as follows:

from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
import torch
from PIL import Image
import requests

model = InstructBlipForConditionalGeneration.from_pretrained("...")
processor = InstructBlipProcessor.from_pretrained("...")

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

url = "https://raw.githubusercontent.com/salesforce/LAVIS/main/docs/_static/Confusing-Pictures.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
prompt = "What is unusual about this image?"
inputs = processor(images=image, text=prompt, return_tensors="pt")

outputs = model.generate(
        **inputs,
        do_sample=False,
        num_beams=1,
        max_length=256,
        min_length=1,
        top_p=0.9,
        repetition_penalty=1.5,
        length_penalty=1.0,
        temperature=1,
)
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)

To do:

discuss whether to integrate the QFormerTokenizer into the processor
integration tests
figure out the the best way to handle the various dtypes of the vision encoder and language model

Nice to haves:

doc tests
int8 support (cc @younesbelkada)

HuggingFaceDocBuilderDev · 2023-05-19T07:33:24Z

The documentation is not available anymore as the PR was closed or merged.

Reveyer · 2023-05-21T05:51:02Z

Thank you for your contribution! I noticed a potential problem with this open PR. It seems that the InstructBLIP processor is missing the QformerTokenizer compared to the BLIP2Processor.

sgugger

Thanks for working on this new model. It's mostly in good shape apart from the floating dtypes in the forward. Like all models in Transformers, this should run in the default precision and users can decide to use another dtype for parts (or all) of their models, but we can't hard-code this.

src/transformers/models/auto/processing_auto.py

src/transformers/models/instructblip/modeling_instructblip.py

tests/models/instructblip/test_modeling_instructblip.py

NielsRogge · 2023-05-29T09:09:52Z

Thanks for your review. Updates:

all autocast logic was removed, turns out the implementation returns the same exact logits as the original implementation when also using float32 for the original implementation. However, we may need to think about supporting various dtypes of building blocks of a model, cause if you'd do from_pretrained("...", dtype=torch.float16"), that would break for the Flan-T5 checkpoints, which require bfloat16. It would be nice to provide the possibility to load the vision encoder in float16 and the language model in bfloat16.
The InstructBlipProcessor is a bit different than other processors in the sense that it consists of 1 image processor and 2 tokenizers (one for the language model, one for the Q-Former). I've included logic to save the Q-Former tokenizer files in a separate folder on the hub as can be seen here, and had to overwrite the from_pretrained and save_pretrained methods to make this work. I know that this logic may need to be addressed in a separate PR.

sgugger

Thanks for iterating! Changes LGTM apart from the modifications in the push to hub mixin which really need to go in their own PR.

src/transformers/utils/hub.py

yukw777 · 2023-05-31T02:18:03Z

Will the converted weights be hosted on the model hub like blip-2?

NielsRogge · 2023-06-05T07:36:38Z

All checkpoints are transferred: https://huggingface.co/models?other=instructblip.

Feel free to merge the PR.

The only thing left is uploading fast tokenizer files for the Vicuna-based checkpoints, but that can only be done once #23889 is fixed. Currently the fast tokenizer is created on-the-fly based on the slow tokenizer files when loading from the hub.

Update: that's now also done, so it's entirely ready

sgugger

@NielsRogge Please do not resolve comments without addressing them or explain why you refuse them.

sgugger · 2023-06-05T16:02:27Z

@amyeroberts Could you have a final look and merge if you are happy?

amyeroberts

Thanks for adding this model!

There's still a few small things to address before it's ready to merge in. Main comments:

Docstrings for some main objects aren't correct - mostly missing inputs needing to be added
Tolerance for the logits check between the converted and original models is v. high. Have you dug into this at all? Do you know where these difference are coming from?
Possibly missing tests? e.g. There's InstructBlipTextModelTester but no InstructBlipTextModelTest and some tests for InstructBlipModel are skipped because they're run in individual model tests. It would be good to get @ydshieh's insight on this as he's both the composite model and testing king 👑

src/transformers/utils/hub.py

src/transformers/models/instructblip/processing_instructblip.py

tests/models/instructblip/test_modeling_instructblip.py

ydshieh · 2023-06-06T12:25:18Z

There's InstructBlipTextModelTester but no InstructBlipTextModelTest

In general, I would say yes to have 1-1 correspondence. But I don't want to make it strict if it doesn't really bring anything valuable.

The pipeline testing script would be easier if we have such correspondence, but since I was able to manage BLIP2 already, and this test file here is similar to BLIP2, I think it's fine.

and some tests for InstructBlipModel are skipped because they're run in individual model tests.

It's same as CLIP test file, so it's OK :-)

amyeroberts · 2023-06-06T12:33:51Z

@ydshieh Thanks for reviewing & info about the tests!

and some tests for InstructBlipModel are skipped because they're run in individual model tests.
It's same as CLIP test file, so it's OK :-)

Ah, sorry, I wasn't clear. What I meant was: if tests are skipped with the reason of being already tested in individual model tests, don't we need the modular tests classes implemented i.e. InstructBlipTextModelTest?

ydshieh · 2023-06-06T12:37:08Z

Ah, sorry, I wasn't clear. What I meant was: if tests are skipped with the reason of being already tested in individual model tests, don't we need the modular tests classes implemented i.e. InstructBlipTextModelTest?

I agree (was thinking the same but my mind is lost in my reply).

@NielsRogge I will let you to express why there is no text model test class :-), which is same as in BLIP2.

Well, after looking a bit, the text part is not a fixed model class

        if config.use_decoder_only_language_model:
            language_model = AutoModelForCausalLM.from_config(config.text_config)
        else:
            language_model = AutoModelForSeq2SeqLM.from_config(config.text_config)

I think that's the main reason why we don't have the test for that part.

kfallah · 2023-06-12T03:46:14Z

Hi, will this land soon? I would love to try out this model. Thanks!

NielsRogge · 2023-06-12T12:01:39Z

Thanks @amyeroberts for your review, there was a bug with LlamaTokenizerFast that has now been fixed, now the absolute tolerance is much lower (1e-4 and 1e-5).

I've removed InstructBlipModel from this PR as that was copied from Blip2Model using the CookieCutter template. The latter was added in this PR: #21817. However I'm not sure why the latter got approved, cause it's not really in lign with the design of the library, meaning that xxxModel are models not including any head on top and not accepting a labels argument. However Blip2Model seems like an entire copy of Blip2ForConditionalGeneration, which seems odd to me.

amyeroberts

Thanks again for adding this model!

Just a few small nit comments

README_es.md

src/transformers/models/instructblip/configuration_instructblip.py

src/transformers/models/instructblip/modeling_instructblip.py

src/transformers/models/instructblip/processing_instructblip.py

src/transformers/models/instructblip/modeling_instructblip.py

tests/models/instructblip/test_modeling_instructblip.py

amyeroberts · 2023-06-15T19:43:12Z

src/transformers/models/instructblip/modeling_instructblip.py

+    def get_input_embeddings(self):
+        return self.language_model.get_input_embeddings()
+
+    def set_input_embeddings(self, value):
+        self.language_model.set_input_embeddings(value)
+
+    def set_output_embeddings(self, new_embeddings):
+        self.language_model.set_output_embeddings(new_embeddings)
+
+    def get_output_embeddings(self) -> nn.Module:
+        return self.language_model.get_output_embeddings()


I'd argue this is a bit confusing, if I call model.get_input_embeddings it's not obvious from which module these relate to

zdxff · 2023-06-16T03:56:41Z

Do the prompt need further packaging when inference? For example, BLIP2 use "Question: {prompt}? Answer: " as prompt. And which type of prompt be used in InstructBLIP? Or we only use question to ask the model?

amyeroberts · 2023-06-19T17:39:34Z

@NielsRogge It appears in the current diff that there a some changes unrelated to this PR? Could you rebase to sync up with main? Could you also respond to the questions in the PR review instead of just marking as resolved?

ydshieh · 2023-06-24T10:07:22Z

Well 💚

ydshieh · 2023-06-26T09:23:49Z

Merge it now as 🟢 is approved.

NielsRogge · 2023-06-26T09:27:47Z

Hi @zdxff there's no specific prompt being used for InstructBLIP. You can just ask it questions like "What is unusual about this image?"

younesbelkada · 2023-06-26T09:28:08Z

Will work on the 8bit / 4bit integration ASAP !

EDIT: here you go #24488

NielsRogge force-pushed the add_instruct_blip branch from 95c349a to 80065da Compare May 22, 2023 20:07

NielsRogge requested a review from sgugger May 24, 2023 06:59

sgugger reviewed May 24, 2023

View reviewed changes

NielsRogge mentioned this pull request May 30, 2023

LLaMATokenizerFast works abnormally #23818

Closed

4 tasks

sgugger reviewed May 30, 2023

View reviewed changes

src/transformers/utils/hub.py Outdated Show resolved Hide resolved

This was referenced May 31, 2023

Behaviour between slow and fast LLaMa tokenizer not equivalent #23889

Closed

[PushToHub] Make it possible to upload folders #23920

Merged

NielsRogge force-pushed the add_instruct_blip branch from 2b48255 to 42e50c0 Compare June 3, 2023 12:39

sgugger reviewed Jun 5, 2023

View reviewed changes

sgugger approved these changes Jun 5, 2023

View reviewed changes

amyeroberts reviewed Jun 5, 2023

View reviewed changes

amyeroberts approved these changes Jun 15, 2023

View reviewed changes

NielsRogge force-pushed the add_instruct_blip branch from b337a9f to 5a94341 Compare June 21, 2023 07:11

NielsRogge added 4 commits June 24, 2023 10:38

Squash 88 commits

210c0dd

Use markdown

6fb349d

Remove mdx files due to bad rebase

2a9a531

Fix modeling files due to bad rebase

6084b59

NielsRogge added 2 commits June 24, 2023 10:38

Fix style

7be14d9

Update comment

9ce45c7

NielsRogge force-pushed the add_instruct_blip branch from cb5f986 to 9ce45c7 Compare June 24, 2023 08:39

fix

2a30af0

ydshieh merged commit 868363a into huggingface:main Jun 26, 2023
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add InstructBLIP #23460

Add InstructBLIP #23460

NielsRogge commented May 19, 2023 •

edited

HuggingFaceDocBuilderDev commented May 19, 2023 •

edited

Reveyer commented May 21, 2023

sgugger left a comment

NielsRogge commented May 29, 2023

sgugger left a comment

yukw777 commented May 31, 2023

NielsRogge commented Jun 5, 2023 •

edited

sgugger left a comment

sgugger commented Jun 5, 2023

amyeroberts left a comment

ydshieh commented Jun 6, 2023 •

edited

amyeroberts commented Jun 6, 2023

ydshieh commented Jun 6, 2023 •

edited

kfallah commented Jun 12, 2023

NielsRogge commented Jun 12, 2023 •

edited

amyeroberts left a comment

amyeroberts Jun 15, 2023

zdxff commented Jun 16, 2023

amyeroberts commented Jun 19, 2023

ydshieh commented Jun 24, 2023

ydshieh commented Jun 26, 2023

NielsRogge commented Jun 26, 2023

younesbelkada commented Jun 26, 2023 •

edited

Add InstructBLIP #23460

Add InstructBLIP #23460

Conversation

NielsRogge commented May 19, 2023 • edited

What does this PR do?

HuggingFaceDocBuilderDev commented May 19, 2023 • edited

Reveyer commented May 21, 2023

sgugger left a comment

Choose a reason for hiding this comment

NielsRogge commented May 29, 2023

sgugger left a comment

Choose a reason for hiding this comment

yukw777 commented May 31, 2023

NielsRogge commented Jun 5, 2023 • edited

sgugger left a comment

Choose a reason for hiding this comment

sgugger commented Jun 5, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

ydshieh commented Jun 6, 2023 • edited

amyeroberts commented Jun 6, 2023

ydshieh commented Jun 6, 2023 • edited

kfallah commented Jun 12, 2023

NielsRogge commented Jun 12, 2023 • edited

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Jun 15, 2023

Choose a reason for hiding this comment

zdxff commented Jun 16, 2023

amyeroberts commented Jun 19, 2023

ydshieh commented Jun 24, 2023

ydshieh commented Jun 26, 2023

NielsRogge commented Jun 26, 2023

younesbelkada commented Jun 26, 2023 • edited

NielsRogge commented May 19, 2023 •

edited

HuggingFaceDocBuilderDev commented May 19, 2023 •

edited

NielsRogge commented Jun 5, 2023 •

edited

ydshieh commented Jun 6, 2023 •

edited

ydshieh commented Jun 6, 2023 •

edited

NielsRogge commented Jun 12, 2023 •

edited

younesbelkada commented Jun 26, 2023 •

edited