Add video modality for InstrucBLIP #30182

zucchini-nlp · 2024-04-11T11:24:08Z

What does this PR do?

I made these changes a month ago and forgot contributing. This PR adds video processing capabilities for InstructBLIP models. The paper states InstructBLIP was trained and evaluated on video, along with images and the original repo has some code on how video inference works.

Seems like this feature has some interest from the community (see here), so I believe we can add it.

HuggingFaceDocBuilderDev · 2024-04-11T11:43:56Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/models/instructblip/image_processing_instructblip.py

NielsRogge · 2024-04-11T11:56:02Z

src/transformers/models/instructblip/image_processing_instructblip.py

+
+    model_input_names = ["pixel_values"]
+
+    def __init__(


This can perhaps also be Copied from

Okay, let me try. Then I have to add "copy ignore" on the preprocess probably

No that's only if you add Copied from to the class. In that case you can add "Ignore copy" above methods that you don't want to copy

NielsRogge

Wow that's awesome, thanks for working on that!

I'm just concerned about 2 things:

making sure that we have a robust API for multimodal processors that is consistent
the current InstructBLIP models on the hub all use BlipImageProcessor. This PR would introduce a new image processor, I guess we would then need to update the auto mapping to make sure AutoImageProcessor still works as expected.

NielsRogge · 2024-04-11T12:06:52Z

src/transformers/models/instructblip/processing_instructblip.py

@@ -57,6 +57,7 @@ def __init__(self, image_processor, tokenizer, qformer_tokenizer):
    def __call__(
        self,
        images: ImageInput = None,
+        videos: ImageInput = None,


cc @molbap since we'd like to standardize multimodal processors, this one isn't making it easier 😅

at some point we will have a VideoTextToTextPipeline, and we'll need to make sure they all have the same API.

See also the ImageTextToTextPipeline which is worked on at #29572. Although technically it could work if we just expect the following to work for any video-text-to-text model:

inputs = processor(videos=..., text=..., return_tensors="pt") generated_ids = model.generate(**inputs, max_new_tokens=20) generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)

Sure! Re to normalizing processors, for models taking video inputs, vivit, videomae, tvlt have videos: ImageInput but tvp has videos: Union[ImageInput, List[ImageInput], List[List[ImageInput]]],. However x_clip reuses videomae's processor.

Overall ImageInput is defined as a type union of Images or list of images. Looks like in the future we might prefer supporting at least list of list of images so a VideoInput defined as such could make sense, or an union of types as done in x_clip.

zucchini-nlp · 2024-04-11T13:22:23Z

@NielsRogge

making sure that we have a robust API for multimodal processors that is consistent
yeah, that needs to be reworked. Right now we have only Blip as first model that supports videos and there willl be VideoLlava.

Unfortunately video llava processing is going a different way to easily be able to interleave modalities. I guess that is part of what is being discussed internally in slack

the current InstructBLIP models on the hub all use BlipImageProcessor. This PR would introduce a new image processor, I guess we would then need to update the [auto mapping (https://github.com/huggingface/transformers/blob/e516d1b19d035469b4852e34ba0356587e6f8ade/src/transformers/models/auto/image_processing_auto.py#L75) to make sure AutoImageProcessor still works as expected.

Yep, seems like now it works only if calling specifically "InstructBlipImageProcessor"

amyeroberts

Thanks for working on adding this capability!

Two general comments:

We have to be careful here with the mappings wrt backwards compatibility and expected behaviour. As a user, I should be able to do:

image_processor = InstructBlipImageProcessor()
images = image_processor(images, return_tensors="pt")

and get exactly the same output as I was getting before with the blip image processor

We should avoid adding lots of if-statements in existing modeling code and instead add a new model

amyeroberts · 2024-04-16T19:42:52Z

src/transformers/models/instructblip/modeling_instructblip.py

@@ -1368,11 +1368,46 @@ def forward(
        >>> generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip()
        >>> print(generated_text)
        The unusual aspect of this image is that a man is ironing clothes on the back of a yellow SUV, which is parked in the middle of a busy city street. This is an unconventional approach to ironing clothes, as it requires the man to balance himself and his ironing equipment on top of the vehicle while navigating through traffic. Additionally, the presence of taxis and other vehicles in the scene further emphasizes the unusual nature of this situation.
+
+        # To generate from video input


The addition of all the if statements here indicate this should really be split into a different model. For things like torch.compile we really want to avoid models being able to have inputs with a varying number of dimensions e.g. adding InstructBLIPForVideoQuestionAnswering instead.

amyeroberts · 2024-04-16T19:46:07Z

src/transformers/models/instructblip/processing_instructblip.py

+        if images is not None or videos is not None:
+            image_encoding = self.image_processor(images=images, videos=videos, return_tensors=return_tensors)


This is going to break things. Either:

We add the new image processor to auto map for existing checkpoints. This might lead to differences in output as the image processor used is different. AFAICT processing steps are the same but output shape isn't.

We load the old image processors, which will break or emit warnings with the videos input

amyeroberts · 2024-04-16T19:47:54Z

src/transformers/models/instructblip/image_processing_instructblip.py

+        size = get_size_dict(size, default_to_square=False)
+
+        if (images is None) ^ (videos is not None):
+            raise ValueError("InstructBLIP currently does not support both images and videos as input")


Is it ever going to support both? Otherwise this message can be misleading

amyeroberts · 2024-04-17T12:09:51Z

src/transformers/models/instructblip/image_processing_instructblip.py

+    # Ignore copy
+    def preprocess(
+        self,
+        images: ImageInput = None,


If we're adding an image processor for InstructBlip, and it can process images, then it should process images consistently with how they were processed for the previous image processor (blip).

Whilst the processing steps are the same i.e. the processed image is consistent, the output shape won't be because this will add an extra axis to the output images i.e. they become (batch_size, num_frames, num_channels, height, width). Instead, we should keep the same output shape for images, this will allow us to add this to the image_processing_auto.py mapping for instruct blip

zucchini-nlp · 2024-04-18T09:54:16Z

@amyeroberts
I see, but I am not sure how to make a new model and keep it easy for users to load and for us to maintain. For example: we can make a separate processing for video modality, which will be further enhanced with new features. Call it InstructBlipVideoProcessing and another class for video-based generation, InstructBlipForConditionalVideoGeneration? For the case of blip, which will never work with interleaving two vision modalities at the same time it might work. But I also want to make sure there will be consistency in how video+image modality LLMs are handled in transformers.

What if:

Image processor returns one of two possible tensors "pixel_values_image" (4-dim tensor) or "pixel_values_video" (5-dim tensor). This is the way I made it for VideoLlava, which actually can generate from interleaving both at the same time.
1.1 Or maybe make Image and Video processor as separate classes, and call the appropriate while processing.
Modeling stays the same, but we add one line which expands image dimensionality so that it is (batch_size, 1, num_channels, height, width) and keep the rest of code as if it were a 1-frame video passed. All the conditions then can be removed, if we treat vision inputs as video-like but 1-frame for image and 4-frame for video clip.

zucchini-nlp · 2024-05-13T08:05:12Z

@amyeroberts ping

molbap · 2024-05-13T10:03:15Z

@zucchini-nlp one thing related that I'll merge this week, regarding processors: #30511 I added a VideosKwargsbut no VideosInput yet. Personal opinion from having spent too much time around Processors, I think a separate model would be actually easier to maintain rather than patching a previous one, because what the model does to which modality is easier to understand rather than mixed modalities. I don't have the final say on this, so just a comment :)

amyeroberts · 2024-05-13T10:05:51Z

@zucchini-nlp Sorry for the late reply here.

Modeling stays the same, but we add one line which expands image dimensionality so that it is (batch_size, 1, num_channels, height, width) and keep the rest of code as if it were a 1-frame video passed. All the conditions then can be removed, if we treat vision inputs as video-like but 1-frame for image and 4-frame for video clip.

This is a neat solution. My main question would be, what does this do to the shapes of the returned tensors e.g. hidden_states? If they remain the same, then this is a nice easy way to enable this within the modeling file.

If they're not the same, then we'll need to add a new class in the modeling file e.g. InstructBlipForVideoConditionalGeneration or perhaps, have a new modeling file src/transformers/models/instructblip/modeling_instructblip_video.py. The latter would add a new model type e.f. VideoInstructBlip / InstructBlipVideo which would enable the correct auto mapping in image_processing_auto.py. It's not unheard of for us to add new models which can load existing checkpoints under a new architecture name.

In terms of the image processor:

If we can use the same class and just expand within the forward pass, then the trick is to correctly batch the inputs such that for a batch of images, the shape of the output pixel values remains (batch_size, num_channels, height, width) and for videos it's (batch_size, num_frames, num_channels, height, width). Both should be output as pixel_values. That is, the output of the image processor for images should remain unchanged from the previous behaviour: same shape, and same key names in the BatchFeature output
If we can't use the same class, but add a single module in modeling_instructblip you can do as above
If we can't use the same class, but add a new modeling file, then we can add a new separate image processor.

My vote would be for a new model. Having to handle video and images within the same image processor is a good indication they should be separated

zucchini-nlp · 2024-05-13T10:35:35Z

Thanks for detailed explanations! At this point it should be possible to go with the first option, I will have to go back and check. If not, making a separate model sounds good since BLIP will never work with both modalities at the same time.

@molbap i see, probably having separate VideoProcessor files can be a better solutions for mutli-modal models, instead of packing it all in ImageProcessor

zucchini-nlp · 2024-05-15T13:52:54Z

I went for the "let's keep one processing and modeling files" way. Currently the following changes are applied:

Processor can accept as arg either images or videos
Image processor returns pixel values of shape (b, c, h, w) for images by squeezing the extra dim and adds and extra frame dimension for videos.
Modeling file unsqueezed back the frame dimension for images, and continues running as if the inputs are all videos. Finally the frame dimension is merged back to embeddings' sequence length.

Slow tests are passing locally + added a few tests for the video generation.

amyeroberts · 2024-06-03T20:00:56Z

Thanks for the continued work on this and apologies for the delay in my review. Skimming over the PR, and thinking about this more I think we should split this up such that we have a video image processor and a video model. We want to avoid conditional outputs from our base processing objects as well as conditional logic within our model's forward passes. As we won't interleave images and videos i.e. we don't expect a user to both be using video and images at the same time then we don't need to be able to handle these all at once

zucchini-nlp · 2024-06-12T08:47:53Z

@amyeroberts oke, I made Video InstructBlip its own model with its separate modeling and image processing files. Added some tests and indicated in the model-doc that it's the same model as InstructBlip except for the video processing capability. Ready for review

@ArthurZucker I made more changes into diff converter in this PR, as it didn't work in some cases. Specifically:

Some models apply CamelCase without an underscore that splits subwords, like InstructBlip, Diff converter cannot infer correct model name in this case, so I added a possibility to indicate model-names by passing it as an arg for converter
In case if we have a config that inherits from another config in the library, the auto-generated config doesn't get all the imports from its parent class causing errors. I added a special visited_module_config to store config-specific classes. I could use the same visited_module that is already defined but it will result in configs being imported into the config file
Sometimes configs need globally defined vars, e.g. in case of the logger. So there's a new line that ports "SimpleStatements" to configs. If they are not used, everything will be cleaned up by ruff anyway
Properties with their setters weren't being ported because they have identical naming, so the fix is to iterate "node.body" directly w/o saving it in a dict

Plus these are the changes in LLaVa-NeXT-Video PR, just duplicating to have all written down somewhere :)

Sometimes we want to add new methods in a class, and still retain all methods from the parent. This case wasn't covered, I added a few lines to fix it
Inferring model-name from file-name didn't consider long model names with an underscore, fixed by modifying the regex

Still have to fix cases when only "init" docstring is changed, I couldn't make it work yet

amyeroberts

Looks great - thanks for splitting this up and adding this model!

Just a few comment and some questions about the changes to the diff_converter. Main comment is about making sure generate is properly tested

src/transformers/image_utils.py

src/transformers/models/instructblipvideo/diff_instructblipvideo.py

amyeroberts · 2024-06-13T13:13:14Z

tests/models/instructblipvideo/test_modeling_instructblipvideo.py

+class InstructBlipVideoForConditionalGenerationDecoderOnlyTest(
+    ModelTesterMixin, GenerationTesterMixin, unittest.TestCase
+):
+    all_model_classes = (InstructBlipVideoForConditionalGeneration,) if is_torch_available() else ()


don't we need to define all_generative_models here too to properly test the generation mixin?

GenerationMixin still can't work for VLMs, and I am planning to properly add it after some unification of VLM processors. Otherwise we'll have so many conditional checks inside testing

AFAIK all VLMs currently are tested in IntegrationTests for that

amyeroberts · 2024-06-13T13:17:45Z

utils/diff_model_converter.py

        # fmt: off
        self.python_module = python_module  # we store the original module to use `code_for_node`
        self.transformers_imports = {}      # maps the imports name like "from transformers.models.xxx" to the parsed AST module
        self.imported_mapping = {}          # stores the name of the imported classes, with their source {"LlamaModel":"transformers.model.llama.modeling_llama"}
        self.visited_module = {}            # modules visited like "transformers.models.llama.modeling_llama"
+        self.visited_module_config = {}     # modules visited like "transformers.models.llama.modeling_llama" in config file, needed to not mix config vs modeling imports


This comment isn't completely clear to me as transformers.models.llama.modeling_llama" is a modeling import and not a config import and the instructblip configuration file doesn't import transformers.models.llama.modeling_llama

Sorry, maybe needs another example. This was needed for me because the auto-generated config wasn't getting imports from its "super config". In other words, it was only copying imports from diff files and removing unused ones by ruff.

One solution may be to indicate all imports in diff, but if they aren't used ruff removes them eventually. In my case PreTrainedConfig wasn't being imported for ex

yep this is also something I noticed, mixing import was not super well done

amyeroberts · 2024-06-13T13:19:08Z

utils/diff_model_converter.py

@@ -457,13 +463,18 @@ def leave_ClassDef(self, original_node, updated_node):
                    f"Tried parsing the name of the imported package from {super_file_name}, could not extract the model name"
                )

-            if super_file_name not in self.visited_module:  # only extract classes once
+            visited_module = self.visited_module_config if "Config" in class_name else self.visited_module


How well does this extend to other files we might have under the model folder e.g. do we need flags for visited_module_processor etc?

Not sure if processors work for diff. I tried actually to add image-processor to diff but it messed up, so I believe it doesn't yet support that

not yet supported but planned for sure.

amyeroberts · 2024-06-13T13:20:24Z

utils/diff_model_converter.py

+    parser.add_argument(
+        "--old_model_name",
+        required=False,
+        help="The name of the model from which the copying is done in CamelCase. If not provided is inferred from diff-file",
+    )
+    parser.add_argument(
+        "--new_model_name",
+        required=False,
+        help="The name of the new model being added in CamelCase. If not provided is inferred from diff-file",
+    )


Is this for models with composite config files e.g. CLIPVisionConfig which we don't want to infer as CLIPVision?

This is for models that do camel case without underscores, like InstructBlip (instructblip) vs LlavaNext (llava_next). In second case we can infer where to make a capital letter, while in former it's impossible so I decided to give users freedom passing model names

that is a great addition indeed! WOuld even add this comment in the help 😉

amyeroberts · 2024-06-13T13:21:56Z

utils/diff_model_converter.py

@@ -474,7 +485,7 @@ def leave_ClassDef(self, original_node, updated_node):
            start_insert_idx = self.global_scope_index
            for dependency, _ in list_dependencies:
                node = class_finder.global_nodes.get(dependency, None)
-                if node is not None:
+                if node is not None and "Config" not in class_name:


Same q here - do we need to account for all other classes e.g. "Processor" not in class_name?

This is added because diff importing configs to the configuration files, even though they are defined as a class a few lines below

amyeroberts · 2024-06-13T13:22:22Z

utils/diff_model_converter.py

I would be good to get a second review of the changes here in this file from @ArthurZucker

overall good to me!
Separating config imports is the way to go, and further separating process import later on will be needed as well

ArthurZucker

diff converter changes look nice IMO, but we should not need to import all the classes. New diff converter is able to parse dependencies so tell me if this is not the case!

ArthurZucker · 2024-06-19T12:51:25Z

src/transformers/models/instructblipvideo/diff_instructblipvideo.py

I am late to the party here, bug normally you should not have to import all these classes. The diff converter will automatically detect dependencies and copy classes that are required. !
Unless this is an edge case?

We discussed this on Slack and decided we shouldn't have separate imports for each file, and let ruff clean-out unnecessary ones. So I'm manually filtering the issue with configs and adding all imports. That worked for InstructBlip

ArthurZucker · 2024-06-19T12:52:58Z

utils/diff_model_converter.py

        # fmt: off
        self.python_module = python_module  # we store the original module to use `code_for_node`
        self.transformers_imports = {}      # maps the imports name like "from transformers.models.xxx" to the parsed AST module
        self.imported_mapping = {}          # stores the name of the imported classes, with their source {"LlamaModel":"transformers.model.llama.modeling_llama"}
        self.visited_module = {}            # modules visited like "transformers.models.llama.modeling_llama"
+        self.visited_module_config = {}     # modules visited like "transformers.models.llama.modeling_llama" in config file, needed to not mix config vs modeling imports


yep this is also something I noticed, mixing import was not super well done

ArthurZucker · 2024-06-19T12:53:23Z

utils/diff_model_converter.py

@@ -457,13 +463,18 @@ def leave_ClassDef(self, original_node, updated_node):
                    f"Tried parsing the name of the imported package from {super_file_name}, could not extract the model name"
                )

-            if super_file_name not in self.visited_module:  # only extract classes once
+            visited_module = self.visited_module_config if "Config" in class_name else self.visited_module


not yet supported but planned for sure.

ArthurZucker · 2024-06-19T12:54:38Z

utils/diff_model_converter.py

+    parser.add_argument(
+        "--old_model_name",
+        required=False,
+        help="The name of the model from which the copying is done in CamelCase. If not provided is inferred from diff-file",
+    )
+    parser.add_argument(
+        "--new_model_name",
+        required=False,
+        help="The name of the new model being added in CamelCase. If not provided is inferred from diff-file",
+    )


that is a great addition indeed! WOuld even add this comment in the help 😉

ArthurZucker · 2024-06-19T12:55:24Z

utils/diff_model_converter.py

overall good to me!
Separating config imports is the way to go, and further separating process import later on will be needed as well

zucchini-nlp · 2024-06-20T06:44:14Z

@amyeroberts this one is ready for the final review I guess

amyeroberts

Another great addition - thanks!

As with Llava Next Video, we just need a final run of the slow tests and we're good to merge ❤️

zucchini-nlp · 2024-06-25T10:45:29Z

Got the CI green, including slow tests. Will merge the PR

zucchini-nlp requested review from amyeroberts and NielsRogge April 11, 2024 11:24

NielsRogge reviewed Apr 11, 2024

View reviewed changes

src/transformers/models/instructblip/image_processing_instructblip.py Outdated Show resolved Hide resolved

NielsRogge reviewed Apr 11, 2024

View reviewed changes

src/transformers/models/instructblip/image_processing_instructblip.py Outdated Show resolved Hide resolved

NielsRogge reviewed Apr 11, 2024

View reviewed changes

amyeroberts reviewed Apr 17, 2024

View reviewed changes

huggingface deleted a comment from github-actions bot May 13, 2024

zucchini-nlp requested a review from amyeroberts May 20, 2024 07:49

squash in single commit

acf0885

zucchini-nlp force-pushed the blip_video branch from 1dc56ae to acf0885 Compare June 11, 2024 13:04

zucchini-nlp added 6 commits June 12, 2024 09:53

add docs

bf7de98

dummy obj

6331dcb

more changes in diff converter

467a758

Merge branch 'main' into blip_video

fdf1eea

tiny fix

799ceb3

make docs happy

6fd098f

zucchini-nlp added 2 commits June 12, 2024 12:30

skip test

04cf107

repo consistency tests

2c44b61

amyeroberts reviewed Jun 13, 2024

View reviewed changes

zucchini-nlp added 4 commits June 14, 2024 11:10

update docstring

a1287a7

style

fc8643d

Merge remote-tracking branch 'upstream/main' into blip_video

0a75d89

fix tests

134a206

ArthurZucker reviewed Jun 19, 2024

View reviewed changes

zucchini-nlp and others added 2 commits June 20, 2024 08:22

change diff imports

cf71681

Merge branch 'huggingface:main' into blip_video

11c2123

amyeroberts approved these changes Jun 21, 2024

View reviewed changes

[run-slow] instructblipvideo

c14fe5d

zucchini-nlp added the run-slow label Jun 24, 2024

zucchini-nlp added 3 commits June 24, 2024 14:42

[run-slow] instructblipvideo

3ab15a8

fix tests and remove logit check

94acdee

[run-slow] instructblipvideo

2a9b6d9

zucchini-nlp merged commit fc689d7 into huggingface:main Jun 25, 2024
26 checks passed

		if images is not None or videos is not None:
		image_encoding = self.image_processor(images=images, videos=videos, return_tensors=return_tensors)

Add video modality for InstrucBLIP #30182

Add video modality for InstrucBLIP #30182

Conversation

zucchini-nlp commented Apr 11, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented Apr 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge left a comment • edited Loading

Choose a reason for hiding this comment

NielsRogge Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp commented Apr 11, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp commented Apr 18, 2024

zucchini-nlp commented May 13, 2024

molbap commented May 13, 2024

amyeroberts commented May 13, 2024

zucchini-nlp commented May 13, 2024

zucchini-nlp commented May 15, 2024

amyeroberts commented Jun 3, 2024

zucchini-nlp commented Jun 12, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp commented Jun 20, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

zucchini-nlp commented Jun 25, 2024

NielsRogge left a comment •

edited

Loading

NielsRogge Apr 11, 2024 •

edited

Loading

zucchini-nlp Jun 13, 2024 •

edited

Loading

zucchini-nlp Jun 13, 2024 •

edited

Loading