fixes clip interpolate #30783

nileshkokane01 · 2024-05-13T15:51:23Z

What does this PR do?

solves : #30579

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

amyeroberts

Thanks for working on this!

A few general comments:

interpolate_pos_encoding should be a boolean
It should be possible to call the model with this flag i.e. model(**inputs, interpolate_pos_encoding)
All the docstrings should be updated to include this argument
Print statements should be removed
Tests should make sure that the process image being passed to the model is not the default size

tests/models/x_clip/test_modeling_x_clip.py

tests/models/altclip/test_modeling_altclip.py

src/transformers/models/x_clip/modeling_x_clip.py

amyeroberts · 2024-05-15T15:11:19Z

src/transformers/models/altclip/modeling_altclip.py

@@ -1099,6 +1136,7 @@ def forward(
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
+        interpolate_pos_encoding: Optional[bool] = False,


The value should be True or False, but not None

Suggested change

interpolate_pos_encoding: Optional[bool] = False,

interpolate_pos_encoding: bool = False,

amyeroberts · 2024-05-15T15:40:36Z

tests/models/bridgetower/test_modeling_bridgetower.py

+        image_processor = BridgeTowerProcessor.from_pretrained(model_name)
+
+        image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
+        inputs = image_processor(text="what's in the image", images=image, return_tensors="pt").to(torch_device)


amyeroberts · 2024-05-15T15:40:44Z

tests/models/chinese_clip/test_modeling_chinese_clip.py

+        image_processor = ChineseCLIPProcessor.from_pretrained(model_name)
+
+        image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
+        inputs = image_processor(text="what's in the image", images=image, return_tensors="pt").to(torch_device)


amyeroberts · 2024-05-15T15:43:29Z

tests/models/clip/test_modeling_clip.py

+        # to visualize self-attention on higher resolution images.
+        model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(torch_device)
+
+        image_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32", size=480)


Three comments:

This is returning the processor, not the image_processor

size should be a dictionary e.g. size={"shortest_edge": 480} here

This won't test the interpolation, because the image processor crops after resizing. crop_size also has to be overriden

Suggested change

image_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32", size=480)

processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32", size=480)

amyeroberts · 2024-05-15T15:44:53Z

tests/models/kosmos2/test_modeling_kosmos2.py

+        processor = AutoProcessor.from_pretrained("microsoft/kosmos-2-patch14-224", padding_side="left")
+
+        image = Image.open("./tests/fixtures/tests_samples/COCO/000000039769.png")
+        inputs = processor(text="what's in the image", images=image, return_tensors="pt").to(torch_device)


Same here - parameters affecting output size have to be updated

amyeroberts · 2024-05-15T15:45:51Z

tests/models/x_clip/test_modeling_x_clip.py

+        # to visualize self-attention on higher resolution images.
+        model = XCLIPModel.from_pretrained("microsoft/xclip-base-patch32").to(torch_device)
+
+        image_processor = XCLIPProcessor.from_pretrained("microsoft/xclip-base-patch32", size=480)


Same here:

returns a processor

needs to override crop size

size and crop_size should be dicts

nileshkokane01 mentioned this pull request May 13, 2024

DeiT, CLIP and Git interpolation added #30649

Closed

5 tasks

fixes clip interpolate

6ed7b47

nileshkokane01 force-pushed the interpolate_clip branch from dd72bd1 to 6ed7b47 Compare May 13, 2024 15:56

amyeroberts reviewed May 15, 2024

View reviewed changes

amyeroberts mentioned this pull request May 21, 2024

Community contribution: enable dynamic resolution input for more vision models. #30579

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes clip interpolate #30783

fixes clip interpolate #30783

nileshkokane01 commented May 13, 2024 •

edited

amyeroberts left a comment

amyeroberts May 15, 2024

amyeroberts May 15, 2024

amyeroberts May 15, 2024

amyeroberts May 15, 2024

amyeroberts May 15, 2024

amyeroberts May 15, 2024

	interpolate_pos_encoding: Optional[bool] = False,
	interpolate_pos_encoding: bool = False,

	image_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32", size=480)
	processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32", size=480)

fixes clip interpolate #30783

Are you sure you want to change the base?

fixes clip interpolate #30783

Conversation

nileshkokane01 commented May 13, 2024 • edited

What does this PR do?

Before submitting

Who can review?

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts May 15, 2024

Choose a reason for hiding this comment

amyeroberts May 15, 2024

Choose a reason for hiding this comment

amyeroberts May 15, 2024

Choose a reason for hiding this comment

amyeroberts May 15, 2024

Choose a reason for hiding this comment

amyeroberts May 15, 2024

Choose a reason for hiding this comment

amyeroberts May 15, 2024

Choose a reason for hiding this comment

nileshkokane01 commented May 13, 2024 •

edited