CLIPForImageClassification has been added #27805

Andron00e · 2023-12-02T15:12:10Z

What does this PR do?

This PR adds a new model to the hub. Called CLIPForImage classification.

Details about implementation and pre-trained version on hub can be seen in my repo.

My "New model" issue with describing of idea
Model on hub link
Code link

Tags:
vision models: @amyeroberts

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
[] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

NielsRogge · 2023-12-04T12:41:05Z

src/transformers/models/clip/modeling_clip.py

+    """
+    Repo with custom implementation: https://github.com/Andron00e/CLIPForImageClassification
+    Pre-trained model on hub: https://huggingface.co/Andron00e/CLIPForImageClassification-v1
+    """


this can be removed

src/transformers/models/clip/modeling_clip.py

NielsRogge · 2023-12-04T12:43:02Z

src/transformers/models/clip/modeling_clip.py

+
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,


Suggested change

input_ids: Optional[torch.LongTensor] = None,

no text needs to be passed in case of image classification

You are right, there is no need in "input_ids" in this case.
But it is a key element when we use clip. for my further enhancement I am going to output not "clip_output.image_embeddings" but "logits_per_image" and here we need those "input_ids".

NielsRogge · 2023-12-04T12:43:17Z

src/transformers/models/clip/modeling_clip.py

+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        return_loss: Optional[bool] = None,


Suggested change

attention_mask: Optional[torch.Tensor] = None,

position_ids: Optional[torch.LongTensor] = None,

return_loss: Optional[bool] = None,

Same here.

NielsRogge · 2023-12-04T12:43:55Z

src/transformers/models/clip/modeling_clip.py

+
+        logits = self.head(clip_outputs.image_embeds)
+
+        loss = None


here, the same logic as here needs to be defined:

transformers/src/transformers/models/dinov2/modeling_dinov2.py

Lines 721 to 757 in 4d4febb

logits = self.classifier(linear_input)

loss = None

if labels is not None:

# move labels to correct device to enable model parallelism

labels = labels.to(logits.device)

if self.config.problem_type is None:

if self.num_labels == 1:

self.config.problem_type = "regression"

elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):

self.config.problem_type = "single_label_classification"

else:

self.config.problem_type = "multi_label_classification"

if self.config.problem_type == "regression":

loss_fct = MSELoss()

if self.num_labels == 1:

loss = loss_fct(logits.squeeze(), labels.squeeze())

else:

loss = loss_fct(logits, labels)

elif self.config.problem_type == "single_label_classification":

loss_fct = CrossEntropyLoss()

loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

elif self.config.problem_type == "multi_label_classification":

loss_fct = BCEWithLogitsLoss()

loss = loss_fct(logits, labels)

if not return_dict:

output = (logits,) + outputs[2:]

return ((loss,) + output) if loss is not None else output

return ImageClassifierOutput(

loss=loss,

logits=logits,

hidden_states=outputs.hidden_states,

attentions=outputs.attentions,

)

NielsRogge

Overall it's a good idea to add a CLIPForImageClassification, but one would need to make sure weights can be loaded using from_pretrained.

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

Andron00e

In my opinion it would be better to save this structure of forward pass for further enhancement.

Andron00e · 2023-12-04T13:41:34Z

src/transformers/models/clip/modeling_clip.py

+
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,


You are right, there is no need in "input_ids" in this case.
But it is a key element when we use clip. for my further enhancement I am going to output not "clip_output.image_embeddings" but "logits_per_image" and here we need those "input_ids".

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

github-actions · 2024-01-08T08:03:44Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Andron00e added 4 commits December 2, 2023 17:57

CLIPForImageClassification has been added

2e24fc2

Update modeling_clip.py

2e4ec69

Update __init__.py

c3f9b69

Merge branch 'huggingface:main' into main

4efc8cd

ArthurZucker mentioned this pull request Dec 4, 2023

Update __init__.py #27811

Closed

3 tasks

NielsRogge reviewed Dec 4, 2023

View reviewed changes

src/transformers/models/clip/modeling_clip.py Outdated Show resolved Hide resolved

NielsRogge reviewed Dec 4, 2023

View reviewed changes

src/transformers/models/clip/modeling_clip.py Outdated Show resolved Hide resolved

NielsRogge reviewed Dec 4, 2023

View reviewed changes

src/transformers/models/clip/modeling_clip.py Outdated Show resolved Hide resolved

NielsRogge reviewed Dec 4, 2023

View reviewed changes

src/transformers/models/clip/modeling_clip.py Outdated Show resolved Hide resolved

NielsRogge reviewed Dec 4, 2023

View reviewed changes

Update src/transformers/models/clip/modeling_clip.py

a6bfcf9

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

Andron00e commented Dec 4, 2023

View reviewed changes

Andron00e and others added 3 commits December 4, 2023 16:44

Update src/transformers/models/clip/modeling_clip.py

56d5aeb

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

Update src/transformers/models/clip/modeling_clip.py

7c08310

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

Update src/transformers/models/clip/modeling_clip.py

4344d09

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

github-actions bot closed this Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLIPForImageClassification has been added #27805

CLIPForImageClassification has been added #27805

Andron00e commented Dec 2, 2023

NielsRogge Dec 4, 2023

NielsRogge Dec 4, 2023

Andron00e Dec 4, 2023

NielsRogge Dec 4, 2023

NielsRogge Dec 4, 2023

NielsRogge left a comment

Andron00e left a comment

Andron00e Dec 4, 2023

github-actions bot commented Jan 8, 2024

	attention_mask: Optional[torch.Tensor] = None,
	position_ids: Optional[torch.LongTensor] = None,
	return_loss: Optional[bool] = None,

	logits = self.classifier(linear_input)

	loss = None
	if labels is not None:
	# move labels to correct device to enable model parallelism
	labels = labels.to(logits.device)
	if self.config.problem_type is None:
	if self.num_labels == 1:
	self.config.problem_type = "regression"
	elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
	self.config.problem_type = "single_label_classification"
	else:
	self.config.problem_type = "multi_label_classification"

	if self.config.problem_type == "regression":
	loss_fct = MSELoss()
	if self.num_labels == 1:
	loss = loss_fct(logits.squeeze(), labels.squeeze())
	else:
	loss = loss_fct(logits, labels)
	elif self.config.problem_type == "single_label_classification":
	loss_fct = CrossEntropyLoss()
	loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
	elif self.config.problem_type == "multi_label_classification":
	loss_fct = BCEWithLogitsLoss()
	loss = loss_fct(logits, labels)

	if not return_dict:
	output = (logits,) + outputs[2:]
	return ((loss,) + output) if loss is not None else output

	return ImageClassifierOutput(
	loss=loss,
	logits=logits,
	hidden_states=outputs.hidden_states,
	attentions=outputs.attentions,
	)

CLIPForImageClassification has been added #27805

CLIPForImageClassification has been added #27805

Conversation

Andron00e commented Dec 2, 2023

What does this PR do?

Before submitting

Who can review?

NielsRogge Dec 4, 2023

Choose a reason for hiding this comment

NielsRogge Dec 4, 2023

Choose a reason for hiding this comment

Andron00e Dec 4, 2023

Choose a reason for hiding this comment

NielsRogge Dec 4, 2023

Choose a reason for hiding this comment

NielsRogge Dec 4, 2023

Choose a reason for hiding this comment

NielsRogge left a comment

Choose a reason for hiding this comment

Andron00e left a comment

Choose a reason for hiding this comment

Andron00e Dec 4, 2023

Choose a reason for hiding this comment

github-actions bot commented Jan 8, 2024