Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs on zeroshot image classification prompt templates #31343

Merged
merged 6 commits into from
Jun 19, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions docs/source/en/model_doc/siglip.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ The abstract from the paper is the following:
- Usage of SigLIP is similar to [CLIP](clip). The main difference is the training loss, which does not require a global view of all the pairwise similarities of images and texts within a batch. One needs to apply the sigmoid activation function to the logits, rather than the softmax.
- Training is not yet supported. If you want to fine-tune SigLIP or train from scratch, refer to the loss function from [OpenCLIP](https://github.com/mlfoundations/open_clip/blob/73ad04ae7fb93ede1c02dc9040a828634cb1edf1/src/open_clip/loss.py#L307), which leverages various `torch.distributed` utilities.
- When using the standalone [`SiglipTokenizer`] or [`SiglipProcessor`], make sure to pass `padding="max_length"` as that's how the model was trained.
- To get the same results as the pipeline, a prompt template of "This is a photo of {label}." should be used.

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/siglip_table.jpeg"
alt="drawing" width="600"/>
Expand Down Expand Up @@ -59,7 +60,8 @@ The pipeline allows to use the model in a few lines of code:
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> # inference
>>> outputs = image_classifier(image, candidate_labels=["2 cats", "a plane", "a remote"])
>>> candidate_labels = ["2 cats", "a plane", "a remote"]
>>> outputs = image_classifier(image, candidate_labels=candidate_labels)
>>> outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
>>> print(outputs)
[{'score': 0.1979, 'label': '2 cats'}, {'score': 0.0, 'label': 'a remote'}, {'score': 0.0, 'label': 'a plane'}]
Expand All @@ -81,7 +83,8 @@ If you want to do the pre- and postprocessing yourself, here's how to do that:
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> texts = ["a photo of 2 cats", "a photo of 2 dogs"]
>>> candidate_labels = ["2 cats", "2 dogs"]
>>> candidate_labels = [f'This is a photo of {label}.' for label in candidate_labels] # follows the pipeline prompt template to get same results
aliencaocao marked this conversation as resolved.
Show resolved Hide resolved
>>> # important: we pass `padding=max_length` since the model was trained with this
>>> inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

Expand Down
1 change: 1 addition & 0 deletions docs/source/en/tasks/zero_shot_image_classification.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ image for the model by resizing and normalizing it, and a tokenizer that takes c

```py
>>> candidate_labels = ["tree", "car", "bike", "cat"]
>>> candidate_labels = [f'This is a photo of {label}.' for label in candidate_labels] # follows the pipeline prompt template to get same results
aliencaocao marked this conversation as resolved.
Show resolved Hide resolved
>>> inputs = processor(images=image, text=candidate_labels, return_tensors="pt", padding=True)
```

Expand Down