Add `get_logits` method and NLLB tokenizer #756

visheratin · 2023-12-07T18:19:50Z

Hi!

I want to make OpenCLIP more usable for downstream applications, like zero-shot classification. Right now, to get the logits, the user has to call encode_image and encode_text, then matmul them, multiply the result by logit_scale, and optionally add logit_bias. I think it makes sense to have one method to get logits, as in OpenAI and HuggingFace. So I added the get_logits method to both CLIP and CustomTextCLIP classes.

I also added the NLLBTokenizer class that has an additional langs parameter in the __call__ method. This is needed because the tokenizer for NLLB models adds a language token to the beginning of the sequence. The token that is added is controlled via the set_src_lang_special_tokens method. If the language is not set via this method, the tokenizer will add an English token to all sequences.

The PR also contains some formatting changes performed by Ruff. For me personally, the formatted code looks nicer, but if you don't like it, I can roll it back.

@gabrielilharco @rwightman

rwightman · 2023-12-07T18:32:49Z

@visheratin can we do this without blacking it and changing more lines than being added?

rwightman · 2023-12-07T18:43:09Z

also, should probably be rebased against latest

visheratin · 2023-12-07T19:11:51Z

@rwightman Left only my changes and rebased.

rwightman · 2023-12-07T19:20:45Z

src/open_clip/model.py

+            image_logits += self.logit_bias
+        text_logits = image_logits.T
+        return image_logits, text_logits
+


This should be

def get_logits(self, image, text): image_features = self.encode_image(image, normalize=True) text_features = self.encode_text(text, normalize=True) image_logits = self.logit_scale.exp() * image_features @ text_features.T if self.logit_bias is not None: image_logits += self.logit_bias text_logits = image_logits.T return image_logits, text_logits

By bad. Fixed.

rwightman · 2023-12-07T19:21:14Z

src/open_clip/factory.py

@@ -111,11 +110,18 @@ def get_tokenizer(
        context_length = text_config.get('context_length', DEFAULT_CONTEXT_LENGTH)

    if 'hf_tokenizer_name' in text_config:
-        tokenizer = HFTokenizer(
+        if model_name.startswith("nllb"):
+            tokenizer = NLLBTokenizer(


really not a fan of having a model name based hack

rwightman · 2023-12-07T19:22:59Z

So, for the tokenizer, I'm not convinced it warrants a new tokenizer and associated maintenance. Isn't it pretty standard for multi-lingual to manually insert the language token per text? Are there any popular impl which do it this way?

rwightman · 2023-12-07T19:25:28Z

On the get_logits, it's useful to have, but will point out this won't work with torchcompile or FSDP which only wrap forward() methods.

visheratin · 2023-12-07T19:48:42Z

Regarding the tokenizer, with add_special_tokens set to True by default, M2M100Tokenizer adds language token automatically. If the user specifies the token in the text without calling set_src_lang_special_tokens, there will be two language tokens in the output, e.g., tensor([[256147, 256006, 718, 1159, ...]]) (256147 for English and 256006 for Afrikaans).

When making an additional tokenizer, I tried to look at the problem from the end-user perspective. With the current HFTokenizer, if the user wants to use an NLLB-based model, they'd have to figure out a proper way to encode the inputs on their own. Having an alternative implementation with the langs parameter and a warning helps to indicate what needs to be done to get proper tokens for the model.

visheratin · 2023-12-07T20:00:49Z

Regarding get_logits and torchcompile or FSDP, I understand the limitation. The main usage of the method would be to do something like:

from open_clip import create_model_and_transforms, get_tokenizer
from PIL import Image

model, _, transform = create_model_and_transforms(...)
tokenizer = get_tokenizer(...)

text_inputs = tokenizer(...)
image = Image.open(...)
image_inputs = transform(image)
logits_per_image, logits_per_text = model.get_logits(image_inputs, text_inputs)

An example of such usage can be found in roboflow/supervision library.

rwightman · 2023-12-07T23:41:57Z

@visheratin rather than make a whole new tokenizer for this with a different tokenize interface, couldn't we pass through the set_src_lang methods and/or src lang init kwargs? assert/report error if it's called on a underlying HF tokenizer that doesn't have it?

visheratin · 2023-12-08T00:15:14Z

Passing through the set_src_lang method is an option. But it will still be non-obvious for the user that they may need to use this method if they want to use languages other than English. Also, if we go this way, we make the users implement the logic that is now in the NLLBTokenizer themselves if they have multilingual input.

Maybe multilingual input texts is too edge case. If you think so, I can remove NLLBTokenizer from the PR. If the user wants to change the source language, they can always call tokenizer.tokenizer.set_src_lang, as it is done in the CLIP benchmark.

rwightman · 2023-12-08T16:43:05Z

Passing through the set_src_lang method is an option. But it will still be non-obvious for the user that they may need to use this method if they want to use languages other than English. Also, if we go this way, we make the users implement the logic that is now in the NLLBTokenizer themselves if they have multilingual input.

Maybe multilingual input texts is too edge case. If you think so, I can remove NLLBTokenizer from the PR. If the user wants to change the source language, they can always call tokenizer.tokenizer.set_src_lang, as it is done in the CLIP benchmark.

Isn't this how the HF tokenizers work though? I think they have some sort of specific tokenization method with src/target lang as args, but usually it's either set on construction of the tokenizer or via the set method no?

visheratin · 2023-12-09T01:04:37Z

In the case of M2M100Tokenizer, the src token is set as a prefix token, which doesn't require a specific tokenization method.

I removed the NLLBTokenizer and added logic for setting the language on init to the HFTokenizer.

* Get logits method and set_language for tokenizer.

visheratin added 4 commits December 7, 2023 14:06

Get logits method and NLLB tokenizer.

3e1565b

Undo formatting.

6877966

Returned original imports.

6709e25

Fixed tokenizer import.

b8e4835

rwightman reviewed Dec 7, 2023

View reviewed changes

Fixed get_logits method.

9b6b13f

Removed NLLBTokenizer.

ee57769

Fixed logic.

019d244

rwightman merged commit ebe135b into mlfoundations:main Dec 9, 2023
5 checks passed

Interpause pushed a commit to Interpause/open_clip that referenced this pull request May 23, 2024

Add get_logits method and NLLB tokenizer (mlfoundations#756)

82ed1b4

* Get logits method and set_language for tokenizer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `get_logits` method and NLLB tokenizer #756

Add `get_logits` method and NLLB tokenizer #756

visheratin commented Dec 7, 2023

rwightman commented Dec 7, 2023

rwightman commented Dec 7, 2023

visheratin commented Dec 7, 2023

rwightman Dec 7, 2023

visheratin Dec 7, 2023

rwightman Dec 7, 2023

rwightman commented Dec 7, 2023

rwightman commented Dec 7, 2023

visheratin commented Dec 7, 2023

visheratin commented Dec 7, 2023

rwightman commented Dec 7, 2023

visheratin commented Dec 8, 2023

rwightman commented Dec 8, 2023

visheratin commented Dec 9, 2023

Add get_logits method and NLLB tokenizer #756

Add get_logits method and NLLB tokenizer #756

Conversation

visheratin commented Dec 7, 2023

rwightman commented Dec 7, 2023

rwightman commented Dec 7, 2023

visheratin commented Dec 7, 2023

rwightman Dec 7, 2023

Choose a reason for hiding this comment

visheratin Dec 7, 2023

Choose a reason for hiding this comment

rwightman Dec 7, 2023

Choose a reason for hiding this comment

rwightman commented Dec 7, 2023

rwightman commented Dec 7, 2023

visheratin commented Dec 7, 2023

visheratin commented Dec 7, 2023

rwightman commented Dec 7, 2023

visheratin commented Dec 8, 2023

rwightman commented Dec 8, 2023

visheratin commented Dec 9, 2023

Add `get_logits` method and NLLB tokenizer #756

Add `get_logits` method and NLLB tokenizer #756