Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get_logits method and NLLB tokenizer #756

Merged
merged 7 commits into from
Dec 9, 2023
Merged

Add get_logits method and NLLB tokenizer #756

merged 7 commits into from
Dec 9, 2023

Conversation

visheratin
Copy link
Contributor

Hi!

I want to make OpenCLIP more usable for downstream applications, like zero-shot classification. Right now, to get the logits, the user has to call encode_image and encode_text, then matmul them, multiply the result by logit_scale, and optionally add logit_bias. I think it makes sense to have one method to get logits, as in OpenAI and HuggingFace. So I added the get_logits method to both CLIP and CustomTextCLIP classes.

I also added the NLLBTokenizer class that has an additional langs parameter in the __call__ method. This is needed because the tokenizer for NLLB models adds a language token to the beginning of the sequence. The token that is added is controlled via the set_src_lang_special_tokens method. If the language is not set via this method, the tokenizer will add an English token to all sequences.

The PR also contains some formatting changes performed by Ruff. For me personally, the formatted code looks nicer, but if you don't like it, I can roll it back.

@gabrielilharco @rwightman

@rwightman
Copy link
Collaborator

@visheratin can we do this without blacking it and changing more lines than being added?

@rwightman
Copy link
Collaborator

also, should probably be rebased against latest

@visheratin
Copy link
Contributor Author

@rwightman Left only my changes and rebased.

image_logits += self.logit_bias
text_logits = image_logits.T
return image_logits, text_logits

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be

    def get_logits(self, image, text):
        image_features = self.encode_image(image, normalize=True)
        text_features = self.encode_text(text, normalize=True)
        image_logits = self.logit_scale.exp() * image_features @ text_features.T
        if self.logit_bias is not None:
            image_logits += self.logit_bias
        text_logits = image_logits.T
        return image_logits, text_logits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By bad. Fixed.

@@ -111,11 +110,18 @@ def get_tokenizer(
context_length = text_config.get('context_length', DEFAULT_CONTEXT_LENGTH)

if 'hf_tokenizer_name' in text_config:
tokenizer = HFTokenizer(
if model_name.startswith("nllb"):
tokenizer = NLLBTokenizer(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really not a fan of having a model name based hack

@rwightman
Copy link
Collaborator

So, for the tokenizer, I'm not convinced it warrants a new tokenizer and associated maintenance. Isn't it pretty standard for multi-lingual to manually insert the language token per text? Are there any popular impl which do it this way?

@rwightman
Copy link
Collaborator

On the get_logits, it's useful to have, but will point out this won't work with torchcompile or FSDP which only wrap forward() methods.

@visheratin
Copy link
Contributor Author

Regarding the tokenizer, with add_special_tokens set to True by default, M2M100Tokenizer adds language token automatically. If the user specifies the token in the text without calling set_src_lang_special_tokens, there will be two language tokens in the output, e.g., tensor([[256147, 256006, 718, 1159, ...]]) (256147 for English and 256006 for Afrikaans).

When making an additional tokenizer, I tried to look at the problem from the end-user perspective. With the current HFTokenizer, if the user wants to use an NLLB-based model, they'd have to figure out a proper way to encode the inputs on their own. Having an alternative implementation with the langs parameter and a warning helps to indicate what needs to be done to get proper tokens for the model.

@visheratin
Copy link
Contributor Author

Regarding get_logits and torchcompile or FSDP, I understand the limitation. The main usage of the method would be to do something like:

from open_clip import create_model_and_transforms, get_tokenizer
from PIL import Image

model, _, transform = create_model_and_transforms(...)
tokenizer = get_tokenizer(...)

text_inputs = tokenizer(...)
image = Image.open(...)
image_inputs = transform(image)
logits_per_image, logits_per_text = model.get_logits(image_inputs, text_inputs)

An example of such usage can be found in roboflow/supervision library.

@rwightman
Copy link
Collaborator

@visheratin rather than make a whole new tokenizer for this with a different tokenize interface, couldn't we pass through the set_src_lang methods and/or src lang init kwargs? assert/report error if it's called on a underlying HF tokenizer that doesn't have it?

@visheratin
Copy link
Contributor Author

Passing through the set_src_lang method is an option. But it will still be non-obvious for the user that they may need to use this method if they want to use languages other than English. Also, if we go this way, we make the users implement the logic that is now in the NLLBTokenizer themselves if they have multilingual input.

Maybe multilingual input texts is too edge case. If you think so, I can remove NLLBTokenizer from the PR. If the user wants to change the source language, they can always call tokenizer.tokenizer.set_src_lang, as it is done in the CLIP benchmark.

@rwightman
Copy link
Collaborator

Passing through the set_src_lang method is an option. But it will still be non-obvious for the user that they may need to use this method if they want to use languages other than English. Also, if we go this way, we make the users implement the logic that is now in the NLLBTokenizer themselves if they have multilingual input.

Maybe multilingual input texts is too edge case. If you think so, I can remove NLLBTokenizer from the PR. If the user wants to change the source language, they can always call tokenizer.tokenizer.set_src_lang, as it is done in the CLIP benchmark.

Isn't this how the HF tokenizers work though? I think they have some sort of specific tokenization method with src/target lang as args, but usually it's either set on construction of the tokenizer or via the set method no?

@visheratin
Copy link
Contributor Author

In the case of M2M100Tokenizer, the src token is set as a prefix token, which doesn't require a specific tokenization method.

I removed the NLLBTokenizer and added logic for setting the language on init to the HFTokenizer.

@rwightman rwightman merged commit ebe135b into mlfoundations:main Dec 9, 2023
5 checks passed
Interpause pushed a commit to Interpause/open_clip that referenced this pull request May 23, 2024
* Get logits method and set_language for tokenizer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants