SigLIP tokenizer not enforcing use_fast=True #29925

yxchng · 2024-03-28T04:56:26Z

System Info

transformers version: 4.38.2
Platform: Linux-4.18.0-477.27.1.el8_8.x86_64-x86_64-with-glibc2.28
Python version: 3.10.13
Huggingface_hub version: 0.21.4
Safetensors version: 0.4.2
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
t=AutoTokenizer.from_pretrained('google/siglip-so400m-patch14-384', use_fast=True)
assert t.is_fast, 'tokenizer is not fast'
print('Success')

Expected behavior

print 'Success' which indicates use_fast=True

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-03-28T05:04:33Z

Hey! Siglip does not have a fast tokenizer:

transformers/src/transformers/models/auto/tokenization_auto.py

Line 404 in 6226b37

    
           ("siglip", ("SiglipTokenizer" if is_sentencepiece_available() else None, None)),

yxchng · 2024-03-28T08:33:17Z

I don't quite understand the line you linked. Why this model is different from others and does not have fast tokenizer?

ArthurZucker · 2024-03-30T07:33:38Z

That is because of

transformers/src/transformers/models/siglip/tokenization_siglip.py

Line 284 in 84d406c

def canonicalize_text(self, text, *, keep_punctuation_exact_string=None):

basically the equivalent fast tokenizer is not implemented because it needed a bit more work. I'll see if I have time to add it but otherwise it's a good issue!

yxchng · 2024-04-22T05:59:52Z

Is this going to get merged anytime soon?

ArthurZucker · 2024-04-22T18:04:34Z

I reviewed it, just need @NielsRogge 's updates

ArthurZucker added the Feature request Request for a new feature label Mar 30, 2024

ArthurZucker added the Good Difficult Issue label Mar 30, 2024

NielsRogge linked a pull request Mar 30, 2024 that will close this issue

[SigLIP] Add fast tokenizer #29969

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SigLIP tokenizer not enforcing use_fast=True #29925

SigLIP tokenizer not enforcing use_fast=True #29925

yxchng commented Mar 28, 2024

ArthurZucker commented Mar 28, 2024

yxchng commented Mar 28, 2024

ArthurZucker commented Mar 30, 2024

yxchng commented Apr 22, 2024

ArthurZucker commented Apr 22, 2024

SigLIP tokenizer not enforcing use_fast=True #29925

SigLIP tokenizer not enforcing use_fast=True #29925

Comments

yxchng commented Mar 28, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Mar 28, 2024

yxchng commented Mar 28, 2024

ArthurZucker commented Mar 30, 2024

yxchng commented Apr 22, 2024

ArthurZucker commented Apr 22, 2024