[SigLIP] Add fast tokenizer #29969

NielsRogge · 2024-03-30T20:07:52Z

What does this PR do?

To do:

fix remaining tests
add slow integration test

HuggingFaceDocBuilderDev · 2024-03-30T20:28:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/models/siglip/tokenization_siglip_fast.py

ArthurZucker · 2024-04-04T06:14:42Z

tests/models/siglip/test_tokenization_siglip.py

+    # Copied from tests.models.t5.test_tokenization_t5.T5TokenizationTest.get_rust_tokenizer with T5->Siglip
+    def get_rust_tokenizer(self, **kwargs) -> SiglipTokenizerFast:
+        return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
+


One of the tests can be skipped as the tokenizer does lower case

Which test can be skipped do you mean?

lower_case related test IMO

src/transformers/convert_slow_tokenizer.py

ArthurZucker · 2024-04-04T06:19:18Z

src/transformers/convert_slow_tokenizer.py

+            )
+
+        return normalizers.Sequence(list_normalizers)
+


As is, you are going to get also the MetaSpace pre-tokenizer, but I am guessing this is also wanted

…enizer_bis

NielsRogge · 2024-04-22T18:44:06Z

@ArthurZucker I'm down to these 3 tests failing:

FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_added_tokens_do_lower_case - AssertionError: 'aaaaa bbbbbb ' == 'aaaaa bbbbbb '
FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_special_tokens_initialization - AssertionError: Lists differ: [342, 322, 291, 269, 262, 266, 32100, 507, 4290, 1] != [342, 322, 291, 269, 262, 266, 32100, 12936, 1]
FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_tokenization_python_rust_equals - AssertionError: Sequences differ: [291,[64 chars]62, 232, 141, 158, 232, 141, 163, 232, 142, 16[5335 chars]3, 1] != [291,[64 chars]62, 2, 16577, 266, 2, 1443, 412, 282, 1791, 13[517...

but I don't really know how to fix these. Are you able to look into these?

ArthurZucker

test_tokenization_python_rust_equals is the only one you really need. The others are not well designed TBH

ArthurZucker · 2024-04-30T11:35:28Z

src/transformers/models/siglip/test.py

should be removed

ArthurZucker · 2024-04-30T11:36:00Z

src/transformers/models/siglip/tokenization_siglip_fast.py

+
+        self.vocab_file = vocab_file
+
+    @property


lots of copied from are missing here as well

ArthurZucker · 2024-04-30T11:36:26Z

tests/models/siglip/test_tokenization_siglip.py

+    # Copied from tests.models.t5.test_tokenization_t5.T5TokenizationTest.get_rust_tokenizer with T5->Siglip
+    def get_rust_tokenizer(self, **kwargs) -> SiglipTokenizerFast:
+        return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)
+


lower_case related test IMO

ArthurZucker · 2024-04-30T11:36:43Z

tests/test_tokenization_common.py

@@ -544,6 +549,15 @@ def test_model_input_names_signature(self):
            # to make sure `tokenizer.pad(...)` works correctly
            self.assertTrue(tokenizer.model_input_names[0] in accepted_model_main_input_names)

+    def test_model_input_names_python_rust_equals(self):


that's a good addition

github-actions · 2024-05-25T08:03:48Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

yxchng · 2024-05-29T04:40:05Z

is this getting merged anytime soon?

ArthurZucker · 2024-06-05T12:11:13Z

Comments need to be adressed. cc @itazap if you want to take this over!

First draft

d5d67b7

NielsRogge added 3 commits March 31, 2024 12:05

Fix more tests

cbde88a

Add test

de444e9

Remove print statements

009fdc6

ArthurZucker reviewed Apr 4, 2024

View reviewed changes

NielsRogge added 3 commits April 22, 2024 20:13

Merge remote-tracking branch 'upstream/main' into add_siglip_fast_tok…

f714af0

…enizer_bis

Address comments

6cd05c2

Use regex

d67e40f

ArthurZucker reviewed Apr 30, 2024

View reviewed changes

ArthurZucker mentioned this pull request Jun 10, 2024

Loading tokenizer.model with Rust API huggingface/tokenizers#1518

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SigLIP] Add fast tokenizer #29969

[SigLIP] Add fast tokenizer #29969

NielsRogge commented Mar 30, 2024 •

edited

HuggingFaceDocBuilderDev commented Mar 30, 2024

ArthurZucker Apr 4, 2024

NielsRogge Apr 22, 2024

ArthurZucker Apr 30, 2024

ArthurZucker Apr 4, 2024

NielsRogge commented Apr 22, 2024

ArthurZucker left a comment

ArthurZucker Apr 30, 2024

ArthurZucker Apr 30, 2024

ArthurZucker Apr 30, 2024

ArthurZucker Apr 30, 2024

github-actions bot commented May 25, 2024

yxchng commented May 29, 2024

ArthurZucker commented Jun 5, 2024

[SigLIP] Add fast tokenizer #29969

Are you sure you want to change the base?

[SigLIP] Add fast tokenizer #29969

Conversation

NielsRogge commented Mar 30, 2024 • edited

What does this PR do?

HuggingFaceDocBuilderDev commented Mar 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge commented Apr 22, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 25, 2024

yxchng commented May 29, 2024

ArthurZucker commented Jun 5, 2024

NielsRogge commented Mar 30, 2024 •

edited