Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SigLIP] Add fast tokenizer #29969

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Mar 30, 2024

What does this PR do?

Fixes #29925.

To do:

  • fix remaining tests
  • add slow integration test

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

# Copied from tests.models.t5.test_tokenization_t5.T5TokenizationTest.get_rust_tokenizer with T5->Siglip
def get_rust_tokenizer(self, **kwargs) -> SiglipTokenizerFast:
return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the tests can be skipped as the tokenizer does lower case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which test can be skipped do you mean?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lower_case related test IMO

src/transformers/convert_slow_tokenizer.py Outdated Show resolved Hide resolved
)

return normalizers.Sequence(list_normalizers)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is, you are going to get also the MetaSpace pre-tokenizer, but I am guessing this is also wanted

@NielsRogge
Copy link
Contributor Author

@ArthurZucker I'm down to these 3 tests failing:

FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_added_tokens_do_lower_case - AssertionError: 'aaaaa bbbbbb ' == 'aaaaa bbbbbb '
FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_special_tokens_initialization - AssertionError: Lists differ: [342, 322, 291, 269, 262, 266, 32100, 507, 4290, 1] != [342, 322, 291, 269, 262, 266, 32100, 12936, 1]
FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_tokenization_python_rust_equals - AssertionError: Sequences differ: [291,[64 chars]62, 232, 141, 158, 232, 141, 163, 232, 142, 16[5335 chars]3, 1] != [291,[64 chars]62, 2, 16577, 266, 2, 1443, 412, 282, 1791, 13[517...

but I don't really know how to fix these. Are you able to look into these?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_tokenization_python_rust_equals is the only one you really need. The others are not well designed TBH

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be removed


self.vocab_file = vocab_file

@property
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of copied from are missing here as well

# Copied from tests.models.t5.test_tokenization_t5.T5TokenizationTest.get_rust_tokenizer with T5->Siglip
def get_rust_tokenizer(self, **kwargs) -> SiglipTokenizerFast:
return self.rust_tokenizer_class.from_pretrained(self.tmpdirname, **kwargs)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lower_case related test IMO

@@ -544,6 +549,15 @@ def test_model_input_names_signature(self):
# to make sure `tokenizer.pad(...)` works correctly
self.assertTrue(tokenizer.model_input_names[0] in accepted_model_main_input_names)

def test_model_input_names_python_rust_equals(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good addition

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@yxchng
Copy link

yxchng commented May 29, 2024

is this getting merged anytime soon?

@ArthurZucker
Copy link
Collaborator

Comments need to be adressed. cc @itazap if you want to take this over!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SigLIP tokenizer not enforcing use_fast=True
4 participants