-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve tokenizer tests #13594
Improve tokenizer tests #13594
Conversation
tests/test_tokenization_common.py
Outdated
tokenizers = [self.get_tokenizer(do_lower_case=True)] if self.test_slow_tokenizer else [] | ||
if not self.test_slow_tokenizer: | ||
self.skipTest("This test is only for slow tokenizers") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be consistent as line 1670.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the change you propose in the test_encode_decode_with_spaces
test, I understand that rust tokenizers now accept spaces in added tokens.
If this is the case, perhaps we should take the opportunity to modify this test to test the Rust Tokenizers as well (as suggested in the comment above). What do you think? 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that makes sense. I'll work on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After doing some tests, although the Rust tokenizer supports spaces in added tokens, two other problems, which seem unrelated to having spaces or not, raise:
The first one looks like that Rust tokenizer doesn't lowercase added tokens so the first assertion failed:
self = <tests.test_tokenization_albert.AlbertTokenizationTest testMethod=test_added_tokens_do_lower_case>
def test_added_tokens_do_lower_case(self):
# TODO(thom) activate fast tokenizer tests once Rust tokenizers accepts white spaces in added tokens.
#if not self.test_slow_tokenizer:
# self.skipTest("This test is only for slow tokenizers")
# return
tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
for tokenizer in tokenizers:
with self.subTest(f"{tokenizer.__class__.__name__}"):
#if not hasattr(tokenizer, "do_lower_case") or not tokenizer.do_lower_case:
# continue
special_token = tokenizer.all_special_tokens[0]
text = special_token + " aaaaa bbbbbb low cccccccccdddddddd l " + special_token
text2 = special_token + " AAAAA BBBBBB low CCCCCCCCCDDDDDDDD l " + special_token
toks0 = tokenizer.tokenize(text) # toks before adding new_toks
new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd", "AAAAA BBBBBB", "CCCCCCCCCDDDDDDDD"]
added = tokenizer.add_tokens(new_toks)
> self.assertEqual(added, 2, tokenizer.get_added_vocab())
E AssertionError: 4 != 2 : {'CCCCCCCCCDDDDDDDD': 30003, 'cccccccccdddddddd': 30001, 'aaaaa bbbbbb': 30000, 'AAAAA BBBBBB': 30002}
The second one seems related to the special handling of mask token in AlbertTokenizerFast
, resulting in the mask token not treated as a special token:
mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token |
# Check that none of the special tokens are lowercased
sequence_with_special_tokens = "A " + " yEs ".join(tokenizer.all_special_tokens) + " B"
tokenized_sequence = tokenizer.tokenize(sequence_with_special_tokens)
for special_token in tokenizer.all_special_tokens:
> self.assertTrue(special_token in tokenized_sequence, f"{tokenizer.all_special_tokens} {tokenized_sequence}")
E AssertionError: False is not true : ['[CLS]', '[SEP]', '<unk>', '<pad>', '[MASK]'] ['▁a', '▁', '[CLS]', '▁yes', '▁', '[SEP]', '▁yes', '▁', '<unk>', '▁yes', '▁', '<pad>', '▁yes', '▁[', 'mask', ']', '▁b']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the detail of the errors that prevent to extend this test to fast tokenizers.
I totally agree with your first analysis. Indeed, slow and fast tokenizers do not add in the same way the new tokens when do_lower_case
is True
. We should see on our side if it's something we want to harmonize or not (cc @sgugger and @LysandreJik ).
Regarding the second error, I think it's because we do mask_token = AddedToken(mask_token, lstrip=True, rstrip=False)
and not mask_token = AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False)
. Did you have something else in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh understood! That seems like the right behavior for fast tokenizers! It is slightly different than for the slow one, but that's okay in this instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again! For the first problem, what do you think about a test that distinguishes between slow and fast tokenizers, like what is done in this test for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinging @SaulLu who has worked on this in the past |
21d330f
to
5d5a98c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for working on the tests!
I have left a few questions in the comments to understand all the proposed changes. Looking forward to reading your answers!
tests/test_tokenization_common.py
Outdated
tokenizers = [self.get_tokenizer(do_lower_case=True)] if self.test_slow_tokenizer else [] | ||
if not self.test_slow_tokenizer: | ||
self.skipTest("This test is only for slow tokenizers") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the change you propose in the test_encode_decode_with_spaces
test, I understand that rust tokenizers now accept spaces in added tokens.
If this is the case, perhaps we should take the opportunity to modify this test to test the Rust Tokenizers as well (as suggested in the comment above). What do you think? 🙂
I'll take the liberty of pinging @Narsil if he can give any leads on how to unblock failed tests on pipelines. |
I believe the pipeline tests are currently passing? |
Hi @SaulLu could you review all the changes are good to you when you're back? |
So that we don't get lost, this PR is awaiting the outcome of the PR #13930. 🙂 |
7b68e65
to
2bc0a28
Compare
|
||
added = tokenizer.add_tokens(new_toks) | ||
new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd", "AAAAA BBBBBB", "CCCCCCCCCDDDDDDDD"] | ||
added = tokenizer.add_tokens([AddedToken(tok, lstrip=True, rstrip=True) for tok in new_toks]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All added tokens should do lstrip
and rstrip
because all no_split_token
in slow tokenizers strip left and right spaces.
Kindly see line 518~522.
transformers/src/transformers/tokenization_utils.py
Lines 504 to 522 in 14cc50d
for i, token in enumerate(tokens): | |
if token in no_split_token: | |
tok_extended = all_special_tokens_extended.get(token, None) | |
left = tokens[i - 1] if i > 0 else None | |
right = tokens[i + 1] if i < len(tokens) - 1 else None | |
if isinstance(tok_extended, AddedToken): | |
if tok_extended.rstrip and right: | |
# A bit counter-intuitive but we strip the left of the string | |
# since tok_extended.rstrip means the special token is eating all white spaces on its right | |
tokens[i + 1] = right.lstrip() | |
# Strip white spaces on the left | |
if tok_extended.lstrip and left: | |
tokens[i - 1] = left.rstrip() # Opposite here | |
else: | |
# We strip left and right by default | |
if right: | |
tokens[i + 1] = right.lstrip() | |
if left: | |
tokens[i - 1] = left.rstrip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for the explications! I agree with you
@SaulLu All tests passed. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for your patience for this PR and for extending the test_added_tokens_do_lower_case
test to fast tokenizers! 🎊
I just left 2 nit that would make the reading of this test even easier by homogenizing the tests between the tokenizers with a do_lower_case
argument and the others. If you don't have any time to devote to this PR, there is no problem, I will apply myself these nit in a next PR. 🙂
@@ -632,30 +631,35 @@ def test_added_tokens_do_lower_case(self): | |||
text = special_token + " aaaaa bbbbbb low cccccccccdddddddd l " + special_token | |||
text2 = special_token + " AAAAA BBBBBB low CCCCCCCCCDDDDDDDD l " + special_token | |||
|
|||
toks0 = tokenizer.tokenize(text) # toks before adding new_toks | |||
toks_before_adding = tokenizer.tokenize(text) # toks before adding new_toks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the renaming of the variables!
|
||
added = tokenizer.add_tokens(new_toks) | ||
new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd", "AAAAA BBBBBB", "CCCCCCCCCDDDDDDDD"] | ||
added = tokenizer.add_tokens([AddedToken(tok, lstrip=True, rstrip=True) for tok in new_toks]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for the explications! I agree with you
What does this PR do?
Improve tokenizer common tests in
tests/test_tokenization_common.py
.Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@LysandreJik