Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tokenizer tests #13594

Merged
merged 14 commits into from
Dec 3, 2021
Merged

Conversation

qqaatw
Copy link
Contributor

@qqaatw qqaatw commented Sep 16, 2021

What does this PR do?

Improve tokenizer common tests in tests/test_tokenization_common.py.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@LysandreJik

tokenizers = [self.get_tokenizer(do_lower_case=True)] if self.test_slow_tokenizer else []
if not self.test_slow_tokenizer:
self.skipTest("This test is only for slow tokenizers")
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be consistent as line 1670.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the change you propose in the test_encode_decode_with_spaces test, I understand that rust tokenizers now accept spaces in added tokens.

If this is the case, perhaps we should take the opportunity to modify this test to test the Rust Tokenizers as well (as suggested in the comment above). What do you think? 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that makes sense. I'll work on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After doing some tests, although the Rust tokenizer supports spaces in added tokens, two other problems, which seem unrelated to having spaces or not, raise:

The first one looks like that Rust tokenizer doesn't lowercase added tokens so the first assertion failed:

self = <tests.test_tokenization_albert.AlbertTokenizationTest testMethod=test_added_tokens_do_lower_case>

    def test_added_tokens_do_lower_case(self):
        # TODO(thom) activate fast tokenizer tests once Rust tokenizers accepts white spaces in added tokens.
        #if not self.test_slow_tokenizer:
        #    self.skipTest("This test is only for slow tokenizers")
        #    return
        tokenizers = self.get_tokenizers(fast=True, do_lower_case=True)
        for tokenizer in tokenizers:
            with self.subTest(f"{tokenizer.__class__.__name__}"):
                #if not hasattr(tokenizer, "do_lower_case") or not tokenizer.do_lower_case:
                #    continue
    
                special_token = tokenizer.all_special_tokens[0]
    
                text = special_token + " aaaaa bbbbbb low cccccccccdddddddd l " + special_token
                text2 = special_token + " AAAAA BBBBBB low CCCCCCCCCDDDDDDDD l " + special_token
    
                toks0 = tokenizer.tokenize(text)  # toks before adding new_toks
    
                new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd", "AAAAA BBBBBB", "CCCCCCCCCDDDDDDDD"]
                added = tokenizer.add_tokens(new_toks)
>               self.assertEqual(added, 2, tokenizer.get_added_vocab())
E               AssertionError: 4 != 2 : {'CCCCCCCCCDDDDDDDD': 30003, 'cccccccccdddddddd': 30001, 'aaaaa bbbbbb': 30000, 'AAAAA BBBBBB': 30002}

The second one seems related to the special handling of mask token in AlbertTokenizerFast, resulting in the mask token not treated as a special token:

mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token

                # Check that none of the special tokens are lowercased
                sequence_with_special_tokens = "A " + " yEs ".join(tokenizer.all_special_tokens) + " B"
                tokenized_sequence = tokenizer.tokenize(sequence_with_special_tokens)
    
                for special_token in tokenizer.all_special_tokens:
>                   self.assertTrue(special_token in tokenized_sequence, f"{tokenizer.all_special_tokens} {tokenized_sequence}")
E                   AssertionError: False is not true : ['[CLS]', '[SEP]', '<unk>', '<pad>', '[MASK]'] ['▁a', '▁', '[CLS]', '▁yes', '▁', '[SEP]', '▁yes', '▁', '<unk>', '▁yes', '▁', '<pad>', '▁yes', '▁[', 'mask', ']', '▁b']

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the detail of the errors that prevent to extend this test to fast tokenizers.

I totally agree with your first analysis. Indeed, slow and fast tokenizers do not add in the same way the new tokens when do_lower_case is True. We should see on our side if it's something we want to harmonize or not (cc @sgugger and @LysandreJik ).

Regarding the second error, I think it's because we do mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) and not mask_token = AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False). Did you have something else in mind?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh understood! That seems like the right behavior for fast tokenizers! It is slightly different than for the slow one, but that's okay in this instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SaulLu This PR #13930 should fix the second problem as you suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @SaulLu,

As PR #13930 was merged, the remaining problem would be the first one i.e. Rust tokenizers don't lowercase added tokens when do_lower_case is set to True, do you have any decision in mind to cope with this issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again! For the first problem, what do you think about a test that distinguishes between slow and fast tokenizers, like what is done in this test for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense! There are two prerequisite PRs ( #14364 and #14365) awaiting merged to complete this test case.

@LysandreJik
Copy link
Member

Pinging @SaulLu who has worked on this in the past

@qqaatw qqaatw force-pushed the improve_tokenizer_test branch from 21d330f to 5d5a98c Compare September 18, 2021 09:23
Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for working on the tests!

I have left a few questions in the comments to understand all the proposed changes. Looking forward to reading your answers!

tests/test_tokenization_common.py Show resolved Hide resolved
tokenizers = [self.get_tokenizer(do_lower_case=True)] if self.test_slow_tokenizer else []
if not self.test_slow_tokenizer:
self.skipTest("This test is only for slow tokenizers")
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the change you propose in the test_encode_decode_with_spaces test, I understand that rust tokenizers now accept spaces in added tokens.

If this is the case, perhaps we should take the opportunity to modify this test to test the Rust Tokenizers as well (as suggested in the comment above). What do you think? 🙂

@SaulLu
Copy link
Contributor

SaulLu commented Sep 29, 2021

I'll take the liberty of pinging @Narsil if he can give any leads on how to unblock failed tests on pipelines.

@LysandreJik
Copy link
Member

I believe the pipeline tests are currently passing?

@sgugger
Copy link
Collaborator

sgugger commented Oct 8, 2021

Hi @SaulLu could you review all the changes are good to you when you're back?

@huggingface huggingface deleted a comment from github-actions bot Nov 1, 2021
@SaulLu
Copy link
Contributor

SaulLu commented Nov 9, 2021

So that we don't get lost, this PR is awaiting the outcome of the PR #13930. 🙂

@qqaatw qqaatw mentioned this pull request Nov 11, 2021
@qqaatw qqaatw force-pushed the improve_tokenizer_test branch from 7b68e65 to 2bc0a28 Compare December 1, 2021 18:24

added = tokenizer.add_tokens(new_toks)
new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd", "AAAAA BBBBBB", "CCCCCCCCCDDDDDDDD"]
added = tokenizer.add_tokens([AddedToken(tok, lstrip=True, rstrip=True) for tok in new_toks])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All added tokens should do lstrip and rstrip because all no_split_token in slow tokenizers strip left and right spaces.

Kindly see line 518~522.

for i, token in enumerate(tokens):
if token in no_split_token:
tok_extended = all_special_tokens_extended.get(token, None)
left = tokens[i - 1] if i > 0 else None
right = tokens[i + 1] if i < len(tokens) - 1 else None
if isinstance(tok_extended, AddedToken):
if tok_extended.rstrip and right:
# A bit counter-intuitive but we strip the left of the string
# since tok_extended.rstrip means the special token is eating all white spaces on its right
tokens[i + 1] = right.lstrip()
# Strip white spaces on the left
if tok_extended.lstrip and left:
tokens[i - 1] = left.rstrip() # Opposite here
else:
# We strip left and right by default
if right:
tokens[i + 1] = right.lstrip()
if left:
tokens[i - 1] = left.rstrip()

Copy link
Contributor

@SaulLu SaulLu Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the explications! I agree with you

@qqaatw qqaatw requested a review from SaulLu December 1, 2021 18:34
@qqaatw
Copy link
Contributor Author

qqaatw commented Dec 2, 2021

@SaulLu All tests passed. Thanks.

Copy link
Contributor

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your patience for this PR and for extending the test_added_tokens_do_lower_case test to fast tokenizers! 🎊

I just left 2 nit that would make the reading of this test even easier by homogenizing the tests between the tokenizers with a do_lower_case argument and the others. If you don't have any time to devote to this PR, there is no problem, I will apply myself these nit in a next PR. 🙂

tests/test_tokenization_common.py Outdated Show resolved Hide resolved
@@ -632,30 +631,35 @@ def test_added_tokens_do_lower_case(self):
text = special_token + " aaaaa bbbbbb low cccccccccdddddddd l " + special_token
text2 = special_token + " AAAAA BBBBBB low CCCCCCCCCDDDDDDDD l " + special_token

toks0 = tokenizer.tokenize(text) # toks before adding new_toks
toks_before_adding = tokenizer.tokenize(text) # toks before adding new_toks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the renaming of the variables!

tests/test_tokenization_common.py Outdated Show resolved Hide resolved

added = tokenizer.add_tokens(new_toks)
new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd", "AAAAA BBBBBB", "CCCCCCCCCDDDDDDDD"]
added = tokenizer.add_tokens([AddedToken(tok, lstrip=True, rstrip=True) for tok in new_toks])
Copy link
Contributor

@SaulLu SaulLu Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the explications! I agree with you

@SaulLu SaulLu merged commit 66ea739 into huggingface:master Dec 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants