Improve tokenizer tests #13594

qqaatw · 2021-09-16T08:31:06Z

What does this PR do?

Improve tokenizer common tests in tests/test_tokenization_common.py.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@LysandreJik

qqaatw · 2021-09-16T09:03:55Z

tests/test_tokenization_common.py

-        tokenizers = [self.get_tokenizer(do_lower_case=True)] if self.test_slow_tokenizer else []
+        if not self.test_slow_tokenizer:
+            self.skipTest("This test is only for slow tokenizers")
+            return


This would be consistent as line 1670.

From the change you propose in the test_encode_decode_with_spaces test, I understand that rust tokenizers now accept spaces in added tokens.

If this is the case, perhaps we should take the opportunity to modify this test to test the Rust Tokenizers as well (as suggested in the comment above). What do you think? 🙂

Sure, that makes sense. I'll work on it.

After doing some tests, although the Rust tokenizer supports spaces in added tokens, two other problems, which seem unrelated to having spaces or not, raise:

The first one looks like that Rust tokenizer doesn't lowercase added tokens so the first assertion failed:

self = <tests.test_tokenization_albert.AlbertTokenizationTest testMethod=test_added_tokens_do_lower_case> def test_added_tokens_do_lower_case(self): # TODO(thom) activate fast tokenizer tests once Rust tokenizers accepts white spaces in added tokens. #if not self.test_slow_tokenizer: # self.skipTest("This test is only for slow tokenizers") # return tokenizers = self.get_tokenizers(fast=True, do_lower_case=True) for tokenizer in tokenizers: with self.subTest(f"{tokenizer.__class__.__name__}"): #if not hasattr(tokenizer, "do_lower_case") or not tokenizer.do_lower_case: # continue special_token = tokenizer.all_special_tokens[0] text = special_token + " aaaaa bbbbbb low cccccccccdddddddd l " + special_token text2 = special_token + " AAAAA BBBBBB low CCCCCCCCCDDDDDDDD l " + special_token toks0 = tokenizer.tokenize(text) # toks before adding new_toks new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd", "AAAAA BBBBBB", "CCCCCCCCCDDDDDDDD"] added = tokenizer.add_tokens(new_toks) > self.assertEqual(added, 2, tokenizer.get_added_vocab()) E AssertionError: 4 != 2 : {'CCCCCCCCCDDDDDDDD': 30003, 'cccccccccdddddddd': 30001, 'aaaaa bbbbbb': 30000, 'AAAAA BBBBBB': 30002}

The second one seems related to the special handling of mask token in AlbertTokenizerFast, resulting in the mask token not treated as a special token:

transformers/src/transformers/models/albert/tokenization_albert_fast.py

Line 139 in 62832c9

mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token

# Check that none of the special tokens are lowercased sequence_with_special_tokens = "A " + " yEs ".join(tokenizer.all_special_tokens) + " B" tokenized_sequence = tokenizer.tokenize(sequence_with_special_tokens) for special_token in tokenizer.all_special_tokens: > self.assertTrue(special_token in tokenized_sequence, f"{tokenizer.all_special_tokens} {tokenized_sequence}") E AssertionError: False is not true : ['[CLS]', '[SEP]', '<unk>', '<pad>', '[MASK]'] ['▁a', '▁', '[CLS]', '▁yes', '▁', '[SEP]', '▁yes', '▁', '<unk>', '▁yes', '▁', '<pad>', '▁yes', '▁[', 'mask', ']', '▁b']

Thanks a lot for the detail of the errors that prevent to extend this test to fast tokenizers.

I totally agree with your first analysis. Indeed, slow and fast tokenizers do not add in the same way the new tokens when do_lower_case is True. We should see on our side if it's something we want to harmonize or not (cc @sgugger and @LysandreJik ).

Regarding the second error, I think it's because we do mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) and not mask_token = AddedToken(mask_token, lstrip=True, rstrip=False, normalized=False). Did you have something else in mind?

Oh understood! That seems like the right behavior for fast tokenizers! It is slightly different than for the slow one, but that's okay in this instance.

@SaulLu This PR #13930 should fix the second problem as you suggested.

Hi @SaulLu,

As PR #13930 was merged, the remaining problem would be the first one i.e. Rust tokenizers don't lowercase added tokens when do_lower_case is set to True, do you have any decision in mind to cope with this issue?

Thanks again! For the first problem, what do you think about a test that distinguishes between slow and fast tokenizers, like what is done in this test for example?

Makes sense! There are two prerequisite PRs ( #14364 and #14365) awaiting merged to complete this test case.

LysandreJik · 2021-09-17T14:09:16Z

Pinging @SaulLu who has worked on this in the past

SaulLu

Thank you very much for working on the tests!

I have left a few questions in the comments to understand all the proposed changes. Looking forward to reading your answers!

tests/test_tokenization_common.py

SaulLu · 2021-09-23T08:04:25Z

tests/test_tokenization_common.py

-        tokenizers = [self.get_tokenizer(do_lower_case=True)] if self.test_slow_tokenizer else []
+        if not self.test_slow_tokenizer:
+            self.skipTest("This test is only for slow tokenizers")
+            return


From the change you propose in the test_encode_decode_with_spaces test, I understand that rust tokenizers now accept spaces in added tokens.

If this is the case, perhaps we should take the opportunity to modify this test to test the Rust Tokenizers as well (as suggested in the comment above). What do you think? 🙂

SaulLu · 2021-09-29T21:36:14Z

I'll take the liberty of pinging @Narsil if he can give any leads on how to unblock failed tests on pipelines.

LysandreJik · 2021-09-30T21:04:00Z

I believe the pipeline tests are currently passing?

sgugger · 2021-10-08T11:44:31Z

Hi @SaulLu could you review all the changes are good to you when you're back?

SaulLu · 2021-11-09T17:56:46Z

So that we don't get lost, this PR is awaiting the outcome of the PR #13930. 🙂

This reverts commit daaa3f5.

qqaatw · 2021-12-01T18:33:52Z

tests/test_tokenization_common.py


-                added = tokenizer.add_tokens(new_toks)
+                new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd", "AAAAA BBBBBB", "CCCCCCCCCDDDDDDDD"]
+                added = tokenizer.add_tokens([AddedToken(tok, lstrip=True, rstrip=True) for tok in new_toks])


All added tokens should do lstrip and rstrip because all no_split_token in slow tokenizers strip left and right spaces.

Kindly see line 518~522.

transformers/src/transformers/tokenization_utils.py

Lines 504 to 522 in 14cc50d

for i, token in enumerate(tokens):

if token in no_split_token:

tok_extended = all_special_tokens_extended.get(token, None)

left = tokens[i - 1] if i > 0 else None

right = tokens[i + 1] if i < len(tokens) - 1 else None

if isinstance(tok_extended, AddedToken):

if tok_extended.rstrip and right:

# A bit counter-intuitive but we strip the left of the string

# since tok_extended.rstrip means the special token is eating all white spaces on its right

tokens[i + 1] = right.lstrip()

# Strip white spaces on the left

if tok_extended.lstrip and left:

tokens[i - 1] = left.rstrip() # Opposite here

else:

# We strip left and right by default

if right:

tokens[i + 1] = right.lstrip()

if left:

tokens[i - 1] = left.rstrip()

Thank you so much for the explications! I agree with you

qqaatw · 2021-12-02T06:03:23Z

@SaulLu All tests passed. Thanks.

SaulLu

Thanks a lot for your patience for this PR and for extending the test_added_tokens_do_lower_case test to fast tokenizers! 🎊

I just left 2 nit that would make the reading of this test even easier by homogenizing the tests between the tokenizers with a do_lower_case argument and the others. If you don't have any time to devote to this PR, there is no problem, I will apply myself these nit in a next PR. 🙂

tests/test_tokenization_common.py

SaulLu · 2021-12-02T10:20:56Z

tests/test_tokenization_common.py

@@ -632,30 +631,35 @@ def test_added_tokens_do_lower_case(self):
                text = special_token + " aaaaa bbbbbb low cccccccccdddddddd l " + special_token
                text2 = special_token + " AAAAA BBBBBB low CCCCCCCCCDDDDDDDD l " + special_token

-                toks0 = tokenizer.tokenize(text)  # toks before adding new_toks
+                toks_before_adding = tokenizer.tokenize(text)  # toks before adding new_toks


I really like the renaming of the variables!

tests/test_tokenization_common.py

SaulLu · 2021-12-02T11:33:53Z

tests/test_tokenization_common.py


-                added = tokenizer.add_tokens(new_toks)
+                new_toks = ["aaaaa bbbbbb", "cccccccccdddddddd", "AAAAA BBBBBB", "CCCCCCCCCDDDDDDDD"]
+                added = tokenizer.add_tokens([AddedToken(tok, lstrip=True, rstrip=True) for tok in new_toks])


Thank you so much for the explications! I agree with you

qqaatw commented Sep 16, 2021

View reviewed changes

qqaatw added 5 commits September 18, 2021 16:09

Use new method to acquire tokenizers

7a256b9

Resolve TODOs.

94ce61a

Style

576b2f3

Fix

6c3bd1b

Enable do_lower_case in test_tokenize_special_tokens

5d5a98c

qqaatw force-pushed the improve_tokenizer_test branch from 21d330f to 5d5a98c Compare September 18, 2021 09:23

SaulLu reviewed Sep 23, 2021

View reviewed changes

huggingface deleted a comment from github-actions bot Nov 1, 2021

qqaatw added 3 commits November 10, 2021 18:20

Merge branch 'master' into improve_tokenizer_test

d00aeeb

Apply suggestion from code review

2970120

Fix mask token handling

daaa3f5

qqaatw mentioned this pull request Nov 11, 2021

Fix mask token handling #14364

Merged

qqaatw added 5 commits November 17, 2021 16:26

Merge branch 'master' into improve_tokenizer_test

e193756

Revert "Fix mask token handling"

0c548ad

This reverts commit daaa3f5.

Fix FNet mask token tokenization

045ad79

Merge branch 'fix_mask_token_handling' into improve_tokenizer_test

693e02f

Complete everything

2bc0a28

qqaatw force-pushed the improve_tokenizer_test branch from 7b68e65 to 2bc0a28 Compare December 1, 2021 18:24

qqaatw commented Dec 1, 2021

View reviewed changes

qqaatw requested a review from SaulLu December 1, 2021 18:34

SaulLu approved these changes Dec 2, 2021

View reviewed changes

Apply suggestions from code review

fe9d9aa

SaulLu merged commit 66ea739 into huggingface:master Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve tokenizer tests #13594

Improve tokenizer tests #13594

qqaatw commented Sep 16, 2021

qqaatw Sep 16, 2021

SaulLu Sep 23, 2021

qqaatw Sep 23, 2021

qqaatw Sep 23, 2021

SaulLu Oct 5, 2021

sgugger Oct 5, 2021

qqaatw Oct 8, 2021

qqaatw Nov 10, 2021

SaulLu Nov 10, 2021

qqaatw Nov 11, 2021

LysandreJik commented Sep 17, 2021

SaulLu left a comment

SaulLu Sep 23, 2021

SaulLu commented Sep 29, 2021

LysandreJik commented Sep 30, 2021

sgugger commented Oct 8, 2021

SaulLu commented Nov 9, 2021

qqaatw Dec 1, 2021

SaulLu Dec 2, 2021 •

edited

Loading

qqaatw commented Dec 2, 2021

SaulLu left a comment

SaulLu Dec 2, 2021

SaulLu Dec 2, 2021 •

edited

Loading

	for i, token in enumerate(tokens):
	if token in no_split_token:
	tok_extended = all_special_tokens_extended.get(token, None)
	left = tokens[i - 1] if i > 0 else None
	right = tokens[i + 1] if i < len(tokens) - 1 else None
	if isinstance(tok_extended, AddedToken):
	if tok_extended.rstrip and right:
	# A bit counter-intuitive but we strip the left of the string
	# since tok_extended.rstrip means the special token is eating all white spaces on its right
	tokens[i + 1] = right.lstrip()
	# Strip white spaces on the left
	if tok_extended.lstrip and left:
	tokens[i - 1] = left.rstrip() # Opposite here
	else:
	# We strip left and right by default
	if right:
	tokens[i + 1] = right.lstrip()
	if left:
	tokens[i - 1] = left.rstrip()

Improve tokenizer tests #13594

Improve tokenizer tests #13594

Conversation

qqaatw commented Sep 16, 2021

What does this PR do?

Who can review?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik commented Sep 17, 2021

SaulLu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SaulLu commented Sep 29, 2021

LysandreJik commented Sep 30, 2021

sgugger commented Oct 8, 2021

SaulLu commented Nov 9, 2021

Choose a reason for hiding this comment

SaulLu Dec 2, 2021 • edited Loading

Choose a reason for hiding this comment

qqaatw commented Dec 2, 2021

SaulLu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SaulLu Dec 2, 2021 • edited Loading

Choose a reason for hiding this comment

SaulLu Dec 2, 2021 •

edited

Loading

SaulLu Dec 2, 2021 •

edited

Loading