[Pegasus] Refactor Tokenizer #8731

patrickvonplaten · 2020-11-23T16:58:16Z

What does this PR do?

This PR refactors the Pegasus Tokenizer.

1st: It decouples the tokenizer from the Reformer Tokenizer because they don't really have much in common.
2nd: Pegasus' masked tokens are added. As stated in the paper, PEGASUS has two masked tokens which are required for pre-training. Those two tokens <mask_1> and <mask_2> are added according to https://github.com/google-research/pegasus/blob/master/pegasus/ops/pretrain_parsing_ops.cc#L66 . This should solve or at least enable a solution for all three issues above.
3rd: IMO, all special tokens - which are in the case of Pegasus the tokens 2 to 104 - should be added to the additional_special_tokens. This is done here as well.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

patrickvonplaten · 2020-11-26T17:44:26Z

src/transformers/models/pegasus/tokenization_pegasus.py

@@ -32,31 +37,136 @@
 }


-class PegasusTokenizer(ReformerTokenizer):


Pegasus has nothing to do with Reformer, so decouple it here.

patrickvonplaten · 2020-11-26T17:45:14Z

src/transformers/models/pegasus/tokenization_pegasus.py

+        pad_token="<pad>",
+        eos_token="</s>",
+        unk_token="<unk>",
+        mask_token="<mask_2>",


Pegasus has two masked tokens that were previously not added to the tokenizer. There are defined as the 2nd and 3rd token according to the original implementation: https://github.com/google-research/pegasus/blob/939830367bcf411193d2b5eca2f2f90f3f9260ca/pegasus/ops/pretrain_parsing_ops.cc#L66

This resolves both this #8536 and this #8594 issue.

Seems like the bos_token which is supposed to be passed into the decoder is missing?

patrickvonplaten · 2020-11-26T17:46:36Z

src/transformers/models/pegasus/tokenization_pegasus.py

+        additional_special_tokens=None,
+        **kwargs
+    ):
+        if additional_special_tokens is not None:


As Sam pointed out before tokens 2-104 were only used for pre-training. I think to add them to the additional_special_tokens in this case.

yes indeed, that the good place to put them.

src/transformers/models/pegasus/tokenization_pegasus_fast.py

src/transformers/models/reformer/tokenization_reformer.py

patrickvonplaten · 2020-11-26T17:51:48Z

src/transformers/models/albert/tokenization_albert_fast.py

@@ -71,10 +71,9 @@

 class AlbertTokenizerFast(PreTrainedTokenizerFast):
    """
-    Construct a "fast" ALBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on `SentencePiece
-    <https://github.com/google/sentencepiece>`__. This tokenizer inherits from


@thomwolf @LysandreJik @n1t0 - I don't think the fast tokenizers are based on google's sentencepiece anymore, so I removed this statement from all fast tokenizers.

src/transformers/tokenization_utils_base.py

tests/test_tokenization_common.py

patrickvonplaten · 2020-11-26T17:54:11Z

tests/test_tokenization_pegasus.py

        return PegasusTokenizer.from_pretrained("google/pegasus-large")

-    @unittest.skip("add_tokens does not work yet")


Test seems to work, so delete this here

tests/test_tokenization_pegasus.py

patrickvonplaten · 2020-11-27T12:33:11Z

src/transformers/convert_slow_tokenizer.py

        ]
-        vocab += [(f"unk_{i}", -100) for i in range(2, 2 + self.original_tokenizer.offset)]


I think this was wrong previously -> it should have been "<unk_{i}>"

patrickvonplaten · 2020-11-27T13:05:11Z

Checked all slow and fast tests on GPU.

…into refactor_pegasus_tok

thomwolf

Looks good! just a comment on init with added tokens

thomwolf · 2020-11-27T12:57:04Z

src/transformers/convert_slow_tokenizer.py

        ]
-        vocab += [(f"unk_{i}", -100) for i in range(2, 2 + self.original_tokenizer.offset)]


thomwolf · 2020-11-27T12:57:49Z

src/transformers/models/pegasus/tokenization_pegasus.py

+        pad_token="<pad>",
+        eos_token="</s>",
+        unk_token="<unk>",
+        mask_token="<mask_2>",


thomwolf · 2020-11-27T12:58:32Z

src/transformers/models/pegasus/tokenization_pegasus.py

+        additional_special_tokens=None,
+        **kwargs
+    ):
+        if additional_special_tokens is not None:


yes indeed, that the good place to put them.

thomwolf · 2020-11-27T12:59:23Z

src/transformers/models/pegasus/tokenization_pegasus.py

+            if mask_token_sent not in additional_special_tokens:
+                additional_special_tokens = [mask_token_sent] + additional_special_tokens
+            # fill additional tokens with ..., <unk_token_102> in case not all additional tokens are already taken
+            additional_special_tokens += [f"<unk_{i}>" for i in range(2, self.offset - len(additional_special_tokens))]


you should check that these tokens are not already there (it's the case when you reload from a checkpoitn saved with saved_pretrained. You can check the logic in the init from T5.

My plan was to only add tokens if the length is not full via range(2, self.offset - len(additional_special_tokens)
-> made it cleaner now I think with a raise ValueError test.

src/transformers/models/pegasus/tokenization_pegasus_fast.py

n1t0

LGTM!

LysandreJik

You removed the links to google/sentencepiece but you kept the Based on SentencePiece..

It seems to me that if we reference SentencePiece then it's good to keep a link to the library, no? Based on SentencePiece means that it's based on the library imo, maybe you wanted to say it's based on Unigram (or BPE if it's based on SentencePiece's BPE) instead?

Great changes, thanks for taking care of it!

LysandreJik · 2020-11-27T19:35:36Z

src/transformers/models/pegasus/tokenization_pegasus.py

+        mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask_2>"`):
+            The token used for masking single token values. This is the token used when training this model with masked
+            language modeling (MLM). This is the token that the PEGASUS encoder will try to predict during pretraining.
+            It corresponds to `[MASK2]` in `PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive
+            Summarization <https://arxiv.org/pdf/1912.08777.pdf>`__.
+        mask_token_sent (:obj:`str`, `optional`, defaults to :obj:`"<mask_1>"`):
+            The token used for masking whole target sentences. This is the token used when training this model with gap
+            sentences generation (GSG). This is the sentence that the PEGASUS decoder will try to predict during
+            pretraining. It corresponds to `[MASK1]` in `PEGASUS: Pre-training with Extracted Gap-sentences for
+            Abstractive Summarization <https://arxiv.org/pdf/1912.08777.pdf>`__.


Great docs here.

src/transformers/models/pegasus/tokenization_pegasus.py

src/transformers/models/pegasus/tokenization_pegasus_fast.py

patrickvonplaten · 2020-11-27T20:27:17Z

You removed the links to google/sentencepiece but you kept the Based on SentencePiece..

It seems to me that if we reference SentencePiece then it's good to keep a link to the library, no? Based on SentencePiece means that it's based on the library IMO, maybe you wanted to say it's based on Unigram instead?

Great changes, thanks for taking care of it!

Good point! I also think it would be nicer to have a link to it...For now, the text is always:

Construct a "fast" ALBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on SentencePiece.

=> So similar to what other FastTokenizers have written in their comments. But I agree that it could be confusing as "SentencePiece" doesn't really exist as an entity in tokenizers ... I think the "fast" sentencepiece tokenizers are either BPE or Unigram in tokenizers, no ? @thomwolf @n1t0 . Should I change the comments and link to their respective tokenizers model instead? So to

Construct a "fast" ALBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on `Unigram <link to unigram in tokenizers>`__ .

LysandreJik · 2020-11-27T20:30:02Z

I think your proposal makes a lot of sense!

patrickvonplaten · 2020-11-27T21:52:16Z

docs/source/model_doc/albert.rst

@@ -44,10 +44,17 @@ AlbertTokenizer
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 .. autoclass:: transformers.AlbertTokenizer
-    :members: build_inputs_with_special_tokens, get_special_tokens_mask,
+    :members: __call__, build_inputs_with_special_tokens, get_special_tokens_mask,


The doc still needs to be nicely done. Waiting for @sgugger feedback here. Almost no model doc files have their fast tokenizers in the doc. Also, I think all tokenizers should at least show the __call__ method in the doc as it's the most important method. Maybe @sgugger can you give me your opinion on which methods should be shown for the slow tokenizer and which for the fast tokenizer and I'll apply it to all model docs in this PR.

Actually, the PR is getting a bit too big already - will open a new PR to add the fast tokenizer docs and merge this one.

* refactor * further refactor * fix the rest tomorrow * save intermediate * finish slow tokenizer * make more tests pass * finish refactor * fix comment * clean further * fix name * fix naming * Update src/transformers/models/reformer/tokenization_reformer.py * Apply suggestions from code review * Apply suggestions from code review * refactor * fix init tokenizers * refactor * improve convert * refactor * correct convert slow tokenizer * final fix for Pegasus Tok * remove ipdb * improve links

patrickvonplaten added 3 commits November 23, 2020 17:57

refactor

d19e7c9

further refactor

814cb3d

fix the rest tomorrow

8a384f1

patrickvonplaten changed the title ~~[Pegasus] Refactor Tokenizer~~ [WIP][Pegasus] Refactor Tokenizer Nov 23, 2020

patrickvonplaten added 5 commits November 26, 2020 13:14

save intermediate

f754101

finish slow tokenizer

1497056

make more tests pass

79fdcf8

finish refactor

293d19a

fix comment

4215881

patrickvonplaten linked an issue Nov 26, 2020 that may be closed by this pull request

[Question] Pegasus tokenizer #8689

Closed

patrickvonplaten added 2 commits November 26, 2020 18:37

clean further

60bcfb3

fix name

c0a5983

patrickvonplaten commented Nov 26, 2020

View reviewed changes

fix naming

898d29e

patrickvonplaten commented Nov 26, 2020

View reviewed changes

src/transformers/models/pegasus/tokenization_pegasus_fast.py Outdated Show resolved Hide resolved

patrickvonplaten commented Nov 26, 2020

View reviewed changes

src/transformers/models/reformer/tokenization_reformer.py Outdated Show resolved Hide resolved

Update src/transformers/models/reformer/tokenization_reformer.py

cd7487b

patrickvonplaten commented Nov 26, 2020

View reviewed changes

src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved

Apply suggestions from code review

95e8cbd

patrickvonplaten commented Nov 26, 2020

View reviewed changes

tests/test_tokenization_common.py Outdated Show resolved Hide resolved

Apply suggestions from code review

e249ed2

patrickvonplaten commented Nov 26, 2020

View reviewed changes

refactor

bf325e6

patrickvonplaten commented Nov 26, 2020

View reviewed changes

tests/test_tokenization_pegasus.py Show resolved Hide resolved

patrickvonplaten added 2 commits November 26, 2020 19:46

fix init tokenizers

3c2921c

refactor

da39bee

patrickvonplaten requested a review from thomwolf November 26, 2020 19:00

patrickvonplaten requested review from LysandreJik and n1t0 November 26, 2020 19:00

patrickvonplaten added 3 commits November 27, 2020 13:16

improve convert

b7a052a

refactor

7e0720e

correct convert slow tokenizer

6915481

patrickvonplaten commented Nov 27, 2020

View reviewed changes

patrickvonplaten changed the title ~~[WIP][Pegasus] Refactor Tokenizer~~ [Pegasus] Refactor Tokenizer Nov 27, 2020

patrickvonplaten linked an issue Nov 27, 2020 that may be closed by this pull request

PEGASUS do not have mask token #8594

Closed

patrickvonplaten mentioned this pull request Nov 27, 2020

Pretrain PEGASUS from scratch #8536

Closed

Merge remote-tracking branch 'main/master' into refactor_pegasus_tok

b39abb6

Merge branch 'master' of https://github.com/huggingface/transformers …

49e8cb7

…into refactor_pegasus_tok

thomwolf approved these changes Nov 27, 2020

View reviewed changes

n1t0 approved these changes Nov 27, 2020

View reviewed changes

final fix for Pegasus Tok

c0e6663

patrickvonplaten requested a review from sgugger November 27, 2020 15:38

LysandreJik approved these changes Nov 27, 2020

View reviewed changes

remove ipdb

bb97982

improve links

7b1a4a5

patrickvonplaten commented Nov 27, 2020

View reviewed changes

patrickvonplaten force-pushed the refactor_pegasus_tok branch from 19e1724 to 7b1a4a5 Compare November 29, 2020 15:56

patrickvonplaten merged commit 5ced23d into huggingface:master Nov 29, 2020

patrickvonplaten deleted the refactor_pegasus_tok branch May 2, 2021 08:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pegasus] Refactor Tokenizer #8731

[Pegasus] Refactor Tokenizer #8731

patrickvonplaten commented Nov 23, 2020 •

edited

patrickvonplaten Nov 26, 2020

patrickvonplaten Nov 26, 2020

patrickvonplaten Nov 26, 2020

thomwolf Nov 27, 2020

adivekar-utexas May 1, 2021

patrickvonplaten Nov 26, 2020

thomwolf Nov 27, 2020

patrickvonplaten Nov 26, 2020

thomwolf Nov 27, 2020

patrickvonplaten Nov 26, 2020

thomwolf Nov 27, 2020

patrickvonplaten Nov 27, 2020 •

edited

thomwolf Nov 27, 2020

patrickvonplaten commented Nov 27, 2020

thomwolf left a comment

thomwolf Nov 27, 2020

thomwolf Nov 27, 2020

thomwolf Nov 27, 2020

thomwolf Nov 27, 2020

patrickvonplaten Nov 27, 2020

n1t0 left a comment

LysandreJik left a comment •

edited

LysandreJik Nov 27, 2020

patrickvonplaten commented Nov 27, 2020 •

edited

LysandreJik commented Nov 27, 2020

patrickvonplaten Nov 27, 2020

patrickvonplaten Nov 29, 2020

		@@ -32,31 +37,136 @@
		}


		class PegasusTokenizer(ReformerTokenizer):

		return PegasusTokenizer.from_pretrained("google/pegasus-large")

		@unittest.skip("add_tokens does not work yet")

		]
		vocab += [(f"unk_{i}", -100) for i in range(2, 2 + self.original_tokenizer.offset)]

[Pegasus] Refactor Tokenizer #8731

[Pegasus] Refactor Tokenizer #8731

Conversation

patrickvonplaten commented Nov 23, 2020 • edited

What does this PR do?

Before submitting

Who can review?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten Nov 27, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Nov 27, 2020

thomwolf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n1t0 left a comment

Choose a reason for hiding this comment

LysandreJik left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Nov 27, 2020 • edited

LysandreJik commented Nov 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickvonplaten commented Nov 23, 2020 •

edited

patrickvonplaten Nov 27, 2020 •

edited

LysandreJik left a comment •

edited

patrickvonplaten commented Nov 27, 2020 •

edited