Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Pegasus] Refactor Tokenizer #8731

Merged

Conversation

patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Nov 23, 2020

What does this PR do?

Fixes #8689, #8594, #8536

This PR refactors the Pegasus Tokenizer.

1st: It decouples the tokenizer from the Reformer Tokenizer because they don't really have much in common.
2nd: Pegasus' masked tokens are added. As stated in the paper, PEGASUS has two masked tokens which are required for pre-training. Those two tokens <mask_1> and <mask_2> are added according to https://github.com/google-research/pegasus/blob/master/pegasus/ops/pretrain_parsing_ops.cc#L66 . This should solve or at least enable a solution for all three issues above.
3rd: IMO, all special tokens - which are in the case of Pegasus the tokens 2 to 104 - should be added to the additional_special_tokens. This is done here as well.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

@patrickvonplaten patrickvonplaten changed the title [Pegasus] Refactor Tokenizer [WIP][Pegasus] Refactor Tokenizer Nov 23, 2020
@patrickvonplaten patrickvonplaten linked an issue Nov 26, 2020 that may be closed by this pull request
@@ -32,31 +37,136 @@
}


class PegasusTokenizer(ReformerTokenizer):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pegasus has nothing to do with Reformer, so decouple it here.

pad_token="<pad>",
eos_token="</s>",
unk_token="<unk>",
mask_token="<mask_2>",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pegasus has two masked tokens that were previously not added to the tokenizer. There are defined as the 2nd and 3rd token according to the original implementation: https://github.com/google-research/pegasus/blob/939830367bcf411193d2b5eca2f2f90f3f9260ca/pegasus/ops/pretrain_parsing_ops.cc#L66

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resolves both this #8536 and this #8594 issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the bos_token which is supposed to be passed into the decoder is missing?

additional_special_tokens=None,
**kwargs
):
if additional_special_tokens is not None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Sam pointed out before tokens 2-104 were only used for pre-training. I think to add them to the additional_special_tokens in this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes indeed, that the good place to put them.

@@ -71,10 +71,9 @@

class AlbertTokenizerFast(PreTrainedTokenizerFast):
"""
Construct a "fast" ALBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on `SentencePiece
<https://github.com/google/sentencepiece>`__. This tokenizer inherits from
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomwolf @LysandreJik @n1t0 - I don't think the fast tokenizers are based on google's sentencepiece anymore, so I removed this statement from all fast tokenizers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed

return PegasusTokenizer.from_pretrained("google/pegasus-large")

@unittest.skip("add_tokens does not work yet")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test seems to work, so delete this here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

]
vocab += [(f"unk_{i}", -100) for i in range(2, 2 + self.original_tokenizer.offset)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was wrong previously -> it should have been "<unk_{i}>"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok!

@patrickvonplaten patrickvonplaten changed the title [WIP][Pegasus] Refactor Tokenizer [Pegasus] Refactor Tokenizer Nov 27, 2020
@patrickvonplaten patrickvonplaten linked an issue Nov 27, 2020 that may be closed by this pull request
@patrickvonplaten
Copy link
Contributor Author

Checked all slow and fast tests on GPU.

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! just a comment on init with added tokens

]
vocab += [(f"unk_{i}", -100) for i in range(2, 2 + self.original_tokenizer.offset)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok!

pad_token="<pad>",
eos_token="</s>",
unk_token="<unk>",
mask_token="<mask_2>",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome!

additional_special_tokens=None,
**kwargs
):
if additional_special_tokens is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes indeed, that the good place to put them.

if mask_token_sent not in additional_special_tokens:
additional_special_tokens = [mask_token_sent] + additional_special_tokens
# fill additional tokens with ..., <unk_token_102> in case not all additional tokens are already taken
additional_special_tokens += [f"<unk_{i}>" for i in range(2, self.offset - len(additional_special_tokens))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should check that these tokens are not already there (it's the case when you reload from a checkpoitn saved with saved_pretrained. You can check the logic in the init from T5.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My plan was to only add tokens if the length is not full via range(2, self.offset - len(additional_special_tokens)
-> made it cleaner now I think with a raise ValueError test.

Copy link
Member

@n1t0 n1t0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You removed the links to google/sentencepiece but you kept the Based on SentencePiece..

It seems to me that if we reference SentencePiece then it's good to keep a link to the library, no? Based on SentencePiece means that it's based on the library imo, maybe you wanted to say it's based on Unigram (or BPE if it's based on SentencePiece's BPE) instead?

Great changes, thanks for taking care of it!

Comment on lines +66 to +75
mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask_2>"`):
The token used for masking single token values. This is the token used when training this model with masked
language modeling (MLM). This is the token that the PEGASUS encoder will try to predict during pretraining.
It corresponds to `[MASK2]` in `PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive
Summarization <https://arxiv.org/pdf/1912.08777.pdf>`__.
mask_token_sent (:obj:`str`, `optional`, defaults to :obj:`"<mask_1>"`):
The token used for masking whole target sentences. This is the token used when training this model with gap
sentences generation (GSG). This is the sentence that the PEGASUS decoder will try to predict during
pretraining. It corresponds to `[MASK1]` in `PEGASUS: Pre-training with Extracted Gap-sentences for
Abstractive Summarization <https://arxiv.org/pdf/1912.08777.pdf>`__.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great docs here.

src/transformers/models/pegasus/tokenization_pegasus.py Outdated Show resolved Hide resolved
@patrickvonplaten
Copy link
Contributor Author

patrickvonplaten commented Nov 27, 2020

You removed the links to google/sentencepiece but you kept the Based on SentencePiece..

It seems to me that if we reference SentencePiece then it's good to keep a link to the library, no? Based on SentencePiece means that it's based on the library IMO, maybe you wanted to say it's based on Unigram instead?

Great changes, thanks for taking care of it!

Good point! I also think it would be nicer to have a link to it...For now, the text is always:

Construct a "fast" ALBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on SentencePiece.

=> So similar to what other FastTokenizers have written in their comments. But I agree that it could be confusing as "SentencePiece" doesn't really exist as an entity in tokenizers ... I think the "fast" sentencepiece tokenizers are either BPE or Unigram in tokenizers, no ? @thomwolf @n1t0 . Should I change the comments and link to their respective tokenizers model instead? So to

Construct a "fast" ALBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on `Unigram <link to unigram in tokenizers>`__ . 

@LysandreJik
Copy link
Member

I think your proposal makes a lot of sense!

@@ -44,10 +44,17 @@ AlbertTokenizer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: transformers.AlbertTokenizer
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
:members: __call__, build_inputs_with_special_tokens, get_special_tokens_mask,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc still needs to be nicely done. Waiting for @sgugger feedback here. Almost no model doc files have their fast tokenizers in the doc. Also, I think all tokenizers should at least show the __call__ method in the doc as it's the most important method. Maybe @sgugger can you give me your opinion on which methods should be shown for the slow tokenizer and which for the fast tokenizer and I'll apply it to all model docs in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the PR is getting a bit too big already - will open a new PR to add the fast tokenizer docs and merge this one.

@patrickvonplaten patrickvonplaten merged commit 5ced23d into huggingface:master Nov 29, 2020
stas00 pushed a commit to stas00/transformers that referenced this pull request Dec 5, 2020
* refactor

* further refactor

* fix the rest tomorrow

* save intermediate

* finish slow tokenizer

* make more tests pass

* finish refactor

* fix comment

* clean further

* fix name

* fix naming

* Update src/transformers/models/reformer/tokenization_reformer.py

* Apply suggestions from code review

* Apply suggestions from code review

* refactor

* fix init tokenizers

* refactor

* improve convert

* refactor

* correct convert slow tokenizer

* final fix for Pegasus Tok

* remove ipdb

* improve links
@patrickvonplaten patrickvonplaten deleted the refactor_pegasus_tok branch May 2, 2021 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Question] Pegasus tokenizer PEGASUS do not have mask token
5 participants