-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Pegasus] Refactor Tokenizer #8731
[Pegasus] Refactor Tokenizer #8731
Conversation
@@ -32,31 +37,136 @@ | |||
} | |||
|
|||
|
|||
class PegasusTokenizer(ReformerTokenizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pegasus has nothing to do with Reformer, so decouple it here.
pad_token="<pad>", | ||
eos_token="</s>", | ||
unk_token="<unk>", | ||
mask_token="<mask_2>", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pegasus has two masked tokens that were previously not added to the tokenizer. There are defined as the 2nd and 3rd token according to the original implementation: https://github.com/google-research/pegasus/blob/939830367bcf411193d2b5eca2f2f90f3f9260ca/pegasus/ops/pretrain_parsing_ops.cc#L66
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the bos_token which is supposed to be passed into the decoder is missing?
additional_special_tokens=None, | ||
**kwargs | ||
): | ||
if additional_special_tokens is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Sam pointed out before tokens 2-104 were only used for pre-training. I think to add them to the additional_special_tokens
in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes indeed, that the good place to put them.
@@ -71,10 +71,9 @@ | |||
|
|||
class AlbertTokenizerFast(PreTrainedTokenizerFast): | |||
""" | |||
Construct a "fast" ALBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on `SentencePiece | |||
<https://github.com/google/sentencepiece>`__. This tokenizer inherits from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomwolf @LysandreJik @n1t0 - I don't think the fast tokenizers are based on google's sentencepiece anymore, so I removed this statement from all fast tokenizers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed
return PegasusTokenizer.from_pretrained("google/pegasus-large") | ||
|
||
@unittest.skip("add_tokens does not work yet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test seems to work, so delete this here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
] | ||
vocab += [(f"unk_{i}", -100) for i in range(2, 2 + self.original_tokenizer.offset)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was wrong previously -> it should have been "<unk_{i}>"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok!
Checked all slow and fast tests on GPU. |
…into refactor_pegasus_tok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! just a comment on init with added tokens
] | ||
vocab += [(f"unk_{i}", -100) for i in range(2, 2 + self.original_tokenizer.offset)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok!
pad_token="<pad>", | ||
eos_token="</s>", | ||
unk_token="<unk>", | ||
mask_token="<mask_2>", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome!
additional_special_tokens=None, | ||
**kwargs | ||
): | ||
if additional_special_tokens is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes indeed, that the good place to put them.
if mask_token_sent not in additional_special_tokens: | ||
additional_special_tokens = [mask_token_sent] + additional_special_tokens | ||
# fill additional tokens with ..., <unk_token_102> in case not all additional tokens are already taken | ||
additional_special_tokens += [f"<unk_{i}>" for i in range(2, self.offset - len(additional_special_tokens))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should check that these tokens are not already there (it's the case when you reload from a checkpoitn saved with saved_pretrained
. You can check the logic in the init from T5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My plan was to only add tokens if the length is not full via range(2, self.offset - len(additional_special_tokens)
-> made it cleaner now I think with a raise ValueError
test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You removed the links to google/sentencepiece but you kept the Based on SentencePiece.
.
It seems to me that if we reference SentencePiece then it's good to keep a link to the library, no? Based on SentencePiece means that it's based on the library imo, maybe you wanted to say it's based on Unigram (or BPE if it's based on SentencePiece's BPE) instead?
Great changes, thanks for taking care of it!
mask_token (:obj:`str`, `optional`, defaults to :obj:`"<mask_2>"`): | ||
The token used for masking single token values. This is the token used when training this model with masked | ||
language modeling (MLM). This is the token that the PEGASUS encoder will try to predict during pretraining. | ||
It corresponds to `[MASK2]` in `PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive | ||
Summarization <https://arxiv.org/pdf/1912.08777.pdf>`__. | ||
mask_token_sent (:obj:`str`, `optional`, defaults to :obj:`"<mask_1>"`): | ||
The token used for masking whole target sentences. This is the token used when training this model with gap | ||
sentences generation (GSG). This is the sentence that the PEGASUS decoder will try to predict during | ||
pretraining. It corresponds to `[MASK1]` in `PEGASUS: Pre-training with Extracted Gap-sentences for | ||
Abstractive Summarization <https://arxiv.org/pdf/1912.08777.pdf>`__. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great docs here.
Good point! I also think it would be nicer to have a link to it...For now, the text is always:
=> So similar to what other FastTokenizers have written in their comments. But I agree that it could be confusing as "SentencePiece" doesn't really exist as an entity in
|
I think your proposal makes a lot of sense! |
docs/source/model_doc/albert.rst
Outdated
@@ -44,10 +44,17 @@ AlbertTokenizer | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
.. autoclass:: transformers.AlbertTokenizer | |||
:members: build_inputs_with_special_tokens, get_special_tokens_mask, | |||
:members: __call__, build_inputs_with_special_tokens, get_special_tokens_mask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc still needs to be nicely done. Waiting for @sgugger feedback here. Almost no model doc files have their fast tokenizers in the doc. Also, I think all tokenizers should at least show the __call__
method in the doc as it's the most important method. Maybe @sgugger can you give me your opinion on which methods should be shown for the slow tokenizer and which for the fast tokenizer and I'll apply it to all model docs in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the PR is getting a bit too big already - will open a new PR to add the fast tokenizer docs and merge this one.
19e1724
to
7b1a4a5
Compare
* refactor * further refactor * fix the rest tomorrow * save intermediate * finish slow tokenizer * make more tests pass * finish refactor * fix comment * clean further * fix name * fix naming * Update src/transformers/models/reformer/tokenization_reformer.py * Apply suggestions from code review * Apply suggestions from code review * refactor * fix init tokenizers * refactor * improve convert * refactor * correct convert slow tokenizer * final fix for Pegasus Tok * remove ipdb * improve links
What does this PR do?
Fixes #8689, #8594, #8536
This PR refactors the Pegasus Tokenizer.
1st: It decouples the tokenizer from the Reformer Tokenizer because they don't really have much in common.
2nd: Pegasus' masked tokens are added. As stated in the paper, PEGASUS has two masked tokens which are required for pre-training. Those two tokens
<mask_1>
and<mask_2>
are added according to https://github.com/google-research/pegasus/blob/master/pegasus/ops/pretrain_parsing_ops.cc#L66 . This should solve or at least enable a solution for all three issues above.3rd: IMO, all special tokens - which are in the case of Pegasus the tokens 2 to 104 - should be added to the
additional_special_tokens
. This is done here as well.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.