Add Remaining Tokenizers #45

abheesht17 · 2022-03-16T17:06:29Z

Below is a list of commonly used tokenizers which can be implemented. @mattdangerw, @chenmoneygithub, let me know what your thoughts on these tokenizers are (whether we should implement all of these, or skip some). Thanks!

Space + Punctuation Tokenizer: I know that subword tokenizers are in vogue nowadays. However, a model as recent as Transformer XL uses a generic Space + Punctuation Tokenizer. Hence, this can be added to the library for the sake of completeness.
Byte Pair Encoding (BPE): GPT-2 and RoBERTa use this.
WordPiece: Has already been implemented. BERT uses this.
SentencePiece: Mentioned in Add a SentencePiece tokenizer layer #27. XLNet uses this.

A note on the differences between the subword tokenizers mentioned above (source: https://blog.floydhub.com/tokenization-nlp/):

BPE: Just uses the frequency of occurrences to identify the best match at every iteration until it reaches the predefined vocabulary size.

WordPiece: Similar to BPE and uses frequency occurrences to identify potential merges but makes the final decision based on the likelihood of the merged token

Unigram: A fully probabilistic model which does not use frequency occurrences. Instead, it trains a LM using a probabilistic model, removing the token which improves the overall likelihood the least and then starting over until it reaches the final token limit.

If you think it's fine, I'll get started with Space + Punctuation Tokenizer and BPE Tokenizer.

Note on Unigram Tokenizer: Not sure if any model uses this. Can be skipped.

The text was updated successfully, but these errors were encountered:

mattdangerw · 2022-03-16T17:56:41Z

Thanks!

For space and punctuation tokenizer, we have the TextVectorization layer in core Keras. It might be nice to wrap that in a Tokenizer base class form at some point for consistency, but that's something we are still discussing. Let's not pull that work now.
For BPE tokenizer, this is indeed something we want, but the implementation will come with some challenged. I'll open an issue where we can discuss.
Unigram tokenization can be used through SentencePiece.

abheesht17 · 2022-03-16T18:03:38Z

Ah, I see. Let's discuss BPE 👍🏼

mattdangerw · 2022-03-16T19:01:45Z

Opened #46 for BPE. Let's close this, as this is sort of a catch all issue.

abheesht17 changed the title ~~Add Tokenizers~~ Add Remaining Tokenizers Mar 16, 2022

mattdangerw closed this as completed Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Remaining Tokenizers #45

Add Remaining Tokenizers #45

abheesht17 commented Mar 16, 2022 •

edited

mattdangerw commented Mar 16, 2022

abheesht17 commented Mar 16, 2022

mattdangerw commented Mar 16, 2022

Add Remaining Tokenizers #45

Add Remaining Tokenizers #45

Comments

abheesht17 commented Mar 16, 2022 • edited

mattdangerw commented Mar 16, 2022

abheesht17 commented Mar 16, 2022

mattdangerw commented Mar 16, 2022

abheesht17 commented Mar 16, 2022 •

edited