Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Remaining Tokenizers #45

Closed
1 of 4 tasks
abheesht17 opened this issue Mar 16, 2022 · 3 comments
Closed
1 of 4 tasks

Add Remaining Tokenizers #45

abheesht17 opened this issue Mar 16, 2022 · 3 comments

Comments

@abheesht17
Copy link
Collaborator

abheesht17 commented Mar 16, 2022

Below is a list of commonly used tokenizers which can be implemented. @mattdangerw, @chenmoneygithub, let me know what your thoughts on these tokenizers are (whether we should implement all of these, or skip some). Thanks!

  • Space + Punctuation Tokenizer: I know that subword tokenizers are in vogue nowadays. However, a model as recent as Transformer XL uses a generic Space + Punctuation Tokenizer. Hence, this can be added to the library for the sake of completeness.

  • Byte Pair Encoding (BPE): GPT-2 and RoBERTa use this.

  • WordPiece: Has already been implemented. BERT uses this.

  • SentencePiece: Mentioned in Add a SentencePiece tokenizer layer #27. XLNet uses this.

A note on the differences between the subword tokenizers mentioned above (source: https://blog.floydhub.com/tokenization-nlp/):

BPE: Just uses the frequency of occurrences to identify the best match at every iteration until it reaches the predefined vocabulary size.

WordPiece: Similar to BPE and uses frequency occurrences to identify potential merges but makes the final decision based on the likelihood of the merged token

Unigram: A fully probabilistic model which does not use frequency occurrences. Instead, it trains a LM using a probabilistic model, removing the token which improves the overall likelihood the least and then starting over until it reaches the final token limit.

If you think it's fine, I'll get started with Space + Punctuation Tokenizer and BPE Tokenizer.

Note on Unigram Tokenizer: Not sure if any model uses this. Can be skipped.

@abheesht17 abheesht17 changed the title Add Tokenizers Add Remaining Tokenizers Mar 16, 2022
@mattdangerw
Copy link
Member

Thanks!

  • For space and punctuation tokenizer, we have the TextVectorization layer in core Keras. It might be nice to wrap that in a Tokenizer base class form at some point for consistency, but that's something we are still discussing. Let's not pull that work now.
  • For BPE tokenizer, this is indeed something we want, but the implementation will come with some challenged. I'll open an issue where we can discuss.
  • Unigram tokenization can be used through SentencePiece.

@abheesht17
Copy link
Collaborator Author

Ah, I see. Let's discuss BPE 👍🏼

@mattdangerw
Copy link
Member

Opened #46 for BPE. Let's close this, as this is sort of a catch all issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants