You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Below is a list of commonly used tokenizers which can be implemented. @mattdangerw, @chenmoneygithub, let me know what your thoughts on these tokenizers are (whether we should implement all of these, or skip some). Thanks!
Space + Punctuation Tokenizer: I know that subword tokenizers are in vogue nowadays. However, a model as recent as Transformer XL uses a generic Space + Punctuation Tokenizer. Hence, this can be added to the library for the sake of completeness.
Byte Pair Encoding (BPE): GPT-2 and RoBERTa use this.
WordPiece: Has already been implemented. BERT uses this.
BPE: Just uses the frequency of occurrences to identify the best match at every iteration until it reaches the predefined vocabulary size.
WordPiece: Similar to BPE and uses frequency occurrences to identify potential merges but makes the final decision based on the likelihood of the merged token
Unigram: A fully probabilistic model which does not use frequency occurrences. Instead, it trains a LM using a probabilistic model, removing the token which improves the overall likelihood the least and then starting over until it reaches the final token limit.
If you think it's fine, I'll get started with Space + Punctuation Tokenizer and BPE Tokenizer.
Note on Unigram Tokenizer: Not sure if any model uses this. Can be skipped.
The text was updated successfully, but these errors were encountered:
For space and punctuation tokenizer, we have the TextVectorization layer in core Keras. It might be nice to wrap that in a Tokenizer base class form at some point for consistency, but that's something we are still discussing. Let's not pull that work now.
For BPE tokenizer, this is indeed something we want, but the implementation will come with some challenged. I'll open an issue where we can discuss.
Unigram tokenization can be used through SentencePiece.
Below is a list of commonly used tokenizers which can be implemented. @mattdangerw, @chenmoneygithub, let me know what your thoughts on these tokenizers are (whether we should implement all of these, or skip some). Thanks!
Space + Punctuation Tokenizer: I know that subword tokenizers are in vogue nowadays. However, a model as recent as Transformer XL uses a generic Space + Punctuation Tokenizer. Hence, this can be added to the library for the sake of completeness.
Byte Pair Encoding (BPE): GPT-2 and RoBERTa use this.
WordPiece: Has already been implemented. BERT uses this.
SentencePiece: Mentioned in Add a SentencePiece tokenizer layer #27. XLNet uses this.
A note on the differences between the subword tokenizers mentioned above (source: https://blog.floydhub.com/tokenization-nlp/):
If you think it's fine, I'll get started with Space + Punctuation Tokenizer and BPE Tokenizer.
Note on Unigram Tokenizer: Not sure if any model uses this. Can be skipped.
The text was updated successfully, but these errors were encountered: