Tokenization plays a crucial role in Natural Language Processing. However, comparative studies focused on tokenization are rare. This is especially true in Natural Language Generation process based on small data and model structures. To address this issue, this repository covers comparative analysis of the impact of four distinct Tokenization approaches on the performance of Neural Machine Translation task. By doing so, I hope to establish a valuable benchmark that can serve as a reference point for future research endeavors.
Word Tokenization
- Split Text based on Words according to separator like white space
- The most simple and intuitive Tokenization Methodology
- Large vocab size is essential for various expressive abilities
- Easily get into Out of Vocabulary trouble, and has difficulty to respond to new words
Character Tokenization
- Split Text based on Character
- Only vocab size equal to the number of alphabets appearing in the text is required
- Out of Vocabulary rarely occurs, and easy to respond to new words
- difficult to train model because the model has to learn a lot of token combinations
Sub-Word Tokenization
- Intermediate form of Word Toknenization and Character Toknenization
- Split text based on subwords, which are smaller than words yet larger than characters
- There are various algorithms for how to construct sub words
- Possible to flexibly cope with new expressions, and can prevent token expressions getting too long
- most commonly used on various models
Word-Level Tokenizer
This is the “classic” tokenization algorithm. It let’s you simply map words to IDs without anything fancy. This has the advantage of being really simple to use and understand, but it requires extremely large vocabularies for a good coverage. Using this Model requires the use of a PreTokenizer. No choice will be made by this model directly, it simply maps input tokens to IDs.
Word Piece Tokenizer
This is a subword tokenization algorithm quite similar to BPE, used mainly by Google in models like BERT. It uses a greedy algorithm, that tries to build long words first, splitting in multiple tokens when entire words don’t exist in the vocabulary. This is different from BPE that starts from characters, building bigger tokens as possible. It uses the famous ## prefix to identify tokens that are part of a word (ie not starting a word).
Byte Pair Encoding Tokenizer
One of the most popular subword tokenization algorithm. The Byte-Pair-Encoding works by starting with characters, while merging those that are the most frequently seen together, thus creating new tokens. It then works iteratively to build new tokens out of the most frequent pairs it sees in a corpus. BPE is able to build words it has never seen by using multiple subword tokens, and thus requires smaller vocabularies, with less chances of having “unk” (unknown) tokens.
Unigram
Unigram is also a subword tokenization algorithm, and works by trying to identify the best set of subword tokens to maximize the probability for a given sentence. This is different from BPE in the way that this is not deterministic based on a set of rules applied sequentially. Instead Unigram will be able to compute multiple ways of tokenizing, while choosing the most probable one.
Model Setup | Training Setup |
---|---|
Architecture: Transformer |
N_Epochs: 10 |
Embedding Dimension: 256 |
Batch Size: 32 |
Hidden Dimension: 256 |
LR: 5e-4 |
FFN Dimension: 512 |
iters_to_accumulate: 4 |
N Heads: 8 |
Gradient Clip Max Norm: 1 |
N Layers: 3 |
Apply AMP: True |
Tokenizer Type | 10k Model Score | 20k Model Score | 30k Model Score |
---|---|---|---|
Word Level |
16.09 | 12.80 | 12.50 |
Word Piece |
19.96 | 14.62 | 12.17 |
BPE |
13.32 | 13.39 | 13.28 |
Unigram |
13.58 | 13.88 | 15.64 |
git clone https://github.com/moon23k/Tokenizers.git
cd Tokenizers
python3 setup.py -tokenizer_type ['all', 'WL', 'WP', 'BPE', 'UNI']
-vocab_size [10k, 20k, 30k]
python3 run.py -mode [train, test, inference]
-tokenizer_type ['WL', 'WP', 'BPE', 'UNI']
-vocab_size [10k, 20k, 30k]