Skip to content

moon23k/Tokenizers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

  Tokenization plays a crucial role in Natural Language Processing. However, comparative studies focused on tokenization are rare. This is especially true in Natural Language Generation process based on small data and model structures. To address this issue, this repository covers comparative analysis of the impact of four distinct Tokenization approaches on the performance of Neural Machine Translation task. By doing so, I hope to establish a valuable benchmark that can serve as a reference point for future research endeavors.



Background

Word Tokenization

  • Split Text based on Words according to separator like white space
  • The most simple and intuitive Tokenization Methodology
  • Large vocab size is essential for various expressive abilities
  • Easily get into Out of Vocabulary trouble, and has difficulty to respond to new words

Character Tokenization

  • Split Text based on Character
  • Only vocab size equal to the number of alphabets appearing in the text is required
  • Out of Vocabulary rarely occurs, and easy to respond to new words
  • difficult to train model because the model has to learn a lot of token combinations

Sub-Word Tokenization

  • Intermediate form of Word Toknenization and Character Toknenization
  • Split text based on subwords, which are smaller than words yet larger than characters
  • There are various algorithms for how to construct sub words
  • Possible to flexibly cope with new expressions, and can prevent token expressions getting too long
  • most commonly used on various models



Tokenizers

Word-Level Tokenizer

This is the “classic” tokenization algorithm. It let’s you simply map words to IDs without anything fancy. This has the advantage of being really simple to use and understand, but it requires extremely large vocabularies for a good coverage. Using this Model requires the use of a PreTokenizer. No choice will be made by this model directly, it simply maps input tokens to IDs.


Word Piece Tokenizer

This is a subword tokenization algorithm quite similar to BPE, used mainly by Google in models like BERT. It uses a greedy algorithm, that tries to build long words first, splitting in multiple tokens when entire words don’t exist in the vocabulary. This is different from BPE that starts from characters, building bigger tokens as possible. It uses the famous ## prefix to identify tokens that are part of a word (ie not starting a word).


Byte Pair Encoding Tokenizer

One of the most popular subword tokenization algorithm. The Byte-Pair-Encoding works by starting with characters, while merging those that are the most frequently seen together, thus creating new tokens. It then works iteratively to build new tokens out of the most frequent pairs it sees in a corpus. BPE is able to build words it has never seen by using multiple subword tokens, and thus requires smaller vocabularies, with less chances of having “unk” (unknown) tokens.


Unigram

Unigram is also a subword tokenization algorithm, and works by trying to identify the best set of subword tokens to maximize the probability for a given sentence. This is different from BPE in the way that this is not deterministic based on a set of rules applied sequentially. Instead Unigram will be able to compute multiple ways of tokenizing, while choosing the most probable one.



Experimental Setups

Model Setup Training Setup
  Architecture:   Transformer     N_Epochs:   10
  Embedding Dimension:   256     Batch Size:   32
  Hidden Dimension:   256   LR:   5e-4
  FFN Dimension:   512   iters_to_accumulate:   4  
  N Heads:   8   Gradient Clip Max Norm:   1  
  N Layers:   3   Apply AMP:   True



Evaluation

  Tokenizer Type     10k Model Score     20k Model Score     30k Model Score  
Word Level 16.09 12.80 12.50
Word Piece 19.96 14.62 12.17
BPE 13.32 13.39 13.28
Unigram 13.58 13.88 15.64



How to Use

git clone https://github.com/moon23k/Tokenizers.git
cd Tokenizers
python3 setup.py -tokenizer_type ['all', 'WL', 'WP', 'BPE', 'UNI']
                 -vocab_size [10k, 20k, 30k]
python3 run.py -mode [train, test, inference]
               -tokenizer_type ['WL', 'WP', 'BPE', 'UNI']
               -vocab_size [10k, 20k, 30k]



Reference


Releases

No releases published

Packages

No packages published

Languages