## character level tokenization
simple and straightforward implementation with no out of vocabulary issues and useful for task that require character level analysis.

In [4]:
text = "I Lovel NLP."
ch_tokens = list(text)
print(ch_tokens)

['I', ' ', 'L', 'o', 'v', 'e', 'l', ' ', 'N', 'L', 'P', '.']


## Word Level Tokeinzation
Token is individual word which uses spaces and punctuation as delimeter to identify word boundary.

In [5]:
word_tokens = text.split(" ")
print(word_tokens)

['I', 'Lovel', 'NLP.']


In [6]:
from nltk.tokenize import word_tokenize

word_tokens = word_tokenize(text)
print(word_tokens)

['I', 'Lovel', 'NLP', '.']


## Sub Word Tokenization
Getting best out of character level and word level helps to balance vocabulary size and semantic representation.

### 1. Byte Pair Encoding (BPE)
BPE start with a vocabulary of individual character and iteratively merges the most frequent adjacent pairs of characters or subwords. The process continues until a desired vocabulary size is reached.

**STEPS**
- Initialize with individual character.
- Count the frequency of character pairs in the corpus data.
- Merge the most frequent pair and add it to the vocabulary.
- Repeat step 2 and 3 until desired vocabulary size is reached.

#### 1.1 Byte Level Byte Pair Encoding
Used by GPT-2 model which uses bytes as the base vocabulary ensuring a fixed base vocabulary size of 256 while being able to tokenize any text without an unknown token.

In [8]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import BertProcessing

In [9]:
class BPETokenizer:
    def __init__(self, vocab_size=1000):
        self.tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
        self.trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=vocab_size)
        self.tokenizer.pre_tokenizer = Whitespace()
        
    def train(self, files):
        self.tokenizer.train(files, self.trainer)
        
    def tokenize(self, text):
        return self.tokenizer.encode(text).tokens

In [None]:
bpe_tokenizer = BPETokenizer()
corpus = ["data/text.txt"]
bpe_tokenizer.train(corpus)
# after training the tokenizer on roman nepali bpe
tokens = bpe_tokenizer.tokenize("Hami sabai anusaasan ma basnu parcha.")
# printing the
print(tokens)

['[UNK]', 'a', 'm', 'i', 'sabai', 'an', 'u', 'sa', 'a', 'san', 'ma', 'bas', 'nu', 'parcha', '.']


### WordPiece Tokenization
Similar to BPE tokenization but uses a different criterion for merging tokens.

It often produce more meaningful subword units, balance frequency and usefulness of token and effective for language with rich morphology but can be computationally expensize than BPE and still requires a pre-tokenization step for most implementaion.

In [12]:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace

class WordPieceTokenizer:
    def __init__(self, vocab_size=1000):
        self.tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
        self.trainer = WordPieceTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=vocab_size)
        self.tokenizer.pre_tokenizer = Whitespace()
        
    def train(self, files):
        self.tokenizer.train(files, self.trainer)
        
    def tokenize(self, text):
        return self.tokenizer.encode(text).tokens


In [13]:
wordpiece_tokenizer = WordPieceTokenizer()
data = ["data/text.txt"]
wordpiece_tokenizer.train(data)

tokens = wordpiece_tokenizer.tokenize("Hami sabai anusaasan ma basnu parcha.")
print(tokens)

['[UNK]', 'sabai', 'an', '##u', '##s', '##a', '##as', '##an', 'ma', 'bas', '##nu', 'parcha', '.']


### Unigram
Unigram start with a large vocabulary and iteratively removes tokens to reach the desired vocabulary size.

In [14]:
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
from tokenizers.pre_tokenizers import Whitespace

In [38]:
class UnigramTokenizerHF:
    def __init__(self, vocab_size=1000):
        self.tokenizer = Tokenizer(Unigram())
        self.trainer = UnigramTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=vocab_size)
        self.tokenizer.pre_tokenizer = Whitespace()
        
    def train(self, files):
        self.tokenizer.train(files, self.trainer)
        
    def tokenize(self, text):
        return self.tokenizer.encode(text).tokens

In [39]:
unigram_tokenizer = UnigramTokenizerHF(vocab_size=1000)
data = ["data/text.txt"]
unigram_tokenizer.train(data)

### SentencePiece
It is not a tokenization algorithm itself but rather a framework that can use BPE or Unigram algorithm.

## Comparison
- BPE: general purpose
- word-piece: for morphologically rich language
- sentencepiece: for multilingual
- UNIGRAM: handle ambiguity