# **Byte-Pair Encoding Tokenizer**

A tokenizer is simply a function that takes in a sentence and outputs a list of tokens, which are usually subwords. 
This is done by choosing the tokens that are, in some sense, frequent and important in a given corpus. The number of possible tokens is contained in a vocabulary set.

Example:
$$
    \text{the cat is sleeping.} \rightarrow \text{th|e\;|ca|t| |is |s|l|ee|p|ing|.| }
$$

A Byte-Pair Encoding Tokenizer is built to recursively add new tokens to the vocabulary until a fixed side has been reached.

This is done by:
- Breaking down the training corpus into single characters, and add create a set with them as the initial vocabulary
- While k new tokens have not been added to this vocabulary:
    - find the most frequent pair viable of tokens (no crossing white-spaces)
    - add this pair to the vocabulary
    - merge every instance of the pair on the corpus
    - add the merging (e.g. ('a', 'b') -> 'ab') to the list of merging operations

When tokenizing a new sentence, the merge operations are done in order on a given sentence.

One thing that is important to keep in mind is that tokenization usually return the index of the token in the list of possible tokens (i.e. vocabulary). These indexes are then usually provided to an Embedding Table, which is optimized using Gradient Descent.

## 1. Import the Corpus

We first read the Dracula book into the code.

In [1]:
with open('data/dracula.txt', 'r') as f:
    book = f.read()

## 2. Defining BPE Tokenizer

The class will contain a .fit() function that will be used to stack the merge operations using the frequencies of a corpus, and then a .tokenize() function that will apply the merge operations in order to tokenize a given sentence.

In [2]:
class BPETokenizer():
    
    def __init__(self, k):
        self.merge_ops = []
        self.vocab = []
        self.k = k

    # This function uses a corpus to build a list of merge operations
    def fit(self, corpus):

        self._reset()
        
        # Getting initial vocabulary (single chars)
        vocab  = list(set(corpus))
        len_init_vocab = len(vocab)
        self.vocab = vocab

        # Breaking corpus into single chars
        corpus_tk = list(corpus)

        # While k new words have not been added to vocab
        while (len(vocab) < len_init_vocab + self.k):

            # Get most common pair
            pair = self._most_common_pair(corpus_tk)

            # If the corpus is too small, we might run out of pairs before k subwords are added
            # If that happens, break
            if len(pair)==0:
                break

            # Apply merge operation
            corpus_tk = self._merge(corpus_tk, pair)
        
            # Store pair
            self.merge_ops.append(pair)

            # Increase vocabulary
            self.vocab.append(''.join(pair))

    # This function tokenizes a sentence with a trained tokenizer
    # If return_idx=True, it returns the indexes of the tokens on the vocabulary instead of the actual strings
    def tokenize(self, sentence, return_idx=False):
        
        # Break down the sentence into single chars
        tokens = list(sentence)

        # Apply merge operations in the same order they were added
        for op in self.merge_ops:
            tokens = self._merge(tokens, op)

        if return_idx:
            return [self.vocab.index(token) for token in tokens]
        else:
            return tokens

    def _merge(self, corpus_tk, pair):
        new_corpus_tk = []
        i=0
        while i < len(corpus_tk):
            w1 = corpus_tk[i]
            w2 = corpus_tk[i+1] if i<len(corpus_tk)-1 else None
            if (w1,w2) == pair:
                new_corpus_tk.append(''.join(pair))
                i += 2
            else:
                new_corpus_tk.append(w1)
                i += 1
        return new_corpus_tk
                
    def _most_common_pair(self, corpus_tk):
        freq = {}
        for i in range(len(corpus_tk)-1):
            pair = (corpus_tk[i], corpus_tk[i+1])
            if self._is_pair_viable(pair):
                if pair in freq:
                    freq[pair] += 1
                else:
                    freq[pair] = 1
        if len(freq)>0:
            return max(freq, key=freq.get)
        else:
            return {}
        
    def _is_pair_viable(self, pair):
        merge = ''.join(pair)
        return ' ' not in merge[1:-1]

    def _reset(self):
        self.merge_ops = []
        self.vocab = []


## 3. Testing Tokenizer

#### 3.1 Testing a tokenizer with 10 added subwords.

In [3]:
bpe = BPETokenizer(k=10)

In [4]:
bpe.fit(book)

In [5]:
sentence = 'the cat is sleeping.'
tokens = bpe.tokenize(sentence)
token_idxs = bpe.tokenize(sentence, return_idx=True)

print(f"{sentence} -> |{'|'.join(tokens)}| ")
print(f"{sentence} -> {token_idxs}")
print(f"# tokens: {len(tokens)}")

the cat is sleeping. -> |th|e |c|a|t |i|s |s|l|e|e|p|in|g|.| 
the cat is sleeping. -> [86, 85, 80, 32, 88, 35, 89, 24, 9, 11, 11, 0, 91, 15, 75]
# tokens: 15


#### 3.2 Testing a tokenizer with 100 added subwords.

In [6]:
bpe = BPETokenizer(k=100)

In [7]:
bpe.fit(book)

In [8]:
sentence = 'the cat is sleeping.'
tokens = bpe.tokenize(sentence)
token_idxs = bpe.tokenize(sentence, return_idx=True)

print(f"{sentence} -> |{'|'.join(tokens)}| ")
print(f"{sentence} -> {token_idxs}")
print(f"# tokens: {len(tokens)}")

the cat is sleeping. -> |the |c|at |is |s|le|e|p|ing|.| 
the cat is sleeping. -> [121, 80, 111, 116, 24, 130, 11, 0, 105, 75]
# tokens: 10
