<a href="https://colab.research.google.com/github/pavansai26/tokenizers-in-nlp/blob/main/byte_pair_encoding_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Byte Pair Encoding (BPE) tokenizer is a subword tokenization technique commonly used in Natural Language Processing (NLP) to split words into smaller units called subwords. It is an unsupervised learning algorithm that iteratively merges the most frequent pairs of consecutive bytes in a corpus until a specified number of subwords is obtained.

BPE tokenizer is used in various NLP tasks such as text classification, machine translation, and speech recognition. It is particularly useful when dealing with out-of-vocabulary (OOV) words that do not appear in the training data, as it allows the model to recognize similar subwords and generalize to unseen words.

The advantages of using BPE tokenizer are:





1.   Vocabulary size reduction: BPE can reduce the vocabulary size and handle OOV words by breaking down complex words into smaller subwords that can be seen in the training data.

2. Improved model performance: By using subwords, BPE can capture the underlying semantic meaning of complex words that may have different variations or inflections, resulting in improved model performance.

3. Language agnostic: BPE can be applied to any language and does not require any prior knowledge of the language.




The disadvantages of using BPE tokenizer are:





1.   Increased preprocessing time: BPE requires preprocessing the text to learn the subword vocabulary, which can be time-consuming, especially for large datasets.

2. Increased computational complexity: Using subwords increases the number of tokens in the vocabulary, which can increase the computational complexity of the model.

3. Limited interpretability: Subwords do not always correspond to meaningful linguistic units, making it difficult to interpret the model's predictions.
 



Here is a step-by-step explanation of how the Byte Pair Encoding (BPE) tokenizer works:



1.   Initialize the vocabulary: The BPE tokenizer starts with a set of initial tokens, which can be individual characters, words, or any other sequence of units.

2. Count the pairs of units: The BPE tokenizer counts the frequency of each pair of consecutive units in the training corpus, such as "a b", "b c", "c d", etc.

3. Merge the most frequent pair: The BPE tokenizer merges the most frequent pair of units into a new token and updates the vocabulary. For example, if the most frequent pair is "e s", the tokenizer would merge them into a new token "es" and update the vocabulary accordingly.

4. Repeat the process: The BPE tokenizer repeats steps 2 and 3 until a predetermined number of subwords is obtained or until the corpus has been fully processed.

5. Tokenize the text: Once the vocabulary has been learned, the BPE tokenizer can be used to tokenize new text by splitting the words into subwords that appear in the learned vocabulary. For example, the word "butterfly" may be tokenized into "butt", "er", and "fly", which are subwords that appear in the vocabulary.
   



here's an implementation of the Byte Pair Encoding (BPE) tokenizer using TensorFlow and Python:

In [None]:
import tensorflow as tf
import collections
import re

# Define a function that counts the frequency of pairs of consecutive symbols in a given vocabulary

#This function get_stats(vocab) takes in a vocabulary vocab and returns a dictionary of pairs of consecutive symbols (in this case, individual characters and the special end-of-word symbol </w>) and their corresponding frequency counts. 
#The function first initializes an empty dictionary pairs using collections.defaultdict(int). It then iterates over each word and its frequency in the input vocabulary using vocab.items(), 
#splits the word into a list of symbols using word.split(), and then iterates over all pairs of consecutive symbols using range(len(symbols)-1). 
#For each pair, it increments its frequency count in the pairs dictionary using the syntax pairs[symbols[i],symbols[i+1]] += freq. Finally, it returns the resulting pairs dictionary.

def get_stats(vocab):
    pairs = collections.defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i],symbols[i+1]] += freq
    return pairs

# Define a function that merges the most frequent pair of symbols in a given vocabulary

"""

This function merge_vocab(pair, v_in) takes in a pair of symbols pair and a vocabulary v_in, and returns a new merged vocabulary v_out. 
The function first initializes an empty dictionary v_out to store the merged vocabulary. 
It then creates a regular expression pattern to match the given pair of symbols, using re.escape() to escape any special characters in the pair 
and join() to concatenate the pair into a string separated by spaces. The resulting pattern is then compiled into a regular expression object p using re.compile(). 
The regular expression pattern is constructed to match the given pair of symbols only if it occurs at the beginning of a word or immediately after whitespace.

The function then iterates over each word in the input vocabulary v_in and replaces all occurrences of the given pair of symbols with a new symbol created by concatenating the two symbols using join(). 
The sub() method of the regular expression object p is used to replace all occurrences of the pattern with the new symbol. 
The resulting modified word is then added to the new merged vocabulary v_out using the same frequency count as the original word in v_in.

Finally, the function returns the resulting merged vocabulary v_out.

"""
def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

# Define the main function that performs the Byte Pair Encoding
def byte_pair_tokenize(vocab, num_merges):
    for i in range(num_merges):
        pairs = get_stats(vocab)
        best = max(pairs, key=pairs.get) # Find the most frequent pair of symbols
        vocab = merge_vocab(best, vocab) # Merge the most frequent pair of symbols
    return vocab

# Example usage
corpus = ['hello world', 'hello tensorflow', 'tensorflow is awesome']

# Define a vocabulary from the corpus by counting the frequency of individual characters and adding the '</w>' symbol
vocab = collections.defaultdict(int)
for sentence in corpus:
    for word in sentence.split():
        vocab[' '.join(list(word)) + ' </w>'] += 1

num_merges = 10 # Specify the number of merges to perform
bpe_vocab = byte_pair_tokenize(vocab, num_merges) # Perform Byte Pair Encoding
print(bpe_vocab) # Print the resulting subword vocabulary
