<a href="https://colab.research.google.com/github/mshojaei77/NLP-Journey/blob/main/ch1/Tokenization_BPE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Tokenization
Tokenization is a fundamental step in natural language processing (NLP) that involves breaking down text into smaller, meaningful units called tokens. These tokens are then used as input to neural networks for various NLP tasks such as text classification, translation, and generation.

## Why is Tokenization Important?
Tokenization is crucial because it allows the model to understand and process natural language in a structured way. By breaking down text into smaller units, the model can learn the relationships between words and phrases, which is essential for tasks like language understanding and generation.

### Whitespace Tokenization

Below I am showing you an example of a simple tokenizer without any following any standards. All it does is extract tokens based on a white space seperator.

Try to running the following code blocks.

In [4]:
text = "Cyrus the Great founded the Achaemenid Empire."
tokens = text.split()
print(tokens)

['Cyrus', 'the', 'Great', 'founded', 'the', 'Achaemenid', 'Empire.']


The split() method divides the text at every space, creating tokens of words. This method is straightforward but does not consider punctuation, so "Empire." remains a single token despite its proximity to the period.


## Advanced Tokenization Techniques
### NLTK Tokenizers
NLTK (Natural Language Toolkit) is a popular library in Python for working with human language data. It provides built-in tokenizers for splitting text into sentences or words.


In [9]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')

text = "Cyrus the Great founded the Achaemenid Empire, It was a powerful empire."
sent_tokens = sent_tokenize(text)
word_tokens = word_tokenize(text)

print("Sentence Tokens:", sent_tokens)
print("Word Tokens:", word_tokens)


Sentence Tokens: ['Cyrus the Great founded the Achaemenid Empire, It was a powerful empire.']
Word Tokens: ['Cyrus', 'the', 'Great', 'founded', 'the', 'Achaemenid', 'Empire', ',', 'It', 'was', 'a', 'powerful', 'empire', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### SpaCy Tokenizers
SpaCy is another popular NLP library that provides robust tokenization as part of its processing pipeline. It is designed for high performance and ease of use.

In [10]:
import spacy

# Load the SpaCy model
nlp = spacy.load("en_core_web_sm")

text = "Cyrus the Great founded the Achaemenid Empire, It was a powerful empire."
doc = nlp(text)
tokens = [token.text for token in doc]

print(tokens)


['Cyrus', 'the', 'Great', 'founded', 'the', 'Achaemenid', 'Empire', '.', 'It', 'was', 'a', 'powerful', 'empire', '.']


## Subword Tokenization
Subword tokenization breaks down words into smaller units. This approach is useful for handling rare words or languages with complex morphology.


# Byte Pair Encoding (BPE)

Byte Pair Encoding is a technique used in natural language processing to break words into subwords. Let's explore how it works using a simple, step-by-step example.

## The Concept

Imagine you're creating a new alphabet for children learning to read. Instead of just having individual letters, you also create special blocks for common letter pairs or groups. This is essentially what BPE does!

## Let's implement a simple version of BPE

In [17]:
from collections import Counter

def get_stats(vocab):
    """Count frequency of character pairs in the vocabulary"""
    pairs = Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[symbols[i], symbols[i + 1]] += freq
    return pairs

def merge_vocab(pair, v_in):
    """Merge the most frequent pair in the vocabulary"""
    v_out = {}
    bigram = ' '.join(pair)
    replacement = ''.join(pair)
    for word in v_in:
        w_out = word.replace(bigram, replacement)
        v_out[w_out] = v_in[word]
    return v_out

## Our Starting Point

Let's begin with a simple vocabulary of words. We'll use:
- "low" (appears 5 times)
- "lower" (appears 2 times)
- "newest" (appears 6 times)

We'll represent each word as a sequence of characters separated by spaces, with '</w>' marking the end of a word.

In [18]:
# Initial vocabulary
vocab = {
    'l o w </w>': 5,
    'l o w e r </w>': 2,
    'n e w e s t </w>': 6
}

print("Initial vocabulary:")
print(vocab)

Initial vocabulary:
{'l o w </w>': 5, 'l o w e r </w>': 2, 'n e w e s t </w>': 6}


## The BPE Process

Now, let's apply BPE steps to see how it builds up subwords.

In [19]:
num_merges = 5

for i in range(num_merges):
    pairs = get_stats(vocab)
    if not pairs:
        break
    best = max(pairs, key=pairs.get)
    vocab = merge_vocab(best, vocab)
    print(f"\nMerge #{i+1}: {best}")
    print(vocab)


Merge #1: ('w', 'e')
{'l o w </w>': 5, 'l o we r </w>': 2, 'n e we s t </w>': 6}

Merge #2: ('l', 'o')
{'lo w </w>': 5, 'lo we r </w>': 2, 'n e we s t </w>': 6}

Merge #3: ('n', 'e')
{'lo w </w>': 5, 'lo we r </w>': 2, 'ne we s t </w>': 6}

Merge #4: ('ne', 'we')
{'lo w </w>': 5, 'lo we r </w>': 2, 'newe s t </w>': 6}

Merge #5: ('newe', 's')
{'lo w </w>': 5, 'lo we r </w>': 2, 'newes t </w>': 6}


## What Just Happened?

1. We started with individual characters.
2. In each step, we found the most frequent pair of adjacent tokens.
3. We created a new token by merging this pair.
4. We repeated this process several times.

## The Benefits

- Common subwords emerge: Notice how 'es' and 'ew' became tokens. These are meaningful subword units!
- Efficiency: We can now represent words with fewer tokens.
- Flexibility: This method can handle new words by breaking them into learned subwords.

## Real-world Application

In practice, BPE is applied to much larger vocabularies and is a crucial preprocessing step for many language models. It helps these models handle large vocabularies efficiently and deal with rare or unseen words effectively.

In [20]:
# Let's test our BPE on a new word
new_word = 'l o w e s t </w>'
print("New word:", new_word)

for old, new in [('e s', 'es'), ('e w', 'ew')]:  # Apply our learned merges
    new_word = new_word.replace(old, new)

print("After applying BPE:", new_word)

New word: l o w e s t </w>
After applying BPE: l o w es t </w>


As you can see, our simple BPE model was able to apply learned subword units to a new word it hadn't seen before!

This is a simplified version of BPE, but it demonstrates the core concept. In real NLP applications, BPE is typically applied to much larger datasets and can learn thousands of merges, creating a rich vocabulary of subword units.