## Tokenization
Tokenization is the process of breaking down a text into individual units, called tokens. These tokens can be words, subwords, characters, or even punctuation marks. It's a fundamental step in most Natural Language Processing (NLP) pipelines because it converts raw text into a format that machine learning models can understand. Think of it like separating a sentence into its constituent words before you can analyze its grammar or meaning.

Here's why tokenization is important:

*   **Machine Learning Compatibility:** Machine learning models work with numerical data. Tokenization converts text into numerical representations (through techniques like word embeddings), making it possible to feed text data into these models.
*   **Feature Engineering:** Tokens can be used as features in NLP tasks like text classification, sentiment analysis, and machine translation.
*   **Vocabulary Creation:** Tokenization helps in creating a vocabulary of unique tokens, which is essential for many NLP models.

Now, let's explore some common tokenization techniques:

**1. Word-based Tokenization:**

*   **How it works:** This is the simplest form of tokenization. It splits the text into tokens based on spaces or punctuation. Each word is treated as a separate token.
*   **Example:** "The quick brown fox." becomes `["The", "quick", "brown", "fox", "."]`
*   **Advantages:** Easy to implement.
*   **Disadvantages:**
    *   Doesn't handle out-of-vocabulary (OOV) words well. If a word isn't in the model's vocabulary, it's often represented as `<UNK>`, losing information.
    *   Can lead to a very large vocabulary, especially for languages with many words or inflections. This can be computationally expensive.
    *   Doesn't capture the meaning of subwords (e.g., "unbreakable" has meaning in "un," "break," and "able").

**2. Character-based Tokenization:**

*   **How it works:** Treats each character as a token.
*   **Example:** "Hello" becomes `["H", "e", "l", "l", "o"]`
*   **Advantages:**
    *   Handles OOV words well. Any word can be represented as a sequence of characters.
    *   Small vocabulary size.
*   **Disadvantages:**
    *   Doesn't capture the meaning of words or subwords directly.
    *   Sequences can become very long, requiring more computational resources.

**3. Subword Tokenization:**

Subword tokenization is a middle ground between word-based and character-based tokenization. It breaks words into smaller units (subwords) that can be morphemes (meaningful units like "un," "break," "able") or frequently occurring character sequences. This approach addresses the OOV problem and manages vocabulary size more effectively. Here are a few popular subword tokenization methods:

*   **a) WordPiece:**
    *   **How it works:** Starts with a small vocabulary of characters and iteratively merges the most frequent character pairs to form subwords. It continues until a desired vocabulary size is reached. It's often used in models like BERT.
    *   **Example:** Imagine starting with `["b", "a", "t", "s", "c", "a", "t", "s"]` and iteratively merging the most frequent pairs. You might get "bat," "cats," and then "bats" (if they are frequent enough).
    *   **Advantages:** Balances vocabulary size and handles OOV words.

*   **b) Byte-Pair Encoding (BPE):**
    *   **How it works:** Similar to WordPiece, BPE also starts with a small vocabulary and iteratively merges the most frequent byte pairs (or character pairs) to form new tokens. It's widely used in various language models.
    *   **Example:** The process is conceptually the same as WordPiece.
    *   **Advantages:** Efficient and commonly used.

*   **c) SentencePiece:**
    *   **How it works:** SentencePiece is a bit different. It treats spaces as a regular character and builds the vocabulary based on the most frequent subword units. It can be used for both subword and character-level tokenization. It's often used in models like XLNet and ALBERT.
    *   **Example:** Instead of splitting "New York," it might learn " New" and " York" as separate tokens (note the space).
    *   **Advantages:** Handles whitespace consistently and can be used for languages without clear word boundaries. It also makes detokenization (converting tokens back to text) straightforward.

**Key Differences and Summary:**

| Technique        | Unit of Tokenization | Handles OOV | Vocabulary Size |
|-----------------|----------------------|-------------|-----------------|
| Word-based      | Word                 | Poor        | Large           |
| Character-based | Character            | Excellent   | Small           |
| WordPiece       | Subword              | Good        | Moderate        |
| BPE             | Subword              | Good        | Moderate        |
| SentencePiece   | Subword/Character    | Good        | Moderate        |

In [3]:
#!pip install tokenizers sentencepiece

In [4]:
import transformers
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
import sentencepiece as spm

# Sample sentence for tokenization
sentence = "Tokenization splits text into smaller units."

# WordPiece Tokenization using Hugging Face
wordpiece_tokenizer = transformers.BertTokenizerFast.from_pretrained("bert-base-uncased")
wordpiece_tokens = wordpiece_tokenizer.tokenize(sentence)
print("WordPiece Tokenization:", wordpiece_tokens)

# Byte-Pair Encoding (BPE) using Hugging Face Tokenizers
bpe_tokenizer = Tokenizer(models.BPE())
bpe_tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
bpe_trainer = trainers.BpeTrainer(special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])

# Training on a small example corpus (You can load a real corpus for better results)
example_corpus = [sentence]
bpe_tokenizer.train_from_iterator(example_corpus, trainer=bpe_trainer)
bpe_tokens = bpe_tokenizer.encode(sentence).tokens
print("BPE Tokenization:", bpe_tokens)

WordPiece Tokenization: ['token', '##ization', 'splits', 'text', 'into', 'smaller', 'units', '.']
BPE Tokenization: ['Tokenization', 'splits', 'text', 'into', 'smaller', 'units', '.']


In [5]:
from tokenizers import Tokenizer
import sentencepiece as spm

# Sample text (you would use your own data here)
text = """This is a sample text.  It includes some words that are repeated, like sample.
We will use this text to demonstrate subword tokenization.  Subword tokenization is important
for handling out-of-vocabulary words and reducing vocabulary size.
"""

# The libraries require training from a file, so we'll save our text to a file
with open("text.txt", "w", encoding="utf-8") as f:
    f.write(text)

# SentencePiece Tokenization (using the 'sentencepiece' library)

# Train SentencePiece
spm.SentencePieceTrainer.Train(
                               input='text.txt',
                               model_prefix='sentencepiece_model',  # Output model file prefix
                               vocab_size=60,
                               character_coverage=0.9995,  # Adjust as needed
                               train_extremely_large_corpus=True) # Important for large text files, otherwise it may crash.

# Load the SentencePiece model
sentencepiece_tokenizer = spm.SentencePieceProcessor()
sentencepiece_tokenizer.Load("sentencepiece_model.model")

# Tokenize with SentencePiece
sentencepiece_output = sentencepiece_tokenizer.EncodeAsPieces(text) # EncodeAsIds to get numerical ids.

print("\nSentencePiece Tokenization:")
print(sentencepiece_output)

# Example of detokenization (converting tokens back to text)

detokenized_sentencepiece = sentencepiece_tokenizer.DecodePieces(sentencepiece_output)
print("\nDetokenized SentencePiece:")
print(detokenized_sentencepiece)


SentencePiece Tokenization:
['▁', 'T', 'h', 'i', 's', '▁i', 's', '▁a', '▁sample', '▁text', '.', '▁', 'I', 't', '▁', 'in', 'c', 'l', 'u', 'de', 's', '▁s', 'o', 'm', 'e', '▁', 'word', 's', '▁th', 'at', '▁', 'ar', 'e', '▁re', 'p', 'e', 'ate', 'd', ',', '▁', 'li', 'ke', '▁sample', '.', '▁', 'W', 'e', '▁w', 'i', 'l', 'l', '▁', 'u', 's', 'e', '▁th', 'i', 's', '▁text', '▁to', '▁', 'de', 'm', 'on', 's', 't', 'r', 'ate', '▁s', 'u', 'b', 'word', '▁tokenization', '.', '▁', 'S', 'u', 'b', 'word', '▁tokenization', '▁i', 's', '▁i', 'mp', 'or', 't', 'an', 't', '▁', 'f', 'or', '▁', 'h', 'and', 'l', 'in', 'g', '▁', 'o', 'u', 't', '-', 'o', 'f', '-', 'v', 'o', 'c', 'a', 'b', 'u', 'l', 'ar', 'y', '▁', 'word', 's', '▁', 'and', '▁re', 'd', 'u', 'c', 'in', 'g', '▁', 'v', 'o', 'c', 'a', 'b', 'u', 'l', 'ar', 'y', '▁s', 'iz', 'e', '.']

Detokenized SentencePiece:
This is a sample text. It includes some words that are repeated, like sample. We will use this text to demonstrate subword tokenization. Subword tok

### Further Reads

https://huggingface.co/learn/nlp-course/en/chapter6/6

https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb#scrollTo=a0ima0_TfmSA