# MINI-PROJECT - Text Tokenizer

In this mini project, we will build a text tokenizer with multiple approaches like:

1. Word Based
2. Character Based
3. Sub-word Based 
- WordPiece - evaluates the benefits and drawbacks of splitting and merging two symbols
- Unigram - Breaks text into smaller pieces / Narrows down a large list of possibilities based on frequency of appearance
- SentencePiece - Segments text into manageable partsand assign unique IDs.
4. Adding special tokens like <bos> and <eos> at the beginning of sentences.

### EXAMPLE 1

In [1]:
!{sys.executable} -m pip install --upgrade --force-reinstall torch transformers

'{sys.executable}' is not recognized as an internal or external command,
operable program or batch file.


In [2]:
import torch
from transformers import BertTokenizer

print("Torch version:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())

# Load tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Sample text
text = "Large Language Models (LLMs) are transforming how machines understand language."

# Tokenize and return PyTorch tensors
tokens = tokenizer(
    text,
    padding='max_length',
    truncation=True,
    max_length=32,
    return_tensors='pt'
)

# Output the tokenized tensors
print("Input IDs:\n", tokens['input_ids'])
print("Attention Mask:\n", tokens['attention_mask'])


  from .autonotebook import tqdm as notebook_tqdm


Torch version: 2.7.1+cpu
CUDA available: False
Input IDs:
 tensor([[  101,  2312,  2653,  4275,  1006,  2222,  5244,  1007,  2024, 17903,
          2129,  6681,  3305,  2653,  1012,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0]])
Attention Mask:
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]])


### 1. Word-Based Tokenizer
Motivation
The simplest form of tokenization—splitting by whitespace—provides an intuitive way to represent language. It’s historically been the foundation for bag-of-words models.

Significance
Efficient for small datasets and classical ML tasks (e.g., sentiment analysis, topic modeling). However, it fails on out-of-vocabulary (OOV) words and does not handle morphological variants.

Practical Usage
Used in early NLP pipelines (e.g., TF-IDF vectorizers, early RNNs).

In [3]:
text = "Tokenization is crucial in NLP."
tokens = text.split()
print(tokens)


['Tokenization', 'is', 'crucial', 'in', 'NLP.']


### 2. Character-Based Tokenizer
Motivation
By decomposing into characters, the model can handle any string, including rare or novel words.

Significance
This approach avoids OOV issues and captures morphological patterns but at the cost of longer sequences.

Practical Usage
Used in text generation tasks and when modeling fine-grained linguistic structures (e.g., OCR, speech).

In [4]:
text = "Tokenizer"
tokens = list(text)
print(tokens)

['T', 'o', 'k', 'e', 'n', 'i', 'z', 'e', 'r']


### 3. Subword-Based Tokenizer
Subword methods provide a balance between the flexibility of character tokenization and the compactness of word tokenization.

#### 3.1 WordPiece
Motivation
Originally developed for BERT, WordPiece merges frequent symbol pairs to build a vocabulary.

Significance
Efficient handling of rare and compound words (e.g., unaffordable → un ##afford ##able).

Practical Usage
Used in BERT and other Transformer-based models.

In [5]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
print(tokenizer.tokenize("unaffordable housing"))


['una', '##ff', '##ord', '##able', 'housing']


#### 3.2 Unigram
Motivation
Instead of merging, it selects subwords from a fixed vocabulary to maximize likelihood.

Significance
Provides probabilistic coverage and optimal subword selection.

Practical Usage
Used in Google's T5 and XLNet.

In [6]:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import NFKC
from tokenizers.processors import TemplateProcessing

# Step 1: Sample data
corpus = [
    "unaffordable housing",
    "natural language processing is fun",
    "tokenization improves NLP performance",
]

# Step 2: Initialize a Unigram model
tokenizer = Tokenizer(models.Unigram())

# Step 3: Add normalizer and pre-tokenizer
tokenizer.normalizer = NFKC()
tokenizer.pre_tokenizer = Whitespace()

# Step 4: Define trainer
trainer = trainers.UnigramTrainer(vocab_size=100, show_progress=True)

# Step 5: Train tokenizer
tokenizer.train_from_iterator(corpus, trainer=trainer)

# (Optional) Add special tokens
tokenizer.post_processor = TemplateProcessing(
    single="<bos> $A <eos>",
    pair="<bos> $A <sep> $B:1 <eos>:1",
    special_tokens=[
        ("<bos>", 1),
        ("<eos>", 2),
        ("<sep>", 3),
    ],
)

# Step 6: Save the tokenizer to file
tokenizer.save("unigram_tokenizer.json")

# Step 7: Reload and test
tokenizer = Tokenizer.from_file("unigram_tokenizer.json")
output = tokenizer.encode("unaffordable housing")
print("Tokens:", output.tokens)


Tokens: ['<bos>', 'un', 'a', 'f', 'for', 'd', 'a', 'b', 'l', 'e', 'h', 'o', 'u', 's', 'i', 'ng', '<eos>']


This code demonstrates how to train a Unigram-based text tokenizer using the tokenizers library. It begins by importing essential modules for tokenization, normalization, pre-tokenization, and processing. A small sample corpus of text sentences is provided to serve as training data. A Unigram model is then initialized, which is designed to break words into meaningful subword units based on frequency and likelihood. The text is normalized using the NFKC standard to handle variations in characters and is split into words using whitespace pre-tokenization. A trainer is defined with a vocabulary size of 100, guiding how many subword units the tokenizer should learn. The tokenizer is then trained on the provided corpus using this trainer. To support downstream language models, special tokens such as <bos> (beginning of sentence), <eos> (end of sentence), and <sep> (separator) are added using a template post-processor. The trained tokenizer is saved to a JSON file for reuse, and finally, it is reloaded and used to encode a new input sentence, printing the resulting subword tokens as output. This workflow encapsulates a complete tokenizer training pipeline for subword-based NLP applications.

### 3.3 SentencePiece
Motivation
Builds subword units from raw text without requiring pre-tokenization. It treats whitespace as a normal character.

Significance
Language-agnostic and used in multilingual settings.

Practical Usage
Used in models like ALBERT, mBART, and T5.

In [7]:
import sentencepiece as spm

# Step 1: Write your training corpus to a file
with open("corpus.txt", "w", encoding="utf-8") as f:
    corpus = [
        "unaffordable housing",
        "natural language processing is fun",
        "tokenization improves NLP performance",
        "deep learning models require a lot of data",
        "generative models can produce realistic text",
        "neural networks learn from examples",
        "transformers use attention mechanisms",
        "language models are pre-trained on massive datasets",
        "BERT and GPT are popular NLP architectures",
        "machine translation is a classic NLP task",
        "text summarization condenses information",
        "question answering systems understand queries",
        "word embeddings capture semantic meaning",
        "subword tokenization handles rare words",
        "sentencepiece segments text effectively",
        "training tokenizers requires representative data",
        "AI systems benefit from clean tokenized input"
    ]
    f.write("\n".join(corpus))

# Step 2: Train the SentencePiece model (Unigram with vocab size 30)
spm.SentencePieceTrainer.Train(
    input='corpus.txt',
    model_prefix='spm',
    vocab_size=50,
    model_type='unigram',
    bos_id=1,
    eos_id=2,
    pad_id=0,
    unk_id=3
)

# Step 3: Load the trained model and tokenize text
sp = spm.SentencePieceProcessor(model_file='spm.model')
tokens = sp.encode("unaffordable housing", out_type=str)
print("Tokens:", tokens)


Tokens: ['▁', 'u', 'n', 'a', 'f', 'f', 'o', 'r', 'd', 'a', 'b', 'l', 'e', '▁', 'h', 'o', 'u', 's', 'ing']
