## This notebook will cover the basic concept of tokenizer, how they work with data. The basic definition of tokenizer is:
## A tokenizer is a program that convert a sequence of characters into a sequence of tokens.

In [1]:
# Let's start with examples
text = open('file.txt', 'r').read()
words = text.split(" ")
tokens = {v: k for k, v in enumerate(words)}
tokens
# This is the basic code of tokenization. You can tokenize the sentence using this code, but it's too slow for huge text file. So, for fast tokenization, we use libraries for tokenization

{'This': 0,
 'is': 1,
 'the': 2,
 'txt': 3,
 'file': 4,
 'that': 5,
 'will': 6,
 'use': 7,
 'for': 8,
 'tokenization.': 9}

## Tokenization Overview

#### A tokenizer (Hugging Face Library) converts raw text into smaller units called tokens so that machines can process language.

#### Instead of using full words, modern NLP systems often use subword tokenization to handle unknown or rare words effectively.

## Subword Tokenization Methods

### 1. Byte Pair Encoding (BPE):

#### breaks words into frequent character pairs by repeatedly merging the most common sequences. It is widely used in GPT-style models.

#### For example: texts are: low, lower, lowest. Most common pair = l + o → merge → lo Then lo + w → low.

#### Final tokens might be: low, low + er, low + est. Simple rule: BPE learns tokens based on frequency.

#### Use in: GPT model, RoBERTa, many production systems

### 2. WordPiece:

#### is similar to BPE but selects subwords based on probabilistic usefulness rather than only frequency. It is used in BERT-based models.

#### For example: texts are playing → play + ##ing. ##ing means: “this piece comes after another piece”. WP learns tokens based on probability.

#### Used in: BERT, DistilBERT

### 3. SentencePiece:

#### treats text as a sequence of characters without relying on spaces, making it suitable for non-space-based languages.  It is used in models like T5 and LLaMA.

#### It is mostly use for languages like chinese, urdu, spanish etc. Simple rule: SentencePiece works directly on raw text, spaces included.

#### Used in: T5, ALBERT, mBERT, LLaMA

## Industry Usage

#### 1. Transformers / LLMs → Hugging Face tokenizers

#### 2. Classical NLP pipelines → spaCy tokenizer

In [17]:
# We will use tokenizer here
from tokenizers import Tokenizer
from tokenizers.models import BPE

In [18]:
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))

In [19]:
# If you want to train a new tokenizer just like above UNK from scratch, use BPE trainer
# Here, we are just adding new tokens like unk (unknown), cls (classification) etc and add some rule in bpe.

from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[MASK]"]
)

In [22]:
files = [
    f"wikitext-103/wiki.{split}.tokens"
    for split in ["train", "test", "valid"]
]


In [23]:
# NOTE: Here, train means add new tokens to dataset, not train model.
# WARNING: This code section will take huge time like 1 hour, or 1.5 hour.
tokenizer.train(files, trainer)






# Conclusion

## Tokenization Guidelines

### Before creating embeddings, text data must be tokenized. How you do this depends on the type of data:

### 1. Sensitive data (legal, finance, medical reports, etc.):

#### Use a custom tokenizer and train it with methods like SentencePiece, WordPiece, or BPE.
#### ⚠️ This may take a significant amount (like 1.5 hour) of time, but it ensures better handling of sensitive or domain-specific vocabulary.

### 2. Non-sensitive or general data:

#### You can use pre-built tokenizers from Hugging Face or spaCy.

#### ✅ This saves processing time and works well for most common datasets.