# **Tokenization**

***This notebook demonstrates how to implement tokenization using both Python’s built-in re module and spaCy.***

- We start with the simplest approach of splitting on spaces, and then gradually expand the rules to include punctuation and other symbols. This helps illustrate how rule-based tokenization works, why empty tokens appear, and how small changes in the pattern can produce very different token sequences.
- The goal is to show that regex-based tokenization is easy to implement and fully transparent, but also sensitive to the structure of the text.

Reference: https://docs.python.org/3/library/re.html

In [1]:
import re

text = "I'm lucky to pay only $5000 for my mother-in-law’s candies, chocolates in the U.S.!"

# Split on spaces
result = re.split(r'(\s)', text)
print("1:", result)
print("Total tokens 1:", len(result))

# Split on spaces and commas/periods
result = re.split(r'([,.]|\s)', text)
print("2:", result)

# Remove whitespace and drop empties
result = [item for item in result if item.strip()]
print("3:", result)

# Expanded punctuation
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print("4:", result)
print("Total tokens (4):", len(result))


1: ["I'm", ' ', 'lucky', ' ', 'to', ' ', 'pay', ' ', 'only', ' ', '$5000', ' ', 'for', ' ', 'my', ' ', 'mother-in-law’s', ' ', 'candies,', ' ', 'chocolates', ' ', 'in', ' ', 'the', ' ', 'U.S.!']
Total tokens 1: 27
2: ["I'm", ' ', 'lucky', ' ', 'to', ' ', 'pay', ' ', 'only', ' ', '$5000', ' ', 'for', ' ', 'my', ' ', 'mother-in-law’s', ' ', 'candies', ',', '', ' ', 'chocolates', ' ', 'in', ' ', 'the', ' ', 'U', '.', 'S', '.', '!']
3: ["I'm", 'lucky', 'to', 'pay', 'only', '$5000', 'for', 'my', 'mother-in-law’s', 'candies', ',', 'chocolates', 'in', 'the', 'U', '.', 'S', '.', '!']
4: ['I', "'", 'm', 'lucky', 'to', 'pay', 'only', '$5000', 'for', 'my', 'mother-in-law’s', 'candies', ',', 'chocolates', 'in', 'the', 'U', '.', 'S', '.', '!']
Total tokens (4): 21


# **spaCy**

Reference: https://spacy.io/usage/spacy-101

spaCy is a Python library that helps in turning raw text into well-defined tokens. Instead of treating every word boundary the same, it reads text from left to right and applies a series of carefully designed rules to decide where tokens should begin and end.

For example, abbreviations such as “U.S.” remain the same rather than being broken apart, while words like “we’re” are divided into “we” and “’re.” spaCy also pays attention to punctuation and special symbols: prefixes (e.g., “$5”), suffixes (e.g., “dogs’ ”), and infixes (e.g., “brother-in-law”).

- Text is initially split using whitespace.
- Prefixes are separated from the base word.
- Exceptions and suffixes are identified and detached.
- The result is a clean set of tokens that capture the structure of the text accurately.

Below is an example of a longer paragraph.
Scroll down further for a shorter example to understand it better.

In [2]:
# pip install spacy
# python -m spacy download en_core_web_sm

import spacy

nlp = spacy.load("en_core_web_sm")

# Sample text: https://en.wikipedia.org/wiki/Formula_One
text = """Formula One (F1) is the highest class of worldwide racing for
open-wheel single-seater formula racing cars sanctioned by the Fédération
Internationale de l'Automobile (FIA). The FIA Formula One World Championship
has been one of the world's premier forms of motorsport since its inaugural
running in 1950 and is often considered to be the pinnacle of motorsport. The
word formula in the name refers to the set of rules all participant cars must
follow. A Formula One season consists of a series of races, known as Grands
Prix. Grands Prix take place in multiple countries and continents on either
purpose-built circuits or closed roads. With the average annual cost of running
a team—e.g., designing, building, and maintaining cars; staff payroll;
transport—at approximately £193 million as of 2018,[2] Formula One's financial
and political battles are widely reported. The Formula One Group is owned by
Liberty Media, which acquired it in 2017 for US$8 billion. The United Kingdom
is the hub of Formula One racing, with six out of the ten teams based there."""

# Process text
doc = nlp(text)

# Print tokens
print("Tokens:")
for token in doc:
    print(token.text)

Tokens:
Formula
One
(
F1
)
is
the
highest
class
of
worldwide
racing
for


open
-
wheel
single
-
seater
formula
racing
cars
sanctioned
by
the
Fédération


Internationale
de
l'Automobile
(
FIA
)
.
The
FIA
Formula
One
World
Championship


has
been
one
of
the
world
's
premier
forms
of
motorsport
since
its
inaugural


running
in
1950
and
is
often
considered
to
be
the
pinnacle
of
motorsport
.
The


word
formula
in
the
name
refers
to
the
set
of
rules
all
participant
cars
must


follow
.
A
Formula
One
season
consists
of
a
series
of
races
,
known
as
Grands


Prix
.
Grands
Prix
take
place
in
multiple
countries
and
continents
on
either


purpose
-
built
circuits
or
closed
roads
.
With
the
average
annual
cost
of
running


a
team
—
e.g.
,
designing
,
building
,
and
maintaining
cars
;
staff
payroll
;


transport
—
at
approximately
£
193
million
as
of
2018,[2
]
Formula
One
's
financial


and
political
battles
are
widely
reported
.
The
Formula
One
Group
is
owned
by


Liberty
Media
,
which
acquired
it


In [3]:
import spacy

nlp = spacy.load("en_core_web_sm")

text = "I'm lucky to pay only $5000 for my mother-in-law’s chocolates in the U.S.!"

# Process text
doc = nlp(text)

# Print tokens
print("Tokens:")
for token in doc:
    print(token.text)

Tokens:
I
'm
lucky
to
pay
only
$
5000
for
my
mother
-
in
-
law
’s
chocolates
in
the
U.S.
!


# **Byte Pair Encoding Tokenizer**

Byte Pair Encoding (BPE) can be implemented from scratch to split words into subword units.
It uses spaCy to tokenize text, then:
- Builds a “vocabulary” of words broken down into individual characters plus an end-of-sequence marker.
- Counts how often adjacent character pairs appear.
- Repeatedly merges the most frequent pair, learning common letter clusters (subwords).
- Applies those learned merges to new text so words are represented as a sequence of subword tokens.

Reference: https://huggingface.co/learn/llm-course/en/chapter6/5

In the examples below, BPE is implemented from scratch so you can see exactly how it works: it builds a vocabulary from characters, counts frequent pairs, merges them step by step, and then applies those learned merges to new text. You can adjust the number of merges to see how the output changes. The first example uses a small Formula One–themed corpus, and the second uses a minimal toy corpus to show the process more clearly.

Below is an example of a longer paragraph. Scroll down further for a shorter example to understand it better. I'd encourage you to play around with the number of merges to understand it better.

**Example 1:**

In [4]:
import re
from collections import Counter
import spacy

nlp = spacy.load("en_core_web_sm")

def tokenize(text):
    doc = nlp(text)
    return [t.text.lower() for t in doc if not t.is_space]

# Represent each word as a sequence of chars + </es> end marker
def words_to_vocab(words):
    vocab = Counter()
    for w in words:
        vocab[" ".join(list(w) + ["</es>"])] += 1
    return vocab

# Count frequency of symbol pairs
def get_freq(vocab):
    pairs = Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i+1])] += freq
    return pairs

# Merge the most frequent pair into a single symbol
def merge_vocab(pair, vocab):
    bigram = re.escape(" ".join(pair))
    pattern = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    output = {}
    for word, freq in vocab.items():
        new_word = pattern.sub("".join(pair), word)  # "A B" becomes "AB"
        output[new_word] = output.get(new_word, 0) + freq
    return output

def learn_bpe(corpus_tokens, num_merges=50):
    vocab = words_to_vocab(corpus_tokens)
    merges = []
    for _ in range(num_merges):
        stats = get_freq(vocab)
        if not stats:
            break
        best = max(stats, key=stats.get)
        if stats[best] < 2:
            break
        vocab = merge_vocab(best, vocab)
        merges.append(best)
    return merges

def apply_bpe_word(word, merges_set):
    symbols = list(word) + ["</es>"]
    changed = True
    while changed:
        changed = False
        i = 0
        while i < len(symbols) - 1:
            pair = (symbols[i], symbols[i+1])
            if pair in merges_set:
                symbols[i:i+2] = ["".join(pair)]
                changed = True
            else:
                i += 1
    if symbols and symbols[-1] == "</es>":
        symbols = symbols[:-1]
    return symbols

def apply_bpe(text, merges):
    merges_set = set(merges)
    toks = tokenize(text)
    out = []
    for t in toks:
        if re.fullmatch(r"[a-z0-9]+", t):
            out.extend(apply_bpe_word(t, merges_set))
        else:
            out.append(t)
    return out

corpus = """
Formula One teams race on street circuits like Monaco and purpose-built tracks like Silverstone.
Ferrari and Mercedes have dominated different eras of Formula One, but Red Bull has been ascendant recently.
Drivers compete for the World Championship across multiple Grands Prix in a season.
"""

# Train merges on corpus
corpus_tokens = tokenize(corpus)
merges = learn_bpe(corpus_tokens, num_merges=100)

# Apply to a different sentence
text = "Formula One cars battle on purpose-built circuits and tight street tracks."
bpe_tokens = apply_bpe(text, merges)

print("Learnt merges (first 20):", merges[:20])
print("\nOriginal:", text)
print("BPE tokens:", bpe_tokens)

Learnt merges (first 20): [('e', '</es>'), ('s', '</es>'), ('o', 'n'), ('e', 'r'), ('t', '</es>'), ('d', '</es>'), ('o', 'r'), ('u', 'l'), ('r', 'a'), ('e', 'n'), ('f', 'or'), ('m', 'ul'), ('a', '</es>'), ('on', 'e</es>'), ('r', 'e'), ('a', 'n'), ('.', '</es>'), ('r', 'i'), ('h', 'a'), ('for', 'mul')]

Original: Formula One cars battle on purpose-built circuits and tight street tracks.
BPE tokens: ['formula</es>', 'one</es>', 'c', 'a', 'r', 's</es>', 'b', 'a', 't', 't', 'l', 'e</es>', 'on</es>', 'p', 'u', 'r', 'p', 'os', 'e</es>', '-', 'b', 'ui', 'l', 't</es>', 'c', 'i', 'r', 'c', 'ui', 't', 's</es>', 'and</es>', 't', 'i', 'g', 'h', 't</es>', 'st', 're', 'e', 't</es>', 't', 'rac', 'k', 's</es>', '.']


**Example 2:**

In [5]:
import re
from collections import Counter
import spacy

nlp = spacy.load("en_core_web_sm")

def tokenize(text):
    doc = nlp(text)
    return [t.text.lower() for t in doc if not t.is_space]

# Represent each word as a sequence of chars + </es> end marker
def words_to_vocab(words):
    vocab = Counter()
    for w in words:
        vocab[" ".join(list(w) + ["</es>"])] += 1
    return vocab

# Count frequency of symbol pairs
def get_freq(vocab):
    pairs = Counter()
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i+1])] += freq
    return pairs

# Merge the most frequent pair into a single symbol
def merge_vocab(pair, vocab):
    bigram = re.escape(" ".join(pair))
    pattern = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')  # whole-token pair
    output = {}
    for word, freq in vocab.items():
        new_word = pattern.sub("".join(pair), word)  # "A B" becomes "AB"
        output[new_word] = output.get(new_word, 0) + freq
    return output

def learn_bpe(corpus_tokens, num_merges=2):
    vocab = words_to_vocab(corpus_tokens)
    merges = []
    for _ in range(num_merges):
        stats = get_freq(vocab)
        if not stats:
            break
        best = max(stats, key=stats.get)
        if stats[best] < 2:
            break
        vocab = merge_vocab(best, vocab)
        merges.append(best)
    return merges

def apply_bpe_word(word, merges_set):
    symbols = list(word) + ["</es>"]
    changed = True
    while changed:
        changed = False
        i = 0
        while i < len(symbols) - 1:
            pair = (symbols[i], symbols[i+1])
            if pair in merges_set:
                symbols[i:i+2] = ["".join(pair)]
                changed = True
            else:
                i += 1
    if symbols and symbols[-1] == "</es>":
        symbols = symbols[:-1]
    return symbols

def apply_bpe(text, merges):
    merges_set = set(merges)
    toks = tokenize(text)
    out = []
    for t in toks:
        if re.fullmatch(r"[a-z0-9]+", t):
            out.extend(apply_bpe_word(t, merges_set))
        else:
            out.append(t)
    return out

corpus = """
car cars carpet bar bartender
"""

# Train merges on corpus
corpus_tokens = tokenize(corpus)
merges = learn_bpe(corpus_tokens, num_merges=3)

# Apply to a different sentence
text = "I love cartier and barbie."
bpe_tokens = apply_bpe(text, merges)

print("Learnt merges (first n):", merges[:3])
print("\nOriginal:", text)
print("BPE tokens:", bpe_tokens)

Learnt merges (first n): [('a', 'r'), ('c', 'ar'), ('b', 'ar')]

Original: I love cartier and barbie.
BPE tokens: ['i', 'l', 'o', 'v', 'e', 'car', 't', 'i', 'e', 'r', 'a', 'n', 'd', 'bar', 'b', 'i', 'e', '.']
