# Chapter 2: Working with Text Data

## 2.2 Tokenizing Text

In [1]:
from importlib.metadata import version

print("PyTorch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

PyTorch version: 2.9.1
tiktoken version: 0.12.0


In this section, we wil tokenize text into smaller units, such as individual words and punctuation characters.

Before that, we will load raw text we want to work with.

In [2]:
import os
import requests

if not os.path.exists("the-verdict.txt"):
    url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
    file_path = "the-verdict.txt"

    response = requests.get(url, timeout=30)
    response.raise_for_status()
    with open(file_path, "wb") as f:
        f.write(response.content)

In [3]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print(f"Total number of characters: {len(raw_text)}")
print(f"First 100 characters:\n{raw_text[:100]}")

Total number of characters: 20479
First 100 characters:
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


In [4]:
# To start with, we will use `re` to tokenize the text into words and punctuation.
import re

text = "Hello, world! This is a test."
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world!', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test.']


In [5]:
result = re.split(r'([,.]|\s)', text)

print(result)

['Hello', ',', '', ' ', 'world!', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


In [6]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world!', 'This', 'is', 'a', 'test', '.']


**NOTE:** When developing a simple tokenizer, whether we should encode whitespaces as separate characters or ignore them depends on our application and its requirements. Removing whitespaces reduces the memory and computing power, but keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example, Python code).

In [7]:
# Final tokenizer implementation
text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [8]:
# Test the tokenizer on the raw text
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]

# Print the first 30 tokens
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


## 2.3 Converting Tokens into Token IDs

To convert textual tokens into numerical representations that machine learning models can process, we need to map each token to a unique integer ID. This process is essential for feeding text data into models like neural networks.

Before that, we need to build a vocabulary that defines how we map each unique word and special character to an integer. This vocabulary acts as a dictionary for the model to understand the input data.

From these tokens, we can build a vocabulary by assigning a unique integer ID to each unique token:

In [9]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(f"Vocabulary size: {vocab_size}")

Vocabulary size: 1130


In [10]:
# Create a vocabulary mapping from token to ID
vocab = {token: integer for integer, token in enumerate(all_words)}

# Display the first 50 items in the vocabulary
for i, item in enumerate(vocab.items()):
    if i >= 50:
        break
    print(item)

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)


Next, we will apply the vocabulary to convert text tokens into their corresponding token IDs. When we want to convert the outputs of an LLM from numbers back into text, we can use the reverse mapping from token IDs to tokens.

To do this, we will implement a tokenizer class:

In [11]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}

    def encode(self, text):
        # Tokenize the input text
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]

        # Convert tokens to token IDs
        ids = [self.str_to_int[s] for s in preprocessed]

        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuation marks
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)

        return text

In [12]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [13]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

In [14]:
tokenizer.decode(tokenizer.encode(text))

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

This looks good so far, but it will occur an error if we try to encode a token that is not in the vocabulary:

In [15]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

## 2.4 Adding Special Context Tokens

To handle unknown words and address the usage and addition of special context tokens, we can enhance our tokenizer class.

We will add two special tokens to our vocabulary:
- `<|unk|>` for unknown words
- `<|endoftext|>` to signify the end of a text sequence.

When training GPT-like LLMs on multiple independent documents or books, it is common to insert a token before each document or book that follows a previous text source. This helps the LLM understand that although these text sources are concatenated for training purposes, they are independent of each other.

Now we will modify our tokenizer class and vocabulary to include these special tokens.

In [18]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|unk|>", "<|endoftext|>"])

# Update the vocabulary to include special tokens
vocab = {token: integer for integer, token in enumerate(all_tokens)}

print(f"Updated vocabulary size: {len(vocab)}")

Updated vocabulary size: 1132


In [19]:
# Print the last 5 items in the updated vocabulary
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|unk|>', 1130)
('<|endoftext|>', 1131)


Next we will update our tokenizer to handle unknown tokens gracefully by mapping them to the `<unk>` token ID during encoding.

In [20]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}

    def encode(self, text):
        # Tokenize the input text
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        # Convert unknown tokens to <unk>
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]
        # Convert tokens to token IDs
        ids = [self.str_to_int[s] for s in preprocessed]

        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [21]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [22]:
tokenizer.encode(text)

[1130, 5, 355, 1126, 628, 975, 10, 1131, 55, 988, 956, 984, 722, 988, 1130, 7]

In [23]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

Depending on the LLM, some researchers also consider additional special tokens:
- `[BOS]` (beginning of sequence) to mark the start of a text sequence. It signifies to the model where a piece of content begins.
- `[EOS]` (end of sequence) to mark the end of a text sequence. Similar to `<|endoftext|>`, it indicates where a piece of content concludes.
- `[PAD]` (padding) to fill in sequences to a uniform length when training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. The `[PAD]` token is used to extend shorter sequences to match the length of the longest sequence in the batch, ensuring that all sequences have the same length for efficient processing.

## 2.5 Byte Pair Encoding (BPE)

The **Byte Pair Encoding (BPE)** tokenizer was used to train LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT. 

BPE allows the model to break down words that are not in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words more effectively.

We will explore how BPE works by using an existing BPE implementation from the `tiktoken` library.

In [24]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.12.0


In [25]:
# Initialize the BPE tokenizer for GPT-2
tokenizer = tiktoken.get_encoding("gpt2")

In [26]:
# Test the BPE tokenizer
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."

ids = tokenizer.encode(
    text,
    allowed_special={"<|endoftext|>"}
)

print(ids)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]


In [27]:
strings = tokenizer.decode(ids)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


The `<|endoftext|>` token is assigned a relatively large token ID (50256) to avoid conflicts with common tokens in the vocabulary.

The BPE tokenizer can also handle unknown words, such as `"someunknownPlace"`, correctly breaking it down into smaller subword units.

In [28]:
tokenizer.encode("someunknownPlace")

[11246, 34680, 27271]

In [None]:
tokenizer.encode("some unknown Place")

[11246, 555, 74, 8474]