# Chapter 2: Working with Text Data

## 2.2 Tokenizing Text

In [1]:
from importlib.metadata import version

print("PyTorch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

PyTorch version: 2.9.1
tiktoken version: 0.12.0


In this section, we wil tokenize text into smaller units, such as individual words and punctuation characters.

Before that, we will load raw text we want to work with.

In [2]:
import os
import requests

if not os.path.exists("the-verdict.txt"):
    url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
    file_path = "the-verdict.txt"

    response = requests.get(url, timeout=30)
    response.raise_for_status()
    with open(file_path, "wb") as f:
        f.write(response.content)

In [3]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

print(f"Total number of characters: {len(raw_text)}")
print(f"First 100 characters:\n{raw_text[:100]}")

Total number of characters: 20479
First 100 characters:
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


In [4]:
# To start with, we will use `re` to tokenize the text into words and punctuation.
import re

text = "Hello, world! This is a test."
result = re.split(r'(\s)', text)

print(result)

['Hello,', ' ', 'world!', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test.']


In [5]:
result = re.split(r'([,.]|\s)', text)

print(result)

['Hello', ',', '', ' ', 'world!', ' ', 'This', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


In [6]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world!', 'This', 'is', 'a', 'test', '.']


**NOTE:** When developing a simple tokenizer, whether we should encode whitespaces as separate characters or ignore them depends on our application and its requirements. Removing whitespaces reduces the memory and computing power, but keeping whitespaces can be useful if we train models that are sensitive to the exact structure of the text (for example, Python code).

In [7]:
# Final tokenizer implementation
text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [8]:
# Test the tokenizer on the raw text
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]

# Print the first 30 tokens
print(preprocessed[:30])

['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


## 2.3 Converting Tokens into Token IDs

To convert textual tokens into numerical representations that machine learning models can process, we need to map each token to a unique integer ID. This process is essential for feeding text data into models like neural networks.

Before that, we need to build a vocabulary that defines how we map each unique word and special character to an integer. This vocabulary acts as a dictionary for the model to understand the input data.

From these tokens, we can build a vocabulary by assigning a unique integer ID to each unique token:

In [9]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(f"Vocabulary size: {vocab_size}")

Vocabulary size: 1130


In [10]:
# Create a vocabulary mapping from token to ID
vocab = {token: integer for integer, token in enumerate(all_words)}

# Display the first 50 items in the vocabulary
for i, item in enumerate(vocab.items()):
    if i >= 50:
        break
    print(item)

('!', 0)
('"', 1)
("'", 2)
('(', 3)
(')', 4)
(',', 5)
('--', 6)
('.', 7)
(':', 8)
(';', 9)
('?', 10)
('A', 11)
('Ah', 12)
('Among', 13)
('And', 14)
('Are', 15)
('Arrt', 16)
('As', 17)
('At', 18)
('Be', 19)
('Begin', 20)
('Burlington', 21)
('But', 22)
('By', 23)
('Carlo', 24)
('Chicago', 25)
('Claude', 26)
('Come', 27)
('Croft', 28)
('Destroyed', 29)
('Devonshire', 30)
('Don', 31)
('Dubarry', 32)
('Emperors', 33)
('Florence', 34)
('For', 35)
('Gallery', 36)
('Gideon', 37)
('Gisburn', 38)
('Gisburns', 39)
('Grafton', 40)
('Greek', 41)
('Grindle', 42)
('Grindles', 43)
('HAD', 44)
('Had', 45)
('Hang', 46)
('Has', 47)
('He', 48)
('Her', 49)


Next, we will apply the vocabulary to convert text tokens into their corresponding token IDs. When we want to convert the outputs of an LLM from numbers back into text, we can use the reverse mapping from token IDs to tokens.

To do this, we will implement a tokenizer class:

In [11]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}

    def encode(self, text):
        # Tokenize the input text
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]

        # Convert tokens to token IDs
        ids = [self.str_to_int[s] for s in preprocessed]

        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuation marks
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)

        return text

In [12]:
tokenizer = SimpleTokenizerV1(vocab)

text = """"It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(ids)

[1, 56, 2, 850, 988, 602, 533, 746, 5, 1126, 596, 5, 1, 67, 7, 38, 851, 1108, 754, 793, 7]


In [13]:
tokenizer.decode(ids)

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

In [14]:
tokenizer.decode(tokenizer.encode(text))

'" It\' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.'

This looks good so far, but it will occur an error if we try to encode a token that is not in the vocabulary:

In [15]:
text = "Hello, do you like tea?"
print(tokenizer.encode(text))

KeyError: 'Hello'

## 2.4 Adding Special Context Tokens

To handle unknown words and address the usage and addition of special context tokens, we can enhance our tokenizer class.

We will add two special tokens to our vocabulary:
- `<|unk|>` for unknown words
- `<|endoftext|>` to signify the end of a text sequence.

When training GPT-like LLMs on multiple independent documents or books, it is common to insert a token before each document or book that follows a previous text source. This helps the LLM understand that although these text sources are concatenated for training purposes, they are independent of each other.

Now we will modify our tokenizer class and vocabulary to include these special tokens.

In [18]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|unk|>", "<|endoftext|>"])

# Update the vocabulary to include special tokens
vocab = {token: integer for integer, token in enumerate(all_tokens)}

print(f"Updated vocabulary size: {len(vocab)}")

Updated vocabulary size: 1132


In [19]:
# Print the last 5 items in the updated vocabulary
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('younger', 1127)
('your', 1128)
('yourself', 1129)
('<|unk|>', 1130)
('<|endoftext|>', 1131)


Next we will update our tokenizer to handle unknown tokens gracefully by mapping them to the `<unk>` token ID during encoding.

In [20]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}

    def encode(self, text):
        # Tokenize the input text
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        # Convert unknown tokens to <unk>
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]
        # Convert tokens to token IDs
        ids = [self.str_to_int[s] for s in preprocessed]

        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

In [21]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.


In [22]:
tokenizer.encode(text)

[1130, 5, 355, 1126, 628, 975, 10, 1131, 55, 988, 956, 984, 722, 988, 1130, 7]

In [23]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>.'

Depending on the LLM, some researchers also consider additional special tokens:
- `[BOS]` (beginning of sequence) to mark the start of a text sequence. It signifies to the model where a piece of content begins.
- `[EOS]` (end of sequence) to mark the end of a text sequence. Similar to `<|endoftext|>`, it indicates where a piece of content concludes.
- `[PAD]` (padding) to fill in sequences to a uniform length when training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. The `[PAD]` token is used to extend shorter sequences to match the length of the longest sequence in the batch, ensuring that all sequences have the same length for efficient processing.

## 2.5 Byte Pair Encoding (BPE)

The **Byte Pair Encoding (BPE)** tokenizer was used to train LLMs such as GPT-2, GPT-3, and the original model used in ChatGPT. 

BPE allows the model to break down words that are not in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words more effectively.

We will explore how BPE works by using an existing BPE implementation from the `tiktoken` library.

In [10]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.12.0


In [11]:
# Initialize the BPE tokenizer for GPT-2
tokenizer = tiktoken.get_encoding("gpt2")

In [12]:
# Test the BPE tokenizer
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."

ids = tokenizer.encode(
    text,
    allowed_special={"<|endoftext|>"}
)

print(ids)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]


In [13]:
strings = tokenizer.decode(ids)
print(strings)

Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.


The `<|endoftext|>` token is assigned a relatively large token ID (50256) to avoid conflicts with common tokens in the vocabulary.

The BPE tokenizer can also handle unknown words, such as `"someunknownPlace"`, correctly breaking it down into smaller subword units.

In [28]:
tokenizer.encode("someunknownPlace")

[11246, 34680, 27271]

In [None]:
tokenizer.encode("some unknown Place")

[11246, 555, 74, 8474]

## 2.6 Data Sampling with a Sliding Window

Our next step is to generate the input-target pairs required for training an LLM. LLMs are pretrained by predicting the one word at a time, so we need to prepare the training data accordingly.

In [14]:
# Load the text data
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

# Encode the entire text using the BPE tokenizer
enc_text = tokenizer.encode(raw_text)
print(f"Total number of tokens: {len(enc_text)}")

Total number of tokens: 5145


For demo purposes, we will remove the first 50 tokens from the dataset:

In [35]:
enc_smaple = enc_text[50:]

Since we want the model to predict the next word, the targets are the inputs shifted by one position to the right.

In [37]:
context_size = 4

x = enc_smaple[:context_size]
y = enc_smaple[1 : context_size + 1]

print(f"x: {x}")
print(f"y:      {y}")

x: [290, 4920, 2241, 287]
y:      [4920, 2241, 287, 257]


The *context size* defines how many tokens are included in the input.

Now we can create input-target pairs using a sliding window approach.

In [38]:
for i in range(1, context_size+1):
    context = enc_smaple[:i]
    desired = enc_smaple[i]

    print(f"Context: {context} -> Desired: {desired}")

Context: [290] -> Desired: 4920
Context: [290, 4920] -> Desired: 2241
Context: [290, 4920, 2241] -> Desired: 287
Context: [290, 4920, 2241, 287] -> Desired: 257


In [39]:
# Convert token IDs back to strings for better readability
for i in range(1, context_size+1):
    context = enc_smaple[:i]
    desired = enc_smaple[i]

    context_str = tokenizer.decode(context)
    desired_str = tokenizer.decode([desired])

    print(f"Context: '{context_str}' -> Desired: '{desired_str}'")

Context: ' and' -> Desired: ' established'
Context: ' and established' -> Desired: ' himself'
Context: ' and established himself' -> Desired: ' in'
Context: ' and established himself in' -> Desired: ' a'


Next, we will implement an efficient dataloader that iterates over the input dataset and returns the inputs and targets as PyTorch tensors for training. There are two tensors: an input tensor containing the text that the LLM sees and a target tensor containing the expected outputs (the next token for each position in the input).

We will use `Dataset` and `DataLoader` classes from the `torch.utils.data` module to create a custom dataset and dataloader for our tokenized text data.

In [1]:
import torch
print("PyTorch version:", torch.__version__)

PyTorch version: 2.9.1+cpu


In [7]:
from torch.utils.data import Dataset, DataLoader

class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(
            txt,
            allowed_special={"<|endoftext|>"}
        )
        assert len(token_ids) > max_length, "Number of tokenized inputs must at least be equal to `max_length`."

        # Use a sliding window to chunk the book into overlapping sequences of `max_length` tokens
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i : i + max_length]
            target_chunk = token_ids[i+1 : i + max_length+1]

            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)
    
    def __getitem__(self, index):
        return self.input_ids[index], self.target_ids[index]

Then we use the `GPTDatasetV1` class to load the inputs in batches via a `DataLoader` instance.

In [8]:
def create_dataloader_v1(
    txt,
    batch_size=4,
    max_length=256,
    stride=128,
    shuffle=True,
    drop_last=True,
    num_workers=0
):
    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create the dataset
    dataset = GPTDatasetV1(
        txt,
        tokenizer,
        max_length,
        stride
    )

    # Create the dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

Test the `dataloader` with a batch size of 1 for an LLM with a context size of 4 to see how it works.

In [43]:
dataloader = create_dataloader_v1(
    raw_text,
    batch_size=1,
    max_length=4,
    stride=1,
    shuffle=False,
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print("Input IDs:", first_batch[0])
print("Target IDs:", first_batch[1])

Input IDs: tensor([[  40,  367, 2885, 1464]])
Target IDs: tensor([[ 367, 2885, 1464, 1807]])


The `first_batch` variable contains two tensors: the input IDs and the target IDs. Since the `max_length` is set to 4, each tensor has a shape of `(1, 4)`, indicating one sequence of four tokens.

In [44]:
second_batch = next(data_iter)
print("Input IDs:", second_batch[0])
print("Target IDs:", second_batch[1])

Input IDs: tensor([[ 367, 2885, 1464, 1807]])
Target IDs: tensor([[2885, 1464, 1807, 3619]])


Comparing the first and second batches, we can see that the second batch's token IDs are shifted by **one position** compared to the first batch. The `stride` parameter in the `GPTDatasetV1` class controls this shifting behavior.

If we use the dataloader to sample with a larger batch size and different stride:

In [46]:
dataloader = create_dataloader_v1(
    raw_text,
    batch_size=8,
    max_length=4,
    stride=4,
    shuffle=False,
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Input IDs:\n", inputs)
print("Target IDs:\n", targets)

Input IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])
Target IDs:
 tensor([[  367,  2885,  1464,  1807],
        [ 3619,   402,   271, 10899],
        [ 2138,   257,  7026, 15632],
        [  438,  2016,   257,   922],
        [ 5891,  1576,   438,   568],
        [  340,   373,   645,  1049],
        [ 5975,   284,   502,   284],
        [ 3285,   326,    11,   287]])


The stride of 4 avoids any overlap between the batches since more overlap could lead to increased overfitting during training.

## 2.7 Creating Token Embeddings

The data is almost ready for training an LLM, but lastly we need to embed the tokens in a continuous vector representation using an embedding layer. Usually these embedding layers are part of the LLM architecture itself and are updated (trained) during model training.

Suppose we have the following four input token IDs:

In [2]:
input_ids = torch.tensor([2, 3, 5, 1])

For demo purpose, suppose we have a small vocabulary of only 6 words and an embedding dimension of 3.

In [3]:
vocab_size = 6
output_dim = 3

torch.manual_seed(0)
embedding_layer = torch.nn.Embedding(
    num_embeddings=vocab_size,
    embedding_dim=output_dim
)

print(embedding_layer.weight)

Parameter containing:
tensor([[-1.1258, -1.1524,  0.5667],
        [ 0.7935,  0.5988, -1.5551],
        [-0.3414,  1.8530,  0.4681],
        [-0.1577, -0.1734,  0.1835],
        [ 1.3894,  1.5863,  0.9463],
        [-0.8437,  0.9318,  1.2590]], requires_grad=True)


This weight matrix has 6 rows (one for each token in the vocabulary) and 3 columns (the embedding dimension). 

Now we can apply it to a token ID to obtain the corresponding embedding vector.

In [4]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.1577, -0.1734,  0.1835]], grad_fn=<EmbeddingBackward0>)


This will output the embedding vector for the token ID `3` from the embedding layer. This is the 4th row of the embedding weight matrix, corresponding to the token with ID `3`.

Now we can use this embedding layer to convert all input token IDs into their corresponding embedding vectors, which can then be fed into the LLM for training or inference.

In [5]:
print(embedding_layer(input_ids))

tensor([[-0.3414,  1.8530,  0.4681],
        [-0.1577, -0.1734,  0.1835],
        [-0.8437,  0.9318,  1.2590],
        [ 0.7935,  0.5988, -1.5551]], grad_fn=<EmbeddingBackward0>)


## 2.8 Encoding Word Positions

The self-attention mechanism in LLMs is **permutation invariant**, meaning it does not inherently understand the order of tokens in a sequence. To address this, we need to encode the positions of words in a sequence so that the model can differentiate between tokens based on their order. The same embedding vector for a word should be treated differently depending on its position in the sequence.

Thus, we will inject additional positional information into the token embeddings before feeding them into the LLM. We will use two broad approaches of position-aware embeddings:
- relative position embeddings
- absolute position embeddings

#### Absolute Position Embeddings
In absolute position embeddings, each position in the input sequence is assigned a unique embedding vector based on its absolute position (e.g., first word, second word, etc.).

#### Relative Position Embeddings
In relative position embeddings, the model learns to represent the positions of tokens relative to each other, rather than based on their absolute positions in the sequence. This allows the model to focus on the relationships between tokens regardless of their specific positions.

The advantage of relative position embeddings is that the model can generalize better to sequences of varying lengths even if it has not seen such lengths during training.

OpenAI's GPT models use absolute position embeddings that are optimized during the training process rather than being fixed or predefined like the sinusoidal position embeddings used in the original Transformer model.

Suppose we will encode the input tokens into a 256-dimensional embedding space and assume the token IDs were created by the BPE tokenizer from the previous section, which has a vocabulary size of 50257 (including the `<|endoftext|>` token).

In [6]:
vocab_size = 50257
output_dim = 256

token_embed_layer = torch.nn.Embedding(
    num_embeddings=vocab_size,
    embedding_dim=output_dim
)

If we sample data from the dataloader we created earlier, we will embed each token in each batch into a 256-dimensional vector using the token embedding layer we defined above.

If we have a batch size of 8 with 4 tokens each (context size of 4), the resulting tensor of embedded tokens will have a shape of `(8, 4, 256)`, where:
- `8` is the batch size,
- `4` is the context size (number of tokens per sequence),
- `256` is the embedding dimension for each token.

In [15]:
max_length = 4
batch_size = 8

dataloader = create_dataloader_v1(
    raw_text,
    batch_size=batch_size,
    max_length=max_length,
    stride=max_length,
    shuffle=False,
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print(f"Token IDs:\n{inputs}")
print(f"\nInputs shape:\n{inputs.shape}")

Token IDs:
tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
torch.Size([8, 4])


In [16]:
token_embeddings = token_embed_layer(inputs)
print(f"Token Embeddings shape:\n{token_embeddings.shape}")

Token Embeddings shape:
torch.Size([8, 4, 256])


For the absolute position embeddings in a GPT model, we need to create another embedding layer that has the same embedding dimension as the `token_embed_layer`:

In [19]:
context_length = max_length
pos_embed_layer = torch.nn.Embedding(
    num_embeddings=context_length,
    embedding_dim=output_dim
)

# Create position embeddings for each position in the sequence
pos_embeddings = pos_embed_layer(torch.arange(context_length))
print(f"Position Embeddings shape:\n{pos_embeddings.shape}")

Position Embeddings shape:
torch.Size([4, 256])


The input to the `pos_embed_layer` is a placeholder vector `torch.arange(context_length)` that represents the positions in the sequence from `0` to `context_length - 1`. This tensor is used to generate position embeddings for each position in the input sequence.

In this case, we choose `context_length` to be equal to `max_length`, where the maximum input context size is equal to the maximum number of tokens the model can process in a single forward pass. In practice, input text can be longer than the supported context size, in which case we have to truncate the text.

Next, we add the position embeddings to the token embeddings to create position-aware token embeddings that can be fed into the LLM.

In [20]:
input_embeddings = token_embeddings + pos_embeddings
print(f"Input Embeddings shape:\n{input_embeddings.shape}")

Input Embeddings shape:
torch.Size([8, 4, 256])
