# Working with Text (Tokenizing & Creating Embeddings)

This notebook covers data preparation and sampling process when building an LLM to get input data ready. 

The entire workflow of the data preparation and sampling process would be as follows:
1. Tokenizing Text
2. Converting tokens into token IDs
3. Adding special context tokens & BytePair encoding
4. Data sampling with a sliding window
5. Creating token embeddings
6. Encoding word positions

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/01.webp?timestamp=1" width="500px">

In [115]:
!pip install torch



## Word embeddings

**Word embeddings** are a way of representing words as vectors in a multi-dimensional space, where the distance and direction between vectors reflect the similarity and relationships among the corresponding words.

The goal of the data preparation process in building an LLM is to assign each word its own word embedding.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/02.webp" width="500px">

LLMs work with embeddings in high-dimensional spaces (i.e., thousands of dimensions).

The figure below illustrates a 2-dimensional embedding space.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/03.webp" width="300px">

## 1. Tokenizing text

**Tokenizing text** means breaking text into smaller units, such as individual words and punctuation characters

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/04.webp" width="300px">

The goal is to tokenize and embed this text for an LLM

A simple tokenizer will operate as follows:

In [127]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text) # Split on whitespaces

print(result)

['Hello,', ' ', 'world.', ' ', 'This,', ' ', 'is', ' ', 'a', ' ', 'test.']


In [128]:
result = re.split(r'([,.]|\s)', text) # Split on commas and periods

print(result)

['Hello', ',', '', ' ', 'world', '.', '', ' ', 'This', ',', '', ' ', 'is', ' ', 'a', ' ', 'test', '.', '']


Remove empty strings:

In [129]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


Handling other types of punctuation (periods, question marks, etc): 

In [130]:
text = "Hello, world. Is this-- a test? I am so glad to meet you!"

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
preprocessed = [item.strip() for item in result if item.strip()]
print(preprocessed)

['Hello', ',', 'world', '.', 'This', ',', 'is', 'a', 'test', '.']


In [131]:
# Total number of tokens
print(len(preprocessed)) 

10


We are now ready to apply this tokenization to the raw text

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/05.webp" width="350px">

## 2. Converting tokens into token IDs

In this step, we convert the text tokens into token IDs that we can process via embedding layers later

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/06.webp" width="500px">

From the created tokens, build a vocabulary that consists of all the unique tokens

In [132]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)

print(vocab_size)

8


In [133]:
vocab = {token:integer for integer,token in enumerate(all_words)}

Entries in this vocabulary:

In [134]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 20:
        break

(',', 0)
('.', 1)
('Hello', 2)
('This', 3)
('a', 4)
('is', 5)
('test', 6)
('world', 7)


A Tokenizer class that handles the entire tokenization process:

In [135]:
class SimpleTokenizer:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)

        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

- The `encode` function turns text into token IDs
- The `decode` function turns token IDs back into text

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/08.webp?123" width="500px">

- We can use the tokenizer to encode (that is, tokenize) texts into integers
- These integers then can be embedded (later) as input for the LLM

In [139]:
tokenizer = SimpleTokenizer(vocab)

text = "Hello, world. This is a test."
ids = tokenizer.encode(text)
print(ids)

[2, 0, 7, 1, 3, 5, 4, 6, 1]


- Decoding the integers back into text

In [140]:
tokenizer.decode(ids)

'Hello, world. This is a test.'

In [141]:
tokenizer.decode(tokenizer.encode(text))

'Hello, world. This is a test.'

## 3. Adding special context tokens

Add some **special** tokens for unknown words, or to denote the end of a text

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/09.webp?123" width="500px">

- Some tokenizers use special tokens to help the LLM with additional context.
- Some of these special tokens are:
  - `[BOS]` (beginning of sequence) marks the beginning of text
  - `[EOS]` (end of sequence) marks where the text ends (this is usually used to concatenate multiple unrelated texts, e.g., two different Wikipedia articles or two different books, and so on)
  - `[PAD]` (padding) if we train LLMs with a batch size greater than 1 (we may include multiple texts with different lengths; with the padding token we pad the shorter texts to the longest length so that all texts have an equal length)
- `[UNK]` to represent words that are not included in the vocabulary

- GPT-2 does not need any of these tokens, but only uses an `<|endoftext|>` token (analogous to the `[EOS]`) to reduce complexity.
- GPT also uses the `<|endoftext|>` for padding (since we typically use a mask when training on batched inputs, we would not attend padded tokens anyways, so it does not matter what these tokens are)
- GPT-2 does not use an `<UNK>` token for out-of-vocabulary words; instead, GPT-2 uses a byte-pair encoding (BPE) tokenizer, which breaks down words into subword units.

We use the `<|endoftext|>` tokens between two independent sources of text:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/10.webp" width="500px">

Example:

In [None]:
# This chunk is supposed to produce an error.
tokenizer = SimpleTokenizer(vocab)

text = "Hello, do you like tea. Is this-- a test?"

tokenizer.encode(text)

KeyError: 'do'

The above produces an error because the word "do" is not contained in the vocabulary.

To deal with the **unknown words**, we can use special tokens: `"<|unk|>"` 

Let's also add another token called `"<|endoftext|>"` which is used in GPT-2 training to denote the end of a text (It's also used between concatenated text, when our training datasets consists of multiple articles, books, etc.)

In [None]:
all_tokens = sorted(list(set(preprocessed)))
all_tokens.extend(["<|endoftext|>", "<|unk|>"])

vocab = {token:integer for integer,token in enumerate(all_tokens)}

In [None]:
len(vocab.items())

20

In [None]:
# Confirm that those two tokens are added
for i, item in enumerate(list(vocab.items())[-5:]):
    print(item)

('the', 15)
('with', 16)
('you', 17)
('<|endoftext|>', 18)
('<|unk|>', 19)


Adjust the tokenizer accordingly with the new `<unk>` token

In [None]:
class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = { i:s for s,i in vocab.items()}

    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed = [
            item if item in self.str_to_int
            else "<|unk|>" for item in preprocessed
        ]

        ids = [self.str_to_int[s] for s in preprocessed]
        return ids

    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.:;?!"()\'])', r'\1', text)
        return text

Let's try to tokenize text with the modified tokenizer:

In [None]:
tokenizer = SimpleTokenizerV2(vocab)

text1 = "Hello, do you like tea?"
text2 = "Do you want some?"

text = " <|endoftext|> ".join((text1, text2))

print(text)

Hello, do you like tea? <|endoftext|> Do you want some?


In [None]:
tokenizer.encode(text)

[19, 2, 19, 17, 19, 19, 19, 18, 19, 17, 19, 19, 19]

The unknown tokens are replaced with `<unk>` token.

In [None]:
tokenizer.decode(tokenizer.encode(text))

'<|unk|>, <|unk|> you <|unk|> <|unk|> <|unk|> <|endoftext|> <|unk|> you <|unk|> <|unk|> <|unk|>'

### Special text encoding with BytePair

**GPT-2** used **BytePair encoding (BPE)** as its tokenizer.

It allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words.

For instance, if GPT-2's vocabulary doesn't have the word "unfamiliarword," it might tokenize it as ["unfam", "iliar", "word"] or some other subword breakdown, depending on its trained BPE merges.

The original BPE tokenizer can be found here: [https://github.com/openai/gpt-2/blob/master/src/encoder.py](https://github.com/openai/gpt-2/blob/master/src/encoder.py)

In this notebook, I'll be using the BPE tokenizer from OpenAI's open-source [tiktoken](https://github.com/openai/tiktoken) library, which implements its core algorithms in Rust to improve computational performance.

In [None]:
pip install tiktoken


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
import importlib
import tiktoken

print("tiktoken version:", importlib.metadata.version("tiktoken"))

tiktoken version: 0.9.0


In [None]:
tokenizer = tiktoken.get_encoding("gpt2")

In [None]:
text = (
    "Hello, do you like tea? <|endoftext|> Do you want some?"
)

integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(integers)

[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 2141, 345, 765, 617, 30]


In [None]:
strings = tokenizer.decode(integers)

print(strings)

Hello, do you like tea? <|endoftext|> Do you want some?


BPE tokenizers break down unknown words into subwords and individual characters:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/11.webp" width="300px">

## 4. Data sampling with a sliding window

We train LLMs to generate/predict one word at a time.

So we need to prepare the training data accordingly, where the next word in a sequence represents the target to predict:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/12.webp" width="400px">

In [None]:
raw_text = "LLMs learn to predict one word at a time."

enc_text = tokenizer.encode(raw_text)
print(enc_text)

[3069, 10128, 2193, 284, 4331, 530, 1573, 379, 257, 640, 13]


For each text chunk, we want the inputs and targets.

Since we want the model to predict the next word, the targets are the inputs shifted by one position to the right.

In [None]:
context_size = 4 # Define the context size as 4

x = enc_text[:context_size]
y = enc_text[1:context_size+1]

print(f"x: {x}")
print(f"y:      {y}")

x: [3069, 10128, 2193, 284]
y:      [10128, 2193, 284, 4331]


One by one, the prediction would look like:

In [None]:
# Represented with token IDs
for i in range(1, context_size+1):
    context = enc_text[:i]
    desired = enc_text[i]

    print(context, "---->", desired)

[3069] ----> 10128
[3069, 10128] ----> 2193
[3069, 10128, 2193] ----> 284
[3069, 10128, 2193, 284] ----> 4331


In [None]:
# Represented with tokens
for i in range(1, context_size+1):
    context = enc_text[:i]
    desired = enc_text[i]

    print(tokenizer.decode(context), "---->", tokenizer.decode([desired]))

LL ----> Ms
LLMs ---->  learn
LLMs learn ---->  to
LLMs learn to ---->  predict


A simple data loader that iterates over the input dataset and returns the inputs and targets shifted by one

In [None]:
import torch
print("PyTorch version:", torch.__version__)

PyTorch version: 2.6.0


We use a sliding window approach, changing the position by +1:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/13.webp?123" width="500px">

Create dataset and dataloader that extract chunks from the input text dataset

In [None]:
from torch.utils.data import Dataset, DataLoader

# A class for creating dataset (GPT)
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the book into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

In [None]:
# Data loader for sliding window approach
def create_dataloader_v1(txt, batch_size=4, max_length=256,
                         stride=128, shuffle=True, drop_last=True,
                         num_workers=0):

    # Initialize the tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

Testing the dataloader with a batch size of 1 for an LLM with a context size of 4:

In [None]:
dataloader = create_dataloader_v1(
    raw_text, batch_size=1, max_length=4, stride=1, shuffle=False
)

data_iter = iter(dataloader)
first_batch = next(data_iter)
print(first_batch)

[tensor([[ 3069, 10128,  2193,   284]]), tensor([[10128,  2193,   284,  4331]])]


In [None]:
second_batch = next(data_iter)
print(second_batch)

[tensor([[10128,  2193,   284,  4331]]), tensor([[2193,  284, 4331,  530]])]


Creating batched outputs:

(Note: adjust the stride (overlaps between the batches) accordingly. More overlap could lead to increased overfitting)

In [None]:
dataloader = create_dataloader_v1(raw_text, batch_size=2, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

Inputs:
 tensor([[ 3069, 10128,  2193,   284],
        [ 4331,   530,  1573,   379]])

Targets:
 tensor([[10128,  2193,   284,  4331],
        [  530,  1573,   379,   257]])


## 5. Creating token embeddings

The data is already almost ready for an LLM.

The last/additional step is to embed the tokens in a continuous vector representation using an embedding layer.

Usually, these embedding layers are part of the LLM itself and are updated (trained) during model training.

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/15.webp" width="400px">

Suppose we have the following four input examples with input ids 2, 3, 5, and 1 (after tokenization):

In [None]:
input_ids = torch.tensor([2, 3, 5, 1])

- Also, suppose we have a small vocabulary of only 6 words and we want to create embeddings of size (dimension) 3:

In [None]:
vocab_size = 6
output_dim = 3

torch.manual_seed(123)
embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

This would result in a 6x3 weight matrix:

In [None]:
print(embedding_layer.weight)

Parameter containing:
tensor([[ 0.3374, -0.1778, -0.1690],
        [ 0.9178,  1.5810,  1.3010],
        [ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-1.1589,  0.3255, -0.6315],
        [-2.8400, -0.7849, -1.4096]], requires_grad=True)


The embedding layer approach above is essentially just a more efficient way of implementing one-hot encoding followed by matrix multiplication in a fully-connected layer.

The embedding layer can be seen as a neural network layer that can be optimized via backpropagation.

Converting a token with id 3 into a 3-dimensional vector (The 4th row in the `embedding_layer` weight matrix):

In [None]:
print(embedding_layer(torch.tensor([3])))

tensor([[-0.4015,  0.9666, -1.1481]], grad_fn=<EmbeddingBackward0>)


Embedding all four `input_ids` values above:

In [None]:
print(embedding_layer(input_ids))

tensor([[ 1.2753, -0.2010, -0.1606],
        [-0.4015,  0.9666, -1.1481],
        [-2.8400, -0.7849, -1.4096],
        [ 0.9178,  1.5810,  1.3010]], grad_fn=<EmbeddingBackward0>)


An embedding layer is essentially a look-up operation:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/16.webp?123" width="500px">

## 6. Encoding word positions

**Embedding layer** convert token IDs into identical vector representations regardless their positions in the input sequence:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/17.webp" width="400px">

**Positional embeddings** are combined with the token embedding vector to form the input embeddings for a large language model:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/18.webp" width="500px">

The BytePair encoder has a vocabulary size of 50,257.

Suppose we want to encode the input tokens into a 256-dimensional vector representation:

In [None]:
vocab_size = 50257
output_dim = 256

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)

If we sample data from the dataloader, we embed the tokens in each batch into a 256-dimensional vector.

If we have a batch size of 8 with 4 tokens each, this results in a 8 x 4 x 256 tensor:

In [None]:
print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)

Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1807,  3619,   402,   271],
        [10899,  2138,   257,  7026],
        [15632,   438,  2016,   257],
        [  922,  5891,  1576,   438],
        [  568,   340,   373,   645],
        [ 1049,  5975,   284,   502],
        [  284,  3285,   326,    11]])

Inputs shape:
 torch.Size([8, 4])


In [None]:
token_embeddings = token_embedding_layer(inputs)
print(token_embeddings.shape)

torch.Size([2, 4, 256])


GPT-2 uses absolute position embeddings, so just create another embedding layer:

In [None]:
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

In [None]:
pos_embeddings = pos_embedding_layer(torch.arange(max_length))
print(pos_embeddings.shape)

torch.Size([4, 256])


To create the input embeddings used in an LLM, simply add the token and the positional embeddings:

In [None]:
input_embeddings = token_embeddings + pos_embeddings
print(input_embeddings.shape)

torch.Size([2, 4, 256])


In the initial phase of the input processing workflow, the input text is segmented into separate tokens

Following this segmentation, these tokens are transformed into token IDs based on a predefined vocabulary:

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/ch02_compressed/19.webp" width="400px">