# Stage 2 - Tokenization and Data Piplines


In [1]:
# Import previous stages
from tiny_gpt import env_display_tinygpt_modules, io_load_text_file
print("Running Stage 1")
print(50* "=")


env_display_tinygpt_modules()

print(50* "=")
raw_text_data = io_load_text_file()


print(50* "=")
print("Stage 1 Successfully Loaded")

Running Stage 1
Running env_display_tinygpt_modules
🔧 Environment Summary
--------------------------------------------------
Python version : 3.13.7
Platform       : Windows 11

Torch version  : 2.9.0+cu130
NumPy version  : 2.3.4
Tiktoken ver.  : 0.12.0
Pandas version : 2.3.3
--------------------------------------------------
⚙️  CUDA & GPU Information
--------------------------------------------------
✅ CUDA available: 13.0
🧠 GPU name      : NVIDIA GeForce RTX 4070 Laptop GPU
💽 Total memory  : 8.59 GB
--------------------------------------------------
🔤 Tokenizer Test (tiktoken)
--------------------------------------------------
Input text : Once upon a time in TinyGPT...
Tokens     : [7454, 2402, 257, 640, 287, 20443, 38, 11571, 986]
Decoded    : Once upon a time in TinyGPT...
--------------------------------------------------
✅ TinyGPT environment initialized successfully!
--------------------------------------------------
Running io_load_text_file
Loaded '../The_Verdict.txt' succes

---

## Part A – Tokenization




- Words are represented as continuous-valued vectors, allowing mathematical operations to be applied on them.

- The process of converting text into vector form is known as embedding.

- Retrieval-Augmented Generation (RAG) combines two steps:

  - Retrieval — searching an external knowledge base for relevant information.

  - Generation — producing coherent text using both the retrieved and contextual information.

- The embedding size (or dimensionality) represents the number of features in each word vector — it defines the hidden state size of the model.

- For instance, the GPT-2 model uses 768-dimensional embeddings, meaning each word is encoded across 768 distinct numerical features.
  Think of it as measuring a word’s meaning along many abstract axes — for example, a “bird” might score high on a ‘can-fly’ dimension.


### Regex Tokenizer Concept
- Using regular expressions build a tokenizer
- We also refrain from making all text lowercase because capitalization helps LLMs distinguish between proper nouns and common nouns, understand sentence structure, and learn to generate text with proper capitalization.



In [2]:
from tiny_gpt import RegexTokenizer



rt_obj = RegexTokenizer(raw_text_data)

# Step 1: Split the text into tokens
vocab_list = rt_obj.split()
print("\n🧩 Tokenized Vocabulary List:")
print(vocab_list)

# Step 2: Create the vocabulary mappings (token <-> ID)
vocab = rt_obj.create_vocabulary()
print("\n📘 Vocabulary Created:")
print(f"Total Tokens: {len(vocab)}")
print(list(vocab.items())[:10], "...")  # print first 10 tokens

# Step 3: Encode a sample text into IDs
text = """"It's the last he painted, you know," 
       Mrs. Gisburn said with pardonable pride."""
ids = rt_obj.encode(text)
print("\n🔢 Encoded IDs:")
print(ids)

# Step 4: Decode the IDs back into readable text
decoded_text = rt_obj.decode(ids)
print("\n🗣️ Decoded Text:")
print(decoded_text)

# Step 5: Encode text with <|endoftext|> separators (like GPT-style)
text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
joined_text = " <|endoftext|> ".join((text1, text2))
print("\n📜 Joined Text with <|endoftext|> token:")
print(joined_text)

# Step 6: Encode and decode the joined text
ids = rt_obj.encode(joined_text)
print("\n🔢 Encoded IDs for Joined Text:")
print(ids)

decoded_joined_text = rt_obj.decode(ids)
print("\n🗣️ Decoded Joined Text:")
print(decoded_joined_text)




🧩 Tokenized Vocabulary List:
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in', 'the', 'height', 'of', 'his', 'glory', ',', 'he', 'had', 'dropped', 'his', 'painting', ',', 'married', 'a', 'rich', 'widow', ',', 'and', 'established', 'himself', 'in', 'a', 'villa', 'on', 'the', 'Riviera', '.', '(', 'Though', 'I', 'rather', 'thought', 'it', 'would', 'have', 'been', 'Rome', 'or', 'Florence', '.', ')', '"', 'The', 'height', 'of', 'his', 'glory', '"', '--', 'that', 'was', 'what', 'the', 'women', 'called', 'it', '.', 'I', 'can', 'hear', 'Mrs', '.', 'Gideon', 'Thwing', '--', 'his', 'last', 'Chicago', 'sitter', '--', 'deploring', 'his', 'unaccountable', 'abdication', '.', '"', 'Of', 'course', 'it', "'", 's', 'going', 'to', 'send', 'the', 'value', 'of', 'my', 'picture', "'", 'way', 'up', ';', 'but', 'I', 'don', "'", 

- [BOS] (beginning of sequence) —This token marks the start of a text. It signifies to the LLM where a piece of content begins.
- [EOS] (end of sequence) —This token is positioned at the end of a text and is especially useful when concatenating multiple unrelated texts, similar to <|endoftext|>. For instance, when combining two different Wikipedia articles or books, the [EOS] token indicates where one ends and the next begins.
- [PAD] (padding) —When training LLMs with batch sizes larger than one, the batch might contain texts of varying lengths. To ensure all texts have the same length, the shorter texts are extended or “padded” using the [PAD] token, up to the length of the longest text in the batch.

---

### TikToken Byte Pair Encoding

- In this section, we perform tokenization using **Byte Pair Encoding (BPE)** provided by the `tiktoken` library.
- We use the **GPT-2 tokenizer** as our base model for encoding the text.

- The algorithm underlying BPE breaks down words that aren’t in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words. So, thanks to the BPE algorithm, if the tokenizer encounters an unfamiliar word during tokenization, it can represent it as a sequence of subword tokens or characters


In [3]:
# Example: Encoding and decoding text using TikTokenizer

from tiny_gpt import TikTokenizer

text1 = "Hello, do you like tea?"
text2 = "In the sunlit terraces of the palace."
joined_text = " <|endoftext|> ".join((text1, text2))

print("\nJoined Text with <|endoftext|> token:")
print(joined_text)


# Initialize tokenizer
tik_obj = TikTokenizer("gpt2", {"<|endoftext|>"})

# Encode text to token IDs
encoded_ids = tik_obj.encode(joined_text)
print("\nEncoded Token IDs:")
print(encoded_ids)

# Decode token IDs back to text
decoded_text = tik_obj.decode(encoded_ids)
print("\nDecoded Text:")
print(decoded_text)

# Optional: Show token count
print("\nTotal Tokens:", tik_obj.count_tokens(joined_text))



Joined Text with <|endoftext|> token:
Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.

Encoded Token IDs:
[15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 262, 20562, 13]

Decoded Text:
Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace.

Total Tokens: 19


## Part B - Data Pipelines

# LLM Training Notes

1. Tokenization

So far, we’ve explored how to convert raw text into tokens, using methods like:

- Regex tokenization – simple rule-based splitting of text.

- BPE (Byte Pair Encoding) – merges frequent character pairs into larger units, balancing efficiency and vocabulary size.

- Tokens are the smallest meaningful units the model works with.

2. Dataset and Dataloader

To train a model, we need to feed data in a structured way — that’s where the dataset and dataloader come in.

- Dataset

  - The dataset is a subclass of torch.utils.data.Dataset.

  It contains:

  - Input tokens (the source text as token IDs)

  - Targets (the shifted version of input tokens)

- Inputs are created using:

  - context_length → defines how many tokens the model can see at once.

  - stride → defines how much to move (or slide) the input window forward for the next sample.

- For example:

  tokens: [A, B, C, D, E, F]
  context_length = 4
  stride = 2

  → samples:

  Input 1: [A, B, C, D]

  Input 2: [C, D, E, F]

Each input has a corresponding target, usually the same sequence shifted by one token.
LLMs are trained to predict the next token given the previous ones.

- Dataloader

The dataloader takes the dataset and:

- Divides it into batches (controlled by batch_size)

- Optionally shuffles the samples

- Can drop the last incomplete batch if needed

Each batch contains multiple input sequences (and their targets) that are fed together into the model during training.

This batching helps with:

- Parallelization on GPUs

- Stabilizing gradients through averaging across samples

3. Token Embeddings

The inputs from the dataloader are still integer token IDs.
Before feeding them into the transformer, they must be converted into dense, learnable vector representations called token embeddings.

The embedding layer maps each token ID → a high-dimensional vector.

Each dimension represents a different learned “feature” or aspect of meaning.

These embeddings form the model’s input representation space.

4. Positional Embeddings

Transformers have no inherent sense of order — self-attention treats all tokens equally.
To fix this, we add positional embeddings to the token embeddings.

They encode where each token appears in the sequence.

When added to token embeddings, they allow the model to understand the relative position of words/tokens (e.g., “cat sat on mat” ≠ “mat sat on cat”).

```
Raw text
   ↓
Tokenizer → Tokens
   ↓
Dataset (creates input-target pairs using stride & context length)
   ↓
Dataloader (batches & shuffles data)
   ↓
Token Embeddings (map token IDs → dense vectors)
   ↓
+ Positional Embeddings (add sequence order info)
   ↓
Transformer (self-attention, training, etc.)
```


In [4]:
from tiny_gpt import tiny_data_loader

# ----------------------------- #
# Example: Understanding stride and batch behavior
# ----------------------------- #

print("Stride = 1")
# Create dataloader with stride=1 → input window slides by 1 token each time
data_loader = tiny_data_loader(
    raw_text_data, batch_size=1, max_length=4, stride=1, shuffle=False
)
data_iter = iter(data_loader)

# Each batch contains (input_ids, target_ids)
first_batch = next(data_iter)
print("First Batch")
print(first_batch)

second_batch = next(data_iter)
print("Second Batch")
print(second_batch)

print(50 * "-")

# ----------------------------- #
# Stride = 2 → input moves 2 tokens ahead between samples
# Target is still shifted by 1 internally (always next-token prediction)
# ----------------------------- #
print("Stride = 2")
data_loader = tiny_data_loader(
    raw_text_data, batch_size=1, max_length=4, stride=2, shuffle=False
)
data_iter = iter(data_loader)

first_batch = next(data_iter)
print("First Batch")
print(first_batch)

second_batch = next(data_iter)
print("Second Batch")
print(second_batch)

print(50 * "-")

# ----------------------------- #
# Stride = 4 → no overlap, input window jumps completely to the next chunk
# ----------------------------- #
print("Stride = 4")
data_loader = tiny_data_loader(
    raw_text_data, batch_size=1, max_length=4, stride=4, shuffle=False
)
data_iter = iter(data_loader)

first_batch = next(data_iter)
print("First Batch")
print(first_batch)

second_batch = next(data_iter)
print("Second Batch")
print(second_batch)

print(50 * "-")

# ----------------------------- #
# Batch size = 4, Stride = 2
# Multiple samples are grouped together in one batch
# Each element in the batch is an (input, target) pair
# ----------------------------- #
print("Batch size 4 ,Stride = 2")
data_loader = tiny_data_loader(
    raw_text_data, batch_size=4, max_length=4, stride=2, shuffle=False
)
data_iter = iter(data_loader)

first_batch = next(data_iter)
print("First Batch")
print(first_batch)

second_batch = next(data_iter)
print("Second Batch")
print(second_batch)

print(50 * "-")


Stride = 1
First Batch
[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
Second Batch
[tensor([[ 367, 2885, 1464, 1807]]), tensor([[2885, 1464, 1807, 3619]])]
--------------------------------------------------
Stride = 2
First Batch
[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
Second Batch
[tensor([[2885, 1464, 1807, 3619]]), tensor([[1464, 1807, 3619,  402]])]
--------------------------------------------------
Stride = 4
First Batch
[tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
Second Batch
[tensor([[1807, 3619,  402,  271]]), tensor([[ 3619,   402,   271, 10899]])]
--------------------------------------------------
Batch size 4 ,Stride = 2
First Batch
[tensor([[   40,   367,  2885,  1464],
        [ 2885,  1464,  1807,  3619],
        [ 1807,  3619,   402,   271],
        [  402,   271, 10899,  2138]]), tensor([[  367,  2885,  1464,  1807],
        [ 1464,  1807,  3619,   402],
        [ 3619,   402,   271,

## Create Input Embeddings (Token Embeddings + Position Embedding)

In [14]:
import torch
from tiny_gpt import tiny_data_loader
from tiny_gpt import TikTokenizer  # assuming your tokenizer class is here

# ------------------------------------------------------------
# Initialize tokenizer and get vocabulary info
# ------------------------------------------------------------
tik_obj = TikTokenizer("gpt2", {"<|endoftext|>"})

print(50 * "-")
print("Total tokens present in the vocabulary")
print(tik_obj.get_n_vocab_tokens())

# ------------------------------------------------------------
# Create token embedding layer
# ------------------------------------------------------------
vocab_size = tik_obj.get_n_vocab_tokens()  # total tokens in tokenizer vocabulary
output_dim = 256  # embedding vector dimension (hidden size)

token_embedding_layer = torch.nn.Embedding(vocab_size, output_dim)
print(token_embedding_layer)

print(50 * "-")

# ------------------------------------------------------------
# Prepare a simple dataloader
# ------------------------------------------------------------
max_length = 4
dataloader = tiny_data_loader(
    raw_text_data,
    batch_size=8,
    max_length=max_length,
    stride=max_length - 1,
    shuffle=False,
)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print("Token IDs:\n", inputs)
print("\nInputs shape:\n", inputs.shape)
print(50 * "-")

# ------------------------------------------------------------
# Convert token IDs to embeddings
# ------------------------------------------------------------
token_embeddings = token_embedding_layer(inputs)
print("Token embeddings shape:", token_embeddings.shape)
print(50 * "-")

# ------------------------------------------------------------
# Create position embeddings
# ------------------------------------------------------------
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)

# Get positional embeddings for each position in the input (0, 1, 2, 3)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print("Positional embeddings shape:", pos_embeddings.shape)

# ------------------------------------------------------------
# Combine token + positional embeddings
# ------------------------------------------------------------
# Broadcasting automatically matches positions for each token in the batch
input_embeddings = token_embeddings + pos_embeddings
print("Final input embeddings shape:", input_embeddings.shape)

--------------------------------------------------
Total tokens present in the vocabulary
50257
Embedding(50257, 256)
--------------------------------------------------
Token IDs:
 tensor([[   40,   367,  2885,  1464],
        [ 1464,  1807,  3619,   402],
        [  402,   271, 10899,  2138],
        [ 2138,   257,  7026, 15632],
        [15632,   438,  2016,   257],
        [  257,   922,  5891,  1576],
        [ 1576,   438,   568,   340],
        [  340,   373,   645,  1049]])

Inputs shape:
 torch.Size([8, 4])
--------------------------------------------------
Token embeddings shape: torch.Size([8, 4, 256])
--------------------------------------------------
Positional embeddings shape: torch.Size([4, 256])
Final input embeddings shape: torch.Size([8, 4, 256])


## Token and Positional Embeddings — Shape, Broadcasting, and Learning Behavior

1. Token Embeddings

Defined as:

```
    token_embedding_layer = torch.nn.Embedding(vocab_size, embed_dim)
    token_embeddings = token_embedding_layer(input_ids)

```

Shape of input_ids: (batch_size, seq_len)

Shape of token_embeddings: (batch_size, seq_len, embed_dim)

Each token ID is mapped to a vector of size embed_dim.
This represents the semantic meaning of that token (like a feature vector for each word).

2. Positional Embeddings

Defined as:

```
    pos_embedding_layer = torch.nn.Embedding(context_length, embed_dim)
    pos_embeddings = pos_embedding_layer(torch.arange(seq_len))
```

Shape of pos_embeddings: (seq_len, embed_dim)

This layer assigns a unique embedding vector to each position in the input sequence (0, 1, 2, …).
Initially, these vectors are randomly initialized but learnable.

3. Addition via Broadcasting

When we add them:

input_embeddings = token_embeddings + pos_embeddings

PyTorch applies broadcasting automatically because:

token_embeddings has shape (batch_size, seq_len, embed_dim)

pos_embeddings has shape (seq_len, embed_dim)

Broadcasting expands pos_embeddings along the batch dimension:

```
(batch_size, seq_len, embed_dim)

-           (seq_len, embed_dim)

---

= (batch_size, seq_len, embed_dim)
```

Effectively, each sequence in the batch gets the same positional offsets added to its token embeddings.

So the i-th token in every sequence gets the same position vector added — e.g.:

input_embeddings[b, i, :] = token_embeddings[b, i, :] + pos_embeddings[i, :]

4. Why Addition Works

Adding these two embeddings lets the model encode both meaning and order:

token_embeddings → represent what the token is

pos_embeddings → represent where it appears in the sequence

After addition, each token’s final embedding uniquely represents both its identity and its position.

Example:

```
    token("cat") + pos(0) → "cat at start"
    token("cat") + pos(3) → "cat at end"
```


Without this, “cat” at position 0 and “cat” at position 3 would look identical to the model.

5. How Learning Differentiates Between Them

Both embedding layers (token_embedding_layer and pos_embedding_layer) have independent learnable weights.
During backpropagation:

Gradients flow separately into each layer.

token_embedding_layer.weight learns semantic relationships between tokens.

pos_embedding_layer.weight learns how position affects meaning.

Over many training steps:

The token embeddings form clusters for semantically similar words (e.g., “dog” and “cat”).

The position embeddings start encoding relative position information, such as:

Early tokens → may influence beginnings of sentences.

Later tokens → may correlate with sentence endings.

Even though position embeddings start as random, they converge into structured, meaningful representations that help the model attend correctly to token order.

✅ In short:

| Concept                  | Purpose                   | Shape                              | Learns? | Notes                       |
| ------------------------ | ------------------------- | ---------------------------------- | ------- | --------------------------- |
| **Token Embedding**      | Represents token meaning  | `(batch_size, seq_len, embed_dim)` | ✅ Yes  | Learns vocabulary semantics |
| **Positional Embedding** | Represents token position | `(seq_len, embed_dim)`             | ✅ Yes  | Learns order patterns       |
| **Combined Embedding**   | Input to Transformer      | `(batch_size, seq_len, embed_dim)` | ✅ Yes  | Added via broadcasting      |
