# Data Sampling and DataLoaders for LLM Training

When training Large Language Models (LLMs), we need to efficiently prepare and feed data to the model. This notebook demonstrates:

1. **Why we need data sampling**: Raw text needs to be converted into training examples
2. **Sliding window approach**: How to create overlapping sequences for better learning
3. **DataLoader creation**: Batching sequences for efficient training
4. **Input-target pairs**: Understanding the predict-next-token objective

By the end, you'll understand how to transform raw text into batches of training data that LLMs can learn from.

## 1. Setup and Imports

First, we import the necessary libraries:
- **PyTorch**: For tensor operations and data loading utilities
- **tiktoken**: OpenAI's tokenizer for converting text to tokens
- **Dataset & DataLoader**: PyTorch utilities for efficient data handling

In [6]:
import sys
from pathlib import Path

# Add the parent directory to the path so we can import from core
blog_dir = Path.cwd().parent
sys.path.insert(0, str(blog_dir))

In [7]:
import torch
import tiktoken
from torch.utils.data import Dataset, DataLoader

## 2. Understanding the Sliding Window Dataset

**The Problem**: LLMs are trained to predict the next token given previous tokens. We need to create many training examples from our text.

**The Solution**: Use a sliding window approach where:
- **max_length**: The number of tokens in each training sequence
- **stride**: How many tokens to shift the window (controls overlap)
- **Input**: A sequence of tokens
- **Target**: The same sequence shifted by 1 position (next token prediction)

For example, with tokens `[1, 2, 3, 4, 5]` and `max_length=3`:
- Input: `[1, 2, 3]` → Target: `[2, 3, 4]`
- Input: `[2, 3, 4]` → Target: `[3, 4, 5]`

This teaches the model: given `[1, 2, 3]`, predict `2, 3, 4` at each position.

In [8]:
class SlidingWindowDataset(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []

        # Tokenize the entire text
        token_ids = tokenizer.encode(txt, allowed_special={"<|endoftext|>"})

        # Use a sliding window to chunk the text into overlapping sequences of max_length
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

## 3. Creating the DataLoader Function

The `create_dataloader` function combines everything:
1. Initializes the tokenizer (GPT-2's BPE tokenizer)
2. Creates the dataset with sliding windows
3. Wraps it in a DataLoader for batching and shuffling

**Key Parameters**:
- `batch_size`: Number of sequences to process together
- `max_length`: Sequence length (context window)
- `stride`: Window shift size (smaller = more overlap, more training examples)
- `shuffle`: Randomize order (helps model generalize)
- `drop_last`: Drop incomplete final batch (keeps batch sizes consistent)

In [9]:
def create_dataloader(txt, batch_size=4, max_length=256, 
                      stride=128, shuffle=True, drop_last=True,
                      num_workers=0):
    """
    Create a DataLoader for training language models
    
    Args:
        txt: Raw text string to tokenize
        batch_size: Number of sequences per batch
        max_length: Length of each input sequence
        stride: Step size for sliding window (smaller = more overlap)
        shuffle: Whether to shuffle the data
        drop_last: Whether to drop the last incomplete batch
        num_workers: Number of worker processes for data loading
    
    Returns:
        DataLoader instance
    """
    # Initialize the GPT-2 tokenizer
    tokenizer = tiktoken.get_encoding("gpt2")

    # Create dataset
    dataset = SlidingWindowDataset(txt, tokenizer, max_length, stride)

    # Create dataloader
    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=shuffle,
        drop_last=drop_last,
        num_workers=num_workers
    )

    return dataloader

## 4. Load the Training Data

Let's load our text data - Romeo and Juliet from Project Gutenberg.

In [11]:
file_path = "../data/romeo_juliet_gutenberg.txt"

with open(file_path, "r", encoding="utf-8") as file:
    text = file.read()
print(f"Data '{file_path}' loaded successfully.")

Data '../data/romeo_juliet_gutenberg.txt' loaded successfully.


## 5. Testing with Small Parameters

Let's create a dataloader with small parameters to understand how it works:
- `batch_size=1`: One sequence at a time
- `max_length=4`: Only 4 tokens per sequence
- `stride=1`: Shift window by 1 token (maximum overlap)
- `shuffle=False`: Keep original order to see the pattern

In [13]:
# Create dataloader with small parameters for demonstration
dataloader = create_dataloader(
    text, batch_size=1, max_length=4, stride=1, shuffle=False
)

# Get the first batch
data_iter = iter(dataloader)
first_batch = next(data_iter)
print("First batch:")
print(first_batch)

[tensor([[  464,  4935, 20336, 46566]]), tensor([[ 4935, 20336, 46566,   286]])]


Notice how the first batch contains:
- A list with 2 tensors: `[input_tensor, target_tensor]`
- Each tensor has shape `[batch_size, max_length]` = `[1, 4]`

Let's get the second batch to see the sliding window effect:

In [14]:
# Get the second batch
second_batch = next(data_iter)
print("Second batch:")
print(second_batch)

[tensor([[ 4935, 20336, 46566,   286]]), tensor([[20336, 46566,   286, 43989]])]


With `stride=1`, the window shifted by just 1 token. Most tokens appear in multiple training examples, helping the model learn from more context.

## 6. Batching Multiple Sequences

Now let's create a more realistic dataloader:
- `batch_size=8`: Process 8 sequences at once
- `stride=4`: Less overlap (window shifts by entire sequence length)
- This is more efficient for training

In [15]:
# Create a dataloader with batch_size=8
dataloader = create_dataloader(text, batch_size=8, max_length=4, stride=4, shuffle=False)

data_iter = iter(dataloader)
inputs, targets = next(data_iter)

print("Inputs shape:", inputs.shape)  # [batch_size, max_length]
print("Targets shape:", targets.shape)  # [batch_size, max_length]
print("\nInputs (token IDs):\n", inputs)
print("\nTargets (token IDs):\n", targets)

Inputs:
 tensor([[  464,  4935, 20336, 46566],
        [  286, 43989,   290, 38201],
        [  198,   220,   220,   220],
        [  220,   198,  1212, 47179],
        [  318,   329,   262,   779],
        [  286,  2687,  6609,   287],
        [  262,  1578,  1829,   290],
        [  198,  1712,   584,  3354]])

Targets:
 tensor([[ 4935, 20336, 46566,   286],
        [43989,   290, 38201,   198],
        [  220,   220,   220,   220],
        [  198,  1212, 47179,   318],
        [  329,   262,   779,   286],
        [ 2687,  6609,   287,   262],
        [ 1578,  1829,   290,   198],
        [ 1712,   584,  3354,   286]])


Perfect! We now have:
- **8 sequences** in the batch (rows)
- Each sequence has **4 tokens** (columns)
- Targets are shifted by 1 position

## 7. Decoding to See Actual Text

Let's decode the token IDs back to text to see what the model is actually learning from:

In [16]:
# Initialize tokenizer for decoding
tokenizer = tiktoken.get_encoding("gpt2")

print("=" * 50)
print("DECODED BATCH - Showing Input → Target pairs")
print("=" * 50)

# Decode each sequence in the batch
for i in range(len(inputs)):
    input_text = tokenizer.decode(inputs[i].tolist())
    target_text = tokenizer.decode(targets[i].tolist())
    
    print(f"\nSequence {i + 1}:")
    print(f"  Input:  {repr(input_text)}")
    print(f"  Target: {repr(target_text)}")

DECODED BATCH - Showing Input → Target pairs

Sequence 1:
  Input:  'The Project Gutenberg eBook'
  Target: ' Project Gutenberg eBook of'

Sequence 2:
  Input:  ' of Romeo and Juliet'
  Target: ' Romeo and Juliet\n'

Sequence 3:
  Input:  '\n   '
  Target: '    '

Sequence 4:
  Input:  ' \nThis ebook'
  Target: '\nThis ebook is'

Sequence 5:
  Input:  ' is for the use'
  Target: ' for the use of'

Sequence 6:
  Input:  ' of anyone anywhere in'
  Target: ' anyone anywhere in the'

Sequence 7:
  Input:  ' the United States and'
  Target: ' United States and\n'

Sequence 8:
  Input:  '\nmost other parts'
  Target: 'most other parts of'


## Understanding the Output

Notice how each target is the input shifted by one token:
- The model learns to predict the **next token** at each position
- For input token sequence `[A, B, C, D]`, the model predicts `[B, C, D, E]`
- At position 0: given `A`, predict `B`
- At position 1: given `A, B`, predict `C`
- At position 2: given `A, B, C`, predict `D`
- At position 3: given `A, B, C, D`, predict `E`

This is called **causal language modeling** - the fundamental training objective for LLMs like GPT!

## Key Takeaways

1. **Sliding windows** create multiple training examples from continuous text
2. **Stride controls overlap** - smaller stride = more examples but slower
3. **Batching** groups sequences for efficient parallel processing
4. **Input-target pairs** are shifted by 1 token for next-token prediction
5. **DataLoaders** handle shuffling, batching, and multi-processing automatically

This data preparation pipeline is essential for training any autoregressive language model!