# LLM Text Preprocessing Foundations

This notebook explores the fundamental concepts from Chapter 2 of *Build a Large Language Model (From Scratch)* by Sebastian Raschka.

### Learning Objectives:
- Understand tokenization strategies (word-level, character-level, subword)
- Implement Byte Pair Encoding (BPE) tokenization
- Create training samples using sliding windows
- Generate token embeddings
- Experiment with hyperparameters and understand their impact

In [None]:
# !pip install torch tiktoken

In [2]:
import re
import torch
import tiktoken
from importlib.metadata import version

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

torch version: 2.10.0
tiktoken version: 0.12.0


## 1. Loading and Preparing Text Data

The quality and preprocessing of training data directly impacts model performance. For LLMs and agentic systems:

- **Data is the foundation**: Models learn patterns, syntax, semantics, and even reasoning from raw text
- **Preprocessing choices matter**: How we clean and structure text affects what the model learns
- **Scale requirements**: LLMs need massive text corpora (billions of tokens) to learn language effectively
- **Agentic implications**: For AI agents to interact naturally, they must be trained on diverse, high-quality conversational and instructional text


In [3]:
import urllib.request

url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
file_path = "the-verdict.txt"

urllib.request.urlretrieve(url, file_path)

with open(file_path, "r", encoding="utf-8") as f:
    raw_text = f.read()

print(f"Total characters: {len(raw_text)}")
print(f"First 500 characters:\n{raw_text[:500]}")

Total characters: 20479
First 500 characters:
I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no great surprise to me to hear that, in the height of his glory, he had dropped his painting, married a rich widow, and established himself in a villa on the Riviera. (Though I rather thought it would have been Rome or Florence.)

"The height of his glory"--that was what the women called it. I can hear Mrs. Gideon Thwing--his last Chicago sitter--deploring his unaccountable abdication. "Of course it'


## 2. Tokenization

Tokenization is the critical bridge between human language and machine learning

- **Vocabulary size tradeoff**: 
  - Word-level: Large vocabularies (100K+ words), but handles known words well
  - Character-level: Tiny vocabulary (~100), but sequences become very long

- **Why BPE wins for LLMs**:
  - Efficiently handles rare words by breaking them into common subwords
  - No "unknown token" problem for new words
  - Balances sequence length with vocabulary size

- **Impact on agentic systems**:
  - Better tokenization = better understanding of domain-specific terms, code, URLs, etc.
  - Agents need to handle diverse inputs (technical terms, names, multilingual text)
  - Token efficiency directly affects inference cost and latency

In [4]:
# Simple word-level tokenization (baseline)
preprocessed = re.split(r'([,.?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(f"Total tokens (simple): {len(preprocessed)}")
print(f"First 30 tokens: {preprocessed[:30]}")

Total tokens (simple): 4649
First 30 tokens: ['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


In [5]:
# Build vocabulary
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(f"Vocabulary size: {vocab_size}")

vocab = {token: integer for integer, token in enumerate(all_words)}
print(f"First 20 vocabulary entries: {list(vocab.items())[:20]}")

Vocabulary size: 1159
First 20 vocabulary entries: [('!', 0), ('"', 1), ("'", 2), ('(', 3), (')', 4), (',', 5), ('--', 6), ('.', 7), (':', 8), (';', 9), ('?', 10), ('A', 11), ('Ah', 12), ('Among', 13), ('And', 14), ('Are', 15), ('Arrt', 16), ('As', 17), ('At', 18), ('Be', 19)]


In [6]:
# Simple tokenizer class
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.?_!"()\']|--|\s)', text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
    
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

tokenizer = SimpleTokenizerV1(vocab)
text = """"It's the last he painted, you know," Mrs. Gisburn said with pardonable pride."""
ids = tokenizer.encode(text)
print(f"Encoded: {ids}")
print(f"Decoded: {tokenizer.decode(ids)}")

Encoded: [1, 58, 2, 872, 1013, 615, 541, 763, 5, 1155, 608, 5, 1, 69, 7, 39, 873, 1136, 773, 812, 7]
Decoded: " It' s the last he painted, you know," Mrs. Gisburn said with pardonable pride.


### BPE Tokenization

In [7]:
# Use GPT-2's BPE tokenizer
tokenizer = tiktoken.get_encoding("gpt2")

text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})

print(f"Encoded text: {integers}")
print(f"Decoded text: {tokenizer.decode(integers)}")
print(f"\nGPT-2 vocabulary size: {tokenizer.n_vocab}")

Encoded text: [15496, 11, 466, 345, 588, 8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13]
Decoded text: Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.

GPT-2 vocabulary size: 50257


In [8]:
# Tokenize the full text
enc_text = tokenizer.encode(raw_text)
print(f"Total tokens in text: {len(enc_text)}")
print(f"First 50 token IDs: {enc_text[:50]}")

Total tokens in text: 5145
First 50 token IDs: [40, 367, 2885, 1464, 1807, 3619, 402, 271, 10899, 2138, 257, 7026, 15632, 438, 2016, 257, 922, 5891, 1576, 438, 568, 340, 373, 645, 1049, 5975, 284, 502, 284, 3285, 326, 11, 287, 262, 6001, 286, 465, 13476, 11, 339, 550, 5710, 465, 12036, 11, 6405, 257, 5527, 27075, 11]
