# LLM Text Preprocessing Foundations

This notebook explores the fundamental concepts from Chapter 2 of *Build a Large Language Model (From Scratch)* by Sebastian Raschka.

### Learning Objectives:
- Understand tokenization strategies (word-level, character-level, subword)
- Implement Byte Pair Encoding (BPE) tokenization
- Create training samples using sliding windows
- Generate token embeddings
- Experiment with hyperparameters and understand their impact

In [None]:
# !pip install torch tiktoken

In [2]:
import re
import torch
import tiktoken
from importlib.metadata import version

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

torch version: 2.10.0
tiktoken version: 0.12.0


## 1. Loading and Preparing Text Data

The quality and preprocessing of training data directly impacts model performance. For LLMs and agentic systems:

- **Data is the foundation**: Models learn patterns, syntax, semantics, and even reasoning from raw text
- **Preprocessing choices matter**: How we clean and structure text affects what the model learns
- **Scale requirements**: LLMs need massive text corpora (billions of tokens) to learn language effectively
- **Agentic implications**: For AI agents to interact naturally, they must be trained on diverse, high-quality conversational and instructional text


In [None]:
import urllib.request

url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
file_path = "the-verdict.txt"

urllib.request.urlretrieve(url, file_path)

with open(file_path, "r", encoding="utf-8") as f:
    raw_text = f.read()

print(f"Total characters: {len(raw_text)}")
print(f"First 500 characters:\n{raw_text[:500]}")