# LangChain Notebook (Part 2): Chunking + Tokenization + Context Window (Production mindset)

This notebook builds on the previous one:

✅ **Part 1 notebook**: *Readers + Document object + Cleaning*  
You created `docs` and `cleaned_docs` (`List[Document]`) and preserved metadata.

In this notebook, you will learn:
1) **Chunking strategies** (why, when, how)  
2) **Tokenization** (how to count tokens)  
3) **Context window budgeting** (how many chunks can you fit into a prompt)

---

## How to use this notebook

### Option A (recommended)
Run Part 1 notebook first, then run this notebook.  
You can re-run the “Load + Clean” cell below to regenerate `cleaned_docs` exactly like Part 1.

### Option B
If you already have `cleaned_docs` in memory (same kernel/session), skip the load/clean cell.

## 0) Install & Imports

We will use:
- `langchain_text_splitters` (splitters)
- `tiktoken` (token counting; widely used for OpenAI-style BPE tokenization)

If something is missing, run the install cell.

In [1]:
# If needed (run once)
# %pip install -U langchain langchain-core langchain-community langchain-text-splitters tiktoken pypdf unstructured python-dotenv

import os
import re
import unicodedata
from pathlib import Path
from typing import List, Dict, Any, Tuple

from langchain_core.documents import Document
from langchain_community.document_loaders import TextLoader, DirectoryLoader

# Splitters
from langchain_text_splitters import RecursiveCharacterTextSplitter, TokenTextSplitter

# Token counter
try:
    import tiktoken
except Exception:
    tiktoken = None

print("Ready ✅")

Ready ✅


## 1) (Reference from Part 1) Re-create `cleaned_docs`

This cell reuses the same cleaning approach from Part 1 so this notebook is **standalone**.

If you already have `cleaned_docs`, you can skip this cell.

In [None]:
# ---------- DEMO DATA (same idea as Part 1) ----------
DATA_DIR = Path("demo_docs")
DATA_DIR.mkdir(exist_ok=True)

# Create sample docs if folder is empty
if not any(DATA_DIR.glob("*.txt")):
    (DATA_DIR / "doc1.txt").write_text(
    """ACME SUPPORT PORTAL — INTERNAL USE ONLY
------------------------------------------

Hello   team,

This   is    a    test document.  

It contains    extra spaces,    odd line breaks,
and some unicode like café, naïve, and “smart quotes”.

ACME SUPPORT PORTAL — INTERNAL USE ONLY
Page 1
""", encoding="utf-8")

    (DATA_DIR / "doc2.txt").write_text(
    """ACME SUPPORT PORTAL — INTERNAL USE ONLY
------------------------------------------

FAQ:
1) Reset password — go to Settings → Security.
2) Contact support at support@example.com

Disclaimer: This email and any attachments are confidential.
Disclaimer: This email and any attachments are confidential.

ACME SUPPORT PORTAL — INTERNAL USE ONLY
Page 2
""", encoding="utf-8")

    (DATA_DIR / "doc3.txt").write_text(
    """Report Title: Quarterly Summary

\t\tThis line starts with tabs.
\n\n\nMultiple blank lines above.

• Bullet 1
• Bullet 2

Footer: Company Confidential
Footer: Company Confidential
Footer: Company Confidential
""", encoding="utf-8")

print("Files:", [p.name for p in DATA_DIR.glob("*.txt")])

# ---------- LOAD (Readers) ----------
dir_loader = DirectoryLoader(
    str(DATA_DIR),
    glob="**/*.txt",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"}
)
docs: List[Document] = dir_loader.load()
print("Loaded docs:", len(docs))

# ---------- CLEANING (from Part 1) ----------
BOILERPLATE_PATTERNS = [
    r"^Disclaimer:.*confidential\.?$",
    r"^ACME SUPPORT PORTAL — INTERNAL USE ONLY$",
    r"^-{5,}$",
]

compiled_bp = [re.compile(p, flags=re.IGNORECASE) for p in BOILERPLATE_PATTERNS]

def normalize_unicode(text: str) -> str:
    text = unicodedata.normalize("NFKC", text)
    text = text.replace("\u00A0", " ")
    return text

def normalize_whitespace(text: str) -> str:
    text = text.replace("\t", " ")
    text = "\n".join(line.rstrip() for line in text.splitlines())
    text = re.sub(r"[ ]{2,}", " ", text)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text.strip()

def remove_boilerplate_lines(text: str, patterns=compiled_bp) -> str:
    kept_lines = []
    for line in text.splitlines():
        if any(p.match(line.strip()) for p in patterns):
            continue
        kept_lines.append(line)
    return "\n".join(kept_lines)

def dedupe_consecutive_lines(text: str) -> str:
    out = []
    prev = None
    for ln in text.splitlines():
        if prev is not None and ln.strip() and ln.strip() == prev.strip():
            continue
        out.append(ln)
        prev = ln
    return "\n".join(out).strip()

from collections import Counter
def remove_repeated_lines(text: str, min_line_len: int = 10, freq_threshold: float = 0.25) -> str:
    lines = [ln.strip() for ln in text.splitlines() if ln.strip()]
    if not lines:
        return text
    counts = Counter(ln for ln in lines if len(ln) >= min_line_len)
    total = len(lines)
    repeated = {ln for ln, c in counts.items() if c / total >= freq_threshold}
    new_lines = []
    for ln in text.splitlines():
        s = ln.strip()
        if s and s in repeated:
            continue
        new_lines.append(ln)
    return "\n".join(new_lines).strip()

CLEANING_VERSION = "v1.0"

def clean_text(text: str) -> str:
    text = normalize_unicode(text)
    text = remove_boilerplate_lines(text)
    text = dedupe_consecutive_lines(text)
    text = normalize_whitespace(text)
    text = remove_repeated_lines(text, min_line_len=10, freq_threshold=0.25)
    text = normalize_whitespace(text)
    return text

def clean_documents(documents: List[Document]) -> List[Document]:
    cleaned_docs = []
    for d in documents:
        original = d.page_content
        cleaned = clean_text(original)
        new_meta = dict(d.metadata)
        new_meta.update({
            "cleaning_version": CLEANING_VERSION,
            "original_char_count": len(original),
            "clean_char_count": len(cleaned),
        })
        cleaned_docs.append(Document(page_content=cleaned, metadata=new_meta))
    return cleaned_docs

cleaned_docs: List[Document] = clean_documents(docs)
print("Cleaned docs:", len(cleaned_docs))

# Preview one
print("\nSOURCE:", cleaned_docs[0].metadata.get("source"))
print(cleaned_docs[0].page_content[:250])

# Part A — Chunking (in depth)

## Why chunking exists
LLMs can’t ingest unlimited text. For RAG, you typically:
1) Load → clean → split into chunks
2) Embed chunks → store in vector DB
3) Retrieve top-k chunks → put in prompt

Chunking impacts:
- **Recall**: smaller chunks can match more precisely
- **Precision**: too small may lose context
- **Cost & latency**: bigger chunks increase prompt tokens
- **Faithfulness**: better chunking → fewer hallucinations

---

## What makes a “good chunk”?
A good chunk should be:
- **Semantically coherent** (not cutting mid-sentence if possible)
- **Not too long** (token budget)
- **Not too short** (enough context to answer)
- Overlap only as needed (avoid excessive redundancy)

We’ll explore with `RecursiveCharacterTextSplitter` and token-aware splitting.

## 2) Recursive Character Chunking (most commonly used)

This splitter tries multiple separators in order, like:
- `\n\n` (paragraph)
- `\n` (line)
- `" "` (space)
- fallback to hard cuts

Good baseline for mixed text.

In [None]:
def chunk_with_recursive(
    documents: List[Document],
    chunk_size: int = 500,
    chunk_overlap: int = 80,
) -> List[Document]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", " ", ""],
        add_start_index=True,
    )
    chunks = splitter.split_documents(documents)
    return chunks

chunks_500 = chunk_with_recursive(cleaned_docs, chunk_size=500, chunk_overlap=80)
print("Chunks created:", len(chunks_500))
print("Example chunk metadata:", chunks_500[0].metadata)
print("Example chunk text:\n", chunks_500[0].page_content[:250])

## 3) Compare chunk sizes and overlap (experiment)

You’ll see how:
- chunk size affects number of chunks
- overlap increases total text duplication (cost)

Rule of thumb starting points:
- 300–800 characters (or ~150–400 tokens) for QA docs
- Overlap 10–20% of chunk size

In [None]:
def chunk_stats(chunks: List[Document]) -> Dict[str, Any]:
    lens = [len(c.page_content) for c in chunks]
    return {
        "chunks": len(chunks),
        "min_chars": min(lens) if lens else 0,
        "max_chars": max(lens) if lens else 0,
        "avg_chars": sum(lens)/len(lens) if lens else 0,
    }

for size, overlap in [(200, 40), (500, 80), (1000, 120)]:
    ch = chunk_with_recursive(cleaned_docs, chunk_size=size, chunk_overlap=overlap)
    print(f"chunk_size={size}, overlap={overlap} ->", chunk_stats(ch))

## 4) Metadata hygiene for chunks

When chunking, you want chunk-level metadata like:
- `source` (file)
- `start_index` (where chunk begins in original text)
- `chunk_id` (unique id)
- optional: `doc_id`, `page` for PDFs

We'll add `chunk_id` in a deterministic way.

In [None]:
import hashlib

def attach_chunk_ids(chunks: List[Document]) -> List[Document]:
    out = []
    for idx, c in enumerate(chunks):
        # Deterministic ID using source + start_index + hash(text)
        source = str(c.metadata.get("source", "unknown"))
        start = str(c.metadata.get("start_index", 0))
        h = hashlib.sha1(c.page_content.encode("utf-8")).hexdigest()[:12]
        chunk_id = f"{Path(source).name}:{start}:{h}:{idx}"
        meta = dict(c.metadata)
        meta["chunk_id"] = chunk_id
        out.append(Document(page_content=c.page_content, metadata=meta))
    return out

chunks_500 = attach_chunk_ids(chunks_500)
print("Chunk id example:", chunks_500[0].metadata["chunk_id"])

# Part B — Tokenization & Context Window (in depth)

## Why tokenization matters
LLMs operate in **tokens**, not characters.

Your total token budget includes:
- System + user prompt tokens
- Retrieved context tokens (all chunks you stuff into prompt)
- Tool/function calling overhead (if any)
- Output tokens (`max_tokens` / `max_new_tokens`)

If you exceed the context window, the model will:
- truncate inputs (losing context)
- error (some APIs)
- degrade quality

---

We’ll implement:
- token counting for strings
- estimating chunk token sizes
- computing “how many chunks can fit” for a given context window

## 5) Token counting with `tiktoken`

`tiktoken` is commonly used for GPT-style tokenization.
If you’re using other models (Llama/Mistral), token counts differ, but this is still a very useful approximation for budgeting.

If `tiktoken` is not installed, install it:
```python
%pip install -U tiktoken
```

In [None]:
def get_token_encoder(model_hint: str = "gpt-4o-mini"):
    if tiktoken is None:
        raise ImportError("tiktoken is not installed. Run: %pip install -U tiktoken")
    try:
        return tiktoken.encoding_for_model(model_hint)
    except Exception:
        return tiktoken.get_encoding("cl100k_base")

def count_tokens(text: str, model_hint: str = "gpt-4o-mini") -> int:
    enc = get_token_encoder(model_hint)
    return len(enc.encode(text))

sample_text = cleaned_docs[0].page_content
print("Sample chars:", len(sample_text))
print("Approx tokens:", count_tokens(sample_text))

## 6) Token-aware chunking (`TokenTextSplitter`)

Character chunking ≠ token chunking.

`TokenTextSplitter` ensures each chunk is around **N tokens**, which is safer for context windows.

In [None]:
def chunk_with_tokens(
    documents: List[Document],
    chunk_size_tokens: int = 200,
    chunk_overlap_tokens: int = 40,
    model_hint: str = "gpt-4o-mini"
) -> List[Document]:
    # TokenTextSplitter uses a tokenizer internally
    splitter = TokenTextSplitter(
        chunk_size=chunk_size_tokens,
        chunk_overlap=chunk_overlap_tokens,
        encoding_name="cl100k_base",  # stable default; ok even if model differs
    )
    return splitter.split_documents(documents)

token_chunks = chunk_with_tokens(cleaned_docs, chunk_size_tokens=120, chunk_overlap_tokens=20)
print("Token-chunks:", len(token_chunks))

# Show actual token sizes
sizes = [count_tokens(c.page_content) for c in token_chunks[:10]]
sizes, {"min": min(sizes), "max": max(sizes), "avg": sum(sizes)/len(sizes)}

## 7) Context window budgeting

You usually allocate budget like this:

- **Model context window**: e.g., 8k / 16k / 32k / 128k tokens
- Reserve tokens for:
  - Instructions + question (e.g., 300–800 tokens)
  - The answer (e.g., 400–1200 tokens)
- Remaining is retrieval context budget

We’ll write a helper:
- estimate how many chunks (top-k) can fit in remaining budget

In [None]:
def estimate_k_fit(
    chunks: List[Document],
    question: str,
    system_prompt: str,
    context_window: int = 8192,
    reserved_for_answer: int = 800,
    model_hint: str = "gpt-4o-mini",
) -> Dict[str, Any]:
    prompt_tokens = count_tokens(system_prompt + "\n" + question, model_hint=model_hint)
    remaining = context_window - reserved_for_answer - prompt_tokens
    if remaining < 0:
        return {
            "prompt_tokens": prompt_tokens,
            "remaining_for_context": remaining,
            "k_fit": 0,
            "note": "Prompt + reserved answer already exceed context window."
        }

    token_sizes = [count_tokens(c.page_content, model_hint=model_hint) for c in chunks]
    total = 0
    k = 0
    for sz in token_sizes:
        if total + sz > remaining:
            break
        total += sz
        k += 1

    return {
        "context_window": context_window,
        "prompt_tokens": prompt_tokens,
        "reserved_for_answer": reserved_for_answer,
        "remaining_for_context": remaining,
        "k_fit": k,
        "context_tokens_used": total,
    }

system_prompt = "You are a helpful assistant. Answer only from the provided context."
question = "How do I reset my password and contact support?"

# Use token-based chunks for realistic budgeting
token_chunks = chunk_with_tokens(cleaned_docs, chunk_size_tokens=180, chunk_overlap_tokens=30)

estimate_k_fit(
    chunks=token_chunks,
    question=question,
    system_prompt=system_prompt,
    context_window=8192,
    reserved_for_answer=800
)

## 8) Practical recipe (recommended defaults)

For many enterprise RAG Q&A cases:
- **Token chunk size**: 200–400 tokens
- **Overlap**: 10–20% (20–80 tokens)
- Retrieve: `top_k = 10–30` (recall)
- Rerank: keep `top_k = 3–8` (precision)
- Prompt stuffing: fit `3–8` chunks depending on context window

Next notebook (if you want):
✅ Retrieval + MMR + Rerank + Prompt packing + LLM call