# Level 2 - Week 2 - 02 Chunking and Idempotent Ingestion

**Estimated time:** 60-90 minutes

## Learning Objectives

- Implement chunking with overlap
- Create stable chunk IDs
- Avoid duplicate ingestion


## Overview

Chunking defines the unit you retrieve.

Idempotent ingestion means:

- running ingestion twice does **not** create duplicates
- re-ingesting updated documents produces predictable results
- your `chunk_id` scheme is stable (or versioned deliberately)

## Underlying theory: tokens, overlap, and context budget

### Context budget constraint

If your model context window is $C$ tokens, a simplified budget is:

$$
C \approx T_{\text{system}} + T_{\text{question}} + k \cdot T_{\text{chunk}} + T_{\text{formatting}}
$$

Implications:

- if chunks are too large, you canâ€™t fit enough evidence (small $k$)
- if chunks are too small, each chunk may miss the key sentence needed to answer safely

### Overlap math (how much redundancy you introduce)

For a fixed-size sliding window chunker:

- chunk size $S$
- overlap $O$
- stride $\Delta = S - O$
- document length $L$

Approximate number of chunks:

$$
N \approx \left\lceil \frac{L - O}{S - O} \right\rceil
$$

Redundancy factor:

$$
\text{redundancy} \approx \frac{S}{S - O}
$$

So overlap reduces boundary misses, but increases storage and embedding cost.

## Practice Steps

- Implement a fixed-size chunker with overlap.
- Choose a stable `chunk_id` strategy (hash-based or doc_id+index).
- Verify IDs are stable across repeated ingestion.

### Sample code

Fixed-size chunker with overlap.


In [None]:
from dataclasses import dataclass

@dataclass
class Chunk:
    text: str
    start: int
    end: int


def chunk_text(text: str, chunk_size: int = 1200, overlap: int = 200) -> list[Chunk]:
    if overlap < 0 or overlap >= chunk_size:
        raise ValueError('overlap must be >=0 and < chunk_size')
    chunks = []
    i = 0
    n = len(text)
    while i < n:
        j = min(i + chunk_size, n)
        chunks.append(Chunk(text=text[i:j], start=i, end=j))
        if j == n:
            break
        i = j - overlap
    return chunks


### Exercise: Stable chunk IDs (idempotency)

Implement both strategies below.

- Hash-based IDs:
  - pro: identical text overwrites on re-ingest
  - con: if chunking parameters change, IDs change and you effectively create a new set

- Index-based IDs:
  - pro: human-readable and debuggable
  - con: insertions/deletions early in a document can shift later chunk indices

Pick one as your course default and record it in your run artifacts (with chunk_size/overlap).

In [None]:
import hashlib


def chunk_id_hash(text: str) -> str:
    return hashlib.sha256(text.encode("utf-8")).hexdigest()


def chunk_id_index(doc_id: str, index: int) -> str:
    return f"{doc_id}#{index:05d}"

## Self-check

- Are chunk IDs stable across re-ingestion?
- Are chunking parameters (`chunk_size`, `overlap`) recorded somewhere (config/run artifact)?
- If a citation validator fails, can you locate the exact chunk text by `chunk_id`?
- If you change chunking parameters, do you understand what happens to IDs (hash vs index tradeoff)?