# Week 6 — Part 03: Chunking long text + synthesizing summaries

**Estimated time:** 60–100 minutes

## What success looks like (end of Part 03)

- You can split long text into overlapping chunks deterministically.
- You can produce a small per-chunk schema and synthesize a final summary.
- You can show the number of chunks created and inspect at least one chunk.

### Checkpoint

After running this notebook, you should see:

- `chunks=...` printed
- a final synthesized summary dict printed

## Learning Objectives

- Implement a simple chunking utility for long text
- Define a per-chunk summary schema
- Synthesize chunk outputs into a final summary
- Understand chunk-boundary pitfalls and overlap strategies

### What this part covers
This notebook covers **chunking** — splitting long text into overlapping pieces so each piece fits within the LLM's context window — and **synthesis** — combining per-chunk summaries into a final result.

**When do you need chunking?** When your text is longer than the context window allows. For example, a 50-page document at ~500 tokens/page = 25,000 tokens — too large for many models.

**The chunking trade-off:**
- Too large chunks → may exceed context window
- Too small chunks → lose cross-sentence context, more API calls
- Overlap between chunks → prevents losing information at boundaries (a sentence split across two chunks is captured by both)

## Overview

When text is too long for the context window:

1. **Split into chunks**
2. **Process each chunk** into a small structured summary
3. **Synthesize** a final summary from per-chunk outputs

---

## Underlying theory: chunking is a strategy for bounded-context reasoning

If the document is longer than what you can send at once, you have two options:

- **lossy compression**: summarize first (risk losing details)
- **chunking**: split and process pieces (risk missing cross-chunk dependencies)

Chunking introduces a boundary effect: information near chunk boundaries can be separated.

Practical implication: if correctness depends on cross-paragraph links, you may need overlap or a second pass.

### What this cell does
Defines three functions and runs a demo:

- **`chunk_text_simple()`** — splits text into non-overlapping chunks of `chunk_size` characters
- **`chunk_text_overlap()`** — splits with overlap: each chunk starts `(chunk_size - overlap)` characters after the previous one. This means the last `overlap` characters of chunk N are also the first `overlap` characters of chunk N+1.
- **`summarize_chunk_stub()`** — a placeholder that returns a minimal summary dict (replace with a real LLM call)
- **`synthesize_summaries()`** — combines per-chunk summaries into a final result by merging bullet lists

**What to notice in the demo:** With `chunk_size=1800` and `overlap=150`, a 4800-character text produces 3 chunks. The overlap ensures no information is lost at chunk boundaries — a sentence that spans the boundary appears in both adjacent chunks.

In [None]:
from typing import Any, Dict, List


def chunk_text_simple(text: str, *, chunk_size: int = 1800) -> List[str]:
    return [text[i : i + chunk_size] for i in range(0, len(text), chunk_size)]


def chunk_text_overlap(text: str, *, chunk_size: int = 1800, overlap: int = 150) -> List[str]:
    chunks: List[str] = []
    step = max(1, chunk_size - overlap)
    for i in range(0, len(text), step):
        chunks.append(text[i : i + chunk_size])
    return chunks


def summarize_chunk_stub(chunk: str) -> Dict[str, Any]:
    return {
        "summary_bullets": [chunk[:40] + "..."] if chunk else [],
        "key_entities": ["<todo>"] if chunk else [],
        "open_questions": [],
    }


def synthesize_summaries(summaries: List[Dict[str, Any]]) -> Dict[str, Any]:
    bullets = [b for s in summaries for b in s.get("summary_bullets", [])]
    return {"summary": bullets[:5], "note": "synthesized from chunk summaries"}


text = "B" * 4800
chunks = chunk_text_overlap(text, chunk_size=1800, overlap=150)
per_chunk = [summarize_chunk_stub(c) for c in chunks]
final_summary = synthesize_summaries(per_chunk)
print("chunks=", len(chunks))
print(final_summary)

### What this cell does
Defines `chunk_text_todo()` and `make_chunk_prompt()` — two functions for you to implement.

**`chunk_text_todo()`** should implement chunking with overlap. The key formula: `step = chunk_size - overlap`. Each chunk starts at index `i`, `i+step`, `i+2*step`, etc. Each chunk covers `text[i : i+chunk_size]`.

**`make_chunk_prompt()`** should build a prompt that asks the LLM to summarize one chunk into a structured JSON with keys: `summary_bullets`, `key_entities`, `open_questions`. Keeping the schema small and consistent makes synthesis easier — you can merge lists across chunks without complex logic.

**Your task:** Implement both functions. The solution is in the Appendix if you get stuck.

## Synthesis pattern (high-level)

- For each chunk: ask the model for a short structured summary.
- After all chunks: ask the model to combine those summaries.

This is a simple form of hierarchical summarization:

- level 1: per-chunk summaries
- level 2: global synthesis

Practical implication: enforcing a small per-chunk schema helps reduce drift and makes synthesis more reliable.

In [None]:
from typing import List


def chunk_text_todo(text: str, chunk_size: int = 1500, overlap: int = 200) -> List[str]:
    # TODO: implement chunking with overlap.
    if chunk_size <= 0:
        return []
    if overlap < 0:
        overlap = 0
    step = max(1, chunk_size - overlap)
    return [text[i : i + chunk_size] for i in range(0, len(text), step)]


def make_chunk_prompt(chunk: str) -> str:
    # TODO: create a prompt template for per-chunk summaries.
    return "".join(
        [
            "You are summarizing one chunk of a longer document.\n",
            "Return JSON with keys: summary_bullets (list of short strings), key_entities (list), open_questions (list).\n",
            "Chunk:\n",
            chunk,
        ]
    )


_demo_chunks = chunk_text_todo("ABC" * 1200, chunk_size=1500, overlap=200)
print("demo_chunks=", len(_demo_chunks))
print(make_chunk_prompt(_demo_chunks[0])[:200] + "...")

## Appendix: Solutions (peek only after trying)

Reference implementations for the TODO functions in this notebook.

In [None]:
def chunk_text_todo(text: str, chunk_size: int = 1500, overlap: int = 200) -> List[str]:
    if chunk_size <= 0:
        raise ValueError("chunk_size must be > 0")
    if overlap >= chunk_size:
        raise ValueError("overlap must be < chunk_size")

    step = chunk_size - overlap
    return [text[i : i + chunk_size] for i in range(0, len(text), step)]


def make_chunk_prompt(chunk: str) -> str:
    return (
        "You are summarizing one chunk of a longer document.\n"
        "Return STRICT JSON with keys:\n"
        "- summary_bullets: list of short strings (max 3)\n"
        "- key_entities: list of strings\n"
        "- open_questions: list of strings\n\n"
        "Chunk:\n"
        + chunk
    )


print("solution_prompt_preview=", make_chunk_prompt("hello")[:120] + "...")