# Tutorial 2 — Semantic Chunking (Same Pipeline, Better Chunks)

Only one variable changes from Tutorial 1: chunking strategy (`fixed` → `semantic`).

```mermaid
flowchart LR
    A[Same Documents] --> B[Semantic Chunking]
    B --> C[OpenAI Embeddings]
    C --> D[Chroma]
    E[Same Query Set] --> F[Dense Retrieval]
    F --> G[Compare vs Tutorial 1]
```

## Learning checkpoint: what improved and what still fails

**What works better in Tutorial 2**
- Semantically grouped chunks preserve policy rule + condition together.
- Recall on context-dependent questions should improve versus Tutorial 1.

**Challenges you should observe**
- Retrieval ranking can still surface a good chunk below weaker ones.
- Similar chunks with overlapping terms may still be misordered.
- Exact-token questions (e.g., specific form IDs) are not consistently top-ranked.

**Why move to Tutorial 3**
- Chunking is better now, but ranking quality is still a bottleneck.
- We next add a reranking stage to reorder candidates by query-specific relevance.

In [None]:
# 1-3) Setup, config, and load data

import importlib
import os
from pathlib import Path
import shutil
import subprocess
import sys

import pandas as pd
from dotenv import load_dotenv

# Ensure uv is available (installs with: pip install uv)
if shutil.which("uv") is None:
    print("uv not found. Installing with pip...")
    subprocess.run([sys.executable, "-m", "pip", "install", "uv"], check=True)

# Ensure notebook runs from repo root and local src/ is importable
cwd = Path.cwd().resolve()
repo_root = next(
    (path for path in [cwd, *cwd.parents] if (path / "pyproject.toml").exists() and (path / "src").exists()),
    cwd,
)
os.chdir(repo_root)
src_path = repo_root / "src"
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

REQUIRED_PACKAGES = [
    "openai",
    "chromadb",
    "numpy",
    "pandas",
    "rank_bm25",
    "sentence_transformers",
    "dotenv",
]

missing = [pkg for pkg in REQUIRED_PACKAGES if importlib.util.find_spec(pkg) is None]
if missing:
    print("Missing packages:", missing)
    print("Running: uv sync")
    subprocess.run(["uv", "sync"], check=True)
else:
    print("All required packages are available.")

from rag_tutorials.io_utils import load_handbook_documents, load_queries
from rag_tutorials.chunking import fixed_chunk_documents, semantic_chunk_documents
from rag_tutorials.pipeline import build_dense_retriever
from rag_tutorials.qa import answer_with_context
from rag_tutorials.evaluation import evaluate_single, summarize

load_dotenv()
if not os.getenv("OPENAI_API_KEY"):
    raise EnvironmentError("OPENAI_API_KEY is required")

embedding_model = os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small")
chat_model = os.getenv("OPENAI_CHAT_MODEL", "gpt-4.1-mini")

handbook_path = Path("data/handbook_manual.txt")
queries_path = Path("data/queries.jsonl")
if not handbook_path.exists() or not queries_path.exists():
    raise FileNotFoundError("Run: uv run python scripts/generate_data.py")

documents = load_handbook_documents(handbook_path)
queries = load_queries(queries_path)

In [None]:
# 4) Chunk and normalize text: fixed vs semantic

fixed_chunks = fixed_chunk_documents(documents, chunk_size=260)
semantic_chunks = semantic_chunk_documents(documents)

comparison = pd.DataFrame([
    {"mode": "fixed", "count": len(fixed_chunks), "avg_chars": sum(len(c.text) for c in fixed_chunks) / len(fixed_chunks)},
    {"mode": "semantic", "count": len(semantic_chunks), "avg_chars": sum(len(c.text) for c in semantic_chunks) / len(semantic_chunks)},
])
comparison

In [None]:
# Chunk boundary visualization (same source text, different split strategies)

section_doc = next(doc for doc in documents if doc.section == "International Work")
fixed_view = [c.text for c in fixed_chunk_documents([section_doc], chunk_size=120)]
semantic_view = [c.text for c in semantic_chunk_documents([section_doc])]

print("Section:", section_doc.section)
print("\nFixed chunks:")
for idx, chunk_text in enumerate(fixed_view, start=1):
    print(f"[{idx}] {chunk_text}")

print("\nSemantic chunks:")
for idx, chunk_text in enumerate(semantic_view, start=1):
    print(f"[{idx}] {chunk_text}")

In [None]:
# 5-8) Build embeddings/index and run retrieval pipeline on semantic chunks

semantic_retriever, semantic_vectors = build_dense_retriever(
    chunks=semantic_chunks,
    collection_name="tutorial2_semantic_dense",
    embedding_model=embedding_model,
)

probe = "What is the policy for working from another country?"
semantic_results = semantic_retriever(probe, top_k=5)

pd.DataFrame([
    {"rank": i + 1, "chunk_id": r.chunk_id, "score": r.score, "preview": r.text[:110]}
    for i, r in enumerate(semantic_results)
])

In [None]:
# 9-10) Evaluation and debug output (compareable with Tutorial 1)

rows = [
    evaluate_single(
        query=q,
        retrieval_fn=lambda question: semantic_retriever(question, top_k=5),
        answer_fn=lambda question, context: answer_with_context(question, context, model=chat_model),
        top_k=5,
    )
    for q in queries[:20]
]

print("Tutorial 2 metrics:", summarize(rows))

toy_q = queries[0].question
toy_results = semantic_retriever(toy_q, top_k=5)
print("\nNovice trace for one query:")
for i, r in enumerate(toy_results, start=1):
    print(f"{i}. {r.chunk_id} | {r.score:.4f} | {r.text[:90]}")