# Tutorial 2 — Semantic Chunking (Same Pipeline, Better Chunks)

Only one variable changes from Tutorial 1: chunking strategy (`fixed` → `semantic`).

```mermaid
flowchart LR
    A[Same Documents] --> B[Semantic Chunking]
    B --> C[OpenAI Embeddings]
    C --> D[Chroma]
    E[Same Query Set] --> F[Dense Retrieval]
    F --> G[Compare vs Tutorial 1]
```

## Learning checkpoint: what improved and what still fails

**What works better in Tutorial 2**
- Semantically grouped chunks preserve policy rule + condition together.
- Recall on context-dependent questions should improve versus Tutorial 1.

**Challenges you should observe**
- Retrieval ranking can still surface a good chunk below weaker ones.
- Similar chunks with overlapping terms may still be misordered.
- Exact-token questions (e.g., specific form IDs) are not consistently top-ranked.

**Why move to Tutorial 3**
- Chunking is better now, but ranking quality is still a bottleneck.
- We next add a reranking stage to reorder candidates by query-specific relevance.

In [1]:
# 1-3) Setup, config, and load data

import importlib
import os
from pathlib import Path
import shutil
import subprocess
import sys

import pandas as pd
from dotenv import load_dotenv

# Ensure uv is available (installs with: pip install uv)
if shutil.which("uv") is None:
    print("uv not found. Installing with pip...")
    subprocess.run([sys.executable, "-m", "pip", "install", "uv"], check=True)

# Ensure notebook runs from repo root and local src/ is importable
cwd = Path.cwd().resolve()
repo_root = next(
    (path for path in [cwd, *cwd.parents] if (path / "pyproject.toml").exists() and (path / "src").exists()),
    cwd,
)
os.chdir(repo_root)
src_path = repo_root / "src"
if str(src_path) not in sys.path:
    sys.path.insert(0, str(src_path))

REQUIRED_PACKAGES = [
    "openai",
    "chromadb",
    "numpy",
    "pandas",
    "rank_bm25",
    "sentence_transformers",
    "dotenv",
]
PIP_NAME_MAP = {"rank_bm25": "rank-bm25", "sentence_transformers": "sentence-transformers", "dotenv": "python-dotenv"}

def find_missing(packages: list[str]) -> list[str]:
    importlib.invalidate_caches()
    return [pkg for pkg in packages if importlib.util.find_spec(pkg) is None]

missing = find_missing(REQUIRED_PACKAGES)
if missing:
    print("Missing packages:", missing)
    print("Running: uv sync")
    subprocess.run(["uv", "sync"], check=True)

missing_after_sync = find_missing(REQUIRED_PACKAGES)
if missing_after_sync:
    pip_targets = [PIP_NAME_MAP.get(pkg, pkg) for pkg in missing_after_sync]
    print("Installing into current kernel with pip:", pip_targets)
    subprocess.run([sys.executable, "-m", "pip", "install", *pip_targets], check=True)

final_missing = find_missing(REQUIRED_PACKAGES)
if final_missing:
    raise ImportError(f"Dependencies still missing in current kernel: {final_missing}")

from rag_tutorials.io_utils import load_handbook_documents, load_queries
from rag_tutorials.chunking import fixed_chunk_documents, semantic_chunk_documents
from rag_tutorials.pipeline import build_dense_retriever
from rag_tutorials.qa import answer_with_context
from rag_tutorials.evaluation import evaluate_single, summarize

load_dotenv()
if not os.getenv("OPENAI_API_KEY"):
    raise EnvironmentError("OPENAI_API_KEY is required")

embedding_model = os.getenv("OPENAI_EMBEDDING_MODEL", "text-embedding-3-small")
chat_model = os.getenv("OPENAI_CHAT_MODEL", "gpt-4.1-mini")

handbook_path = Path("data/handbook_manual.txt")
queries_path = Path("data/queries.jsonl")
if not handbook_path.exists() or not queries_path.exists():
    raise FileNotFoundError("Run: uv run python scripts/generate_data.py")

documents = load_handbook_documents(handbook_path)
queries = load_queries(queries_path)

Missing packages: ['rank_bm25']
Running: uv sync
Installing into current kernel with pip: ['rank-bm25']


[2mResolved [1m204 packages[0m [2min 3ms[0m[0m
[2mAudited [1m180 packages[0m [2min 2ms[0m[0m


Collecting rank-bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25
Successfully installed rank-bm25-0.2.2



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/opt/homebrew/opt/python@3.11/bin/python3.11 -m pip install --upgrade pip[0m


In [2]:
# 4) Chunk and normalize text: fixed vs semantic

fixed_chunks = fixed_chunk_documents(documents, chunk_size=260)
semantic_chunks = semantic_chunk_documents(documents)

comparison = pd.DataFrame([
    {"mode": "fixed", "count": len(fixed_chunks), "avg_chars": sum(len(c.text) for c in fixed_chunks) / len(fixed_chunks)},
    {"mode": "semantic", "count": len(semantic_chunks), "avg_chars": sum(len(c.text) for c in semantic_chunks) / len(semantic_chunks)},
])
comparison

Unnamed: 0,mode,count,avg_chars
0,fixed,6,191.0
1,semantic,7,163.142857


In [3]:
# Chunk boundary visualization (same source text, different split strategies)

section_doc = next(doc for doc in documents if doc.section == "International Work")
fixed_view = [c.text for c in fixed_chunk_documents([section_doc], chunk_size=120)]
semantic_view = [c.text for c in semantic_chunk_documents([section_doc])]

print("Section:", section_doc.section)
print("\nFixed chunks:")
for idx, chunk_text in enumerate(fixed_view, start=1):
    print(f"[{idx}] {chunk_text}")

print("\nSemantic chunks:")
for idx, chunk_text in enumerate(semantic_view, start=1):
    print(f"[{idx}] {chunk_text}")

Section: International Work

Fixed chunks:
[1] Working from another country is capped at 14 days in a rolling 12-month period without permit support. Beyond 14 days, e
[2] mployees must open a Global Mobility case and obtain HR, Legal, and Payroll approval. Violations can trigger immigration
[3] , payroll, and tax exposure.

Semantic chunks:
[1] Working from another country is capped at 14 days in a rolling 12-month period without permit support. Beyond 14 days, employees must open a Global Mobility case and obtain HR, Legal, and Payroll approval
[2] Violations can trigger immigration, payroll, and tax exposure.


### Vector Embedding and Nearest-Neighbor Search (same mechanism as Tutorial 1, different chunks)

Each chunk is embedded into a high-dimensional vector using `text-embedding-3-small`.
Retrieval ranks chunks by **cosine similarity** between the query vector and every chunk
vector, then returns the **top-k highest-scoring** chunks — this is nearest-neighbor search.

#### Quick recap: how top-k nearest-neighbor works

```
for each chunk in the index:
    score = cosine_similarity(query_vector, chunk_vector)
sort chunks by score descending
return first k chunks           ← these are the k nearest neighbors
```

#### What changes in Tutorial 2: the chunks, not the search algorithm

The nearest-neighbor algorithm is **identical** to Tutorial 1.  The only difference is
*what* is stored in the index.  Semantic chunking groups sentences by meaning, so each
chunk vector captures a tighter, more coherent idea — this shifts which chunk ends up
as the nearest neighbor for a given query.

```
Same query: 'working from abroad'

Tutorial 1 (fixed chunks)         Tutorial 2 (semantic chunks)
─────────────────────────         ─────────────────────────────
rank 1  [0.82]  remote-work-p1   rank 1  [0.91]  full remote-work section
rank 2  [0.74]  remote-work-p2   rank 2  [0.78]  international-transfer block
rank 3  [0.61]  leave-general    rank 3  [0.65]  tax-compliance paragraph

Fixed chunks split mid-sentence → two low-quality neighbors.
Semantic chunks keep the policy together → one high-quality neighbor.
```

See **Tutorial 1 cells 10–13** for the full cosine-similarity derivation and
a step-by-step nearest-neighbor example with 6 toy chunk vectors.


In [4]:
# 5-8) Build embeddings/index and run retrieval pipeline on semantic chunks

semantic_retriever, semantic_vectors = build_dense_retriever(
    chunks=semantic_chunks,
    collection_name="tutorial2_semantic_dense",
    embedding_model=embedding_model,
)

probe = "What is the policy for working from another country?"
semantic_results = semantic_retriever(probe, top_k=5)

pd.DataFrame([
    {"rank": i + 1, "chunk_id": r.chunk_id, "score": r.score, "preview": r.text[:110]}
    for i, r in enumerate(semantic_results)
])

Unnamed: 0,rank,chunk_id,score,preview
0,1,DOC-HB-INTERNATIONALWORK-SEM-00,0.16067,Working from another country is capped at 14 d...
1,2,DOC-HB-INTERNATIONALTAX-SEM-00,-0.017738,Employees traveling internationally may need F...
2,3,DOC-HB-REMOTEWORK-SEM-00,-0.191435,"Z-Tech encourages remote work from home, co-wo..."
3,4,DOC-HB-TRAVELAPPROVAL-SEM-00,-0.220997,International travel requests must be submitte...
4,5,DOC-HB-SECURITY-SEM-00,-0.323462,Employees handling customer data while traveli...


In [5]:
# 9-10) Evaluation and debug output (compareable with Tutorial 1)

rows = [
    evaluate_single(
        query=q,
        retrieval_fn=lambda question: semantic_retriever(question, top_k=5),
        answer_fn=lambda question, context: answer_with_context(question, context, model=chat_model),
        top_k=5,
    )
    for q in queries[:20]
]

print("Tutorial 2 metrics:", summarize(rows))

toy_q = queries[0].question
toy_results = semantic_retriever(toy_q, top_k=5)
print("\nNovice trace for one query:")
for i, r in enumerate(toy_results, start=1):
    print(f"{i}. {r.chunk_id} | {r.score:.4f} | {r.text[:90]}")

Tutorial 2 metrics: {'recall_at_k': 1.0, 'mrr': 0.975, 'latency_ms': 2040.9615729935467, 'groundedness': 0.7643919249916471}

Novice trace for one query:
1. DOC-HB-REMOTEWORK-SEM-00 | 0.1225 | Z-Tech encourages remote work from home, co-working spaces, or temporary domestic location
2. DOC-HB-INTERNATIONALWORK-SEM-00 | -0.2440 | Working from another country is capped at 14 days in a rolling 12-month period without per
3. DOC-HB-REMOTEWORK-SEM-01 | -0.3477 | Public Wi-Fi usage is allowed only with corporate VPN enabled.
4. DOC-HB-INTERNATIONALTAX-SEM-00 | -0.3520 | Employees traveling internationally may need Form A-12 before departure when cross-border 
5. DOC-HB-SECURITY-SEM-00 | -0.3534 | Employees handling customer data while traveling must use VPN, hardware-backed MFA, and en
