# Level 2 - Week 3 - 03 Retrieval Evaluation

**Estimated time:** 60-90 minutes

## Learning Objectives

- Define a small eval set
- Compute hit rate and recall
- Save misses for inspection


## Overview

A saved eval set makes retrieval changes measurable.
Start with 10 to 20 queries.

## Practice Steps

- Define eval items with relevant_chunk_ids.
- Compute hit rate and recall_at_k.


### Sample code

Minimal eval loop with misses.


In [None]:
from __future__ import annotations


def hit_at_k(relevant: set[str], retrieved: list[str], k: int) -> int:
    if k <= 0:
        raise ValueError("k must be > 0")
    topk = retrieved[:k]
    return int(any(cid in relevant for cid in topk))


def recall_at_k(relevant: set[str], retrieved: list[str], k: int) -> float:
    if k <= 0:
        raise ValueError("k must be > 0")
    if not relevant:
        return 0.0
    topk = retrieved[:k]
    hits = [cid for cid in topk if cid in relevant]
    return len(hits) / len(relevant)


def mrr(relevant: set[str], retrieved: list[str]) -> float:
    # Reciprocal rank of the first relevant chunk (0.0 if none)
    for i, cid in enumerate(retrieved, start=1):
        if cid in relevant:
            return 1.0 / i
    return 0.0


# Minimal eval set: record which chunk_ids count as a "correct" retrieval for each query.
# In a real system, you save this as JSONL so it is repeatable and diffable.
eval_items: list[dict] = [
    {"id": "q_001", "query": "What is RAG?", "relevant_chunk_ids": ["rag_intro#02", "rag_overview#01"]},
    {"id": "q_002", "query": "How do I start the API?", "relevant_chunk_ids": ["fastapi#001", "uvicorn#003"]},
    {"id": "q_003", "query": "What is the weather in Tokyo tomorrow?", "relevant_chunk_ids": []},
]


def run_search_stub(query: str, top_k: int) -> list[str]:
    # Deterministic stub so repeated runs are stable.
    # Replace this with a real call to /search or your vector DB client.
    if "rag" in query.lower():
        base = ["rag_intro#02", "other#99", "rag_overview#01", "noise#00"]
    elif "api" in query.lower() or "start" in query.lower():
        base = ["noise#10", "uvicorn#003", "fastapi#001", "noise#11"]
    else:
        base = ["noise#20", "noise#21", "noise#22", "noise#23"]

    return base[:top_k]


def evaluate(items: list[dict], k: int = 5) -> dict:
    hit_scores: list[int] = []
    recall_scores: list[float] = []
    mrr_scores: list[float] = []
    misses: list[dict] = []

    for item in items:
        query = item["query"]
        relevant = set(item.get("relevant_chunk_ids", []))
        retrieved = run_search_stub(query=query, top_k=k)

        h = hit_at_k(relevant=relevant, retrieved=retrieved, k=k)
        r = recall_at_k(relevant=relevant, retrieved=retrieved, k=k)
        rr = mrr(relevant=relevant, retrieved=retrieved)

        hit_scores.append(h)
        recall_scores.append(r)
        mrr_scores.append(rr)

        if relevant and h == 0:
            misses.append(
                {
                    "id": item.get("id"),
                    "query": query,
                    "relevant_chunk_ids": sorted(list(relevant)),
                    "retrieved_chunk_ids": retrieved,
                }
            )

    def mean(xs: list[float]) -> float:
        return sum(xs) / max(len(xs), 1)

    metrics = {
        "k": k,
        "n_items": len(items),
        "hit_rate": mean([float(x) for x in hit_scores]),
        "avg_recall_at_k": mean(recall_scores),
        "mrr": mean(mrr_scores),
        "n_misses": len(misses),
    }

    return {"metrics": metrics, "misses": misses}


result = evaluate(eval_items, k=3)
print("metrics:", result["metrics"])
print("misses:", result["misses"])

### Student fill-in

Add more eval items and track misses.


In [None]:
# TODO: add 5-10 eval items with relevant_chunk_ids


## Self-check

- Is the eval set saved and repeatable?
- Do you keep misses for debugging?
