# Level 2 - Week 5 - 02 Eval Set Design

**Estimated time:** 60-90 minutes

## Learning Objectives

- Create a 10 to 20 item eval set
- Cover in-KB and out-of-KB cases
- Store expected modes


## Overview

A minimal eval set creates repeatable signals.

You don’t need hundreds of examples to start.

## The underlying theory: you are sampling from a distribution

Your real users generate questions from some unknown distribution $D$.

An eval set is a sample from $D$ (or an approximation). If your eval set is biased, you will optimize the wrong behavior.

Practical implication:

- include the failure modes you actually care about, not just easy questions

## Coverage beats size (early on)

With 10–20 items, your goal is not high statistical confidence. Your goal is to cover:

- common intents
- near-miss cases (where retrieval must be correct)
- out-of-KB boundary
- ambiguous queries that should trigger clarification

## Suggested composition (10–20 items)

- 50% obvious in-KB
- 30% near-miss (requires the correct chunk)
- 20% out-of-KB (should refuse/clarify)

## What to store per item (keep it mechanically checkable)

- `id`
- `question`
- `expected_mode`: `answer|clarify|refuse`
- (recommended) `relevant_chunk_ids`: one or more acceptable chunk_ids

## Practice Steps

- Draft eval items with `expected_mode` and `relevant_chunk_ids`.
- Include at least:
  - one out-of-KB case
  - one ambiguous case that should trigger `clarify`.

### Sample code

Example eval item format.


In [None]:
eval_items = [
    {
        'id': 'q_001',
        'question': 'What endpoint shows service health?',
        'expected_mode': 'answer',
        'relevant_chunk_ids': ['fastapi#001'],
    },
]

print(eval_items)


### Student fill-in

Add more eval items.

Checklist:

- include 1–2 items for each `expected_mode`
- keep labels simple and checkable
- list acceptable `relevant_chunk_ids` (even if it’s just one)

In [None]:
ALLOWED_MODES = {"answer", "clarify", "refuse"}


eval_items = [
    {
        "id": "q_001",
        "question": "What endpoint shows service health?",
        "expected_mode": "answer",
        "relevant_chunk_ids": ["fastapi#001"],
    },
    {
        "id": "q_002",
        "question": "How do I configure retries?",
        "expected_mode": "clarify",
        "relevant_chunk_ids": [],
    },
    {
        "id": "q_003",
        "question": "What is the weather in Tokyo tomorrow?",
        "expected_mode": "refuse",
        "relevant_chunk_ids": [],
    },
]


def validate_eval_item(item: dict) -> None:
    if not item.get("id"):
        raise ValueError("missing id")
    if not item.get("question"):
        raise ValueError("missing question")
    if item.get("expected_mode") not in ALLOWED_MODES:
        raise ValueError(f"invalid expected_mode: {item.get('expected_mode')}")
    if "relevant_chunk_ids" not in item:
        raise ValueError("missing relevant_chunk_ids")
    if not isinstance(item["relevant_chunk_ids"], list):
        raise ValueError("relevant_chunk_ids must be a list")


for it in eval_items:
    validate_eval_item(it)

print("n_eval_items:", len(eval_items))
print(eval_items)

## Self-check

- Does your eval set include failure cases?
- Are labels checkable?
