# Level 2 - Week 5 - 03 Eval Script and Failure Analysis

**Estimated time:** 60-90 minutes

## Learning Objectives

- Compute metrics from eval items
- Log failures with evidence
- Save run artifacts


## Overview

A good `eval_rag.py` does two things:

- prints metrics
- prints failures with evidence

## Minimum metrics (recommended)

Define each metric precisely before coding it.

- retrieval hit rate / hit@k (or recall@k if you have graded relevance)
- citation coverage rate
- refusal correctness rate

Example definition:

- citation coverage rate = fraction of `mode=answer` responses that have >= 1 valid citation

## Metric formalization (examples)

Let the eval set have $n$ items.

### Hit@k (retrieval)

Let $h_i \in \{0,1\}$ indicate whether item $i$ retrieved at least one relevant chunk in top-k.

$$
\mathrm{Hit@k} = \frac{1}{n} \sum_{i=1}^{n} h_i
$$

### Mode / refusal correctness

Let $r_i \in \{0,1\}$ indicate whether the predicted mode matches the expected mode.

$$
\mathrm{ModeCorrect} = \frac{1}{n} \sum_{i=1}^{n} r_i
$$

Writing definitions down prevents “metric drift” where different runs compute different things.

## Uncertainty intuition (don’t overfit)

With only 10–20 items, you can overfit by tuning until the eval looks good without truly improving.

Practical guardrails:

- keep a small hidden set (even 5 items) you don’t look at during tuning
- or periodically compare to a frozen baseline configuration

## Failure labeling (root cause)

For each failure, label one primary cause:

- `retrieval_miss`
- `context_too_noisy`
- `prompt_ambiguous`
- `citation_invalid`

Always add one short note: “what would have made this succeed?”

## Practice Steps

- Implement metric calculations.
- Emit failure records with evidence fields (so you can debug without rerunning).

### Sample code

Minimal evaluation loop with failures.


In [None]:
def evaluate(items: list[dict]) -> dict:
    failures = []
    for item in items:
        if item.get('actual_mode') != item.get('expected_mode'):
            failures.append(item)
    return {'n': len(items), 'failures': failures}


### Student fill-in

Implement metric calculations and failure records.

Suggested per-item fields (minimal):

- `id`, `question`
- `expected_mode`, `actual_mode`
- `relevant_chunk_ids`, `retrieved_chunk_ids`
- `citations` (list of objects with at least `chunk_id`)
- `label` (root cause)
- `note` (what would have made this succeed?)

In [None]:
from __future__ import annotations


def hit_at_k(relevant_chunk_ids: list[str], retrieved_chunk_ids: list[str]) -> int:
    if not relevant_chunk_ids:
        return 0
    relevant = set(relevant_chunk_ids)
    return int(any(cid in relevant for cid in retrieved_chunk_ids))


def citation_coverage_for_item(actual_mode: str, citations: list[dict]) -> int:
    if actual_mode != "answer":
        return 0
    return int(len(citations) > 0)


def mode_correct(expected_mode: str, actual_mode: str) -> int:
    return int(expected_mode == actual_mode)


def evaluate(items: list[dict]) -> dict:
    failures: list[dict] = []

    hit_sum = 0
    mode_sum = 0
    citation_cov_sum = 0
    n_answer = 0

    for it in items:
        expected = it.get("expected_mode", "")
        actual = it.get("actual_mode", "")
        retrieved = it.get("retrieved_chunk_ids", [])
        relevant = it.get("relevant_chunk_ids", [])
        citations = it.get("citations", [])

        hit_sum += hit_at_k(relevant, retrieved)
        mode_sum += mode_correct(expected, actual)

        if actual == "answer":
            n_answer += 1
            citation_cov_sum += citation_coverage_for_item(actual, citations)

        if expected != actual:
            failures.append(
                {
                    "id": it.get("id"),
                    "question": it.get("question"),
                    "expected_mode": expected,
                    "actual_mode": actual,
                    "retrieved_chunk_ids": retrieved,
                    "citations": citations,
                    "label": it.get("label", ""),
                    "note": it.get("note", ""),
                }
            )

    n = len(items)
    return {
        "n_items": n,
        "hit_at_k": hit_sum / n if n else 0.0,
        "mode_correct": mode_sum / n if n else 0.0,
        "citation_coverage": (citation_cov_sum / n_answer) if n_answer else 0.0,
        "n_failures": len(failures),
        "failures": failures,
    }


# Minimal synthetic example (replace with real `/search` + `/chat` outputs)
items = [
    {
        "id": "q_001",
        "question": "What endpoint shows service health?",
        "expected_mode": "answer",
        "actual_mode": "answer",
        "relevant_chunk_ids": ["fastapi#001"],
        "retrieved_chunk_ids": ["fastapi#001", "misc#123"],
        "citations": [{"chunk_id": "fastapi#001"}],
        "label": "",
        "note": "",
    },
    {
        "id": "q_003",
        "question": "What is the weather in Tokyo tomorrow?",
        "expected_mode": "refuse",
        "actual_mode": "answer",
        "relevant_chunk_ids": [],
        "retrieved_chunk_ids": ["misc#123"],
        "citations": [],
        "label": "prompt_ambiguous",
        "note": "Mode decision should be based on retrieval signals (empty/low score), not prompt vibes.",
    },
]

summary = evaluate(items)
print({k: v for k, v in summary.items() if k != "failures"})
print("failures:")
for f in summary["failures"]:
    print(f)

## Self-check

- Are metric definitions written down and implemented consistently?
- Do failures include enough evidence to debug without rerunning?
- Can you compare two runs with the same eval set and see stable behavior?