# Level 2 - Week 4 - 02 Citations Enforcement

**Estimated time:** 60-90 minutes

## Learning Objectives

- Format citations with doc_id and chunk_id
- Validate citations against retrieved hits
- Keep snippets traceable


## Overview

Citations are verifiable pointers to retrieved text.

### Underlying theory: verifiability

In a RAG system, the important property is **verifiability**:

- each answer claim should be backed by a pointer to evidence
- the pointer should resolve to a chunk you actually retrieved

This turns “correctness” into something testable.

### Why strict substring checks work (early)

If you require `snippet` to be a strict substring of the stored chunk text:

- the model is forced to copy evidence rather than invent it
- you get a cheap mechanical check (no extra model call)

Tradeoff: strict checks can be brittle (whitespace, formatting, chunk boundaries). That brittleness is often acceptable early because it encourages stable chunking + stable storage.

## Practice Steps

- Format citations from hits.
- Validate:
  - `chunk_id` is one of the retrieved hits for this request
  - `snippet` appears in the stored chunk text

### Sample code

Minimal citation validator.


In [None]:
def validate_citations(citations: list[dict], retrieved_ids: set[str], chunks_by_id: dict[str, str]) -> bool:
    for c in citations:
        chunk_id = c.get('chunk_id')
        snippet = c.get('snippet', '')
        if not chunk_id or chunk_id not in retrieved_ids:
            return False
        if snippet and snippet not in chunks_by_id.get(chunk_id, ''):
            return False
    return True


### Student fill-in

1. Build a citation list from hits.
2. Validate citations against:
   - the retrieved `chunk_id` set for this request
   - a `chunks_by_id` mapping from chunk_id -> stored chunk text

Then test invalid cases:

- missing chunk_id
- chunk_id not in retrieved set
- snippet not present in chunk text

In [None]:
hits = [
    {"doc_id": "doc-1", "chunk_id": "c1", "text": "Evidence text 1"},
]

retrieved_ids = {"c1"}
chunks_by_id = {"c1": hits[0]["text"]}

citations_ok = [{"doc_id": "doc-1", "chunk_id": "c1", "snippet": "Evidence text"}]

citations_bad_chunk = [{"doc_id": "doc-1", "chunk_id": "c999", "snippet": "Evidence text"}]

citations_bad_snippet = [{"doc_id": "doc-1", "chunk_id": "c1", "snippet": "Not a substring"}]

print("ok:", validate_citations(citations_ok, retrieved_ids, chunks_by_id))
print("bad_chunk:", validate_citations(citations_bad_chunk, retrieved_ids, chunks_by_id))
print("bad_snippet:", validate_citations(citations_bad_snippet, retrieved_ids, chunks_by_id))

## Self-check

- Do citations use retrieved chunk_id values?
- Is snippet validation deterministic?
