# Level 2 - Week 3 - 01 Retrieval as API

**Estimated time:** 60-90 minutes

## Learning Objectives

- Define /search contract
- Use typed request validation
- Keep response fields debuggable


## Overview

Retrieval should be a first-class endpoint.
Typed models make failures consistent.

## Practice Steps

- Implement SearchRequest and SearchResponse.
- Stub the search handler.


### Sample code

Minimal Pydantic models with constraints.


In [None]:
from pydantic import BaseModel, Field
from typing import Dict, List

class SearchRequest(BaseModel):
    query: str = Field(min_length=1)
    top_k: int = Field(default=5, ge=1, le=50)
    filters: Dict | None = None

class SearchHit(BaseModel):
    doc_id: str
    chunk_id: str
    score: float
    text: str
    metadata: Dict

class SearchResponse(BaseModel):
    query: str
    hits: List[SearchHit]


### Student fill-in

Add a stub search handler that returns empty hits.


In [None]:
def search_handler(req: SearchRequest) -> SearchResponse:
    # TODO: replace with real vector DB query
    return SearchResponse(query=req.query, hits=[])


## Self-check

- Do request fields have validation?
- Does response include chunk_id and metadata?


## Underlying theory: retrieval is nearest-neighbor search with a contract

### Retrieval as a function

In a RAG system, retrieval can be modeled as a function:

$$
R(q; \theta) \rightarrow E
$$

Where:

- $q$ is the user query text
- $\theta$ are retrieval parameters (embedding model, distance metric, index settings, filters, `top_k`)
- $E$ is the evidence set (your returned chunk hits)

Making `/search` a separate endpoint is a practical way to make $R$ observable and testable.

### Nearest-neighbor definition (what the vector DB is doing)

Let:

- $f(\cdot)$ be the embedding function mapping text to $\mathbb{R}^d$
- $s(\cdot,\cdot)$ be a similarity score (larger is more similar) or a distance $d(\cdot,\cdot)$ (smaller is closer)

Compute:

$$
\mathbf{q} = f(q), \quad \mathbf{x}_i = f(e_i)
$$

Then top-k retrieval returns the $k$ evidence items with best scores:

$$
\mathrm{TopK}(q) = \operatorname{argsort}_{i} \; s(\mathbf{q}, \mathbf{x}_i) \; [1:k]
$$

Or equivalently (for distances):

$$
\mathrm{TopK}(q) = \operatorname{argsort}_{i} \; d(\mathbf{q}, \mathbf{x}_i) \; [1:k]
$$

Key implication:

- retrieval returns what is *numerically close* under your embedding + metric, not what is *factually correct*

### Practical meaning

Your goal in Week 3 is not to maximize quality; it is to make failures attributable:

- if `/search` looks wrong, fix embedding/chunking/filters/top_k
- if `/search` looks right but `/chat` is wrong, fix context assembly, prompting, or output validation

## Practical usage: what to log so retrieval is debuggable

For every `/search` call, log at least:

- query text (or a hash if sensitive)
- `top_k`
- filters
- embedding model name
- distance/similarity metric used by the store
- returned hits: `rank`, `chunk_id`, `doc_id/source`, and `score`

This is what lets you answer: “Is the system failing because retrieval is wrong, or because generation ignored good evidence?”

In [None]:
# Exercise 1: Deterministic retrieval stub
#
# Implement a deterministic stub that returns top_k hits with stable ids and descending scores.
#
# Goal:
# - make the /search response shape testable without needing a real vector DB yet
# - ensure repeated runs return identical results for the same input

In [None]:
from typing import List


def retrieve_stub(query: str, top_k: int = 5) -> List[SearchHit]:
    hits: List[SearchHit] = []
    for i in range(top_k):
        hits.append(
            SearchHit(
                doc_id=f"doc-{i % 2}",
                chunk_id=f"chunk-{i}",
                score=1.0 - (i * 0.1),
                text=f"Stub text for query={query} (rank={i+1})",
                metadata={"source": "stub"},
            )
        )
    return hits


req = SearchRequest(query="example query", top_k=3)
res = SearchResponse(query=req.query, hits=retrieve_stub(req.query, req.top_k))
print("n_hits:", len(res.hits))
print("top_hit:", res.hits[0].chunk_id if res.hits else None)

### Exercise 2: Minimal retrieval metrics

Compute quick metrics that help you debug retrieval regressions without reading long outputs.

At minimum, compute:

- number of hits
- min/max score

Optional:

- average score
- list of returned `chunk_id`s in rank order

In [None]:
from typing import Optional


def retrieval_metrics(hits: list[SearchHit]) -> dict:
    if not hits:
        return {"count": 0, "min_score": None, "max_score": None, "avg_score": None, "chunk_ids": []}

    scores = [h.score for h in hits]
    return {
        "count": len(hits),
        "min_score": min(scores),
        "max_score": max(scores),
        "avg_score": sum(scores) / len(scores),
        "chunk_ids": [h.chunk_id for h in hits],
    }


print(retrieval_metrics(res.hits))

### Exercise 3: Add a debug-print format

Print each hit in a stable, human-debuggable way:

- rank
- score
- `doc_id`
- `chunk_id`
- short text preview

This is the same shape you’ll eventually want in logs (even if you later log JSON instead of printing).

In [None]:
def print_hits(hits: list[SearchHit], preview_chars: int = 80) -> None:
    for i, h in enumerate(hits):
        text = h.text
        preview = (text[:preview_chars] + "...") if len(text) > preview_chars else text
        print(f"#{i+1} score={h.score:.3f} doc_id={h.doc_id} chunk_id={h.chunk_id}\n{preview}\n")


print_hits(res.hits)

## Self-check

- Is your `/search` output shape stable (same inputs => same outputs)?
- Do you return enough fields to debug retrieval (`chunk_id`, `doc_id`, `score`, `metadata`, `text`)?
- If `/chat` answers are wrong, can you use `/search` output to localize the failure (retrieval vs generation)?
- Are you logging/printing the retrieval knobs (`top_k`, filters, embedding model, metric)?