# rag-evaluation-starter — Walkthrough

**Evaluate any RAG pipeline against a golden set in under 30 minutes.**

This notebook walks you through every step: loading a golden set, plugging in your retriever and generator, running the evaluation, and interpreting the results.

---

| | |
|---|---|
| 🔗 GitHub | [github.com/infrixo-systems/rag-evaluation-starter](https://github.com/infrixo-systems/rag-evaluation-starter) |
| ⏱ Time to complete | ~20 minutes |
| 🔑 API key required? | No — works with the mock retriever out of the box |

## 1. Setup

Install the dependencies and clone the repo. `sentence-transformers` is needed for the cosine-similarity relevance metric. If you skip it, pass `no_embeddings=True` to `evaluate()`.

In [None]:
import os, sys

# ── Colab: clone the repo and move into it ───────────────────────────────
if "google.colab" in sys.modules:
    import subprocess
    subprocess.run(["pip", "install", "sentence-transformers", "tiktoken", "rich", "numpy", "-q"])
    if not os.path.exists("rag-evaluation-starter"):
        subprocess.run(["git", "clone", "https://github.com/infrixo-systems/rag-evaluation-starter.git"])
    os.chdir("rag-evaluation-starter")

# ── Local: find repo root (works whether you launch from repo root or notebook/)
else:
    # Walk up directory tree until we find rag_eval.py
    here = os.path.abspath(".")
    root = here
    for _ in range(5):  # max 5 levels up
        if os.path.exists(os.path.join(root, "rag_eval.py")):
            break
        root = os.path.dirname(root)
    os.chdir(root)
    if root not in sys.path:
        sys.path.insert(0, root)

print("Working directory:", os.getcwd())
assert os.path.exists("rag_eval.py"), "Could not find rag_eval.py — please launch Jupyter from the repo root"
print("✅ Setup complete")


In [None]:
import json
import numpy as np
from pathlib import Path

# Import the evaluation library
from rag_eval import (
    load_golden_set,
    evaluate,
    summarise,
    print_rich_table,
    save_json,
)

print('✅ Imports OK')

---
## 2. Load and Inspect the Golden Set

A **golden set** is a hand-curated list of question → expected_answer pairs, each annotated with the source document(s) the answer should come from.

Think of it as a regression test suite for your RAG pipeline.

In [None]:
golden = load_golden_set('examples/golden_set.json')

print(f'Loaded {len(golden)} questions\n')
print(json.dumps(golden[0], indent=2))

### Golden set field reference

| Field | Required | What it's for |
|---|---|---|
| `id` | ✅ | Stable identifier — used to track regressions over time |
| `question` | ✅ | The question exactly as a user would ask it |
| `expected_answer` | ✅ | Reference answer for Exact Match and Token F1 scoring |
| `expected_source_ids` | ✅ | Which document chunk(s) should be retrieved |
| `category` | Recommended | Groups results — e.g. `billing`, `api`, `onboarding` |
| `difficulty` | Recommended | `easy` / `medium` / `hard` — lets you see where the system breaks first |
| `notes` | Optional | Internal context about why this question is in the set |

In [None]:
# Quick summary of the example golden set
from collections import Counter

difficulties = Counter(e.get('difficulty') for e in golden)
categories   = Counter(e.get('category')   for e in golden)

print('By difficulty:', dict(difficulties))
print('By category:  ', dict(categories))

---
## 3. Define Your Retriever

The retriever must be a function with this signature:
```python
def my_retriever(question: str) -> list[dict]:
    # Returns a list of {"text": ..., "source_id": ...} dicts
    ...
```

Below are three real-world examples. **Copy the one that matches your stack.**

In [None]:
# ── Option A: LangChain + any vector store ────────────────────────────────
#
# from langchain_community.vectorstores import Chroma
# from langchain_openai import OpenAIEmbeddings
#
# vectorstore = Chroma(
#     persist_directory="./my_chroma_db",
#     embedding_function=OpenAIEmbeddings(),
# )
#
# def my_retriever(question: str) -> list[dict]:
#     docs = vectorstore.similarity_search(question, k=3)
#     return [
#         {"text": d.page_content, "source_id": d.metadata.get("source", "unknown")}
#         for d in docs
#     ]

print('LangChain retriever (commented out — uncomment and fill in your vectorstore)')

In [None]:
# ── Option B: LlamaIndex ─────────────────────────────────────────────────
#
# from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
#
# documents = SimpleDirectoryReader("./my_docs").load_data()
# index     = VectorStoreIndex.from_documents(documents)
# retriever = index.as_retriever(similarity_top_k=3)
#
# def my_retriever(question: str) -> list[dict]:
#     nodes = retriever.retrieve(question)
#     return [
#         {"text": n.text, "source_id": n.metadata.get("file_name", "unknown")}
#         for n in nodes
#     ]

print('LlamaIndex retriever (commented out — uncomment and fill in your index path)')

In [None]:
# ── Option C: Raw ChromaDB ───────────────────────────────────────────────
#
# import chromadb
# from chromadb.utils import embedding_functions
#
# client     = chromadb.PersistentClient(path="./my_chroma_db")
# collection = client.get_collection(
#     "my_docs",
#     embedding_function=embedding_functions.SentenceTransformerEmbeddingFunction()
# )
#
# def my_retriever(question: str) -> list[dict]:
#     results = collection.query(query_texts=[question], n_results=3)
#     texts   = results["documents"][0]
#     metas   = results["metadatas"][0]
#     return [
#         {"text": t, "source_id": m.get("source", "unknown")}
#         for t, m in zip(texts, metas)
#     ]

print('Raw ChromaDB retriever (commented out — uncomment and fill in your collection)')

In [None]:
# ── Default: mock retriever (works without any live system) ───────────────
from examples.mock_retriever import mock_retriever_fn as my_retriever

# Quick test
sample = my_retriever("What is the rate limit for the Vanta Billing API?")
print(f'Retrieved {len(sample)} chunks')
print('Top result source_id:', sample[0]['source_id'])
print('Top result text excerpt:', sample[0]['text'][:80], '...')

---
## 4. Define Your Generator

The generator must be a function with this signature:
```python
def my_generator(question: str, context: list[str]) -> str:
    # context is a list of retrieved text chunks
    # Returns the generated answer as a string
    ...
```

In [None]:
# ── Option A: OpenAI ─────────────────────────────────────────────────────
#
# import openai, os
# client = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
#
# def my_generator(question: str, context: list[str]) -> str:
#     ctx_text = "\n\n".join(context)
#     response = client.chat.completions.create(
#         model="gpt-4o-mini",
#         messages=[
#             {"role": "system", "content": f"Answer using only the context below.\n\n{ctx_text}"},
#             {"role": "user",   "content": question},
#         ],
#         temperature=0,
#         max_tokens=300,
#     )
#     return response.choices[0].message.content

print('OpenAI generator (commented out)')

In [None]:
# ── Option B: Anthropic ──────────────────────────────────────────────────
#
# import anthropic, os
# client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
#
# def my_generator(question: str, context: list[str]) -> str:
#     ctx_text = "\n\n".join(context)
#     response = client.messages.create(
#         model="claude-haiku-4-5-20251001",
#         max_tokens=300,
#         system=f"Answer using only the context below.\n\n{ctx_text}",
#         messages=[{"role": "user", "content": question}],
#     )
#     return response.content[0].text

print('Anthropic generator (commented out)')

In [None]:
# ── Option C: HuggingFace (fully local) ──────────────────────────────────
#
# from transformers import pipeline
# pipe = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.2")
#
# def my_generator(question: str, context: list[str]) -> str:
#     ctx_text = "\n\n".join(context)
#     prompt = f"Context:\n{ctx_text}\n\nQuestion: {question}\n\nAnswer:"
#     output = pipe(prompt, max_new_tokens=200, temperature=0.1)
#     return output[0]["generated_text"].split("Answer:")[-1].strip()

print('HuggingFace generator (commented out)')

In [None]:
# ── Default: mock generator ───────────────────────────────────────────────
from examples.mock_retriever import mock_generator_fn as my_generator

# Quick test
context = [c['text'] for c in my_retriever("What is the rate limit?")]
answer  = my_generator("What is the rate limit for the Vanta Billing API free tier?", context)
print('Answer:', answer)

---
## 5. Run the Evaluation

One function call. `evaluate()` loops through the golden set, calls your retriever and generator, computes all five metrics, and returns a structured list of results.

In [None]:
results = evaluate(
    golden_set   = golden,
    retriever_fn = my_retriever,
    generator_fn = my_generator,
    k            = 3,           # top-K for retrieval recall
    no_embeddings= True,        # set False if sentence-transformers is installed
)

summary = summarise(results)
print(f'Evaluated {summary["n_questions"]} questions')
print('Mean scores:', json.dumps(summary['mean_scores'], indent=2))

---
## 6. Inspect Results

### 6a. Rich console table

The table shows per-question scores colour-coded as **PASS** (green), **WARN** (amber), **FAIL** (red), with a summary row at the bottom.

In [None]:
print_rich_table(results, summary)

### 6b. Filter by category and difficulty

In [None]:
# Pull out just the result rows (excludes the _summary sentinel)
rows = [r for r in results if not r.get('_summary')]

# Filter to hard questions only
hard = [r for r in rows if r.get('difficulty') == 'hard']
print(f'Hard questions: {len(hard)}')
for r in hard:
    f1 = r['metrics']['token_f1']['score']
    recall = r['metrics']['retrieval_recall_at_k']['score']
    print(f"  {r['id']} | Recall@3={recall:.2f} | Token F1={f1:.2f} | Q: {r['question'][:60]}...")

In [None]:
# Mean Token F1 by category
from collections import defaultdict

cat_scores = defaultdict(list)
for r in rows:
    cat_scores[r.get('category', 'unknown')].append(r['metrics']['token_f1']['score'])

print('Mean Token F1 by category:')
for cat, scores in sorted(cat_scores.items()):
    print(f'  {cat:<15} {sum(scores)/len(scores):.3f}  (n={len(scores)})')

---
## 7. Interpret Your Scores

Raw numbers without context are hard to act on. Here's what each metric actually means:

---

### Retrieval Recall@K
**What it measures:** Did the right chunk come back in the top-K results?

| Score | What it means | What to do |
|---|---|---|
| 0.8 – 1.0 | Retrieval is working well | Focus on generation quality |
| 0.5 – 0.8 | Some queries miss the right chunk | Check chunk size, embedding model |
| < 0.5 | Retrieval is broken for many queries | Fix before touching generation |

**A Recall@3 of 0.6 means:** 40% of questions didn't find the right source in the top 3 results. Your generator is being asked to answer from wrong context — no amount of prompt-tuning will fix that.

---

### Answer Faithfulness
**What it measures:** Is the answer grounded in the retrieved context?

Low faithfulness + high retrieval recall → your generator is **hallucinating** even when the right context is present.
Low faithfulness + low retrieval recall → the generator is making things up because it received bad context.

---

### Answer Relevance (cosine similarity)
**What it measures:** Is the answer semantically on-topic for the question?

A system can retrieve the right chunks and generate a grounded answer that still doesn't actually address the question. This metric catches that.

---

### Token F1 / Exact Match
**What it measures:** How closely does the generated answer match the reference answer?

Most useful for **factual Q&A** where the answer is a specific value (a number, a date, a policy rule). Less useful for open-ended questions where multiple phrasings are valid.

---

### Latency + Cost
**What it measures:** Time per query. Token cost estimate (if `--llm-judge` is used).

Use this to catch regressions before deploying — a new embedding model might improve recall but triple latency.

In [None]:
# Deep-dive into a single result
r = rows[4]  # q005 — the hard backdating question

print(f"Question:  {r['question']}")
print(f"Expected:  {r['expected_answer'][:80]}...")
print(f"Generated: {r['generated_answer'][:80]}...")
print()
print('Retrieved sources:')
for chunk in r['retrieved_chunks']:
    print(f"  [{chunk['source_id']}] {chunk['text'][:60]}...")
print()
for metric, data in r['metrics'].items():
    score = data['score']
    v     = data['verdict']
    print(f"  {metric:<30} {score!s:>6}  [{v}]")

---
## 8. Save Results

In [None]:
save_json(results, summary, 'my_results.json')
print('Saved → my_results.json')

# Also save as CSV
from rag_eval import save_csv
save_csv(results, 'my_results.csv')
print('Saved → my_results.csv')

---
## 9. Next Steps

### If Retrieval Recall is low
- Try smaller chunk sizes (512 → 256 tokens)
- Try a better embedding model (`BAAI/bge-large-en-v1.5` often beats MiniLM)
- Add a reranker (CrossEncoder) on top of your retriever

### If Faithfulness is low
- Strengthen your system prompt: *"Answer ONLY using the context below. If the answer is not in the context, say you don't know."*
- Reduce max_tokens to discourage the model from padding with hallucinations

### If Token F1 is low on hard questions
- Review those questions in your golden set — some may need updated expected answers
- Consider adding a reranker to surface the most relevant chunks first

### When you've outgrown this script
This tool is designed for diagnostic evaluation against a fixed golden set. When you need continuous eval in production, look at:
- [Ragas](https://docs.ragas.io) — comprehensive RAG metrics framework
- [LangSmith](https://smith.langchain.com) — evaluation + tracing for LangChain apps
- [TruLens](https://www.trulens.org) — feedback functions for LLM apps

---

*If your scores are consistently low and you're not sure why, this is the kind of thing we look at in a Foundation Check — [infrixo.com/start](https://infrixo.com/start)*