#### 06 - RAG Evaluation

This notebook evaluates the quality of our RAG system using an LLM-as-judge approach.
We score each query on:
- retrieval_relevance (1 - 5): are retrieved chunks relevant to the question?
- answer_relevance (1 - 5): does the answer address the question?
- faithfulness (1 - 5): is the answer grounded in the retrieved sources?

It creates the Delta table, inspects its structure, and demonstrates how to query evaluation metrics. All reusable evaluation logic (LLM-as-judge, scoring, logging) lives in `00_utils.ipynb`.

### Design notes

This table enables offline and online quality monitoring of the RAG system.

Each query is scored on:
- Retrieval relevance — did we fetch the right info?
- Answer relevance — did the model answer the question?
- Faithfulness — is the answer grounded in the retrieved sources?

This allows:
- Regression detection
- Model comparison
- Retriever A/B tests
- Prompt iteration tracking
- Hallucination analysis

Evaluation is intentionally separated from serving so that:
- We can batch-evaluate
- We can re-score old answers
- We can add human labels later

In [0]:
%run ./00_install_deps_and_restart

In [0]:
%run ./00_constants

In [0]:
%run ./00_utils

In [0]:
# Create eval question table

spark.sql(f"""
CREATE TABLE IF NOT EXISTS {RAG_EVAL_TABLE} (
  evaluation_id STRING,
  query_id STRING,
  question STRING,
  answer STRING,
  retrieval_relevance INT,
  answer_relevance INT,
  faithfulness INT,
  evaluator STRING,
  notes STRING,
  created_at TIMESTAMP
)
USING DELTA
""")


DataFrame[]

In [0]:
spark.table(RAG_EVAL_TABLE).printSchema()

root
 |-- evaluation_id: string (nullable = true)
 |-- query_id: string (nullable = true)
 |-- question: string (nullable = true)
 |-- answer: string (nullable = true)
 |-- retrieval_relevance: integer (nullable = true)
 |-- answer_relevance: integer (nullable = true)
 |-- faithfulness: integer (nullable = true)
 |-- evaluator: string (nullable = true)
 |-- notes: string (nullable = true)
 |-- created_at: timestamp (nullable = true)



All required packages already installed. No restart needed.


In [0]:
# Inspect recent evaluations

spark.sql(f"""
SELECT
  created_at,
  query_id,
  retrieval_relevance,
  answer_relevance,
  faithfulness,
  evaluator
FROM {RAG_EVAL_TABLE}
ORDER BY created_at DESC
LIMIT 20
""").display()

created_at,query_id,retrieval_relevance,answer_relevance,faithfulness,evaluator


In [0]:
# Aggregate metrics

spark.sql(f"""
SELECT
  count(*) AS n,
  avg(retrieval_relevance) AS avg_retrieval,
  avg(answer_relevance) AS avg_answer,
  avg(faithfulness) AS avg_faithfulness
FROM {RAG_EVAL_TABLE}
""").display()

n,avg_retrieval,avg_answer,avg_faithfulness
0,,,


In [0]:
# Trend over time

spark.sql(f"""
SELECT
  date_trunc('day', created_at) AS day,
  avg(retrieval_relevance) AS avg_retrieval,
  avg(answer_relevance) AS avg_answer,
  avg(faithfulness) AS avg_faithfulness
FROM {RAG_EVAL_TABLE}
GROUP BY 1
ORDER BY 1
""").display()

day,avg_retrieval,avg_answer,avg_faithfulness


In [0]:
# Join logs + evaluations (root cause analysis)

spark.sql(f"""
SELECT
  e.created_at,
  l.question,
  e.retrieval_relevance,
  e.answer_relevance,
  e.faithfulness,
  l.retrieved_chunks[0].url AS top_source,
  e.notes
FROM {RAG_EVAL_TABLE} e
JOIN {RAG_LOG_TABLE} l
  ON e.query_id = l.query_id
ORDER BY e.created_at DESC
LIMIT 20
""").display()

created_at,question,retrieval_relevance,answer_relevance,faithfulness,top_source,notes


In [0]:
%run ./00_init_openai_client

In [0]:
from pyspark.sql import functions as F

def enrich_chunks_with_text(retrieved_chunks_py: list[dict]) -> list[dict]:
    chunk_ids = [c["chunk_id"] for c in retrieved_chunks_py if c.get("chunk_id")]
    if not chunk_ids:
        return retrieved_chunks_py

    text_map = (
        chunks_df
        .where(F.col("chunk_id").isin(chunk_ids))
        .select("chunk_id", "chunk_text")
        .collect()
    )
    text_map = {r["chunk_id"]: r["chunk_text"] for r in text_map}

    for c in retrieved_chunks_py:
        cid = c.get("chunk_id")
        c["chunk_text"] = text_map.get(cid, "")
    return retrieved_chunks_py

# Nowerun evaluation properly:
logs_df = spark.table(RAG_LOG_TABLE)

recent_logs_5 = (
    logs_df
    .orderBy(F.col("created_at").desc())
    .limit(5)
    .collect()
)

chunks_df = spark.table(CHUNKS_TABLE).select("chunk_id", "chunk_text")

for r in recent_logs_5:
    query_id = r["query_id"]
    question = r["question"]
    answer = r["answer"]

    retrieved_chunks_py = []
    for c in r["retrieved_chunks"]:
        retrieved_chunks_py.append({
            "chunk_id": c["chunk_id"],
            "doc_id": c["doc_id"],
            "title": c["title"],
            "url": c["url"],
            "chunk_index": c["chunk_index"],
            "category": c["category"],
            "score": c["score"],
            # Your log table schema for retrieved_chunks does not include chunk_text, so the judge has no real evidence to assess faithfulness.
            "chunk_text": "" # enrich below
        })

    retrieved_chunks_py = enrich_chunks_with_text(retrieved_chunks_py)

    scores = judge_rag(question, answer, retrieved_chunks_py)
    eval_id = write_evaluation(query_id, question, answer, scores, evaluator="llm_judge_v1_with_text")

    print("Evaluated:", query_id, scores)

Evaluated: e26f88a9-6388-4875-be1d-2852a31fe1d8 {'retrieval_relevance': 2, 'answer_relevance': 5, 'faithfulness': 3, 'notes': 'The answer provides a detailed comparison between normal Azure VMs and ephemeral VMs, addressing the question effectively. However, the retrieved excerpts do not directly support the answer, leading to some concerns about faithfulness.'}
Evaluated: 6bb2bad3-7cc5-4805-b332-6fa0ffb75219 {'retrieval_relevance': 2, 'answer_relevance': 5, 'faithfulness': 5, 'notes': 'The answer provides a detailed comparison between normal Azure VMs and ephemeral VMs, addressing the question effectively. However, the retrieved excerpts do not contain relevant information directly related to the differences between normal and ephemeral VMs, which affects the retrieval relevance score.'}
Evaluated: 0ff22b09-ebc9-49e9-b32f-cf25a2c4ca6e {'retrieval_relevance': 2, 'answer_relevance': 5, 'faithfulness': 4, 'notes': 'The retrieved excerpts do not directly address the differences between no

[Trace(request_id=tr-837f26e601634d52914a048833188c73), Trace(request_id=tr-89d4e4935cca4bccb9951b033f6ffb0d), Trace(request_id=tr-d875238fb73b4d66aa9beaa89bdce308), Trace(request_id=tr-5cd48139636d4dec855b586af394f31e)]

In [0]:
# View evaluation summary in SQL

spark.sql(f"""
SELECT
  count(*) AS n,
  avg(retrieval_relevance) AS avg_retrieval_relevance,
  avg(answer_relevance) AS avg_answer_relevance,
  avg(faithfulness) AS avg_faithfulness
FROM {RAG_EVAL_TABLE}
""").display()

n,avg_retrieval_relevance,avg_answer_relevance,avg_faithfulness
4,2.0,5.0,3.75


In [0]:
# Show latest evaluations:

spark.sql(f"""
SELECT
  created_at,
  query_id,
  retrieval_relevance,
  answer_relevance,
  faithfulness,
  notes
FROM {RAG_EVAL_TABLE}
ORDER BY created_at DESC
LIMIT 20
""").display()

created_at,query_id,retrieval_relevance,answer_relevance,faithfulness,notes
2026-01-11T23:26:35.219874Z,0e2132c9-3324-4f02-b24d-a4dc628bc924,2,5,3,"The answer provides a detailed comparison between normal Azure VMs and ephemeral VMs, addressing the question well. However, the retrieved excerpts do not directly support the answer, leading to a lower faithfulness score. The excerpts focus on specific VM series and configurations without directly discussing the differences between normal and ephemeral VMs."
2026-01-11T23:26:31.821616Z,0ff22b09-ebc9-49e9-b32f-cf25a2c4ca6e,2,5,4,"The retrieved excerpts do not directly address the differences between normal Azure VMs and ephemeral VMs, which affects the relevance of the retrieval. However, the answer provided is comprehensive and accurately describes the differences based on general knowledge of Azure VMs. The answer is mostly faithful to the context of ephemeral VMs, but it lacks direct support from the excerpts."
2026-01-11T23:26:28.125017Z,6bb2bad3-7cc5-4805-b332-6fa0ffb75219,2,5,5,"The answer provides a detailed comparison between normal Azure VMs and ephemeral VMs, addressing the question effectively. However, the retrieved excerpts do not contain relevant information directly related to the differences between normal and ephemeral VMs, which affects the retrieval relevance score."
2026-01-11T23:26:24.138362Z,e26f88a9-6388-4875-be1d-2852a31fe1d8,2,5,3,"The answer provides a detailed comparison between normal Azure VMs and ephemeral VMs, addressing the question effectively. However, the retrieved excerpts do not directly support the answer, leading to some concerns about faithfulness."
