#### 06 - RAG Evaluation

This notebook evaluates the quality of our RAG system using an LLM-as-judge approach.
We score each query on:
- retrieval_relevance (1 - 5): are retrieved chunks relevant to the question?
- answer_relevance (1 - 5): does the answer address the question?
- faithfulness (1 - 5): is the answer grounded in the retrieved sources?

It creates the Delta table, inspects its structure, and demonstrates how to query evaluation metrics. All reusable evaluation logic (LLM-as-judge, scoring, logging) lives in `00_utils.ipynb`.

### Design notes

This table enables offline and online quality monitoring of the RAG system.

Each query is scored on:
- Retrieval relevance — did we fetch the right info?
- Answer relevance — did the model answer the question?
- Faithfulness — is the answer grounded in the retrieved sources?

This allows:
- Regression detection
- Model comparison
- Retriever A/B tests
- Prompt iteration tracking
- Hallucination analysis

Evaluation is intentionally separated from serving so that:
- We can batch-evaluate
- We can re-score old answers
- We can add human labels later

In [0]:
%run ./00_install_deps_and_restart

In [0]:
%run ./00_constants

In [0]:
%run ./00_utils

In [0]:
%run ./00_init_openai_client

In [0]:
import mlflow
# Disable mlflow autologging
mlflow.autolog(disable=True)
mlflow.openai.autolog(disable=True)

In [0]:
# Create eval question table

ensure_rag_eval_table(spark, RAG_EVAL_TABLE)

✅ RAG evaluation table ensured: databricks_rag_demo.default.rag_evaluations


In [0]:
spark.table(RAG_EVAL_TABLE).printSchema()

root
 |-- evaluation_id: string (nullable = true)
 |-- query_id: string (nullable = true)
 |-- question: string (nullable = true)
 |-- answer: string (nullable = true)
 |-- retrieval_relevance: integer (nullable = true)
 |-- answer_relevance: integer (nullable = true)
 |-- faithfulness: integer (nullable = true)
 |-- evaluator: string (nullable = true)
 |-- notes: string (nullable = true)
 |-- created_at: timestamp (nullable = true)



In [0]:
evaluate_recent_logs(
    spark=spark,
    rag_log_table=RAG_LOG_TABLE,
    chunks_table=CHUNKS_TABLE,
    n=5,
    judge_fn=judge_rag,
    write_eval_fn=write_evaluation,
    evaluator_name="llm_judge_v1_with_text"
)

✅ Evaluated: 827800ec-0f48-42e2-8ba9-77686e23eba4 {'retrieval_relevance': 5, 'answer_relevance': 5, 'faithfulness': 5, 'notes': 'The retrieved excerpts provide comprehensive information about the differences between Spot VMs and standard VMs, including cost, eviction policy, SLA, workload suitability, and quota management, which directly supports the answer given.'}
✅ Evaluated: 603f8671-d355-4e73-a3bf-b60971f83562 {'retrieval_relevance': 5, 'answer_relevance': 5, 'faithfulness': 5, 'notes': 'The retrieved excerpts directly support the answer provided, detailing how Azure handles VM disk persistence and the settings that can be configured to manage this behavior.'}
✅ Evaluated: bae458ca-5555-4cf2-b40b-a357c3670e78 {'retrieval_relevance': 5, 'answer_relevance': 5, 'faithfulness': 5, 'notes': 'The retrieved excerpts provide relevant information about Azure VM Scale Sets, including their features and management, which directly supports the answer provided.'}
✅ Evaluated: 5fecd526-9c6b-4af

Collecting openai<2.0.0,>=1.0.0
  Downloading openai-1.109.1-py3-none-any.whl.metadata (29 kB)
Collecting anyio<5,>=3.5.0 (from openai<2.0.0,>=1.0.0)
  Downloading anyio-4.12.1-py3-none-any.whl.metadata (4.3 kB)
Collecting httpx<1,>=0.23.0 (from openai<2.0.0,>=1.0.0)
  Downloading httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai<2.0.0,>=1.0.0)
  Downloading jiter-0.12.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)
Collecting sniffio (from openai<2.0.0,>=1.0.0)
  Downloading sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting tqdm>4 (from openai<2.0.0,>=1.0.0)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.7 kB[0m [31m?[0m eta [36m-:--:--[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.7/57.7 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai<2.0.0

In [0]:
# Inspect recent evaluations

spark.sql(f"""
SELECT
  created_at,
  query_id,
  retrieval_relevance,
  answer_relevance,
  faithfulness,
  evaluator
FROM {RAG_EVAL_TABLE}
ORDER BY created_at DESC
LIMIT 20
""").display()

created_at,query_id,retrieval_relevance,answer_relevance,faithfulness,evaluator
2026-01-15T06:13:19.304822Z,aa9f09f7-ffca-47b8-a349-98b7cc2cddd8,2,5,3,llm_judge_v1_with_text
2026-01-15T06:13:16.267989Z,5fecd526-9c6b-4afd-b5b6-d381c44f149c,2,5,2,llm_judge_v1_with_text
2026-01-15T06:13:13.362384Z,bae458ca-5555-4cf2-b40b-a357c3670e78,5,5,5,llm_judge_v1_with_text
2026-01-15T06:13:10.524493Z,603f8671-d355-4e73-a3bf-b60971f83562,5,5,5,llm_judge_v1_with_text
2026-01-15T06:13:07.843522Z,827800ec-0f48-42e2-8ba9-77686e23eba4,5,5,5,llm_judge_v1_with_text
2026-01-15T06:12:21.049506Z,827800ec-0f48-42e2-8ba9-77686e23eba4,5,5,5,llm_judge_with_text
2026-01-15T06:12:11.321303Z,603f8671-d355-4e73-a3bf-b60971f83562,5,5,5,llm_judge_with_text
2026-01-15T06:12:03.806685Z,bae458ca-5555-4cf2-b40b-a357c3670e78,5,5,5,llm_judge_with_text
2026-01-15T06:11:55.097941Z,5fecd526-9c6b-4afd-b5b6-d381c44f149c,2,5,2,llm_judge_with_text
2026-01-15T06:11:43.046162Z,aa9f09f7-ffca-47b8-a349-98b7cc2cddd8,2,5,3,llm_judge_with_text


In [0]:
# Aggregate metrics

spark.sql(f"""
SELECT
  count(*) AS n,
  avg(retrieval_relevance) AS avg_retrieval,
  avg(answer_relevance) AS avg_answer,
  avg(faithfulness) AS avg_faithfulness
FROM {RAG_EVAL_TABLE}
""").display()

n,avg_retrieval,avg_answer,avg_faithfulness
10,3.8,5.0,4.0


In [0]:
# Trend over time

spark.sql(f"""
SELECT
  date_trunc('day', created_at) AS day,
  avg(retrieval_relevance) AS avg_retrieval,
  avg(answer_relevance) AS avg_answer,
  avg(faithfulness) AS avg_faithfulness
FROM {RAG_EVAL_TABLE}
GROUP BY 1
ORDER BY 1
""").display()

day,avg_retrieval,avg_answer,avg_faithfulness
2026-01-15T00:00:00Z,3.8,5.0,4.0


In [0]:
# Join logs + evaluations (root cause analysis)

spark.sql(f"""
SELECT
  e.created_at,
  l.question,
  e.retrieval_relevance,
  e.answer_relevance,
  e.faithfulness,
  l.retrieved_chunks[0].url AS top_source,
  e.notes
FROM {RAG_EVAL_TABLE} e
JOIN {RAG_LOG_TABLE} l
  ON e.query_id = l.query_id
ORDER BY e.created_at DESC
LIMIT 20
""").display()

created_at,question,retrieval_relevance,answer_relevance,faithfulness,top_source,notes
2026-01-15T06:13:19.304822Z,What is the difference between a normal Azure VM and an ephemeral VM?,2,5,3,https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/virtual-machines/managed-disks-overview.md,"The answer provides a clear and detailed explanation of the differences between normal Azure VMs and ephemeral VMs, addressing the question directly. However, the retrieved excerpts do not contain relevant information specifically about ephemeral VMs, which affects the faithfulness score as the answer is not directly supported by the provided excerpts."
2026-01-15T06:13:16.267989Z,How do I resize an Azure virtual machine?,2,5,2,https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/virtual-machines/vm-customization.md,"The answer provides a detailed and accurate process for resizing an Azure virtual machine, which is relevant to the question. However, the retrieved excerpts do not directly support the specific steps mentioned in the answer, particularly regarding the resizing process, leading to a lower faithfulness score."
2026-01-15T06:13:13.362384Z,What is Azure VM Scale Sets?,5,5,5,https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/virtual-machine-scale-sets/flexible-virtual-machine-scale-sets-powershell.md,"The retrieved excerpts provide relevant information about Azure VM Scale Sets, including their features and management, which directly supports the answer provided."
2026-01-15T06:13:10.524493Z,How does Azure handle VM disk persistence?,5,5,5,https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/virtual-machines/managed-disks-overview.md,"The retrieved excerpts directly support the answer provided, detailing how Azure handles VM disk persistence and the settings that can be configured to manage this behavior."
2026-01-15T06:13:07.843522Z,What is the difference between Spot VM and normal VM?,5,5,5,https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/virtual-machines/spot-vms.md,"The retrieved excerpts provide comprehensive information about the differences between Spot VMs and standard VMs, including cost, eviction policy, SLA, workload suitability, and quota management, which directly supports the answer given."
2026-01-15T06:12:21.049506Z,What is the difference between Spot VM and normal VM?,5,5,5,https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/virtual-machines/spot-vms.md,"The retrieved excerpts provide comprehensive information that directly supports the answer regarding the differences between Spot VMs and normal VMs, including cost, eviction policy, SLA, workload suitability, and quota management."
2026-01-15T06:12:11.321303Z,How does Azure handle VM disk persistence?,5,5,5,https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/virtual-machines/managed-disks-overview.md,"The retrieved excerpts directly support the answer regarding VM disk persistence in Azure. The answer accurately reflects the information provided in the excerpts, particularly from source [3], which explicitly states that disks, NICs, and public IPs are persisted by default when a VM is deleted, and how to manage this behavior."
2026-01-15T06:12:03.806685Z,What is Azure VM Scale Sets?,5,5,5,https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/virtual-machine-scale-sets/flexible-virtual-machine-scale-sets-powershell.md,"The retrieved excerpts provide relevant information about Azure VM Scale Sets, including their features and management, which directly supports the answer provided."
2026-01-15T06:11:55.097941Z,How do I resize an Azure virtual machine?,2,5,2,https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/virtual-machines/vm-customization.md,"The answer provides a detailed and accurate process for resizing an Azure virtual machine, which is relevant to the question. However, the retrieved excerpts do not directly support the answer, particularly regarding the steps involved in resizing a VM. The first excerpt mentions downtime during resizing but lacks specific instructions or commands that align with the answer."
2026-01-15T06:11:43.046162Z,What is the difference between a normal Azure VM and an ephemeral VM?,2,5,3,https://github.com/MicrosoftDocs/azure-compute-docs/blob/main/articles/virtual-machines/managed-disks-overview.md,"The answer provides a clear and detailed explanation of the differences between normal Azure VMs and ephemeral VMs, addressing the question directly. However, the retrieved excerpts do not contain relevant information about ephemeral VMs, which affects the faithfulness score as the answer is not directly supported by the provided excerpts."


In [0]:
# Breakdown by retriever, can be used to compare which retriever is better

spark.sql(f"""
SELECT
  retriever_type,
  avg(retrieval_relevance),
  avg(answer_relevance),
  avg(faithfulness)
FROM databricks_rag_demo.default.rag_evaluations e
JOIN databricks_rag_demo.default.rag_query_logs l
  ON e.query_id = l.query_id
GROUP BY retriever_type
""").display()

retriever_type,avg(retrieval_relevance),avg(answer_relevance),avg(faithfulness)
A,3.8,5.0,4.0
