# Part 3: Evaluation & Debugging

In this notebook, we'll learn how to:

1. **Evaluate** RAG quality with test cases
2. **Debug** when things go wrong
3. **Improve** based on failures

A system is only as good as your ability to measure and fix it.

## Setup

In [None]:
!git clone https://github.com/i33ym/rag-workshop.git 2>/dev/null || echo "Already cloned"
%cd rag-workshop

In [None]:
!pip install -q openai langchain langchain-openai langchain-community chromadb rank-bm25

In [None]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

In [None]:
# Load everything from Part 2
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_community.retrievers import BM25Retriever
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Load and prepare
loader = DirectoryLoader("docs/", glob="**/*.md", loader_cls=TextLoader)
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = Chroma.from_documents(documents=chunks, embedding=embeddings)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

print(f"Loaded {len(chunks)} chunks")

In [None]:
# Copy the pipeline functions from Part 2

def hybrid_search(query, k=5):
    vector_results = vector_store.similarity_search(query, k=k)
    bm25_results = bm25_retriever.invoke(query)[:k]
    
    rrf_scores = {}
    k_constant = 60
    
    for rank, doc in enumerate(vector_results):
        doc_id = doc.page_content[:100]
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k_constant + rank + 1)
        rrf_scores[doc_id + "_doc"] = doc
    
    for rank, doc in enumerate(bm25_results):
        doc_id = doc.page_content[:100]
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k_constant + rank + 1)
        rrf_scores[doc_id + "_doc"] = doc
    
    sorted_ids = sorted(
        [k for k in rrf_scores.keys() if not k.endswith("_doc")],
        key=lambda x: rrf_scores[x], reverse=True
    )
    
    results = [rrf_scores[doc_id + "_doc"] for doc_id in sorted_ids[:k]]
    scores = [rrf_scores[doc_id] for doc_id in sorted_ids[:k]]
    return results, scores

def rerank_documents(query, documents, top_n=3):
    rerank_prompt = ChatPromptTemplate.from_template(
        "Rate relevance 0-10. Reply with only a number.\n\nQuestion: {question}\n\nDocument: {document}\n\nScore:"
    )
    chain = rerank_prompt | llm | StrOutputParser()
    
    scored = []
    for doc in documents:
        try:
            score = float(chain.invoke({"question": query, "document": doc.page_content[:500]}).strip())
        except:
            score = 5.0
        scored.append((doc, score))
    
    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[:top_n]

def check_relevance(query, documents):
    context = "\n\n".join([doc.page_content[:300] for doc in documents])
    prompt = ChatPromptTemplate.from_template(
        "Can this context answer the question? Reply 'yes' or 'no'.\n\nQuestion: {question}\n\nContext: {context}"
    )
    result = (prompt | llm | StrOutputParser()).invoke({"question": query, "context": context})
    return "yes" in result.lower()

def generate_answer(query, documents):
    context = "\n\n---\n\n".join([doc.page_content for doc in documents])
    prompt = ChatPromptTemplate.from_template(
        "Answer based only on context. If unsure, say so.\n\nContext:\n{context}\n\nQuestion: {question}\n\nAnswer:"
    )
    return (prompt | llm | StrOutputParser()).invoke({"context": context, "question": query})

def check_grounding(answer, documents):
    context = "\n\n".join([doc.page_content for doc in documents])
    prompt = ChatPromptTemplate.from_template(
        "Is this answer supported by context? Reply 'yes' or 'no'.\n\nContext:\n{context}\n\nAnswer: {answer}"
    )
    result = (prompt | llm | StrOutputParser()).invoke({"context": context, "answer": answer})
    return "yes" in result.lower()

print("Pipeline functions loaded.")

## Creating a Test Dataset

To evaluate RAG, you need:
1. **Questions** ‚Äî what users might ask
2. **Expected answers** ‚Äî what the correct response should contain

This is called a "golden dataset" or "ground truth".

In [None]:
# Define test cases
test_cases = [
    {
        "question": "How do I authenticate API requests?",
        "expected_keywords": ["token", "authorization", "header"],
        "should_answer": True
    },
    {
        "question": "What is the endpoint for creating a payment?",
        "expected_keywords": ["POST", "payment", "api"],
        "should_answer": True
    },
    {
        "question": "What error codes can the API return?",
        "expected_keywords": ["error", "code", "400", "401", "500"],
        "should_answer": True
    },
    {
        "question": "How do I integrate with Stripe?",
        "expected_keywords": [],
        "should_answer": False  # Not in our docs!
    },
    {
        "question": "What is the meaning of life?",
        "expected_keywords": [],
        "should_answer": False  # Completely off-topic
    }
]

print(f"Created {len(test_cases)} test cases")

## Evaluation Metrics

We'll measure three things:

1. **Retrieval Quality** ‚Äî Did we find relevant documents?
2. **Answer Quality** ‚Äî Does the answer contain expected information?
3. **Appropriate Refusal** ‚Äî Did we correctly say "I don't know" when needed?

In [None]:
def evaluate_answer(answer, test_case):
    """Evaluate a single answer against a test case."""
    
    result = {
        "question": test_case["question"],
        "answer": answer[:200],
        "metrics": {}
    }
    
    # Check if answer contains expected keywords
    answer_lower = answer.lower()
    
    if test_case["should_answer"]:
        # Should provide an answer with keywords
        keywords_found = sum(1 for kw in test_case["expected_keywords"] if kw.lower() in answer_lower)
        keywords_total = len(test_case["expected_keywords"])
        
        if keywords_total > 0:
            result["metrics"]["keyword_coverage"] = keywords_found / keywords_total
        else:
            result["metrics"]["keyword_coverage"] = 1.0
            
        # Check it's not a refusal
        refusal_phrases = ["don't have information", "cannot find", "no information", "i don't know"]
        is_refusal = any(phrase in answer_lower for phrase in refusal_phrases)
        result["metrics"]["correctly_answered"] = not is_refusal
        
    else:
        # Should refuse to answer
        refusal_phrases = ["don't have information", "cannot find", "no information", "i don't know"]
        is_refusal = any(phrase in answer_lower for phrase in refusal_phrases)
        result["metrics"]["correctly_refused"] = is_refusal
    
    return result

In [None]:
def run_evaluation(test_cases, rag_function):
    """Run all test cases through the RAG system."""
    
    results = []
    
    for i, tc in enumerate(test_cases):
        print(f"Testing {i+1}/{len(test_cases)}: {tc['question'][:50]}...")
        
        # Get answer
        answer = rag_function(tc["question"])
        
        # Evaluate
        result = evaluate_answer(answer, tc)
        results.append(result)
    
    return results

## Compare Simple vs Production RAG

In [None]:
# Simple RAG function
def simple_rag(query):
    docs = vector_store.similarity_search(query, k=3)
    return generate_answer(query, docs)

# Production RAG function
def production_rag(query):
    # Hybrid search
    docs, _ = hybrid_search(query, k=6)
    
    # Rerank
    reranked = rerank_documents(query, docs, top_n=3)
    top_docs = [doc for doc, _ in reranked]
    
    # Relevance check
    if not check_relevance(query, top_docs):
        return "I don't have information about this topic in the documentation."
    
    # Generate
    answer = generate_answer(query, top_docs)
    
    return answer

In [None]:
print("=" * 50)
print("EVALUATING SIMPLE RAG")
print("=" * 50)
simple_results = run_evaluation(test_cases, simple_rag)

In [None]:
print("\n" + "=" * 50)
print("EVALUATING PRODUCTION RAG")
print("=" * 50)
production_results = run_evaluation(test_cases, production_rag)

In [None]:
# Compare results
print("\n" + "=" * 50)
print("COMPARISON")
print("=" * 50)

for i, tc in enumerate(test_cases):
    print(f"\nQ: {tc['question']}")
    print(f"Should answer: {tc['should_answer']}")
    print(f"\nSimple RAG:")
    print(f"  {simple_results[i]['metrics']}")
    print(f"Production RAG:")
    print(f"  {production_results[i]['metrics']}")

## Debugging: When Things Go Wrong

When RAG fails, you need to find where in the pipeline it broke:

1. **Retrieval problem** ‚Äî Wrong documents retrieved
2. **Reranking problem** ‚Äî Good docs scored low
3. **Relevance problem** ‚Äî False positive/negative
4. **Generation problem** ‚Äî Right docs, wrong answer

In [None]:
def debug_query(query):
    """Step through the pipeline and show what happens at each stage."""
    
    print(f"Query: {query}")
    print("=" * 60)
    
    # Stage 1: Hybrid Search
    print("\n[STAGE 1: HYBRID SEARCH]")
    docs, scores = hybrid_search(query, k=6)
    print(f"Retrieved {len(docs)} documents")
    for i, (doc, score) in enumerate(zip(docs[:3], scores[:3])):
        print(f"  {i+1}. (score: {score:.4f}) {doc.page_content[:60]}...")
    
    # Stage 2: Reranking
    print("\n[STAGE 2: RERANKING]")
    reranked = rerank_documents(query, docs, top_n=3)
    rerank_scores = [score for _, score in reranked]
    print(f"Rerank scores: {rerank_scores}")
    top_docs = [doc for doc, _ in reranked]
    for i, (doc, score) in enumerate(reranked):
        print(f"  {i+1}. (score: {score}/10) {doc.page_content[:60]}...")
    
    # Stage 3: Relevance Check
    print("\n[STAGE 3: RELEVANCE CHECK]")
    is_relevant = check_relevance(query, top_docs)
    print(f"Is relevant: {is_relevant}")
    
    if not is_relevant:
        print("\n‚ùå Pipeline stopped: Documents not relevant")
        return
    
    # Stage 4: Generate
    print("\n[STAGE 4: GENERATE ANSWER]")
    answer = generate_answer(query, top_docs)
    print(f"Answer: {answer[:300]}...")
    
    # Stage 5: Grounding
    print("\n[STAGE 5: GROUNDING CHECK]")
    is_grounded = check_grounding(answer, top_docs)
    print(f"Is grounded: {is_grounded}")
    
    if is_grounded:
        print("\n‚úÖ Pipeline complete: Answer is grounded")
    else:
        print("\n‚ö†Ô∏è Warning: Answer may contain hallucinations")

In [None]:
# Debug a successful query
debug_query("How do I authenticate API requests?")

In [None]:
# Debug a query that should fail
debug_query("How do I integrate with PayPal?")

## Common Failure Patterns

### 1. Retrieval Failure
The right documents aren't being found.

**Symptoms:** Rerank scores are all low (< 5)

**Fixes:**
- Improve chunking (keep related content together)
- Add metadata to help filtering
- Tune BM25/vector weights in hybrid search

In [None]:
# Example: Check if retrieval is the problem
def diagnose_retrieval(query):
    docs, _ = hybrid_search(query, k=6)
    reranked = rerank_documents(query, docs)
    scores = [s for _, s in reranked]
    
    avg_score = sum(scores) / len(scores)
    max_score = max(scores)
    
    print(f"Query: {query}")
    print(f"Scores: {scores}")
    print(f"Average: {avg_score:.1f}, Max: {max_score}")
    
    if max_score < 5:
        print("‚ö†Ô∏è RETRIEVAL PROBLEM: No highly relevant docs found")
    elif avg_score < 4:
        print("‚ö†Ô∏è PARTIAL PROBLEM: Some relevant docs, but noisy")
    else:
        print("‚úÖ Retrieval looks good")

diagnose_retrieval("How do I create a payment?")

### 2. False Refusals
The system says "I don't know" when the answer IS in the docs.

**Symptoms:** Relevance check returns False incorrectly

**Fixes:**
- Adjust relevance threshold
- Improve the relevance prompt
- Check if chunking is splitting relevant content

In [None]:
# Check what the relevance check sees
def diagnose_relevance(query):
    docs, _ = hybrid_search(query, k=6)
    reranked = rerank_documents(query, docs, top_n=3)
    top_docs = [doc for doc, _ in reranked]
    
    print(f"Query: {query}")
    print(f"\nTop doc content:")
    print(top_docs[0].page_content[:500])
    print(f"\nRelevance check result: {check_relevance(query, top_docs)}")

diagnose_relevance("How do I get an access token?")

### 3. Hallucination
The answer includes information not in the documents.

**Symptoms:** Grounding check fails, or answer contains specific details not in context

**Fixes:**
- Strengthen the generation prompt
- Lower temperature
- Add explicit "only use provided context" instructions

In [None]:
# Check for hallucination
def diagnose_hallucination(query):
    docs, _ = hybrid_search(query, k=6)
    reranked = rerank_documents(query, docs, top_n=3)
    top_docs = [doc for doc, _ in reranked]
    
    answer = generate_answer(query, top_docs)
    is_grounded = check_grounding(answer, top_docs)
    
    print(f"Query: {query}")
    print(f"\nAnswer: {answer[:300]}")
    print(f"\nGrounded: {is_grounded}")
    
    if not is_grounded:
        print("\n‚ö†Ô∏è HALLUCINATION DETECTED")
        print("Check: Does the answer contain info not in the docs?")

diagnose_hallucination("How do I authenticate?")

## Building a Feedback Loop

The best RAG systems improve over time by collecting user feedback.

In [None]:
# Simple feedback collection
feedback_log = []

def rag_with_feedback(query):
    """RAG that collects feedback."""
    
    # Get answer
    answer = production_rag(query)
    
    # Log for review
    entry = {
        "query": query,
        "answer": answer,
        "feedback": None  # To be filled by user
    }
    feedback_log.append(entry)
    
    return answer, len(feedback_log) - 1  # Return answer and log ID

def submit_feedback(log_id, is_helpful, comment=""):
    """Submit feedback for an answer."""
    feedback_log[log_id]["feedback"] = {
        "helpful": is_helpful,
        "comment": comment
    }
    print(f"Feedback recorded: {'üëç' if is_helpful else 'üëé'}")

In [None]:
# Example usage
answer, log_id = rag_with_feedback("How do I create a payment?")
print(f"Answer: {answer[:200]}...")
print(f"Log ID: {log_id}")

In [None]:
# Submit feedback
submit_feedback(log_id, is_helpful=True, comment="Clear and complete")

In [None]:
# View feedback log
import json
print(json.dumps(feedback_log, indent=2, default=str))

## Summary

**What we learned:**

1. **Test cases are essential** ‚Äî Define questions + expected answers
2. **Measure at each stage** ‚Äî Find where failures happen
3. **Common problems:**
   - Retrieval failure ‚Üí Improve search/chunking
   - False refusals ‚Üí Tune relevance check
   - Hallucinations ‚Üí Strengthen generation prompt
4. **Collect feedback** ‚Äî Improve over time

**Key insight:** A debuggable system beats a clever system. Always know what's happening inside.

In [None]:
print("‚úÖ Workshop complete!")
print("")
print("You've learned:")
print("  1. Why simple RAG fails (30% accuracy)")
print("  2. How to build production RAG (86% accuracy)")
print("  3. How to evaluate and debug")
print("")
print("Next steps:")
print("  - Try with your own documents")
print("  - Build a web interface")
print("  - Add streaming responses")
print("  - Deploy to production")