# RAG System Evaluation with RAGAS Metrics

This notebook evaluates the Julius Caesar RAG system using multiple metrics:
- **Faithfulness**: How factually accurate is the answer based on context?
- **Answer Relevancy**: How relevant is the answer to the question?
- **Context Precision**: How precise is the retrieved context?
- **Context Recall**: How much of the ground truth is captured?
- **Answer Correctness**: Overall correctness of the answer

In [None]:
# Import required libraries
import requests
import json
import os
import pandas as pd
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_correctness
)
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings

print("All libraries imported successfully!")

In [None]:
# Configuration
API_URL = "http://127.0.0.1:8000/query"
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY", "AIzaSyBIgBGOVlg-QF9yCcl2T3ObW_3kofjmlcI")

# Set environment variable for RAGAS
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

print(f"API URL: {API_URL}")
print(f"API Key configured: {'Yes' if GOOGLE_API_KEY else 'No'}")

## Test Dataset

We'll create a test dataset with questions, ground truth answers, and evaluate the system's performance.

In [None]:
# Test questions with ground truth answers
test_data = [
    {
        "question": "Who is Brutus?",
        "ground_truth": "Brutus is a Roman senator and one of the main conspirators in the assassination of Julius Caesar. He is portrayed as an honorable man who joins the conspiracy because he believes Caesar's ambition threatens the Roman Republic."
    },
    {
        "question": "What does the Soothsayer say to Caesar?",
        "ground_truth": "The Soothsayer warns Caesar to 'Beware the Ides of March', which is March 15th, the day Caesar is ultimately assassinated."
    },
    {
        "question": "Why does Brutus kill Caesar?",
        "ground_truth": "Brutus kills Caesar because he believes Caesar's ambition would lead to tyranny and the destruction of the Roman Republic. He acts out of love for Rome rather than personal hatred for Caesar."
    },
    {
        "question": "What happens at Caesar's funeral?",
        "ground_truth": "At Caesar's funeral, Brutus speaks first and explains the reasons for the assassination. Then Mark Antony delivers his famous 'Friends, Romans, countrymen' speech, which turns the crowd against the conspirators."
    },
    {
        "question": "Who is Cassius?",
        "ground_truth": "Cassius is a Roman senator and the main instigator of the conspiracy against Caesar. He manipulates Brutus into joining the plot by appealing to his sense of honor and republican ideals."
    },
    {
        "question": "What is the relationship between Caesar and Brutus?",
        "ground_truth": "Caesar trusts and loves Brutus, treating him almost like a son. This makes Brutus's betrayal particularly tragic, as evidenced by Caesar's famous last words 'Et tu, Brute?' (And you, Brutus?)."
    },
    {
        "question": "What role does Mark Antony play?",
        "ground_truth": "Mark Antony is Caesar's loyal friend and supporter. After Caesar's assassination, he skillfully turns public opinion against the conspirators through his funeral oration and eventually leads forces against them."
    },
    {
        "question": "What are Caesar's last words?",
        "ground_truth": "Caesar's last words are 'Et tu, Brute? Then fall, Caesar!' expressing his shock and betrayal at seeing Brutus among his assassins."
    }
]

print(f"Created test dataset with {len(test_data)} questions")
for i, item in enumerate(test_data, 1):
    print(f"{i}. {item['question']}")

## Query the RAG System

Let's query our RAG API and collect the responses.

In [None]:
# Query the RAG system and collect results
results = []

print("Querying RAG system...\\n")
print("="*80)

for i, item in enumerate(test_data, 1):
    question = item['question']
    print(f"\\n[{i}/{len(test_data)}] Question: {question}")
    print("-"*80)
    
    try:
        # Query the API
        response = requests.post(API_URL, json={"query": question}, timeout=30)
        response.raise_for_status()
        data = response.json()
        
        # Extract answer and contexts
        answer = data.get('answer', '')
        sources = data.get('sources', [])
        contexts = [source['chunk'] for source in sources]
        
        print(f"Answer: {answer}")
        print(f"\\nRetrieved {len(contexts)} context chunks")
        
        # Store result
        results.append({
            "question": question,
            "answer": answer,
            "contexts": contexts,
            "ground_truth": item['ground_truth']
        })
        
    except Exception as e:
        print(f"Error: {e}")
        results.append({
            "question": question,
            "answer": "Error: Could not get response",
            "contexts": [],
            "ground_truth": item['ground_truth']
        })

print("\\n" + "="*80)
print(f"\\nCollected {len(results)} responses")

## Prepare Data for RAGAS Evaluation

RAGAS requires data in a specific format using the Hugging Face Dataset format.

In [None]:
# Convert results to RAGAS format
ragas_data = {
    "question": [r["question"] for r in results],
    "answer": [r["answer"] for r in results],
    "contexts": [r["contexts"] for r in results],
    "ground_truth": [r["ground_truth"] for r in results]
}

# Create dataset
dataset = Dataset.from_dict(ragas_data)

print("Dataset created for RAGAS evaluation")
print(f"Number of samples: {len(dataset)}")
print(f"\\nDataset columns: {dataset.column_names}")
print(f"\\nSample data:")
print(dataset[0])

## Configure RAGAS with Google Gemini

We'll use Google Gemini for evaluation instead of OpenAI.

In [None]:
# Configure LLM and Embeddings for RAGAS
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    google_api_key=GOOGLE_API_KEY,
    temperature=0
)

embeddings = GoogleGenerativeAIEmbeddings(
    model="models/embedding-001",
    google_api_key=GOOGLE_API_KEY
)

print("RAGAS configured with Google Gemini")
print(f"LLM Model: gemini-2.5-flash")
print(f"Embedding Model: models/embedding-001")

## Run RAGAS Evaluation

Now let's evaluate our RAG system using multiple metrics:

1. **Faithfulness**: Measures how factually accurate the answer is based on the given context
2. **Answer Relevancy**: Measures how relevant the answer is to the question
3. **Context Precision**: Measures the signal-to-noise ratio of retrieved context
4. **Context Recall**: Measures how much of the ground truth is captured in the context
5. **Answer Correctness**: Measures the overall correctness of the answer

In [None]:
# Define metrics to evaluate
metrics = [
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_correctness
]

print("Evaluating RAG system with RAGAS metrics...")
print("This may take a few minutes...\\n")

# Run evaluation
evaluation_result = evaluate(
    dataset=dataset,
    metrics=metrics,
    llm=llm,
    embeddings=embeddings
)

print("\\nEvaluation completed!")

## Display Results

In [None]:
# Display overall scores
print("="*80)
print("RAGAS EVALUATION RESULTS")
print("="*80)
print("\\nOverall Scores (0-1 scale, higher is better):\\n")

scores = evaluation_result

print(f"1. Faithfulness:        {scores['faithfulness']:.4f}")
print(f"   -> How factually accurate is the answer based on context?")
print()

print(f"2. Answer Relevancy:    {scores['answer_relevancy']:.4f}")
print(f"   -> How relevant is the answer to the question?")
print()

print(f"3. Context Precision:   {scores['context_precision']:.4f}")
print(f"   -> How precise is the retrieved context?")
print()

print(f"4. Context Recall:      {scores['context_recall']:.4f}")
print(f"   -> How much ground truth is captured in context?")
print()

print(f"5. Answer Correctness:  {scores['answer_correctness']:.4f}")
print(f"   -> Overall correctness of the answer")
print()

# Calculate average score
avg_score = sum([scores['faithfulness'], scores['answer_relevancy'], 
                 scores['context_precision'], scores['context_recall'], 
                 scores['answer_correctness']]) / 5

print("="*80)
print(f"AVERAGE SCORE: {avg_score:.4f}")
print("="*80)

In [None]:
# Convert to DataFrame for better visualization
results_df = evaluation_result.to_pandas()

print("\\nDetailed Results per Question:\\n")
print(results_df[['question', 'faithfulness', 'answer_relevancy', 
                  'context_precision', 'context_recall', 'answer_correctness']].to_string(index=False))

## Metric Interpretation

### Score Ranges:
- **0.8 - 1.0**: Excellent
- **0.6 - 0.8**: Good
- **0.4 - 0.6**: Fair
- **0.0 - 0.4**: Needs Improvement

### What Each Metric Means:

1. **Faithfulness (0-1)**
   - Measures if the answer contains only information from the context
   - High score = No hallucinations
   - Low score = Answer contains made-up information

2. **Answer Relevancy (0-1)**
   - Measures how well the answer addresses the question
   - High score = Directly answers the question
   - Low score = Answer is off-topic or vague

3. **Context Precision (0-1)**
   - Measures if retrieved chunks are relevant
   - High score = All retrieved chunks are useful
   - Low score = Many irrelevant chunks retrieved

4. **Context Recall (0-1)**
   - Measures if context contains all needed information
   - High score = Context has everything to answer
   - Low score = Missing important information

5. **Answer Correctness (0-1)**
   - Measures factual and semantic similarity to ground truth
   - High score = Answer matches expected answer
   - Low score = Answer differs from ground truth

In [None]:
# Performance categorization
def categorize_score(score):
    if score >= 0.8:
        return "Excellent"
    elif score >= 0.6:
        return "Good"
    elif score >= 0.4:
        return "Fair"
    else:
        return "Needs Improvement"

print("\\nPerformance Summary:\\n")
print("="*80)

metrics_summary = {
    "Faithfulness": scores['faithfulness'],
    "Answer Relevancy": scores['answer_relevancy'],
    "Context Precision": scores['context_precision'],
    "Context Recall": scores['context_recall'],
    "Answer Correctness": scores['answer_correctness']
}

for metric_name, score in metrics_summary.items():
    category = categorize_score(score)
    print(f"{metric_name:20s}: {score:.4f} - {category}")

print("="*80)
print(f"\\nOverall System Performance: {categorize_score(avg_score)}")
print(f"Average Score: {avg_score:.4f}")

## Sample Answers Review

Let's look at a few sample Q&A pairs to understand the system's performance.

In [None]:
# Display sample Q&A pairs
print("\\nSample Question-Answer Pairs:\\n")
print("="*80)

for i in range(min(3, len(results))):
    print(f"\\n[Sample {i+1}]")
    print("-"*80)
    print(f"Question: {results[i]['question']}")
    print(f"\\nGenerated Answer:\\n{results[i]['answer']}")
    print(f"\\nGround Truth:\\n{results[i]['ground_truth']}")
    print(f"\\nNumber of Context Chunks: {len(results[i]['contexts'])}")
    
    if i < len(results_df):
        print(f"\\nScores for this question:")
        print(f"  Faithfulness: {results_df.iloc[i]['faithfulness']:.4f}")
        print(f"  Answer Relevancy: {results_df.iloc[i]['answer_relevancy']:.4f}")
        print(f"  Answer Correctness: {results_df.iloc[i]['answer_correctness']:.4f}")
    
    print("="*80)

## Save Results

In [None]:
# Save results to file
output_file = "evaluation_results.json"

output_data = {
    "overall_scores": {
        "faithfulness": float(scores['faithfulness']),
        "answer_relevancy": float(scores['answer_relevancy']),
        "context_precision": float(scores['context_precision']),
        "context_recall": float(scores['context_recall']),
        "answer_correctness": float(scores['answer_correctness']),
        "average_score": float(avg_score)
    },
    "detailed_results": results_df.to_dict('records')
}

with open(output_file, 'w') as f:
    json.dump(output_data, f, indent=2)

print(f"Results saved to {output_file}")

# Also save as CSV
csv_file = "evaluation_results.csv"
results_df.to_csv(csv_file, index=False)
print(f"Detailed results saved to {csv_file}")

## Conclusion

This evaluation provides a comprehensive assessment of the RAG system using industry-standard metrics:

- **Faithfulness** ensures no hallucinations
- **Answer Relevancy** ensures questions are properly addressed
- **Context Precision** ensures quality retrieval
- **Context Recall** ensures completeness
- **Answer Correctness** ensures factual accuracy

Use these metrics to:
1. Track improvements over time
2. Compare different retrieval strategies
3. Identify weaknesses in the system
4. Make data-driven optimization decisions