# Construction RAG - Evaluation with RAGAS

This notebook demonstrates how to evaluate the RAG pipeline using RAGAS-inspired metrics:
- Context Precision
- Context Recall
- F1 Score
- Keyword Coverage

With LLM enabled, additional metrics:
- Faithfulness
- Answer Relevancy

In [None]:
import os
from construction_rag import ConstructionRAGPipeline, ConstructionDrawingRAG
from construction_rag.evaluation import RAGEvaluator, TEST_CASES

## 1. Setup: Index Some Documents First

Make sure you have some documents indexed before running evaluation.

In [None]:
# Initialize pipeline and process sample images
pipeline = ConstructionRAGPipeline(
    persist_directory="./eval_db",
    enable_summaries=False  # Faster for demo
)

# Process sample images if not already done
import glob
images = glob.glob("sample_images/*.jpg")

if pipeline.get_stats()['total_chunks'] == 0:
    print("Processing sample images...")
    results = pipeline.process_batch(images, verbose=True)
else:
    print(f"Using existing index with {pipeline.get_stats()['total_chunks']} chunks")

## 2. View Test Cases

The evaluation uses predefined test cases with ground truth information.

In [None]:
print(f"Number of test cases: {len(TEST_CASES)}\n")

for i, tc in enumerate(TEST_CASES[:3], 1):
    print(f"{i}. {tc['question']}")
    print(f"   Expected types: {tc['relevant_chunk_types']}")
    print(f"   Keywords: {tc['keywords']}")
    print()

## 3. Initialize Evaluator

Create the evaluator with the RAG pipeline.

In [None]:
# Get the RAG component
rag = pipeline.rag

# Initialize evaluator (without LLM for basic metrics)
evaluator = RAGEvaluator(rag, llm=None)

print("Evaluator initialized")
print(f"  RAG collection: {rag.collection_name}")
print(f"  Chunks indexed: {rag.get_stats()['total_chunks']}")

## 4. Run Evaluation

In [None]:
# Run evaluation on all test cases
results = evaluator.evaluate_all(TEST_CASES, n_results=5, verbose=True)

## 5. View Results

In [None]:
# Aggregate metrics
metrics = results['aggregate_metrics']

print("="*50)
print("AGGREGATE METRICS")
print("="*50)
print(f"\nTest cases: {metrics['num_test_cases']}")
print(f"\nRetrieval Metrics:")
print(f"  Context Precision: {metrics['mean_context_precision']:.2%}")
print(f"  Context Recall:    {metrics['mean_context_recall']:.2%}")
print(f"  F1 Score:          {metrics['f1_score']:.2%}")
print(f"  Keyword Coverage:  {metrics['mean_keyword_coverage']:.2%}")

In [None]:
# Individual results
print("\n" + "="*50)
print("INDIVIDUAL RESULTS")
print("="*50)

for i, result in enumerate(results['individual_results'], 1):
    print(f"\n{i}. {result['question'][:50]}...")
    print(f"   Precision: {result['context_precision']:.2f}")
    print(f"   Recall:    {result['context_recall']:.2f}")
    print(f"   Keywords:  {result['keyword_coverage']:.2f}")

## 6. Interpretation

Based on thesis evaluation thresholds:

In [None]:
print("\nQuality Assessment:")
print("-" * 40)

# Precision
if metrics['mean_context_precision'] >= 0.7:
    print("✓ Good precision: Retrieved chunks are mostly relevant")
elif metrics['mean_context_precision'] >= 0.4:
    print("~ Moderate precision: Some irrelevant chunks retrieved")
else:
    print("✗ Low precision: Many irrelevant chunks retrieved")

# Recall
if metrics['mean_context_recall'] >= 0.5:
    print("✓ Good recall: Most relevant chunk types found")
elif metrics['mean_context_recall'] >= 0.3:
    print("~ Moderate recall: Some relevant types missing")
else:
    print("✗ Low recall: Many relevant chunk types not found")

# F1
if metrics['f1_score'] >= 0.6:
    print(f"✓ Good F1 score: {metrics['f1_score']:.2%} (threshold: 60%)")
else:
    print(f"~ Below threshold: {metrics['f1_score']:.2%} (threshold: 60%)")

## 7. With LLM Metrics (Optional)

If you have an OpenRouter API key, you can enable full RAGAS metrics.

In [None]:
# Uncomment to run with LLM metrics

# from construction_rag import OpenRouterLLM
# 
# if os.environ.get("OPENROUTER_API_KEY"):
#     llm = OpenRouterLLM()
#     evaluator_llm = RAGEvaluator(rag, llm=llm)
#     results_llm = evaluator_llm.evaluate_all(TEST_CASES[:3])  # Just 3 for demo
#     
#     print("\nLLM Metrics:")
#     print(f"  Faithfulness:      {results_llm['aggregate_metrics']['mean_faithfulness']:.2%}")
#     print(f"  Answer Relevancy:  {results_llm['aggregate_metrics']['mean_answer_relevancy']:.2%}")

## 8. Save Results

In [None]:
import json

# Save results to file
with open("evaluation_results.json", "w") as f:
    json.dump(results, f, indent=2)

print("Results saved to evaluation_results.json")