# PDF RAG System Evaluation Framework

This framework provides comprehensive evaluation capabilities for PDF-based Retrieval-Augmented Generation (RAG) systems. It assesses various aspects of RAG performance including relevance, faithfulness, coverage, and answer quality.



## Prerequisites
- Python 3.10 or lower
- macOS environment

## Setup Instructions

### 1. Create and Activate Virtual Environment
```bash
# Create virtual environment
python3.10 -m venv venv

# Activate virtual environment
source venv/bin/activate
```

### 2. Install PyTorch
Install PyTorch, torchvision, and torchaudio first:
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
```

### 3. Install Jupyter and Setup Kernel
```bash
# Install Jupyter and IPython kernel
pip install jupyter ipykernel

# Register the virtual environment as a Jupyter kernel
python -m ipykernel install --user --name=venv --display-name "Python (venv)"

# Start Jupyter Lab
jupyter lab
```

### 4. Additional Setup Notes
- Make sure to select the "Python (venv)" kernel in your Jupyter notebook
- The kernel name will appear in the top right corner of your notebook
- You can switch kernels at any time using the kernel menu

### Troubleshooting
- If you encounter any issues with the kernel, try:
  1. Restarting the kernel
  2. Rerunning the kernel installation command
  3. Verifying that your virtual environment is activated

In [1]:
pip install PyPDF2 sentence-transformers rank-bm25 llama-cpp-python fpdf ollama dataclasses typing rouge_score tqdm

Note: you may need to restart the kernel to use updated packages.


In [2]:
from typing import List, Dict, Any, Tuple
from dataclasses import dataclass
from sentence_transformers import SentenceTransformer, util
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from rouge_score import rouge_scorer
import re
from tqdm import tqdm


### PDFRAGResult Class
```python
@dataclass
class PDFRAGResult:
    query: str
    contexts: List[Dict[str, Any]]
    answer: str
    ground_truth: str = ""
    context_scores: List[float] = None
```
This dataclass encapsulates the results of a RAG query, including:
- The original query
- Retrieved contexts with PDF metadata
- Generated answer
- Ground truth (if available)
- Context relevance scores

In [3]:
@dataclass
class PDFRAGResult:
    query: str
    contexts: List[Dict[str, Any]]  # Now includes PDF metadata
    answer: str
    ground_truth: str = ""
    context_scores: List[float] = None


# PDF RAG Evaluator

The `PDFRAGEvaluator` class is designed to assess how well a RAG (Retrieval-Augmented Generation) system works with PDF documents. It evaluates four main aspects:

1. **Answer Quality**: How good is the generated answer compared to the expected answer?
   - Uses ROUGE scores and semantic similarity
   - Checks if the answer length is appropriate

2. **Relevance**: Are the retrieved PDF chunks related to the question?
   - Measures how well retrieved content matches the query
   - Uses semantic similarity to score relevance

3. **Faithfulness**: Does the answer stick to the source material?
   - Checks if the answer uses information from the sources
   - Detects if the answer makes up information not in the sources
   - Verifies if the answer properly cites PDF pages

4. **Context Coverage**: How well does the system use different sources?
   - Measures if retrieved content is too repetitive
   - Checks if information comes from various PDF sources

The evaluator uses sentence transformers for semantic understanding and provides scores that help improve RAG system performance.

In [4]:
class PDFRAGEvaluator:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.embedding_model = SentenceTransformer(model_name)
        self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        
    def evaluate_answer_quality(self, answer: str, ground_truth: str) -> Dict[str, float]:
        """
        Evaluate the quality of the generated answer against ground truth
        """
        # Calculate ROUGE scores
        rouge_scores = self.rouge_scorer.score(ground_truth, answer)
        
        # Calculate semantic similarity using embeddings
        answer_embedding = self.embedding_model.encode([answer])[0]
        truth_embedding = self.embedding_model.encode([ground_truth])[0]
        semantic_similarity = cosine_similarity([answer_embedding], [truth_embedding])[0][0]
        
        # Calculate length ratio (answer length / ground truth length)
        length_ratio = len(answer.split()) / len(ground_truth.split())
        
        return {
            'rouge1': rouge_scores['rouge1'].fmeasure,
            'rouge2': rouge_scores['rouge2'].fmeasure,
            'rougeL': rouge_scores['rougeL'].fmeasure,
            'semantic_similarity': float(semantic_similarity),
            'length_ratio': length_ratio
        }

    def evaluate_relevance(self, query: str, contexts: List[Dict[str, Any]]) -> List[float]:
        """
        Evaluate semantic relevance of retrieved contexts to the query
        Now handles PDF chunks with metadata
        """
        # Get embeddings
        query_embedding = self.embedding_model.encode([query])[0]
        context_texts = [ctx['content'] for ctx in contexts]
        context_embeddings = self.embedding_model.encode(context_texts)
        
        # Calculate cosine similarities
        similarities = cosine_similarity([query_embedding], context_embeddings)[0]
        
        return similarities.tolist()

    def evaluate_faithfulness(self, answer: str, contexts: List[Dict[str, Any]]) -> Dict[str, float]:
        """
        Evaluate if the generated answer is faithful to the retrieved contexts
        Adapted for PDF chunks
        """
        # Combine contexts
        combined_context = " ".join([ctx['content'] for ctx in contexts])
        
        # Calculate ROUGE scores between answer and context
        rouge_scores = self.rouge_scorer.score(combined_context, answer)
        
        # Calculate lexical overlap
        context_words = set(combined_context.lower().split())
        answer_words = set(answer.lower().split())
        overlap = len(context_words.intersection(answer_words)) / len(answer_words) if answer_words else 0
        
        # Check for potential hallucination (words in answer not in context)
        novel_words = len(answer_words - context_words) / len(answer_words) if answer_words else 1
        
        # Check for source attribution
        source_mentions = self._evaluate_source_attribution(answer, contexts)
        
        return {
            'rouge1': rouge_scores['rouge1'].fmeasure,
            'rouge2': rouge_scores['rouge2'].fmeasure,
            'rougeL': rouge_scores['rougeL'].fmeasure,
            'lexical_overlap': overlap,
            'novelty_ratio': novel_words,
            'source_attribution': source_mentions
        }
    
    def _evaluate_source_attribution(self, answer: str, contexts: List[Dict[str, Any]]) -> float:
        """
        Evaluate if the answer properly attributes information to PDF sources
        """
        # Extract page numbers and filenames mentioned in the answer
        page_pattern = r"page\s+\d+"
        mentioned_pages = set(re.findall(page_pattern, answer.lower()))
        
        # Get actual pages from contexts
        actual_pages = set([f"page {ctx['page']}" for ctx in contexts])
        
        # Calculate attribution score
        if not actual_pages:
            return 0.0
        
        return len(mentioned_pages.intersection(actual_pages)) / len(actual_pages)
    
    def evaluate_context_coverage(self, contexts: List[Dict[str, Any]]) -> Dict[str, float]:
        """
        Evaluate the diversity and redundancy of retrieved contexts
        Adapted for PDF chunks
        """
        if not contexts:
            return {'diversity': 0.0, 'redundancy': 0.0, 'source_diversity': 0.0}
            
        # Get embeddings for all contexts
        context_texts = [ctx['content'] for ctx in contexts]
        embeddings = self.embedding_model.encode(context_texts)
        
        # Calculate pairwise similarities
        similarities = cosine_similarity(embeddings)
        
        # Calculate diversity metrics
        n = len(contexts)
        if n < 2:
            return {'diversity': 1.0, 'redundancy': 0.0, 'source_diversity': 1.0}
            
        diversity_scores = []
        redundancy_scores = []
        
        for i in range(n):
            for j in range(i + 1, n):
                similarity = similarities[i][j]
                diversity_scores.append(1 - similarity)
                redundancy_scores.append(similarity)
        
        # Calculate source diversity (unique PDFs referenced)
        unique_sources = len(set([ctx['filename'] for ctx in contexts]))
        source_diversity = unique_sources / len(contexts)
                
        return {
            'diversity': np.mean(diversity_scores),
            'redundancy': np.mean(redundancy_scores),
            'source_diversity': source_diversity
        }
    
    def evaluate_result(self, result: PDFRAGResult) -> Dict[str, Any]:
        """
        Comprehensive evaluation of a single PDF RAG result
        """
        # Evaluate context relevance
        relevance_scores = self.evaluate_relevance(result.query, result.contexts)
        
        # Evaluate answer faithfulness
        faithfulness_metrics = self.evaluate_faithfulness(result.answer, result.contexts)
        
        # Evaluate context coverage
        coverage_metrics = self.evaluate_context_coverage(result.contexts)
        
        # Evaluate answer quality if ground truth is available
        quality_metrics = {}
        if result.ground_truth:
            quality_metrics = self.evaluate_answer_quality(result.answer, result.ground_truth)
        
        return {
            'relevance': {
                'scores': relevance_scores,
                'mean_relevance': np.mean(relevance_scores)
            },
            'faithfulness': faithfulness_metrics,
            'coverage': coverage_metrics,
            'quality': quality_metrics
        }

# PDF RAG Optimizer

The `PDFRAGOptimizer` class analyzes evaluation results and suggests improvements for your RAG system. It looks for common issues and provides specific recommendations in four key areas:

1. **Relevance Issues** (if score < 0.7)
   - Suggests adjusting PDF chunk sizes
   - Recommends fine-tuning search weights
   - Advises reviewing PDF parsing

2. **Hallucination Problems** (if novelty ratio > 0.3)
   - Suggests reviewing chunk sizes
   - Recommends adjusting LLM temperature
   - Advises strengthening source attribution

3. **Redundancy Issues** (if redundancy > 0.3)
   - Suggests adjusting chunk overlap
   - Recommends implementing deduplication
   - Advises reviewing segmentation

4. **Source Diversity Problems** (if diversity < 0.5)
   - Suggests favoring different PDF sources
   - Recommends reviewing relevance scoring
   - Advises implementing diversity penalties

Each suggestion includes practical steps to improve the specific aspect of RAG system performance.

In [5]:
class PDFRAGOptimizer:
    def __init__(self, evaluator: PDFRAGEvaluator):
        self.evaluator = evaluator
        
    def suggest_improvements(self, evaluation_results: Dict[str, Any]) -> List[str]:
        """
        Analyze evaluation results and suggest improvements for PDF RAG
        """
        suggestions = []
        
        # Analyze relevance
        mean_relevance = evaluation_results['relevance']['mean_relevance']
        if mean_relevance < 0.7:
            suggestions.append(
                "Consider improving retrieval relevance:\n"
                "- Adjust chunk size for PDF processing\n"
                "- Fine-tune hybrid search weights\n"
                "- Review PDF parsing quality"
            )
            
        # Analyze faithfulness
        if evaluation_results['faithfulness']['novelty_ratio'] > 0.3:
            suggestions.append(
                "High novelty ratio indicates potential hallucination:\n"
                "- Review PDF chunk size\n"
                "- Adjust LLM temperature\n"
                "- Strengthen source attribution in prompts"
            )
            
        # Analyze coverage
        if evaluation_results['coverage']['redundancy'] > 0.3:
            suggestions.append(
                "High context redundancy detected:\n"
                "- Adjust chunk overlap in PDF processing\n"
                "- Implement cross-document deduplication\n"
                "- Review PDF segmentation strategy"
            )
            
        # Analyze source diversity
        if evaluation_results['coverage']['source_diversity'] < 0.5:
            suggestions.append(
                "Low source diversity:\n"
                "- Adjust retrieval to favor different PDF sources\n"
                "- Review document relevance scoring\n"
                "- Consider document-level diversity penalties"
            )
        
        return suggestions


# PDF RAG System Evaluation Function

The `evaluate_pdf_rag_system` function is the main testing pipeline that runs and evaluates your RAG system. Here's what it does:

1. **Test Execution**
   - Takes a RAG system and test cases as input
   - Runs each test query through the RAG system
   - Collects responses and used contexts

2. **Metric Collection**
   - Tracks four key metrics:
     - Relevance of retrieved content
     - Faithfulness to source material
     - Coverage/diversity of sources
     - Answer quality (when ground truth available)

3. **Results**
   - Returns two sets of results:
     - Detailed results for each test case
     - Aggregated metrics (mean scores) across all tests

This function helps you understand how well your RAG system performs across multiple test cases and different evaluation aspects.

In [6]:
def evaluate_pdf_rag_system(rag_system: Any, test_cases: List[Dict[str, str]], 
                          evaluator: PDFRAGEvaluator) -> Dict[str, Any]:
    """
    Evaluate a PDF RAG system using a set of test cases
    """
    results = []
    metrics = {
        'relevance': [],
        'faithfulness': [],
        'coverage': [],
        'quality': []
    }
    
    for test in tqdm(test_cases, desc="Evaluating test cases"):
        # Generate RAG response
        response = rag_system.generate_response(test['query'])
        
        # Create evaluation result
        result = PDFRAGResult(
            query=test['query'],
            contexts=response['used_chunks'],
            answer=response['response'],
            ground_truth=test.get('ground_truth', '')
        )
        
        # Evaluate
        evaluation = evaluator.evaluate_result(result)
        results.append((test, response, evaluation))
        
        # Aggregate metrics
        metrics['relevance'].append(evaluation['relevance']['mean_relevance'])
        metrics['faithfulness'].append(evaluation['faithfulness']['rougeL'])
        metrics['coverage'].append(evaluation['coverage']['diversity'])
        if evaluation['quality']:
            metrics['quality'].append(evaluation['quality'].get('semantic_similarity', 0))
    
    # Calculate aggregate metrics
    aggregate_metrics = {
        'mean_relevance': np.mean(metrics['relevance']),
        'mean_faithfulness': np.mean(metrics['faithfulness']),
        'mean_coverage': np.mean(metrics['coverage']),
        'mean_quality': np.mean(metrics['quality']) if metrics['quality'] else None
    }
    
    return {
        'detailed_results': results,
        'aggregate_metrics': aggregate_metrics
    }


The `TECHNOVISION_TEST_CASES` is a collection of sample test queries designed to evaluate different aspects of the RAG system's performance with technical documentation. Each test case includes a query and its corresponding ground truth answer, covering various scenarios like hardware specifications, incident reports, API documentation, and regional system status. These test cases help verify if the RAG system can accurately retrieve and present information about technical requirements, system outages, API limitations, and region-specific issues from the documentation.

In [7]:
# Example test cases for TechnoVision AI documentation
TECHNOVISION_TEST_CASES = [
    {
        'query': "What are the hardware requirements for NeuroStack platform?",
        'ground_truth': "TechnoVision Custom Silicon including TV-GPU-2024 series, NeuroStack Accelerator Cards, and minimum 128GB TechnoVision Certified Memory."
    },
    {
        'query': "What happened during the March 10, 2024 outage?",
        'ground_truth': "TechnoVision API rate limiter malfunction caused an outage affecting 4 enterprise customers in APAC, resolved with Emergency patch TV-Hotfix-2024-03."
    },
    {
        'query': "What are the rate limits for TechnoVision's REST API?",
        'ground_truth': "1000 requests per second per TV-API-KEY"
    },
    {
        'query': "What issues might a Singapore-based customer face in March 2024?",
        'ground_truth': "Singapore cluster at 92% capacity, known bug in version 3.2.1-beta affecting Asian region deployments, and scheduled maintenance on March 25, 2024."
    }
]


The `MockPDFRAGSystem` is a simplified implementation of a RAG system used for testing purposes. It maintains a predefined collection of PDF chunks organized by topics (hardware, outage, api, and singapore), where each chunk contains the actual content, filename, and page number. This structure simulates a real PDF-based knowledge base with technical documentation about TechnoVision's systems and services.

The system implements a basic retrieval and response generation mechanism through its `generate_response` method. When given a query, it uses simple keyword matching to find relevant chunks from its collection, and then generates a response by concatenating the content of matched chunks. While this is a simplified approach compared to production RAG systems, it provides a practical way to test the evaluation framework with realistic technical content and document structure.

In [8]:
# Mock RAG system for testing
class MockPDFRAGSystem:
    def __init__(self):
        self.pdf_chunks = {
            "hardware": [{
                "content": "TechnoVision Custom Silicon requirements include TV-GPU-2024 series and NeuroStack Accelerator Cards. Minimum 128GB TechnoVision Certified Memory required for optimal performance.",
                "filename": "hardware_specs.pdf",
                "page": 1
            }],
            "outage": [{
                "content": "March 10, 2024 Incident Report: TechnoVision API rate limiter malfunction caused service disruption. Impact: 4 enterprise customers in APAC region affected. Resolution: Emergency patch TV-Hotfix-2024-03 deployed.",
                "filename": "incident_report.pdf",
                "page": 1
            }],
            "api": [{
                "content": "TechnoVision REST API Rate Limits: Maximum 1000 requests per second per TV-API-KEY. Enterprise customers may request limit increases.",
                "filename": "api_docs.pdf",
                "page": 1
            }],
            "singapore": [{
                "content": "Singapore Region Status (March 2024): Cluster utilization at 92% capacity. Known issues: Version 3.2.1-beta bug affecting Asian deployments. Scheduled maintenance: March 25, 2024.",
                "filename": "region_status.pdf",
                "page": 1
            }]
        }
    
    def generate_response(self, query: str) -> dict:
        # Simple keyword-based retrieval for demo
        used_chunks = []
        for key, chunks in self.pdf_chunks.items():
            if key.lower() in query.lower():
                used_chunks.extend(chunks)
        
        # Simple response generation by concatenating chunks
        response = " ".join([chunk["content"] for chunk in used_chunks])
        
        return {
            "response": response,
            "used_chunks": used_chunks
        }



# RAG Evaluation Runner

The `run_rag_evaluation` function provides a complete pipeline for testing and analyzing a RAG system's performance. It executes in four main steps:

1. **Setup**
   - Creates the evaluator, optimizer, and mock RAG system
   - Prepares all components for testing

2. **Evaluation**
   - Runs the evaluation using test cases
   - Collects detailed metrics and results

3. **Results Reporting**
   - Shows aggregate metrics across all tests
   - Displays optimization suggestions for improvement
   - Prints detailed analysis of the first test case

4. **Output Format**
   - Metrics: relevance, faithfulness, coverage scores
   - Suggestions for system improvements
   - Comparison between generated answers and ground truth

The function serves as a one-stop solution for running tests, analyzing performance, and getting actionable feedback for improving your RAG system.

In [9]:
# Test the evaluation system
def run_rag_evaluation():
    # Initialize components
    evaluator = PDFRAGEvaluator()
    optimizer = PDFRAGOptimizer(evaluator)
    rag_system = MockPDFRAGSystem()
    
    # Run evaluation
    print("Starting RAG system evaluation...")
    evaluation_results = evaluate_pdf_rag_system(rag_system, TECHNOVISION_TEST_CASES, evaluator)
    
    # Print aggregate metrics
    print("\nAggregate Metrics:")
    for metric, value in evaluation_results['aggregate_metrics'].items():
        print(f"{metric}: {value:.3f}")
    
    # Get optimization suggestions
    print("\nOptimization Suggestions:")
    sample_detailed_eval = evaluation_results['detailed_results'][0][2]  # Get first test case evaluation
    suggestions = optimizer.suggest_improvements(sample_detailed_eval)
    for suggestion in suggestions:
        print(f"\n{suggestion}")
    
    # Print detailed results for first test case
    print("\nDetailed Results for First Test Case:")
    first_test = evaluation_results['detailed_results'][0]
    test_case, response, evaluation = first_test
    
    print(f"\nQuery: {test_case['query']}")
    print(f"Generated Answer: {response['response']}")
    print(f"Ground Truth: {test_case['ground_truth']}")
    print("\nEvaluation Metrics:")
    print(f"Relevance: {evaluation['relevance']['mean_relevance']:.3f}")
    print(f"Faithfulness (RougeL): {evaluation['faithfulness']['rougeL']:.3f}")
    print(f"Coverage Diversity: {evaluation['coverage']['diversity']:.3f}")



In [10]:
if __name__ == "__main__":
    run_rag_evaluation()

Starting RAG system evaluation...


Evaluating test cases: 100%|███████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.18it/s]


Aggregate Metrics:
mean_relevance: 0.595
mean_faithfulness: 1.000
mean_coverage: 1.000
mean_quality: 0.879

Optimization Suggestions:

Consider improving retrieval relevance:
- Adjust chunk size for PDF processing
- Fine-tune hybrid search weights
- Review PDF parsing quality

Detailed Results for First Test Case:

Query: What are the hardware requirements for NeuroStack platform?
Generated Answer: TechnoVision Custom Silicon requirements include TV-GPU-2024 series and NeuroStack Accelerator Cards. Minimum 128GB TechnoVision Certified Memory required for optimal performance.
Ground Truth: TechnoVision Custom Silicon including TV-GPU-2024 series, NeuroStack Accelerator Cards, and minimum 128GB TechnoVision Certified Memory.

Evaluation Metrics:
Relevance: 0.640
Faithfulness (RougeL): 1.000
Coverage Diversity: 1.000



