# RAG Application Evaluation Using RAGAS

This notebook evaluates the performance and accuracy of the RAG-based support ticket system using the RAGAS framework.

## Metrics Evaluated:
- **Faithfulness**: How grounded the answer is in the retrieved context
- **Answer Relevancy**: How relevant the answer is to the question
- **Context Precision**: Quality of retrieved contexts
- **Context Recall**: Coverage of ground truth in retrieved contexts
- **Answer Correctness**: Semantic and factual correctness
- **Answer Similarity**: Semantic similarity to ground truth

## 1. Installation and Setup

In [None]:
# Install required packages
# Note: Run this cell first if packages are not already installed
# !pip install ragas datasets langchain langchain-groq langchain-google-genai sentence-transformers pandas matplotlib seaborn requests python-dotenv

In [1]:
# Import required libraries
import json
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
from typing import List, Dict, Tuple
import warnings
warnings.filterwarnings('ignore')

# RAGAS imports
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_correctness,
    answer_similarity
)

# LangChain imports for LLM setup (Groq and Gemini)
from langchain_groq import ChatGroq
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_community.embeddings import HuggingFaceEmbeddings

print("‚úÖ All libraries imported successfully!")

‚úÖ All libraries imported successfully!


In [2]:
# Configuration
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv('./rag-backend/.env')

# API Configuration
RAG_API_BASE_URL = "http://localhost:8000"  # Your FastAPI backend

# LLM API Keys - RAGAS will use these for evaluation
GROQ_API_KEY = os.getenv('GROQ_API_KEY', '')
GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY', '')  # For Gemini

# File paths
TICKETS_FILE = './rag-frontend/Json_Files/tickets_sample.json'
SAMPLE_QUERIES_FILE = './rag-frontend/Json_Files/sample.json'

# LLM Choice for RAGAS evaluation
LLM_PROVIDER1 = "groq"  # Options: "groq" or "gemini"
LLM_MODEL1 = "llama-3.3-70b-versatile"  # For Groq: llama-3.1-70b-versatile, mixtral-8x7b-32768
# For Gemini: gemini-1.5-pro, gemini-1.5-flash
LLM_PROVIDER2 = "gemini"
LLM_MODEL2 = "gemini-2.5-flash"

print(f"RAG API URL: {RAG_API_BASE_URL}")
print(f"LLM Provider: {LLM_PROVIDER1}")
print(f"LLM Model: {LLM_MODEL1}")
print(f"LLM Provider: {LLM_PROVIDER2}")
print(f"LLM Model: {LLM_MODEL2}")
print(f"Tickets file: {TICKETS_FILE}")
print(f"Sample queries file: {SAMPLE_QUERIES_FILE}")

RAG API URL: http://localhost:8000
LLM Provider: groq
LLM Model: llama-3.3-70b-versatile
LLM Provider: gemini
LLM Model: gemini-2.5-flash
Tickets file: ./rag-frontend/Json_Files/tickets_sample.json
Sample queries file: ./rag-frontend/Json_Files/sample.json


### LLM Configuration Guide

**Available Options:**

1. **Groq (Recommended - Fast & Free)**
   - Set `LLM_PROVIDER = "groq"`
   - Add `GROQ_API_KEY` to `.env` file
   - Models: `llama-3.1-70b-versatile`, `llama-3.1-8b-instant`, `mixtral-8x7b-32768`
   - Get API key: https://console.groq.com/

2. **Google Gemini (Alternative)**
   - Set `LLM_PROVIDER = "gemini"`
   - Add `GOOGLE_API_KEY` to `.env` file
   - Models: `gemini-1.5-pro`, `gemini-1.5-flash`
   - Get API key: https://ai.google.dev/

**Note:** RAGAS requires an LLM to evaluate responses. Choose one of the above providers.

## 2. Load Test Data

In [3]:
# Load tickets (ground truth data from vector DB)
with open(TICKETS_FILE, 'r', encoding='utf-8') as f:
    tickets_data = json.load(f)

print(f"‚úÖ Loaded {len(tickets_data)} tickets from vector DB data")
print(f"\nFirst ticket example:")
print(f"ID: {tickets_data[0]['ticket_id']}")
print(f"Title: {tickets_data[0]['title']}")
print(f"Status: {tickets_data[0]['status']}")

‚úÖ Loaded 50 tickets from vector DB data

First ticket example:
ID: TKT-0001
Title: Payment API Returning 500 Internal Server Error
Status: Resolved


In [4]:
# Load sample user queries
with open(SAMPLE_QUERIES_FILE, 'r', encoding='utf-8') as f:
    sample_queries = json.load(f)

print(f"‚úÖ Loaded {len(sample_queries)} sample queries")
print(f"\nFirst query example:")
print(f"Title: {sample_queries[0]['title']}")
print(f"Description: {sample_queries[0]['description'][:100]}...")

‚úÖ Loaded 20 sample queries

First query example:
Title: Payment API Throwing 500 Errors
Description: Payment processing API returning 500 internal server errors for all payment transactions. Started ab...


In [5]:
# Display data statistics
tickets_df = pd.DataFrame(tickets_data)
print("\n=== Tickets Dataset Statistics ===")
print(f"Total tickets: {len(tickets_df)}")
print(f"\nCategories:")
print(tickets_df['category'].value_counts())
print(f"\nSeverity:")
print(tickets_df['severity'].value_counts())
print(f"\nStatus:")
print(tickets_df['status'].value_counts())


=== Tickets Dataset Statistics ===
Total tickets: 50

Categories:
category
Monitoring        32
Infrastructure     6
API Issues         3
Performance        3
Security           2
Database           1
DevOps             1
Networking         1
Storage            1
Name: count, dtype: int64

Severity:
severity
High        23
Medium      16
Critical     8
Low          3
Name: count, dtype: int64

Status:
status
Resolved       30
In Progress    15
Open            5
Name: count, dtype: int64


## 3. Create Evaluation Dataset

Map sample queries to their corresponding ground truth solutions from the tickets database.

In [6]:
def create_evaluation_dataset(sample_queries: List[Dict], tickets_data: List[Dict]) -> List[Dict]:
    """
    Create evaluation dataset by mapping sample queries to ground truth from tickets.
    
    Each entry contains:
    - question: User query
    - ground_truth: Expected solution from ticket
    - expected_ticket_id: Ticket that should be retrieved
    """
    eval_dataset = []
    
    # Create mapping based on similar content
    # Assuming samples correspond to first N tickets
    for idx, sample in enumerate(sample_queries):
        if idx < len(tickets_data):
            ticket = tickets_data[idx]
            
            # Construct question from sample
            question = f"{sample['title']}. {sample['description']}"
            
            # Construct ground truth from ticket
            ground_truth = (
                f"Solution: {ticket['solution']}\n\n"
                f"Reasoning: {ticket['reasoning']}\n\n"
                f"Category: {ticket['category']}\n"
                f"Severity: {ticket['severity']}"
            )
            
            eval_dataset.append({
                'question': question,
                'ground_truth': ground_truth,
                'expected_ticket_id': ticket['ticket_id'],
                'expected_title': ticket['title'],
                'category': ticket['category'],
                'severity': ticket['severity']
            })
    
    return eval_dataset

# Create evaluation dataset
eval_dataset = create_evaluation_dataset(sample_queries, tickets_data)
print(f"‚úÖ Created evaluation dataset with {len(eval_dataset)} test cases")
print(f"\nExample test case:")
print(f"Question: {eval_dataset[0]['question'][:100]}...")
print(f"Expected Ticket: {eval_dataset[0]['expected_ticket_id']}")

‚úÖ Created evaluation dataset with 20 test cases

Example test case:
Question: Payment API Throwing 500 Errors. Payment processing API returning 500 internal server errors for all...
Expected Ticket: TKT-0001


## 4. Query RAG System

Query the RAG API to get responses and retrieved contexts for evaluation.

In [7]:
def query_rag_system(question: str, api_url: str = RAG_API_BASE_URL) -> Dict:
    """
    Query the RAG system API and return the response with contexts.
    
    Returns:
        Dict with 'answer', 'contexts', and 'metadata'
    """
    try:
        # Use GET request with query parameters (matches your backend API)
        response = requests.get(
            f"{api_url}/api/v1/tickets/search",
            params={
                "query": question,
                "limit": 3,
                "collection_name": "SupportTickets"
            },
            timeout=30
        )
        
        if response.status_code == 200:
            data = response.json()
            
            # Extract contexts from retrieved documents
            contexts = []
            retrieved_ids = []
            
            # Your backend returns 'results' array with ticket data
            if 'results' in data:
                for doc in data['results']:
                    # Combine relevant fields for context
                    context = (
                        f"Ticket ID: {doc.get('ticket_id', 'N/A')}\n"
                        f"Title: {doc.get('title', '')}\n"
                        f"Description: {doc.get('description', '')}\n"
                        f"Solution: {doc.get('solution', '')}\n"
                        f"Reasoning: {doc.get('reasoning', '')}"
                    )
                    contexts.append(context)
                    retrieved_ids.append(doc.get('ticket_id', 'N/A'))
            
            return {
                'answer': data.get('answer', ''),  # Your API may not return 'answer' for search
                'contexts': contexts if contexts else ['No context retrieved'],
                'retrieved_ticket_ids': retrieved_ids,
                'success': True
            }
        else:
            print(f"‚ö†Ô∏è  API returned status code: {response.status_code}")
            print(f"Response: {response.text[:200]}")
            return {
                'answer': 'Error: API request failed',
                'contexts': ['Error retrieving contexts'],
                'retrieved_ticket_ids': [],
                'success': False
            }
    
    except Exception as e:
        print(f"‚ùå Error querying RAG system: {str(e)}")
        return {
            'answer': f'Error: {str(e)}',
            'contexts': ['Error retrieving contexts'],
            'retrieved_ticket_ids': [],
            'success': False
        }

# Test the connection
print("Testing RAG API connection...")
print(f"Endpoint: {RAG_API_BASE_URL}/api/v1/tickets/search")
test_response = query_rag_system(eval_dataset[0]['question'])
print(f"\nAPI Status: {'‚úÖ Connected' if test_response['success'] else '‚ùå Failed'}")
if test_response['success']:
    print(f"Contexts retrieved: {len(test_response['contexts'])}")
    print(f"Retrieved ticket IDs: {test_response['retrieved_ticket_ids']}")
else:
    print("\n‚ö†Ô∏è  Troubleshooting:")
    print("  1. Ensure backend is running: docker-compose up")
    print("  2. Check health endpoint: http://localhost:8000/health")
    print("  3. Verify collection exists in Weaviate")

Testing RAG API connection...
Endpoint: http://localhost:8000/api/v1/tickets/search

API Status: ‚úÖ Connected
Contexts retrieved: 3
Retrieved ticket IDs: ['TKT-0001', 'TKT-0009', 'TKT-0017']


In [8]:
# Query RAG system for all test cases
print("Querying RAG system for all test cases...\n")

rag_responses = []
for idx, test_case in enumerate(eval_dataset):
    print(f"Processing {idx + 1}/{len(eval_dataset)}: {test_case['expected_ticket_id']}")
    
    response = query_rag_system(test_case['question'])
    
    rag_responses.append({
        'question': test_case['question'],
        'answer': response['answer'],
        'contexts': response['contexts'],
        'ground_truth': test_case['ground_truth'],
        'expected_ticket_id': test_case['expected_ticket_id'],
        'retrieved_ticket_ids': response['retrieved_ticket_ids'],
        'success': response['success']
    })

print(f"\n‚úÖ Collected {len(rag_responses)} RAG responses")
print(f"Successful queries: {sum(1 for r in rag_responses if r['success'])}")

Querying RAG system for all test cases...

Processing 1/20: TKT-0001
Processing 2/20: TKT-0002
Processing 3/20: TKT-0003
Processing 4/20: TKT-0004
Processing 5/20: TKT-0005
Processing 6/20: TKT-0006
Processing 7/20: TKT-0007
Processing 8/20: TKT-0008
Processing 9/20: TKT-0009
Processing 10/20: TKT-0010
Processing 11/20: TKT-0011
Processing 12/20: TKT-0012
Processing 13/20: TKT-0013
Processing 14/20: TKT-0014
Processing 15/20: TKT-0015
Processing 16/20: TKT-0016
Processing 17/20: TKT-0017
Processing 18/20: TKT-0018
Processing 19/20: TKT-0019
Processing 20/20: TKT-0020

‚úÖ Collected 20 RAG responses
Successful queries: 20


## 5. Prepare Dataset for RAGAS Evaluation

In [9]:
# Convert to RAGAS format
ragas_data = {
    'question': [r['question'] for r in rag_responses],
    'answer': [r['answer'] for r in rag_responses],
    'contexts': [r['contexts'] for r in rag_responses],
    'ground_truth': [r['ground_truth'] for r in rag_responses]
}

# Create HuggingFace Dataset
ragas_dataset = Dataset.from_dict(ragas_data)

print(f"‚úÖ Created RAGAS dataset with {len(ragas_dataset)} samples")
print(f"\nDataset structure:")
print(ragas_dataset)

‚úÖ Created RAGAS dataset with 20 samples

Dataset structure:
Dataset({
    features: ['question', 'answer', 'contexts', 'ground_truth'],
    num_rows: 20
})


## 6. Run RAGAS Evaluation

Evaluate the RAG system using multiple metrics.

In [10]:
# Configure LLM for RAGAS evaluation (Groq or Gemini)
print("=" * 60)
print("CONFIGURING LLM FOR RAGAS EVALUATION")
print("=" * 60)

llm = None

if LLM_PROVIDER1.lower() == "groq" and GROQ_API_KEY:  # changed
    print(f"‚úÖ Using Groq LLM: {LLM_MODEL1}")
    try:
        llm = ChatGroq(
            model=LLM_MODEL1,
            api_key=GROQ_API_KEY,
            temperature=0,
            max_retries=3,
        )
        print("‚úÖ Groq LLM initialized successfully!")
    except Exception as e:
        print(f"‚ùå Error initializing Groq: {e}")

elif LLM_PROVIDER2.lower() == "gemini" and GOOGLE_API_KEY:
    print(f"‚úÖ Using Google Gemini: {LLM_MODEL2}")
    try:
        llm = ChatGoogleGenerativeAI(
            model=LLM_MODEL2 if LLM_MODEL2.startswith("gemini") else "gemini-1.5-flash",
            google_api_key=GOOGLE_API_KEY,
            temperature=0,
            max_retries=3,
        )
        print("‚úÖ Gemini LLM initialized successfully!")
    except Exception as e:
        print(f"‚ùå Error initializing Gemini: {e}")

else:
    print("‚ö†Ô∏è  No valid API key found!")
    print("Please set one of the following in your .env file:")
    print("  - GROQ_API_KEY (for Groq/Llama models)")
    print("  - GOOGLE_API_KEY (for Gemini models)")
    print("\nRAGAS requires an LLM for evaluation.")

# Initialize embeddings for RAGAS
print("\nüìä Initializing embeddings model...")
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)
print("‚úÖ Embeddings initialized!")

print("=" * 60)

CONFIGURING LLM FOR RAGAS EVALUATION
‚úÖ Using Groq LLM: llama-3.3-70b-versatile
‚úÖ Groq LLM initialized successfully!

üìä Initializing embeddings model...


  embeddings = HuggingFaceEmbeddings(


‚úÖ Embeddings initialized!


In [11]:
# Test LLM connection
if llm:
    print("\nüß™ Testing LLM connection...")
    try:
        test_response = llm.invoke("Say 'Hello! LLM is working.' in one sentence.")
        print(f"‚úÖ LLM Response: {test_response.content}")
        print(f"‚úÖ LLM is ready for RAGAS evaluation!")
    except Exception as e:
        print(f"‚ùå LLM test failed: {e}")
        print("Please check your API key and internet connection.")
else:
    print("‚ö†Ô∏è  Skipping LLM test - No LLM configured")


üß™ Testing LLM connection...
‚úÖ LLM Response: Hello, LLM is working.
‚úÖ LLM is ready for RAGAS evaluation!


In [12]:
# Define metrics to evaluate
metrics = [
    faithfulness,           # Is answer grounded in retrieved context?
    answer_relevancy,       # Does answer address the question?
    context_precision,      # Are relevant contexts ranked higher?
    context_recall,         # Do contexts cover the ground truth?
    answer_correctness,     # Semantic + factual correctness
    answer_similarity       # Semantic similarity to ground truth
]

print("Metrics to evaluate:")
for metric in metrics:
    print(f"  - {metric.name}")

Metrics to evaluate:
  - faithfulness
  - answer_relevancy
  - context_precision
  - context_recall
  - answer_correctness
  - answer_similarity


In [None]:
# Run RAGAS evaluation
print("\nüöÄ Running RAGAS evaluation...\n")
print("This may take a few minutes...\n")

if llm is None:
    print("‚ùå Cannot run evaluation: No LLM configured!")
    print("Please configure GROQ_API_KEY or GOOGLE_API_KEY in your .env file")
    results = None
else:
    try:
        # Run evaluation with configured LLM
        results = evaluate(
            ragas_dataset,
            metrics=metrics,
            llm=llm,
            embeddings=embeddings,
        )
        
        print("\n‚úÖ Evaluation completed successfully!")
        
    except Exception as e:
        print(f"\n‚ùå Evaluation failed: {str(e)}")
        print("\nTroubleshooting tips:")
        print("1. Ensure you have a valid API key (Groq or Google)")
        print("2. Check your internet connection")
        print("3. Verify the RAG API is running and accessible")
        print("4. Make sure the selected model is available:")
        print(f"   - Current LLM: {LLM_PROVIDER1} - {LLM_MODEL1}")
        # print("\nFor Groq, available models:")
        # print("  - llama-3.1-70b-versatile")
        # print("  - llama-3.1-8b-instant")
        # print("  - mixtral-8x7b-32768")
        # print("\nFor Gemini, available models:")
        # print("  - gemini-1.5-pro")
        # print("  - gemini-1.5-flash")
        results = None


üöÄ Running RAGAS evaluation...

This may take a few minutes...



Evaluating:   0%|          | 0/120 [00:00<?, ?it/s]

## 7. Display Results

In [None]:
if results:
    # Display overall scores
    print("\n" + "="*60)
    print("RAG SYSTEM EVALUATION RESULTS")
    print("="*60)
    
    print(f"\nüìä Overall Metrics (0.0 - 1.0 scale):")
    print("-" * 60)
    
    for metric_name, score in results.items():
        if metric_name != 'question':  # Skip the question field
            # Determine status emoji
            if score >= 0.8:
                status = "üü¢ Excellent"
            elif score >= 0.6:
                status = "üü° Good"
            elif score >= 0.4:
                status = "üü† Fair"
            else:
                status = "üî¥ Needs Improvement"
            
            print(f"{metric_name:25s}: {score:.4f}  {status}")
    
    print("\n" + "="*60)

In [None]:
if results:
    # Convert to DataFrame for detailed analysis
    results_df = results.to_pandas()
    
    # Add metadata
    results_df['expected_ticket_id'] = [r['expected_ticket_id'] for r in rag_responses]
    results_df['retrieved_ticket_ids'] = [r['retrieved_ticket_ids'] for r in rag_responses]
    results_df['success'] = [r['success'] for r in rag_responses]
    
    # Display detailed results
    print("\nüìã Detailed Results Per Test Case:")
    print(results_df.to_string())
    
    # Save results
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    results_file = f'ragas_evaluation_results_{timestamp}.csv'
    results_df.to_csv(results_file, index=False)
    print(f"\nüíæ Results saved to: {results_file}")

## 8. Custom Metrics Analysis

Additional metrics specific to support ticket retrieval.

In [None]:
# Calculate retrieval accuracy
def calculate_retrieval_accuracy(responses: List[Dict]) -> Dict:
    """
    Calculate if the correct ticket was retrieved (Precision@K)
    """
    total = len(responses)
    top1_correct = 0  # Exact match at rank 1
    top3_correct = 0  # Exact match in top 3
    
    for response in responses:
        expected = response['expected_ticket_id']
        retrieved = response['retrieved_ticket_ids']
        
        if retrieved:
            # Check if expected ticket is in top 1
            if retrieved[0] == expected:
                top1_correct += 1
                top3_correct += 1
            # Check if expected ticket is in top 3
            elif expected in retrieved[:3]:
                top3_correct += 1
    
    return {
        'precision_at_1': top1_correct / total if total > 0 else 0,
        'precision_at_3': top3_correct / total if total > 0 else 0,
        'total_queries': total
    }

# Calculate custom metrics
retrieval_metrics = calculate_retrieval_accuracy(rag_responses)

print("\n" + "="*60)
print("CUSTOM RETRIEVAL METRICS")
print("="*60)
print(f"\nPrecision@1 (Top-1 Accuracy): {retrieval_metrics['precision_at_1']:.2%}")
print(f"Precision@3 (Top-3 Accuracy): {retrieval_metrics['precision_at_3']:.2%}")
print(f"Total Queries Evaluated: {retrieval_metrics['total_queries']}")
print("="*60)

## 9. Visualizations

In [None]:
if results:
    # Set style
    plt.style.use('seaborn-v0_8-darkgrid')
    sns.set_palette("husl")
    
    # Create figure with subplots
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    fig.suptitle('RAG System Evaluation Results', fontsize=16, fontweight='bold')
    
    # 1. Overall Metrics Bar Chart
    ax1 = axes[0, 0]
    metric_names = [k for k in results.keys() if k != 'question']
    metric_values = [results[k] for k in metric_names]
    
    bars = ax1.barh(metric_names, metric_values, color='skyblue')
    ax1.set_xlabel('Score', fontweight='bold')
    ax1.set_title('Overall RAGAS Metrics', fontweight='bold')
    ax1.set_xlim(0, 1)
    
    # Add value labels
    for i, (bar, val) in enumerate(zip(bars, metric_values)):
        ax1.text(val + 0.02, i, f'{val:.3f}', va='center')
    
    # Add threshold line
    ax1.axvline(x=0.7, color='green', linestyle='--', alpha=0.5, label='Good (0.7)')
    ax1.legend()
    
    # 2. Metric Distribution Box Plot
    ax2 = axes[0, 1]
    metric_data = [results_df[col].dropna() for col in results_df.columns if col not in ['question', 'expected_ticket_id', 'retrieved_ticket_ids', 'success']]
    metric_labels = [col.replace('_', ' ').title() for col in results_df.columns if col not in ['question', 'expected_ticket_id', 'retrieved_ticket_ids', 'success']]
    
    bp = ax2.boxplot(metric_data, labels=metric_labels, patch_artist=True)
    ax2.set_ylabel('Score', fontweight='bold')
    ax2.set_title('Score Distribution by Metric', fontweight='bold')
    ax2.tick_params(axis='x', rotation=45)
    plt.setp(ax2.xaxis.get_majorticklabels(), rotation=45, ha='right')
    
    # 3. Retrieval Accuracy
    ax3 = axes[1, 0]
    retrieval_data = [
        retrieval_metrics['precision_at_1'],
        retrieval_metrics['precision_at_3']
    ]
    retrieval_labels = ['Precision@1', 'Precision@3']
    colors = ['#ff9999', '#66b3ff']
    
    bars = ax3.bar(retrieval_labels, retrieval_data, color=colors)
    ax3.set_ylabel('Accuracy', fontweight='bold')
    ax3.set_title('Retrieval Accuracy', fontweight='bold')
    ax3.set_ylim(0, 1)
    
    # Add percentage labels
    for bar, val in zip(bars, retrieval_data):
        height = bar.get_height()
        ax3.text(bar.get_x() + bar.get_width()/2., height + 0.02,
                f'{val:.1%}', ha='center', va='bottom', fontweight='bold')
    
    # 4. Performance Summary (Radar Chart)
    ax4 = axes[1, 1]
    ax4.remove()  # Remove the axis
    ax4 = fig.add_subplot(2, 2, 4, projection='polar')
    
    # Prepare data for radar chart
    categories = metric_names
    values = metric_values
    
    # Number of variables
    N = len(categories)
    angles = [n / float(N) * 2 * np.pi for n in range(N)]
    values += values[:1]  # Complete the circle
    angles += angles[:1]
    
    ax4.plot(angles, values, 'o-', linewidth=2, color='b', label='RAG System')
    ax4.fill(angles, values, alpha=0.25, color='b')
    ax4.set_xticks(angles[:-1])
    ax4.set_xticklabels([c.replace('_', '\n') for c in categories], size=8)
    ax4.set_ylim(0, 1)
    ax4.set_title('Performance Profile', fontweight='bold', pad=20)
    ax4.grid(True)
    
    plt.tight_layout()
    
    # Save figure
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    fig_file = f'ragas_evaluation_visualization_{timestamp}.png'
    plt.savefig(fig_file, dpi=300, bbox_inches='tight')
    print(f"\nüìä Visualization saved to: {fig_file}")
    
    plt.show()

## 10. Detailed Analysis by Category

In [None]:
if results:
    # Add category information to results
    results_df['category'] = [eval_dataset[i]['category'] for i in range(len(eval_dataset))]
    results_df['severity'] = [eval_dataset[i]['severity'] for i in range(len(eval_dataset))]
    
    # Group by category
    print("\n" + "="*60)
    print("PERFORMANCE BY CATEGORY")
    print("="*60)
    
    category_metrics = ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall']
    
    for category in results_df['category'].unique():
        category_data = results_df[results_df['category'] == category]
        print(f"\nüìÅ {category}:")
        print(f"   Number of cases: {len(category_data)}")
        
        for metric in category_metrics:
            if metric in category_data.columns:
                avg_score = category_data[metric].mean()
                print(f"   {metric:20s}: {avg_score:.4f}")
    
    print("\n" + "="*60)

## 11. Failure Analysis

In [None]:
if results:
    # Identify cases with low scores
    print("\n" + "="*60)
    print("FAILURE ANALYSIS")
    print("="*60)
    
    threshold = 0.5
    
    for metric in metric_names:
        if metric in results_df.columns:
            low_scores = results_df[results_df[metric] < threshold]
            
            if len(low_scores) > 0:
                print(f"\n‚ö†Ô∏è  Cases with low {metric} (< {threshold}):")
                print(f"   Count: {len(low_scores)} / {len(results_df)}")
                
                for idx, row in low_scores.iterrows():
                    print(f"   - Case {idx}: {row['expected_ticket_id']} (Score: {row[metric]:.3f})")
                    print(f"     Category: {row['category']}, Severity: {row['severity']}")
            else:
                print(f"\n‚úÖ All cases passed {metric} threshold ({threshold})")
    
    print("\n" + "="*60)

## 12. Generate Summary Report

In [None]:
if results:
    # Generate comprehensive summary
    summary = {
        'evaluation_date': datetime.now().isoformat(),
        'total_test_cases': len(rag_responses),
        'successful_queries': sum(1 for r in rag_responses if r['success']),
        'ragas_metrics': {k: float(v) for k, v in results.items() if k != 'question'},
        'retrieval_metrics': retrieval_metrics,
        'category_breakdown': results_df.groupby('category').size().to_dict(),
        'severity_breakdown': results_df.groupby('severity').size().to_dict(),
    }
    
    # Save summary as JSON
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    summary_file = f'evaluation_summary_{timestamp}.json'
    
    with open(summary_file, 'w', encoding='utf-8') as f:
        json.dump(summary, f, indent=2, ensure_ascii=False)
    
    print(f"\nüíæ Summary report saved to: {summary_file}")
    
    # Display summary
    print("\n" + "="*60)
    print("EVALUATION SUMMARY")
    print("="*60)
    print(json.dumps(summary, indent=2))
    print("="*60)

## 13. Recommendations

Based on the evaluation results, generate actionable recommendations.

In [None]:
if results:
    print("\n" + "="*60)
    print("RECOMMENDATIONS FOR IMPROVEMENT")
    print("="*60)
    
    recommendations = []
    
    # Check each metric and provide recommendations
    if results.get('faithfulness', 0) < 0.7:
        recommendations.append(
            "üìå Low Faithfulness: The model may be hallucinating. Consider:\n"
            "   - Improving prompt engineering to stay grounded in context\n"
            "   - Adding source citation requirements\n"
            "   - Using more conservative generation parameters"
        )
    
    if results.get('context_precision', 0) < 0.7:
        recommendations.append(
            "üìå Low Context Precision: Retrieved contexts may not be relevant. Consider:\n"
            "   - Improving embedding model quality\n"
            "   - Refining chunk size and overlap\n"
            "   - Adding metadata filtering"
        )
    
    if results.get('context_recall', 0) < 0.7:
        recommendations.append(
            "üìå Low Context Recall: Missing relevant information. Consider:\n"
            "   - Increasing number of retrieved documents (top_k)\n"
            "   - Improving document chunking strategy\n"
            "   - Enhancing data ingestion process"
        )
    
    if results.get('answer_relevancy', 0) < 0.7:
        recommendations.append(
            "üìå Low Answer Relevancy: Responses not addressing questions well. Consider:\n"
            "   - Refining system prompts\n"
            "   - Improving query understanding\n"
            "   - Adding query expansion/reformulation"
        )
    
    if retrieval_metrics['precision_at_1'] < 0.7:
        recommendations.append(
            "üìå Low Retrieval Accuracy: Not finding correct tickets. Consider:\n"
            "   - Fine-tuning embedding model on domain data\n"
            "   - Improving ticket descriptions and metadata\n"
            "   - Using hybrid search (vector + keyword)"
        )
    
    if recommendations:
        for i, rec in enumerate(recommendations, 1):
            print(f"\n{i}. {rec}")
    else:
        print("\n‚úÖ System is performing well across all metrics!")
        print("\nSuggested next steps:")
        print("  - Monitor performance over time")
        print("  - Test with more diverse queries")
        print("  - Evaluate on edge cases")
    
    print("\n" + "="*60)

## Conclusion

This notebook provides comprehensive evaluation of your RAG-based support ticket system using RAGAS metrics and custom retrieval accuracy measures. Regular evaluation helps identify areas for improvement and track system performance over time.