# Evaluating RAG Systems with FloTorch

[FloTorch](https://www.flotorch.ai/) offers a robust evaluation framework for Retrieval-Augmented Generation (RAG) systems, enabling comprehensive assessment and comparison of Large Language Models (LLMs). It focuses on key metrics such as accuracy, cost, and latency, crucial for enterprise-level deployments.

## Key Evaluation Metrics for this Notebook

In this notebook, we will focus on evaluating our RAG pipelines using the following metrics:

* **Context Precision:** This metric quantifies the relevance of the retrieved context chunks. It's calculated as the average of the precision@k scores for each chunk within the retrieved context. Precision@k represents the proportion of relevant chunks within the top k retrieved chunks.

* **Response Relevancy:** This metric assesses how well the generated response addresses the user's query. Higher scores indicate greater relevance and completeness, while lower scores suggest incompleteness or the inclusion of unnecessary information.

* **Inference Cost:** This refers to the total cost incurred for invoking Bedrock models to generate responses for all entries in the ground truth dataset.

* **Latency:** This measures the time taken for the inference process, specifically the duration of the Bedrock model invocations.

## Leveraging Ragas for Evaluation

This evaluation process utilizes [Ragas](https://docs.ragas.io/en/stable/), a powerful library designed to streamline and enhance the evaluation of Large Language Model (LLM) applications, allowing for confident and straightforward assessment.

Ragas utilizes Large Language Models (LLMs) internally to compute both Context Precision and Response Relevancy scores. In this evaluation, we will specifically employ `amazon.titan-embed-text-v1` for generating embeddings and `cohere.command-r-plus-v1:0` for the inference tasks.

### Load env variables

In [None]:
import json
with open("../Lab 1/variables.json", "r") as f:
    variables = json.load(f)

variables

### Evaluation Config

In [None]:
evaluation_config_data = {
   "eval_embedding_model" : "amazon.titan-embed-text-v2:0",
   "eval_retrieval_model" : "us.amazon.nova-micro-v1:0",
   "eval_retrieval_service" : "bedrock",
   "aws_region" : variables['regionName'],
   "eval_embed_vector_dimension" : 1024,
   "retrieval_model": "us.amazon.nova-lite-v1:0",
}

### Load RAG response data 

In [None]:
from typing import List, Optional, Dict
from flotorch_core.evaluator.evaluation_item import EvaluationItem

def convert_to_evaluation_dict(data: Dict) -> Dict[str, List[EvaluationItem]]:
    """
    Converts the given dictionary into a dictionary where the key is the KB type
    and the value is a list of EvaluationItem objects.  This version handles
    dynamic KB keys.

    Args:
        data: The input dictionary.

    Returns:
        A dictionary where the key is the KB type and the value is a list of EvaluationItem.
    """

    evaluation_dict: Dict[str, List[EvaluationItem]] = {}

    for kb_type, items in data.items():
        if isinstance(items, list):  # Ensure we're processing a list of items
            evaluation_items: List[EvaluationItem] = []
            for item_data in items:
                if isinstance(item_data, dict) and "question" in item_data and "expected_answer" in item_data and "generated_answer" in item_data and "retrieved_contexts" in item_data:
                    evaluation_item = EvaluationItem(
                        question=item_data["question"],
                        generated_answer=item_data["generated_answer"],
                        expected_answer=item_data["expected_answer"],
                        context=[context["text"] for context in item_data["retrieved_contexts"]]
                    )
                    evaluation_items.append(evaluation_item)
            evaluation_dict[kb_type] = evaluation_items

    return evaluation_dict

In [None]:
import json

filename = f"../results/rag_evaluation_responses.json"

with open(filename, 'r', encoding='utf-8') as f:
    loaded_responses = json.load(f)

evaluation_dataset_per_kb = convert_to_evaluation_dict(loaded_responses)

### Evaluation output

In [None]:
final_evaluation = {}

### Accuracy Evaluation with Ragas

In [None]:
from flotorch_core.embedding.embedding_registry import embedding_registry
from flotorch_core.embedding.titanv2_embedding import TitanV2Embedding
from flotorch_core.embedding.cohere_embedding import CohereEmbedding
from flotorch_core.inferencer.inferencer_provider_factory import InferencerProviderFactory
from flotorch_core.evaluator.ragas_evaluator import RagasEvaluator
import numpy as np

# Initialize embeddings
embedding_class = embedding_registry.get_model(evaluation_config_data.get("eval_embedding_model"))
embedding = embedding_class(evaluation_config_data.get("eval_embedding_model"), 
                            evaluation_config_data.get("aws_region"), 
                            int(evaluation_config_data.get("eval_embed_vector_dimension"))
                            )

# Initialize inferencer
inferencer = InferencerProviderFactory.create_inferencer_provider(
    False,"","",
    evaluation_config_data.get("eval_retrieval_service"),
    evaluation_config_data.get("eval_retrieval_model"), 
    evaluation_config_data.get("aws_region"), 
    variables['bedrockExecutionRoleArn'],
    float(0.1)
)

evaluator = RagasEvaluator(inferencer, embedding)

for evaluation_dataset_kb_id in evaluation_dataset_per_kb:
    ragas_report = evaluator.evaluate(evaluation_dataset_per_kb[evaluation_dataset_kb_id])
    final_evaluation[evaluation_dataset_kb_id] = {
        'llm_context_precision_with_reference': np.mean(ragas_report['llm_context_precision_with_reference']),
        'faithfulness': np.mean(ragas_report['faithfulness']),
        'answer_relevancy': np.mean(ragas_report['answer_relevancy'])
    }

### Cost and Latency Evaluation

In [None]:
import pandas as pd

MILLION = 1_000_000
THOUSAND = 1_000
SECONDS_IN_MINUTE = 60
MINUTES_IN_HOUR = 60
# Read the CSV file into a pandas DataFrame
df = pd.read_csv('../data/bedrock_limits_small.csv')

def calculate_bedrock_inference_cost(input_tokens,output_tokens, inference_model, aws_region):

    input_price = df[
        (df["model"] == inference_model) & (df["Region"] == aws_region)
        ]["input_price"]

    output_price = df[
        (df["model"] == inference_model) & (df["Region"] == aws_region)
        ]["output_price"]

    input_price_per_million_tokens = float(input_price.values[0])  # Price per million tokens
    output_price_per_million_tokens = float(output_price.values[0])  # Price per million tokens

    input_actual_cost = (input_price_per_million_tokens * float(input_tokens)) / MILLION
    output_actual_cost = (output_price_per_million_tokens * float(output_tokens)) / MILLION
    return input_actual_cost + output_actual_cost

In [None]:
from typing import Dict, Any, List
from decimal import Decimal
from dataclasses import dataclass

@dataclass
class MetricsData:
    """Data class to store metrics information"""
    cost: Decimal
    latency: float
    input_tokens: int
    output_tokens: int

def extract_metadata_metrics(metadata: Dict[str, Any]) -> MetricsData:
    """
    Extract metrics from metadata dictionary
    
    Args:
        metadata: Dictionary containing metadata information
    Returns:
        MetricsData object with extracted metrics
    """
    return MetricsData(
        input_tokens=metadata.get("inputTokens", 0),
        output_tokens=metadata.get("outputTokens", 0),
        latency=float(metadata.get("latencyMs", 0)),
        cost=Decimal('0.0000')
    )


In [None]:
for kb_type, items in loaded_responses.items():
    if isinstance(items, list):
        continue

    total_cost = Decimal('0.0000')
    total_latency = 0.0
    item_count = 0

    for item_data in items:
        if not isinstance(item_data, dict) or "metadata" not in item_data:
            continue

        metrics = extract_metadata_metrics(item_data["metadata"])
        
        # Calculate cost for this item
        item_cost = calculate_bedrock_inference_cost(
            metrics.input_tokens,
            metrics.output_tokens,
            evaluation_config_data["retrieval_model"],
            evaluation_config_data["aws_region"]
        )
        
        total_cost += Decimal(str(item_cost))
        total_latency += metrics.latency
        item_count += 1

    if item_count > 0:
         # Calculate averages
        if kb_type not in final_evaluation:
            # Insert - key doesn't exist yet
            final_evaluation[kb_type] = {
                'cost': float(total_cost),
                'average_cost': float(total_cost / item_count),
                'latency': total_latency,
                'average_latency': total_latency / item_count,
                'processed_items': item_count
            }
        else:
            # Update - key already exists
            final_evaluation[kb_type].update({
                'cost': float(total_cost),
                'average_cost': float(total_cost / item_count),
                'latency': total_latency,
                'average_latency': total_latency / item_count,
                'processed_items': item_count
            })

### Evaluation metrics as pandas df

In [None]:
import pandas as pd

# Convert the nested dictionary to a DataFrame
df = pd.DataFrame.from_dict(final_evaluation, orient='index')

# If you want the kb_type as a column instead of an index
df = df.reset_index().rename(columns={'index': 'kb_type'})

print(df)