# Evaluating RAG Systems with FloTorch

[FloTorch](https://www.flotorch.ai/) offers a robust evaluation framework for Retrieval-Augmented Generation (RAG) systems, enabling comprehensive assessment and comparison of Large Language Models (LLMs). It focuses on key metrics such as accuracy, cost, and latency, crucial for enterprise-level deployments.

## Key Evaluation Metrics for this Notebook

In this notebook, we will focus on evaluating our RAG pipelines using the following metrics:

* **Context Precision:** This metric quantifies the relevance of the retrieved context chunks. It's calculated as the average of the precision@k scores for each chunk within the retrieved context. Precision@k represents the proportion of relevant chunks within the top k retrieved chunks.

* **Response Relevancy:** This metric assesses how well the generated response addresses the user's query. Higher scores indicate greater relevance and completeness, while lower scores suggest incompleteness or the inclusion of unnecessary information.

* **Inference Cost:** This refers to the total cost incurred for invoking Bedrock models to generate responses for all entries in the ground truth dataset.

* **Latency:** This measures the time taken for the inference process, specifically the duration of the Bedrock model invocations.

## Leveraging Ragas for Evaluation

This evaluation process utilizes [Ragas](https://docs.ragas.io/en/stable/), a powerful library designed to streamline and enhance the evaluation of Large Language Model (LLM) applications, allowing for confident and straightforward assessment.

Ragas utilizes Large Language Models (LLMs) internally to compute both Context Precision and Response Relevancy scores. In this evaluation, we will specifically employ `amazon.titan-embed-text-v2` for generating embeddings and `amazon.nova-lite-v1:0` for the inference tasks.

### Load env variables

In [None]:
import json
with open("../Lab 1/variables.json", "r") as f:
    variables = json.load(f)

variables

### Evaluation Config

In [None]:
evaluation_config_data = {
   "eval_embedding_model" : "amazon.titan-embed-text-v2:0",
   "eval_retrieval_model" : "us.amazon.nova-lite-v1:0",
   "eval_retrieval_service" : "bedrock",
   "aws_region" : variables['regionName'],
   "eval_embed_vector_dimension" : 1024
}

### Load RAG response data 

In [None]:
import json
from evaluation_utils import convert_to_evaluation_dict

filename = f"../results/ragas_evaluation_responses_for_different_models.json"

with open(filename, 'r', encoding='utf-8') as f:
    loaded_responses = json.load(f)

evaluation_dataset_per_model = convert_to_evaluation_dict(loaded_responses)

### Evaluation output

In [None]:
final_evaluation = {}

### Accuracy Evaluation with Ragas

In [None]:
from flotorch_core.embedding.embedding_registry import embedding_registry
from flotorch_core.embedding.titanv2_embedding import TitanV2Embedding
from flotorch_core.embedding.cohere_embedding import CohereEmbedding
from flotorch_core.inferencer.inferencer_provider_factory import InferencerProviderFactory
from flotorch_core.evaluator.ragas_evaluator import RagasEvaluator

# Initialize embeddings
embedding_class = embedding_registry.get_model(evaluation_config_data.get("eval_embedding_model"))
embedding = embedding_class(evaluation_config_data.get("eval_embedding_model"), 
                            evaluation_config_data.get("aws_region"), 
                            int(evaluation_config_data.get("eval_embed_vector_dimension"))
                            )

# Initialize inferencer
inferencer = InferencerProviderFactory.create_inferencer_provider(
    False,"","",
    evaluation_config_data.get("eval_retrieval_service"),
    evaluation_config_data.get("eval_retrieval_model"), 
    evaluation_config_data.get("aws_region"), 
    variables['bedrockExecutionRoleArn'],
    float(0.1)
)

evaluator = RagasEvaluator(inferencer, embedding)

for model in evaluation_dataset_per_model:
    ragas_report = evaluator.evaluate(evaluation_dataset_per_model[model])
    if ragas_report:
        eval_metrics = ragas_report._repr_dict
        eval_metrics = {key: round(value, 2) if isinstance(value, float) else value for key, value in eval_metrics.items()} 
    final_evaluation[model] = {
            'llm_context_precision_with_reference': eval_metrics['llm_context_precision_with_reference'],
            'faithfulness': eval_metrics['faithfulness'],
            'answer_relevancy': eval_metrics['answer_relevancy']
        }

### Cost and Latency Evaluation

In [None]:
from cost_compute_utils import calculate_cost_and_latency_metrics

for model in loaded_responses.items():
    inference_data = loaded_responses[model]
    cost_and_latency_metrics = calculate_cost_and_latency_metrics(inference_data, model,
                evaluation_config_data["aws_region"])
    
    if model not in final_evaluation:
        # Insert - key doesn't exist yet
        final_evaluation[model] = cost_and_latency_metrics
    else:
        # Update - key already exists
        final_evaluation[model].update(cost_and_latency_metrics)

### Evaluation metrics as pandas df

In [None]:
import pandas as pd

# Convert the nested dictionary to a DataFrame
evaluation_df = pd.DataFrame.from_dict(final_evaluation, orient='index')

# If you want the kb_type as a column instead of an index
evaluation_df = evaluation_df.reset_index().rename(columns={'index': 'model'})

print(evaluation_df)