# Semantic KB: Evaluating Multiple Models with FloTorch

[FloTorch](https://www.flotorch.ai/) provides a powerful framework for evaluating Retrieval-Augmented Generation (RAG) systems, allowing for in-depth assessment and comparison. It emphasizes key metrics like accuracy, cost, and latency, which are critical for RAG pipeline assessment.

### Key Evaluation Metrics for this Notebook

This notebook will focus on evaluating our RAG pipelines using the following metrics:

* **Context Precision:** Measures the relevance of the retrieved context chunks. It's the average of the precision@k scores for each chunk in the retrieved context. Precision@k is the proportion of relevant chunks within the top k retrieved chunks.

* **Faithfulness:** Quantifies how factually consistent the generated response is with the retrieved context. Scores range from 0 to 1, with higher values indicating greater consistency. A response is considered faithful if all its claims are supported by the retrieved context.

* **Response Relevancy:** Assesses how well the generated response addresses the user's query. Higher scores indicate better relevance and completeness, while lower scores suggest incompleteness or the inclusion of irrelevant information.

* **Inference Cost:** The total cost incurred for using Bedrock models to generate responses for all questions in the ground truth dataset.

* **Latency:** The time taken for the inference process, specifically the duration of Bedrock model invocations.

### Leveraging Ragas for Evaluation

This evaluation process utilizes [Ragas](https://docs.ragas.io/en/stable/), a library designed to simplify and enhance the evaluation of Large Language Model (LLM) applications, enabling confident and straightforward assessment.

Internally, Ragas uses Large Language Models (LLMs) to calculate both Context Precision and Response Relevancy scores. In this evaluation, we will specifically use `amazon.titan-embed-text-v2` for generating embeddings and `amazon.nova-pro-v1:0` for the inference tasks.

#### Load env variables

In [None]:
import json
with open("variables.json", "r") as f:
    variables = json.load(f)

variables

#### Evaluation Config

We will evaluate the RAG pipeline using Amazon Nova Pro

In [None]:
evaluation_config_data = {
   "eval_embedding_model" : "amazon.titan-embed-text-v2:0",
   "eval_retrieval_model" : "us.amazon.nova-pro-v1:0",
   "eval_retrieval_service" : "bedrock",
   "aws_region" : variables['regionName'],
   "eval_embed_vector_dimension" : 1024
}

### Load RAG response data 

In [None]:
import json
from evaluation_utils import convert_to_evaluation_dict

filename = f"./results/ragas_evaluation_responses_for_different_models.json"

with open(filename, 'r', encoding='utf-8') as f:
    loaded_responses = json.load(f)

evaluation_dataset_per_model = convert_to_evaluation_dict(loaded_responses)

### Evaluation output

In [None]:
final_evaluation = {}

#### Accuracy Evaluation with Ragas

**Important Note on Metric Variability:** Due to the inherent stochasticity of Large Language Models (LLMs), the sensitivity of evaluation metrics, and the quality of the LLM used, Ragas metrics for the same dataset can vary between evaluations. While efforts are made to improve reproducibility, please be aware that fluctuations in evaluation scores are possible.

For more information, refer to this GitHub issue: [https://github.com/explodinggradients/ragas/issues/1125](https://github.com/explodinggradients/ragas/issues/1125)

**Understanding NaN Values in Evaluation Results:**

You might encounter `NaN` (Not a Number) values in the evaluation results. This typically occurs for two primary reasons:

1.  **JSON Parsing Issue:** Ragas expects LLM outputs to be in a JSON-parsable format because its prompts are structured using Pydantic. This ensures efficient processing of the model's responses. If the model's output is not valid JSON, `NaN` may appear.

2.  **Non-Ideal Cases for Scoring:** Certain scenarios within the evaluation dataset might not be suitable for calculating specific metrics. For instance, assessing the faithfulness of a response like "I don’t know" might not be meaningful, leading to a `NaN` value for that metric in such cases.

For further details, please consult the Ragas documentation: [https://github.com/explodinggradients/ragas/blob/main/docs/index.md#frequently-asked-questions](https://github.com/explodinggradients/ragas/blob/main/docs/index.md#frequently-asked-questions)

In [None]:
from flotorch_core.embedding.embedding_registry import embedding_registry
from flotorch_core.embedding.titanv2_embedding import TitanV2Embedding
from flotorch_core.embedding.cohere_embedding import CohereEmbedding
from flotorch_core.inferencer.inferencer_provider_factory import InferencerProviderFactory
from flotorch_core.evaluator.ragas_evaluator import RagasEvaluator

# Initialize embeddings
embedding_class = embedding_registry.get_model(evaluation_config_data.get("eval_embedding_model"))
embedding = embedding_class(evaluation_config_data.get("eval_embedding_model"), 
                            evaluation_config_data.get("aws_region"), 
                            int(evaluation_config_data.get("eval_embed_vector_dimension"))
                            )

# Initialize inferencer
inferencer = InferencerProviderFactory.create_inferencer_provider(
    False,"","",
    evaluation_config_data.get("eval_retrieval_service"),
    evaluation_config_data.get("eval_retrieval_model"), 
    evaluation_config_data.get("aws_region"), 
    variables['bedrockExecutionRoleArn'],
    float(0.1)
)

evaluator = RagasEvaluator(inferencer, embedding)

for model in evaluation_dataset_per_model:
    # You might encounter some warnings and errors on the console - please ignore them
    # Those are ragas errors and it shouldn't impact our flow
    ragas_report = evaluator.evaluate(evaluation_dataset_per_model[model])
    if ragas_report:
        eval_metrics = ragas_report._repr_dict
        eval_metrics = {key: round(value, 2) if isinstance(value, float) else value for key, value in eval_metrics.items()} 
    final_evaluation[model] = {
            'llm_context_precision_with_reference': eval_metrics['llm_context_precision_with_reference'],
            'faithfulness': eval_metrics['faithfulness'],
            'answer_relevancy': eval_metrics['answer_relevancy']
        }

### Cost and Latency Evaluation

In [None]:
from cost_compute_utils import calculate_cost_and_latency_metrics

for model in loaded_responses:
    inference_data = loaded_responses[model]
    cost_and_latency_metrics = calculate_cost_and_latency_metrics(inference_data, model,
                evaluation_config_data["aws_region"])
    
    if model not in final_evaluation:
        # Insert - key doesn't exist yet
        final_evaluation[model] = cost_and_latency_metrics
    else:
        # Update - key already exists
        final_evaluation[model].update(cost_and_latency_metrics)

### Evaluation metrics as pandas df

In [None]:
import pandas as pd

# Convert the nested dictionary to a DataFrame
evaluation_df = pd.DataFrame.from_dict(final_evaluation, orient='index')

# If you want the kb_type as a column instead of an index
evaluation_df = evaluation_df.reset_index().rename(columns={'index': 'model'})

evaluation_df

In [None]:
from matplotlab_utils import plot_grouped_bar

plot_grouped_bar(evaluation_df, 'model', ['llm_context_precision_with_reference', 'faithfulness', 'answer_relevancy'], show_values=True, title='Evaluation Metrics', xlabel='KB Type', ylabel='Metrics')