# Evaluating RAG Systems with FloTorch

[FloTorch](https://www.flotorch.ai/) offers a robust evaluation framework for Retrieval-Augmented Generation (RAG) systems, enabling comprehensive assessment and comparison of Large Language Models (LLMs). It focuses on key metrics such as accuracy, cost, and latency, crucial for enterprise-level deployments.

## Key Evaluation Metrics for this Notebook

In this notebook, we will focus on evaluating our RAG pipelines using the following metrics:

* **Correctness:** This refers to the total number of samples that semantically both generated and expected are mateched

* **Inference Cost:** This refers to the total cost incurred for invoking Bedrock models to generate responses for all entries in the ground truth dataset.

* **Latency:** This measures the time taken for the inference process, specifically the duration of the Bedrock model invocations.


RAG systems are evaluated using a scoring method that measures response quality to questions in the evaluation set. Responses are rated as correct, Missing or incorrect:

- correct: The response correctly answers the user question and contains no hallucinated content.

- Missing: The answer does not provide the requested information. Such as “I don’t know”, “I’m sorry I can’t find …” or similar sentences without providing a concrete answer to the question.

- Incorrect: The response provides wrong or irrelevant information to answer the user question



### Load env variables

In [None]:
import json
with open("../Lab 1/variables.json", "r") as f:
    variables = json.load(f)

variables

### Evaluation Config

In [None]:
evaluation_config_data = {
   "eval_embedding_model" : "amazon.titan-embed-text-v2:0",
   "eval_retrieval_model" : "us.amazon.nova-micro-v1:0",
   "eval_retrieval_service" : "bedrock",
   "aws_region" : variables['regionName'],
   "eval_embed_vector_dimension" : 1024,
   "inference_model": "us.amazon.nova-lite-v1:0",
}

### Load RAG response data 

In [None]:
import json

filename = f"../results/ragas_evaluation_responses_for_different_kbs.json"

with open(filename, 'r', encoding='utf-8') as f:
    loaded_responses = json.load(f)


### Accuracy Evaluation with Custom Evaluation

In [None]:
from custom_evaluator import CustomEvaluator

In [None]:
evaluator = CustomEvaluator(evaluator_llm_info = evaluation_config_data)
evaluation_metrics = {}
for kb_id, inference_data in loaded_responses.items():
    results = evaluator.evaluate(inference_data)
    evaluation_metrics[kb_id] = results
    print(f"Evaluation completed for {kb_id}")

### Evaluation output

In [None]:
final_evaluation = evaluator.evaluate_results(evaluation_metrics)

### Cost and Latency Evaluation

In [None]:
from cost_compute_utils import calculate_cost_and_latency_metrics

for kb_type in loaded_responses:
    inference_data = loaded_responses[kb_type]
    cost_and_latency_metrics = calculate_cost_and_latency_metrics(inference_data, evaluation_config_data["inference_model"],
                evaluation_config_data["aws_region"])

    custom_eval_metrics = final_evaluation[model].copy()
    if kb_type not in final_evaluation:
        # Insert - key doesn't exist yet
        final_evaluation[model] = {**custom_eval_metrics, **cost_and_latency_metrics}
    else:
        # Update - key already exists
        merged = {**custom_eval_metrics, **cost_and_latency_metrics}
        final_evaluation[model].update(merged)

### Evaluation metrics as pandas df

In [None]:
import pandas as pd

# Convert the nested dictionary to a DataFrame
evaluation_df = pd.DataFrame.from_dict(final_evaluation, orient='index')

# If you want the kb_type as a column instead of an index
evaluation_df = evaluation_df.reset_index().rename(columns={'index': 'kb_type'})

evaluation_df

#### Apply plots for a given metrics

In [None]:
from plot_util import plot_column
import ipywidgets as widgets
from IPython.display import display

In [None]:
# Dropdown widget
dropdown = widgets.Dropdown(
    options=[col for col in evaluation_df.columns if col != 'model'],
    description='Metric:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

# Optional: Dropdown for plot kind (bar or line)
plot_kind = widgets.ToggleButtons(
    options=['bar', 'line'],
    description='Plot Type:',
    style={'description_width': 'initial'}
)

# Function to call on change
def update_plot(column, kind):
    plot_column(evaluation_df, column, kind=kind)

# Link widgets to function
interactive_plot = widgets.interactive_output(update_plot, {
    'column': dropdown,
    'kind': plot_kind
})

# Display UI
display(dropdown, plot_kind, interactive_plot)