# Lab-02: Compare LLMs using Ragas Evaluations

### Context

The LLM used in the Retrieval Augmented Generation (RAG) system has a major impact in the quality of the generated output. Evaluating the results generated by different LLMs can give an idea about the right llm to use for a particular use case.
In this notebook, we will dive deep into building Q&A applications using the Retrieve API provide by Knowledge Bases for Amazon Bedrock, along with LangChain and Ragas for evaluating the responses. Here, we will query the knowledge base to get the desired number of document chunks based on similarity search, prompt the query using Anthropic Claude models and Meta Llama models, and then evaluate the responses effectively using evaluation metrics, such as faithfulness, answer_relevancy, context_recall, context_precision, context_entity_recall, answer_similarity, answer_correctness, harmfulness, maliciousness, coherence, correctness and conciseness.

#### Notebook Walkthrough

For our notebook we will use the `Retrieve API` provided by Knowledge Bases for Amazon Bedrock which converts user queries into
embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom
workﬂows on top of the semantic search results. The output of the `Retrieve API` includes the the `retrieved text chunks`, the `location type` and `URI` of the source data, as well as the relevance `scores` of the retrievals. 


We will then use the text chunks being generated and augment it with the original prompt and pass it through the `anthropic.claude-3-sonnet-20240229-v1:0` model, the `anthropic.claude-3-haiku-20240307-v1:0` model, and the `amazon.titan-text-express-v1` model.

Finally we will evaluate the generated responses using RAGAS on using metrics such as faithfulness, answer relevancy, and context precision. For evaluation, we will use `anthropic.claude-3-sonnet-20240229-v1:0`.

#### Dataset

In this example, you will use Octank's financial 10k reports (sythetically generated dataset) as a text corpus to perform Q&A on. This data will be ingested into the knowledge base.

### Python 3.10

⚠  For this lab we need to run the notebook based on a Python 3.10 runtime. ⚠


## Configuration leveraging knowledge base created from LAB01
We'll use the following data:

* example financial statement documents of fake company "Octank"


In [None]:
import logging
logging.getLogger("sagemaker.config").setLevel(logging.WARNING)
import sagemaker
import boto3
import pprint
from botocore.client import Config

pp = pprint.PrettyPrinter(indent=2)
# Specify your bucket to be the default sagemaker bucket
sess = sagemaker.Session()
bucket = sess.default_bucket() #sagemaker-abcdef
filename = 'octank_financial_10K.pdf'

# Create an S3 client
s3 = boto3.client('s3')

# Upload the file
s3.upload_file(filename, bucket, filename)
print("Upload Successful!!\n")
print(f"{filename} was successfully uploaded to {bucket}/{filename}")

### Follow the steps below to set up necessary packages

1. Import the necessary libraries for creating `bedrock-runtime` for invoking foundation models and `bedrock-agent-runtime` client for using Retrieve API provided by Knowledge Bases for Amazon Bedrock. 
2. Import Langchain for: 
   1. Initializing bedrock model  `anthropic.claude-3-haiku-20240307-v1:0`, `anthropic.claude-3-sonnet-20240229-v1:0`, and `amazon.titan-text-express-v1` as our large language models to perform query completions using the RAG pattern. 
   2. Initializing bedrock model  `anthropic.claude-3-sonnet-20240229-v1:0` as our large language model to perform RAG assessment. 
   3. Initialize Langchain retriever integrated with knowledge bases. 
   4. Later in the notebook we will wrap the LLM and retriever with `RetrieverQAChain` for building our Q&A application.

In [None]:
from langchain.llms.bedrock import Bedrock
from langchain.retrievers.bedrock import AmazonKnowledgeBasesRetriever
from langchain.chains import RetrievalQA
from langchain_aws import BedrockEmbeddings
from langchain_aws import ChatBedrock
import pandas as pd


bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_runtime_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client('bedrock-agent')
bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config
                              )

kwargs = {
    "temperature": 0
}

llm_for_text_generation_haiku = ChatBedrock(model_id="anthropic.claude-3-haiku-20240307-v1:0", client=bedrock_runtime_client, model_kwargs=kwargs)

llm_for_text_generation_sonnet = ChatBedrock(model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=bedrock_runtime_client, model_kwargs=kwargs)

llm_for_text_generation_amazon_titan = ChatBedrock(model_id="amazon.titan-text-express-v1", client=bedrock_runtime_client, model_kwargs=kwargs)

llm_for_evaluation = ChatBedrock(model_id="anthropic.claude-3-sonnet-20240229-v1:0", client=bedrock_runtime_client, model_kwargs=kwargs)

llms = [llm_for_text_generation_haiku, llm_for_text_generation_sonnet,llm_for_text_generation_amazon_titan]
print ("List of LLMs to be evaluated with Ragas")
for llm in llms:
    print (llm.model_id)

### Initialize Knowledge Base from Previous Lab

In [None]:
from utility import interactive_sleep, create_knowledge_base, create_ds

boto3_session = boto3.session.Session()
region_name = boto3_session.region_name

cohere_embed_knowledge_base = create_knowledge_base(
    index_name="cohere-embed-english-v3", 
    collection_name="cohere-embed-english-v3",
    knowledge_base_name="cohere-embed-english-v3",
    vector_store_name="cohere-embed-english-v3",
    access_policy_name="cohere-embed-access-policy",
    embedding_model_arn=f"arn:aws:bedrock:{region_name}::foundation-model/cohere.embed-english-v3"
    )

bedrock_embeddings = BedrockEmbeddings(model_id="cohere.embed-english-v3",client=bedrock_runtime_client)

### Retrieve API: Process flow 

Create a `AmazonKnowledgeBasesRetriever` object from LangChain which will call the `Retrieve API` provided by Knowledge Bases for Amazon Bedrock which converts user queries into
embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom
workﬂows on top of the semantic search results. The output of the `Retrieve API` includes the the `retrieved text chunks`, the `location type` and `URI` of the source data, as well as the relevance `scores` of the retrievals. 

In [None]:
# Create retriever with the knowledge base ID
retriever = AmazonKnowledgeBasesRetriever(
        knowledge_base_id=cohere_embed_knowledge_base["knowledgeBaseId"],
        retrieval_config={"vectorSearchConfiguration": {"numberOfResults": 5}},
        # endpoint_url=endpoint_url,
        # region_name="us-east-1",
        # credentials_profile_name="<profile_name>",
)


## Preparing the Evaluation Data

As RAGAS aims to be a reference-free evaluation framework, the required preparations of the evaluation dataset are minimal. You will need to prepare `question` and `ground_truths` pairs from which you can prepare the remaining information through inference as shown below. If you are not interested in the `context_recall` metric, you don’t need to provide the `ground_truths` information. In this case, all you need to prepare are the `questions`.

In [None]:
from datasets import Dataset
from concurrent.futures import ThreadPoolExecutor

questions = [
    "What was the primary reason for the increase in net cash provided by operating activities for Octank Financial in 2021?",
    "In which year did Octank Financial have the highest net cash used in investing activities, and what was the primary reason for this?",
    "What was the primary source of cash inflows from financing activities for Octank Financial in 2021?",
    "Calculate the year-over-year percentage change in cash and cash equivalents for Octank Financial from 2020 to 2021.",
    "Based on the information provided, what can you infer about Octank Financial's overall financial health and growth prospects?"
]
ground_truth = [
    "The increase in net cash provided by operating activities was primarily due to an increase in net income and favorable changes in operating assets and liabilities.",
    "Octank Financial had the highest net cash used in investing activities in 2021, at $360 million, compared to $290 million in 2020 and $240 million in 2019. The primary reason for this was an increase in purchases of property, plant, and equipment and marketable securities.",
    "The primary source of cash inflows from financing activities for Octank Financial in 2021 was an increase in proceeds from the issuance of common stock and long-term debt.",
    "To calculate the year-over-year percentage change in cash and cash equivalents from 2020 to 2021: \
    2020 cash and cash equivalents: $350 million \
    2021 cash and cash equivalents: $480 million \
    Percentage change = (2021 value - 2020 value) / 2020 value * 100 \
    = ($480 million - $350 million) / $350 million * 100 \
    = 37.14% increase",
    "Based on the information provided, Octank Financial appears to be in a healthy financial position and has good growth prospects. The company has consistently increased its net cash provided by operating activities, indicating strong profitability and efficient management of working capital. Additionally, Octank Financial has been investing in long-term assets, such as property, plant, and equipment, and marketable securities, which suggests plans for future growth and expansion. The company has also been able to finance its growth through the issuance of common stock and long-term debt, indicating confidence from investors and lenders. Overall, Octank Financial's steady increase in cash and cash equivalents over the past three years provides a strong foundation for future growth and investment opportunities."
]

def execute_rag(llm):
    answers = []
    contexts = []
    kb_id = retriever.knowledge_base_id
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm, retriever=retriever, return_source_documents=True
    )
    print(f"Answering questions for llm: {llm.model_id}...")
    for query in questions:
        response = qa_chain.invoke(query)
        answers.append(response["result"])
        contexts.append([docs.page_content for docs in response["source_documents"]])
    # To dict
    data = {
        "question": questions,
        "answer": answers,
        "contexts": contexts,
        "ground_truth": ground_truth
    }
    return llm.model_id, data

with ThreadPoolExecutor(max_workers=3) as executor:
    futures = [executor.submit(execute_rag, llm) for llm in llms]
    datasets = {}
    for future in futures:
        model_id, data = future.result()
        datasets[model_id] = Dataset.from_dict(data)

print("Questions answered and datasets created!!")

## Evaluating the RAG application with different LLMs
First, import all the metrics you want to use from `ragas.metrics`. Then, you can use the `evaluate()` function and simply pass in the relevant metrics and the prepared dataset.

In [None]:
#LLM Model RAGAS comparisson

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    context_entity_recall,
    answer_similarity,
    answer_correctness
)

from ragas.metrics.critique import (
harmfulness, 
maliciousness, 
coherence, 
correctness, 
conciseness
)

#specify the metrics here
metrics = [
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        context_entity_recall,
        answer_similarity,
        answer_correctness,
        harmfulness, 
        maliciousness, 
        coherence, 
        correctness, 
        conciseness
    ]

def evaluate_llm(llm, embeddings, datasets, llm_for_evaluation):
    llm_model_id = llm.model_id
    result = evaluate(
        dataset=datasets[llm_model_id],
        metrics=metrics,
        llm=llm_for_evaluation,
        embeddings=embeddings,
    )
    return llm_model_id, result.to_pandas()

with ThreadPoolExecutor(max_workers=3) as executor:
    futures = [executor.submit(evaluate_llm, llm, bedrock_embeddings, datasets, llm_for_evaluation) for i, llm in enumerate(llms)]
    dfs = {}
    for future in futures:
        model_id, df = future.result()
        dfs[model_id] = df

print("Ragas evaluation complete!!")
# disregard any "failed to parse output" errors. Those originate from the Ragas library 

### RAGAS METRICS AND RESULTS ANALYSIS


Just like in any machine learning system, the performance of individual components within the LLM and RAG pipeline has a significant impact on the overall experience. Ragas offers metrics tailored for evaluating each component of your RAG pipeline in isolation


<!-- ![data_ingestion.png](./images/data_ingestion.png) -->
<img src="./assets/RAGAS_Scores.png" width=50% height=20% />

Below we will plot the results:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

numeric_cols = ['faithfulness', 'answer_relevancy', 'context_precision', 'context_recall',
                'context_entity_recall', 'answer_similarity', 'answer_correctness', 'harmfulness',
                'maliciousness', 'coherence', 'correctness', 'conciseness']


# Calculate the mean of numeric columns for each DataFrame
means = []
kb_names = []
llm_names = []
for key, df in dfs.items():
    mean_values = df[numeric_cols].mean()
    means.append(mean_values)
  
    llm_name = next((llm.model_id for i, llm in enumerate(llms) if llm.model_id == key), 'Unknown')
    llm_names.append(llm_name)

# Combine the means into a single DataFrame
means_df = pd.DataFrame(means, columns=numeric_cols)
means_df = means_df.set_index(pd.Series(llm_names))

# Reshape the DataFrame for plotting
means_df = means_df.reset_index().melt(id_vars='index', value_vars=numeric_cols, var_name='Metric', value_name='Mean')

In [None]:
# Create a figure and axis
fig, ax = plt.subplots(figsize=(12, 6))

# Plot the means as a bar chart
bar_plot = means_df[0:21].set_index(['Metric', 'index'])['Mean'].unstack().plot(kind='bar', ax=ax, figsize=(20, 6), width=0.8, legend=True)

# Set the chart title and axis labels
ax.set_title('Ragas Performance - Metrics Block1', fontsize=14)
ax.set_xlabel('Metrics', fontsize=12)
ax.set_ylabel('Average', fontsize=12)

# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)

# Update the legend labels with KB names
handles, labels = ax.get_legend_handles_labels()
# ax.legend(handles, llm_names, loc='upper right', ncol=1)
ax.legend(handles, llm_names, loc='lower right', ncol=1)

# Display the chart
plt.show()

In [None]:
# Create a figure and axis
fig, ax = plt.subplots(figsize=(12, 6))

# Plot the means as a bar chart
bar_plot = means_df[21:].set_index(['Metric', 'index'])['Mean'].unstack().plot(kind='bar', ax=ax, figsize=(20, 6), width=0.8, legend=True)

# Set the chart title and axis labels
ax.set_title('Ragas Performance - Metrics Block2', fontsize=14)
ax.set_xlabel('Metrics', fontsize=12)
ax.set_ylabel('Average', fontsize=12)

# Rotate the x-axis labels for better readability
plt.xticks(rotation=45)

# Update the legend labels with KB names
handles, labels = ax.get_legend_handles_labels()
# ax.legend(handles, llm_names, loc='upper right', ncol=1)
ax.legend(handles, llm_names, loc='lower right', ncol=1)

# Display the chart
plt.show()

In [None]:
print (means_df[:18])

print (means_df[18:])

> Note: Please note the scores above gives a relative idea on the performance of your RAG application and should be used with caution and not as standalone scores. Also note, that we have used only 5 question/answer pairs for evaluation, as best practice, you should use enough data to cover different aspects of your document for evaluating model.

Based on the scores, you can review other components of your RAG workflow to further optimize the scores, few recommended options are to review your chunking strategy, prompt instructions, adding more numberOfResults for additional context and so on. 