# Evaluate context based RAG responses from Knowledge Bases, Iterate and Monitor with TruEra

SageMaker JumpStart provides a variety of pretrained open source and proprietary models such as Llama-2, Anthropic’s Claude and Cohere Command that can be quickly deployed in the Sagemaker environment. In many cases however, these foundation models are not sufficient on their own for production use cases, needing to be adapted to a particular style or new tasks. One way to surface this need is by evaluating the model against a curated ground truth dataset. Once the need to adapt the foundation model is clear, one could leverage a set of techniques to carry that out. A popular approach is to fine-tune the model on a dataset that is tailored to the use case.

One challenge with this approach is that curated ground truth datasets are expensive to create. In this blog post, we address this challenge by augmenting this workflow with a framework for extensible, automated evaluations. We start off with a baseline foundation model from SageMaker JumpStart and evaluate it with TruLens, an open source library for evaluating & tracking LLM apps. Once we identify the need for adaptation, we can leverage fine-tuning in Sagemaker Jumpstart and confirm improvement with TruLens.

TruLens evaluations make use of an abstraction of feedback functions. These functions can be implemented in several ways, including BERT-style models, appropriately prompted Large Language Models, and more. TruLens’ integration with AWS Bedrock allows you to easily run evaluations using LLMs available from AWS Bedrock. The reliability of Bedrock’s infrastructure is particularly valuable for use in performing evaluations across development and production.


---
In this demo notebook, we demonstrate how to use the SageMaker Python SDK to deploy pre-trained Llama 2 model as well as fine-tune it for your dataset in domain adaptation or instruction tuning format. We will also use TruLens to identify performance issues with the base model and validate improvement of the fine-tuned model.

---

In [12]:
# ! pip install trulens_eval==0.18.3 sagemaker datasets boto3 

## Install the necessary packages for SDK set up

In [13]:
!pip install --upgrade pip

# Install Boto3 version 1.0 specifically
!pip install boto3==1.0.0

# Make sure Boto3 is no older than version 1.15.0
!pip install boto3>=1.15.0

# Avoid versions of Boto3 newer than version 1.15.3
!pip install boto3<=1.15.3

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
[0mLooking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting boto3==1.0.0
  Downloading boto3-1.0.0-py2.py3-none-any.whl (94 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.4/94.4 kB[0m [31m68.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting botocore==1.0.0 (from boto3==1.0.0)
  Downloading botocore-1.0.0-py2.py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m82.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jmespath<1.0.0,>=0.6.2 (from boto3==1.0.0)
  Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting futures==2.2.0 (from boto3==1.0.0)
  Downloading futures-2.2.0-py2.py3-none-any.whl (16 kB)
Collecting jmespath<1.0.0,>=0.6.2 (from boto3==1.0.0)
  Downloading jmespath-0.7.1-py2.py3-none-any.whl (19 kB)
Installing collected packages: jmespath, futures, botocore, boto3
  Attempting uninst

## RetrieveAndGenerate VS. Retrieve API calls on KB

---
Next, we invoke the KB APIs for retrieve and generate and retrieve API calls to make sure which option is the best for our use case

---

In [19]:
!pip install --upgrade boto3 botocore

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting boto3
  Downloading boto3-1.34.11-py3-none-any.whl.metadata (6.6 kB)
Collecting botocore
  Downloading botocore-1.34.11-py3-none-any.whl.metadata (5.6 kB)
Collecting urllib3<2.1,>=1.25.4 (from botocore)
  Downloading urllib3-2.0.7-py3-none-any.whl.metadata (6.6 kB)
Downloading boto3-1.34.11-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.3/139.3 kB[0m [31m47.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading botocore-1.34.11-py3-none-any.whl (11.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.9/11.9 MB[0m [31m161.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading urllib3-2.0.7-py3-none-any.whl (124 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m124.2/124.2 kB[0m [31m119.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: urllib3, botocore, boto3
  Attempting uninstall: urllib3
    

In [20]:
import boto3
import pprint
from botocore.client import Config

pp = pprint.PrettyPrinter(indent=2)

bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.client('bedrock-runtime')
bedrock_agent_client = boto3.client("bedrock-agent-runtime",
                              config=bedrock_config)

model_id = "anthropic.claude-instant-v1" # try with both claude instant as well as claude-v2. for claude v2 - "anthropic.claude-v2"
region_id = "us-east-1" # replace it with the region you're running sagemaker notebook
kb_id = "UAZGA1FONQ" # replace it with the Knowledge base id.

In [21]:
def retrieveAndGenerate(input, kbId, sessionId=None, model_id = "anthropic.claude-v2", region_id = "us-east-1"):
    model_arn = f'arn:aws:bedrock:{region_id}::foundation-model/{model_id}'
    if sessionId:
        return bedrock_agent_client.retrieve_and_generate(
            input={
                'text': input
            },
            retrieveAndGenerateConfiguration={
                'type': 'KNOWLEDGE_BASE',
                'knowledgeBaseConfiguration': {
                    'knowledgeBaseId': kbId,
                    'modelArn': model_arn
                }
            },
            sessionId=sessionId
        )
    else:
        return bedrock_agent_client.retrieve_and_generate(
            input={
                'text': input
            },
            retrieveAndGenerateConfiguration={
                'type': 'KNOWLEDGE_BASE',
                'knowledgeBaseConfiguration': {
                    'knowledgeBaseId': kbId,
                    'modelArn': model_arn
                }
            }
        )

### Now,Retrieving and Generating Responses

In [22]:
query = "What is the amazon sagemaker?"
response = retrieveAndGenerate(query, kb_id,model_id=model_id,region_id=region_id)
generated_text = response['output']['text']
pp.pprint(generated_text)

('Amazon SageMaker is a fully managed service that enables developers and data '
 'scientists to quickly and easily build, train, and deploy machine learning '
 'models at any scale. SageMaker removes all the barriers that typically slow '
 'down developers who want to use machine learning.')


## Now, let's set up several questions in a dictionary to evaluate the KB responses on RetrieveAndGenerate API

In [23]:
import boto3
import pprint

# Define your queries
queries = [
    "What is Amazon EC2?",
    "How does Amazon S3 work?",
    "What are the benefits of Amazon RDS?",
    "Explain Amazon Lambda usage.",
    "What is Amazon DynamoDB?",
    "Describe Amazon VPC.",
    "How to use Amazon SES?",
    "What is Amazon Redshift?",
    "Explain the purpose of Amazon EKS.",
    "What features does Amazon CloudFront offer?"
]

# Initialize the PrettyPrinter
pp = pprint.PrettyPrinter(indent=2)

# Loop through each query and get responses
responses = []
for query in queries:
    response = retrieveAndGenerate(query, kb_id, model_id=model_id, region_id=region_id)
    generated_text = response['output']['text']
    responses.append(generated_text)

# Print the responses
for i, response in enumerate(responses):
    print(f"Query: {queries[i]}")
    pp.pprint(response)
    print("\n" + "-"*50 + "\n")

Query: What is Amazon EC2?
('Amazon EC2 is a web service that provides secure, resizable compute capacity '
 'in the cloud. It allows users to obtain and configure computing resources '
 'like server instances quickly and easily.')

--------------------------------------------------

Query: How does Amazon S3 work?
('Amazon S3 is an object storage service that allows users to store and '
 'protect any amount of data. Users can organize their data and control access '
 'permissions based on their business and compliance needs. Data stored in '
 'Amazon S3 has 99.999999999% durability.')

--------------------------------------------------

Query: What are the benefits of Amazon RDS?
('Amazon RDS makes it easy to set up, operate, and scale a relational database '
 'in the cloud. It provides cost-efficient and resizable capacity while '
 'automating time- consuming administration tasks such as hardware '
 'provisioning, database setup, patching and backups. It frees you to focus on '
 'you

### Set up as responses from the KB

In [26]:
!pip install trulens_eval

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting trulens_eval
  Downloading trulens_eval-0.20.0-py3-none-any.whl.metadata (3.0 kB)
Collecting frozendict>=2.3.8 (from trulens_eval)
  Downloading frozendict-2.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (23 kB)
Collecting munch>=3.0.0 (from trulens_eval)
  Downloading munch-4.0.0-py2.py3-none-any.whl.metadata (5.9 kB)
Collecting merkle-json>=1.0.0 (from trulens_eval)
  Downloading merkle_json-1.0.0-py3-none-any.whl (5.2 kB)
Collecting millify>=0.1.1 (from trulens_eval)
  Downloading millify-0.1.1.tar.gz (1.2 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting humanize>=4.6.0 (from trulens_eval)
  Downloading humanize-4.9.0-py3-none-any.whl.metadata (7.9 kB)
Collecting streamlit>=1.27.0 (from trulens_eval)
  Downloading streamlit-1.29.0-py2.py3-none-any.whl.metadata (8.2 kB)
Collecting streamlit-aggrid>=0.3.4.post3 (from trulens_eval)
  Downloading streamlit_agg

In [28]:
import pandas as pd
from trulens_eval.feedback import GroundTruthAgreement, Groundedness
from trulens_eval import TruBasicApp, Feedback, Tru, Select
from trulens_eval import Bedrock

In [29]:
## Set up the retriever for context based responses
def retrieve(query, kbId, numberOfResults=5):
    return bedrock_agent_client.retrieve(
        retrievalQuery= {
            'text': query
        },
        knowledgeBaseId=kbId,
        retrievalConfiguration= {
            'vectorSearchConfiguration': {
                'numberOfResults': numberOfResults
            }
        }
    )

## Setting up the test dataset

In [30]:
# Define your queries
queries = [
    "What is Amazon EC2?",
    "How does Amazon S3 work?",
    "What are the benefits of Amazon RDS?",
    "Explain Amazon Lambda usage.",
    "What is Amazon DynamoDB?",
    "Describe Amazon VPC.",
    "How to use Amazon SES?",
    "What is Amazon Redshift?",
    "Explain the purpose of Amazon EKS.",
    "What features does Amazon CloudFront offer?"
]

# Your knowledge base ID and other configuration details
kb_id = "UAZGA1FONQ"
model_id = "anthropic.claude-v2"
region_id = "us-east-1"

# Sample test dataset preparation
test_dataset = []

for query in queries:
    # Retrieve context for the query
    context = retrieve(query, kb_id)  # Assuming this returns a string or similar
    
    # Retrieve and generate response for the query
    response = retrieveAndGenerate(query, kb_id, model_id=model_id, region_id=region_id)
    
    # Assuming both functions return a string, append to the test dataset
    test_dataset.append({
        "query": query,
        "context": context,
        "response": response
    })

## Run the tru eval open source evaluator

In [43]:
import pandas as pd
from trulens_eval.feedback import GroundTruthAgreement, Groundedness
from trulens_eval import TruBasicApp, Feedback, Tru, Select, Bedrock

def KB_responses(instruction, context):
    input_output_demarkation_key = "\n\n### Response:\n"
    payload = {
        "inputs": template["prompt"].format(instruction=instruction, context=context) + input_output_demarkation_key,
        "parameters": {"max_new_tokens": 200},
    }
    return retrieveAndGenerate(query, kb_id, model_id=model_id, region_id=region_id)

# Prepare test dataset
test_dataset = []

for query in queries:
    context = retrieve(query, kb_id)  # Retrieve context for the query
    response = retrieveAndGenerate(query, kb_id, model_id=model_id, region_id=region_id)  # Generate response
    test_dataset.append({"query": query, "context": context, "response": response})

# Process the dataset using KB_responses
for item in test_dataset:
    kb_response = KB_responses(item["query"], item["context"])
    print(f"Query: {item['query']}\nResponse: {kb_response}\n")

Query: What is Amazon EC2?
Response: {'ResponseMetadata': {'RequestId': 'f4e5de82-a6f5-49b3-a1f0-84c19bb993b6', 'HTTPStatusCode': 200, 'HTTPHeaders': {'date': 'Tue, 02 Jan 2024 15:42:00 GMT', 'content-type': 'application/json', 'content-length': '5221', 'connection': 'keep-alive', 'x-amzn-requestid': 'f4e5de82-a6f5-49b3-a1f0-84c19bb993b6'}, 'RetryAttempts': 0}, 'sessionId': 'c04ed4ec-6efe-42cc-a0af-48d09306688f', 'output': {'text': 'Amazon CloudFront offers a fast content delivery network (CDN) service that securely delivers data, videos, applications, and APIs to customers globally with low latency and high transfer speeds. It is integrated with AWS services like AWS Shield for DDoS mitigation, Amazon S3, Elastic Load Balancing or Amazon EC2 as origins, and Lambda@Edge to customize the user experience. CloudFront has a simple, pay-as-you-go pricing model with no upfront fees or required long-term contracts.'}, 'citations': [{'generatedResponsePart': {'textResponsePart': {'text': 'Amaz

In [46]:
for item in test_dataset:
    kb_response = KB_responses(item["query"], item["context"])
    
    # Extract the main text response
    main_text = kb_response['output']['text']
    
    print(f"Query: {item['query']}\nResponse: {main_text}\n")


Query: What is Amazon EC2?
Response: Amazon CloudFront offers a fast content delivery network (CDN) service that securely delivers data, videos, applications, and APIs to customers globally with low latency and high transfer speeds. It is integrated with AWS services like AWS Shield for DDoS mitigation, Amazon S3, Elastic Load Balancing or Amazon EC2 as origins, and Lambda@Edge to customize the user experience. CloudFront has a simple, pay-as-you-go pricing model with no upfront fees or required long-term contracts.

Query: How does Amazon S3 work?
Response: Amazon CloudFront offers a fast content delivery network service that securely delivers data, videos, applications, and APIs to customers globally with low latency and high transfer speeds. It is integrated with AWS services like AWS Shield for DDoS mitigation, Amazon S3, Elastic Load Balancing or Amazon EC2 as origins, and Lambda@Edge to customize the user experience. CloudFront has a simple pay-as-you-go pricing model with no upf

In [47]:
from trulens_eval.feedback import GroundTruthAgreement, Groundedness
from trulens_eval import TruBasicApp, Feedback, Tru, Select
import boto3

import os

In [50]:
# Rename columns
test_dataset = pd.DataFrame(test_dataset)
test_dataset.rename(columns={"instruction": "query"}, inplace=True)

# Convert DataFrame to a list of dictionaries
golden_set = test_dataset[["query","response"]].to_dict(orient='records')

In [51]:
# Create a Feedback object for ground truth similarity
ground_truth = GroundTruthAgreement(golden_set)
# Call the agreement measure on the instruction and output
f_groundtruth = (Feedback(ground_truth.agreement_measure, name = "Ground Truth Agreement")
                 .on(Select.Record.calls[0].args.args[0])
                 .on_output()
                )

# Instantiate Bedrock
from trulens_eval import Bedrock

# Initialize Bedrock as feedback function provider
bedrock = Bedrock(model_id = "amazon.titan-tg1-large", region_name="us-east-1")

# Answer Relevance
f_answer_relevance = (Feedback(bedrock.relevance_with_cot_reasons, name = "Answer Relevance")
                      .on(Select.Record.calls[0].args.args[0])
                      .on_output()
                      )

# Context Relevance
f_context_relevance = (Feedback(bedrock.qs_relevance_with_cot_reasons, name = "Context Relevance")
                       .on(Select.Record.calls[0].args.args[0])
                       .on(Select.Record.calls[0].args.args[1])
                      )

# Groundedness
grounded = Groundedness(groundedness_provider=bedrock)
f_groundedness = (Feedback(grounded.groundedness_measure_with_cot_reasons, name = "Groundedness")
                .on(Select.Record.calls[0].args.args[1])
                .on_output()
                .aggregate(grounded.grounded_statements_aggregator)
            )

OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable

In [52]:


# Prepare the dataset
test_dataset = pd.DataFrame(test_dataset)
test_dataset.rename(columns={"instruction": "query"}, inplace=True)
golden_set = test_dataset[["query", "response"]].to_dict(orient='records')

# Setup feedback mechanisms
ground_truth = GroundTruthAgreement(golden_set)
f_groundtruth = Feedback(ground_truth.agreement_measure, name="Ground Truth Agreement").on(Select.Record.calls[0].args.args[0]).on_output()

bedrock = Bedrock(model_id="amazon.titan-tg1-large", region_name="us-east-1")
f_answer_relevance = Feedback(bedrock.relevance_with_cot_reasons, name="Answer Relevance").on(Select.Record.calls[0].args.args[0]).on_output()
f_context_relevance = Feedback(bedrock.qs_relevance_with_cot_reasons, name="Context Relevance").on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1])

grounded = Groundedness(groundedness_provider=bedrock)
f_groundedness = Feedback(grounded.groundedness_measure_with_cot_reasons, name="Groundedness").on(Select.Record.calls[0].args.args[1]).on_output().aggregate(grounded.grounded_statements_aggregator)

# Setup TruLens App
finetuned_recorder = TruBasicApp(KB_responses, app_id="KB retrieveandgenerate API", feedbacks=[f_groundtruth, f_answer_relevance, f_context_relevance, f_groundedness])

# Evaluate the dataset
for i in range(len(test_dataset)):
    with finetuned_recorder as recording:
        finetuned_recorder.app(test_dataset["query"][i], test_dataset["context"][i])

# Retrieve and display results
records, feedback = Tru().get_leaderboard(app_ids=["KB retrieveandgenerate API"])
print(records)
print(feedback)


OpenAIError: The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY environment variable