## Loading Generated Synthetic Datasets

In this section, we load synthetic datasets that have been generated for testing purposes. 

In [1]:
import pandas as pd

df = pd.read_json("data/sample_qa_dataset.jsonl", lines=True)
df.head()

Unnamed: 0,question,ground_truth,question_type,contexts
0,"How do temperature, Top K, and Top P parameter...","Temperature, Top K, and Top P are parameters t...",complex,"• If you set a high temperature, the probabili..."
1,How long will Amazon Bedrock support base mode...,Amazon Bedrock will support base models for a ...,simple,• EOL: This version is no longer available for...
2,How does the system handle a scenario where a ...,The system doesn't explicitly show a function ...,complex,"'payment_date': ['2021-10-05', '2021-10-06', '..."
3,What is the purpose of an S3 retrieval node in...,An S3 retrieval node lets you retrieve data fr...,simple,An S3 retrieval node lets you retrieve data fr...
4,How can a developer create a new prompt versio...,"To create a new prompt version, retrieve its i...",complex,make a CreatePromptVersion Agents for Amazon B...


In [2]:
from datasets import Dataset
import ast
import re

def clean_string(s):
    s = re.sub(r'[^\x00-\x7F]+', '', s)
    s = s.replace("'", '"')
    return s

def convert_to_list(example):
    cleaned_context = clean_string(example["contexts"])
    try:
        contexts = ast.literal_eval(cleaned_context)
    except:
        contexts = cleaned_context
    return {"contexts": contexts}


subset_length = 10  # Change 
test_dataset = Dataset.from_pandas(df.head(subset_length))

test_dataset = test_dataset.map(convert_to_list)
print(test_dataset)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'ground_truth', 'question_type', 'contexts'],
    num_rows: 10
})


## RAG Pipeline Setting

The test dataset is used to simulate real-world queries in a RAG pipeline, which combines document retrieval with response generation. 

Here, we are using the default settings for the KnowledgeBase in Amazon Bedrock as part of the RAG configuration.

_1. To utilize the code below for testing, the KnowledgeBase must be pre-configured in the Amazon Bedrock console_

_2. If you have a specific RAG pipeline you want to evaluate, please modify the cells below accordingly_

### Context Retrieval

In this section, we will test the system’s ability to retrieve relevant context from the KnowledgeBase using the provided queries. 

This is a critical step in the RAG pipeline, as the accuracy of the context retrieved has a direct impact on the quality of the generated responses. 

In [16]:
# RAG implementation sample 1 (Replace with RAG pipeline for evaluation)
from libs.bedrock_kb_util import context_retrieval_from_kb

amazon_kb_id = 'HNCKVA5XST'

question = test_dataset[0]['question']
print("question:\n", question)
search_result = context_retrieval_from_kb(question, 3, 'us-west-2', amazon_kb_id, 'SEMANTIC')
print("search_result[0]:", search_result[0])

contexts = "\n--\n".join([result['content'] for result in search_result])
print("context:", contexts)

question:
 How do temperature, Top K, and Top P parameters interact in Amazon Bedrock's foundation models, and how might adjusting these affect the output when generating text about different types of equines?
search_result[0]: {'index': 1, 'content': 'If you set Top P as 0.7, the model only considers "horses" because it is the only candidate that lies in the top 70% of the probability distribution. If you set Top P as 0.9, the model considers "horses" and "zebras" as they are in the top 90% of probability distribution.     Randomness and diversity 325Amazon Bedrock User Guide     Length     Foundation models typically support parameters that limit the length of the response. Examples of these parameters are provided below.     ? Response length ? An exact value to specify the minimum or maximum number of tokens to return in the generated response.     ? Penalties ? Specify the degree to which to penalize outputs in a response. Examples include the following.     ? The length of the re

In [9]:
import boto3
from botocore.config import Config

model_id = "anthropic.claude-3-5-sonnet-20240620-v1:0"
region = 'us-west-2'

retry_config = Config(
    region_name=region,
    retries={"max_attempts": 10, "mode": "standard"}
)
boto3_client = boto3.client("bedrock-runtime", config=retry_config)

### Answer Generation

Here, we are generating answers based on the retrieved context. 

In [10]:
def generate_answer(question, contexts):
    system_prompt = """You are an AI assistant that uses retrieved context to answer questions accurately. 
    Follow these guidelines:
    1. Use the provided context to inform your answers.
    2. If the context doesn't contain relevant information, say "I don't have enough information to answer that."
    3. Be concise and to the point in your responses."""

    user_prompt = f"""Context: {contexts}

    Question: {question}

    Please answer the question based on the given context."""

    response = boto3_client.converse(
        modelId=model_id,
        messages=[{'role': 'user', 'content': [{'text': user_prompt}]}],
        system=[{'text': system_prompt}]
    )

    answer = response['output']['message']['content'][0]['text']
    return answer

generate_answer(question, contexts)

'Based on the provided context, I can explain how temperature, Top K, and Top P parameters interact in Amazon Bedrock\'s foundation models and how adjusting them might affect the output when generating text about different types of equines:\n\n1. Temperature: \n- Lower values increase the likelihood of higher-probability tokens and decrease the likelihood of lower-probability tokens.\n- Higher values increase the likelihood of lower-probability tokens and decrease the likelihood of higher-probability tokens.\n\n2. Top K:\n- Lower values remove lower-probability tokens from consideration.\n- Higher values allow more lower-probability tokens to be considered.\n\n3. Top P:\n- Lower values remove lower-probability tokens by considering only the top percentage of the probability distribution.\n- Higher values allow more lower-probability tokens by considering a larger percentage of the probability distribution.\n\nIn the context of generating text about different types of equines:\n\n- If y

### Full Process for All Sample Questions

This section runs the entire pipeline, from context retrieval to answer generation, across a set of sample questions

In [11]:
from time import sleep

kb_region = 'us-west-2'
kb_id = amazon_kb_id
top_k = 3

def process_item(item):
    sleep(5)  # Prevent throttling
    question = item['question']
    search_result = context_retrieval_from_kb(question, top_k, kb_region, kb_id, 'SEMANTIC')

    contexts = [result['content'] for result in search_result]
    answer = generate_answer(question, "\n--\n".join(contexts))

    return {
        'question': item['question'],
        'ground_truth': item['ground_truth'],
        'original_contexts': item['contexts'],
        'retrieved_contexts': contexts,
        'answer': answer
    }

updated_dataset = test_dataset.map(process_item)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

### Saving Intermediate Results to File

In [12]:
import json
output_file = "data/sample_processed_qa_dataset.jsonl"

with open(output_file, 'w', encoding='utf-8') as f:
    for item in updated_dataset:
        json.dump(item, f, ensure_ascii=False)
        f.write('\n')

print(f"Dataset saved to {output_file}")

Dataset saved to data/sample_processed_qa_dataset.jsonl


Data Format Verification

In [13]:
import json
from datasets import Dataset

input_file = "data/sample_processed_qa_dataset.jsonl"
def read_jsonl(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            yield json.loads(line.strip())

updated_dataset = Dataset.from_list(list(read_jsonl(input_file)))

item = updated_dataset[0]
print(f"Question: {item['question']}\n\n")
print(f"Answer: {item['answer']}")

Question: How do temperature, Top K, and Top P parameters interact in Amazon Bedrock's foundation models, and how might adjusting these affect the output when generating text about different types of equines?


Answer: Based on the provided context, here's how temperature, Top K, and Top P parameters interact in Amazon Bedrock's foundation models and how adjusting them might affect output about different types of equines:

1. Temperature: 
   - Lower values increase the likelihood of higher-probability tokens and decrease the likelihood of lower-probability tokens.
   - Higher values increase the likelihood of lower-probability tokens and decrease the likelihood of higher-probability tokens.
   - For equine-related text, lower temperature might favor more common horse terms, while higher temperature could introduce more diverse or unusual equine references.

2. Top K:
   - Lower values remove lower-probability tokens from consideration.
   - Higher values allow more lower-probability t

## Evaluation for Each Metric

We now evaluate the system based on various metrics. 

For detailed implementations, refer to the `libs/custom_ragas.py` file. 

This script contains the specific evaluation criteria that we use to assess the performance of the RAG pipeline across different dimensions, such as accuracy and relevance.

In [14]:
from libs.custom_ragas import (
    evaluate,
    AnswerRelevancy, 
    Faithfulness, 
    ContextRecall,
    ContextPrecision
)

llm_id = "anthropic.claude-3-5-sonnet-20240620-v1:0"
emb_id = "amazon.titan-embed-text-v2:0"
region = "us-west-2"

metrics = [AnswerRelevancy, Faithfulness, ContextRecall, ContextPrecision]

def map_dataset(example):
    return {
        "user_input": example["question"],
        "retrieved_contexts": example["retrieved_contexts"],
        "referenced_contexts": example["original_contexts"],
        "response": example["answer"],
        "reference": example["ground_truth"]
    }

dataset = updated_dataset.map(map_dataset)
results = evaluate(dataset, metrics, llm_id, emb_id, region)

print("Average Scores:")
print(results['average_scores'])

print("\nDetailed Results:")
for row in results['detailed_results']:
    print(row)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

AnswerRelevancy - Row 1: Score = 0.9479375190746863
Faithfulness - Row 1: Score = 0.8
ContextRecall - Row 1: Score = 0.0
ContextPrecision - Row 1: Score = 0.99999999995
AnswerRelevancy - Row 2: Score = 0.0
Faithfulness - Row 2: Score = 0.0
ContextRecall - Row 2: Score = 0.0
ContextPrecision - Row 2: Score = 0.0
AnswerRelevancy - Row 3: Score = 0.0
Faithfulness - Row 3: Score = 0.8
ContextRecall - Row 3: Score = 1.0
ContextPrecision - Row 3: Score = 0.99999999995
AnswerRelevancy - Row 4: Score = 0.9774043577012451
Faithfulness - Row 4: Score = 0.6666666666666666
ContextRecall - Row 4: Score = 1.0
ContextPrecision - Row 4: Score = 0.8333333332916666
AnswerRelevancy - Row 5: Score = 0.0
Faithfulness - Row 5: Score = 0.8333333333333334
ContextRecall - Row 5: Score = 0.0
ContextPrecision - Row 5: Score = 0.0
AnswerRelevancy - Row 6: Score = 0.9393254721491061
Faithfulness - Row 6: Score = 1.0
ContextRecall - Row 6: Score = 1.0
ContextPrecision - Row 6: Score = 0.5833333333041666
AnswerRelev

In [15]:
json_results = {
    'average_scores': results['average_scores'],
    'detailed_results': results['detailed_results']
}

json_filename = "data/sample_ragas_result.json"

with open(json_filename, 'w', encoding='utf-8') as f:
    json.dump(json_results, f, ensure_ascii=False, indent=4)

print(f"Results saved to {json_filename}")
print(json_results)


Results saved to data/sample_ragas_result.json
{'average_scores': {'AnswerRelevancy': 0.6499562679769817, 'Faithfulness': 0.76, 'ContextRecall': 0.5, 'ContextPrecision': 0.4249999999745834}, 'detailed_results': [{'row': 1, 'AnswerRelevancy': 0.9479375190746863, 'Faithfulness': 0.8, 'ContextRecall': 0.0, 'ContextPrecision': 0.99999999995}, {'row': 2, 'AnswerRelevancy': 0.0, 'Faithfulness': 0.0, 'ContextRecall': 0.0, 'ContextPrecision': 0.0}, {'row': 3, 'AnswerRelevancy': 0.0, 'Faithfulness': 0.8, 'ContextRecall': 1.0, 'ContextPrecision': 0.99999999995}, {'row': 4, 'AnswerRelevancy': 0.9774043577012451, 'Faithfulness': 0.6666666666666666, 'ContextRecall': 1.0, 'ContextPrecision': 0.8333333332916666}, {'row': 5, 'AnswerRelevancy': 0.0, 'Faithfulness': 0.8333333333333334, 'ContextRecall': 0.0, 'ContextPrecision': 0.0}, {'row': 6, 'AnswerRelevancy': 0.9393254721491061, 'Faithfulness': 1.0, 'ContextRecall': 1.0, 'ContextPrecision': 0.5833333333041666}, {'row': 7, 'AnswerRelevancy': 0.9358363