## Generating Synthetic Test Datasets

**Why use Synthetic Test Datasets?**

Evaluating the performance of RAG (Retrieval-Augmented Generation) augmented pipelines is crucial.

However, manually creating hundreds of QA (Question-Answer-Context) samples from documents can be time-consuming and labor-intensive. Additionally, human-generated questions may struggle to reach the level of complexity needed for thorough evaluation, ultimately affecting the quality of the assessment.

Using synthetic data generation can reduce developer time in the data aggregation process **by up to 90%**.


In [3]:
! pip install pdfplumber


Collecting pdfplumber
  Downloading pdfplumber-0.11.5-py3-none-any.whl.metadata (42 kB)
Collecting pdfminer.six==20231228 (from pdfplumber)
  Using cached pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
Downloading pdfplumber-0.11.5-py3-none-any.whl (59 kB)
Using cached pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
Downloading pypdfium2-4.30.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m69.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdfium2, pdfminer.six, pdfplumber
Successfully installed pdfminer.six-20231228 pdfplumber-0.11.5 pypdfium2-4.30.1


In [8]:
import pdfplumber
from typing import List, Dict, Any

class PDFLoader:
    def __init__(self, file_path: str, start_page: int = None, end_page: int = None):
        self.file_path = file_path
        self.start_page = start_page
        self.end_page = end_page

    def load(self) -> Dict[str, Any]:
        combined_text = ""
        metadata = {}

        with pdfplumber.open(self.file_path) as pdf:
            total_pages = len(pdf.pages)

            start = (self.start_page or 1) - 1
            end = min(self.end_page or total_pages, total_pages)

            for page_num in range(start, end):
                page = pdf.pages[page_num]
                text = page.extract_text()
                combined_text += text + "\n"

            metadata = {
                "source": self.file_path,
                "filename": self.file_path,
                "total_pages": total_pages,
                "extracted_pages": f"{start + 1}-{end}"
            }

            for key, value in pdf.metadata.items():
                if isinstance(value, (str, int)):
                    metadata[key] = value

        return {
            "page_content": combined_text.strip(),
            "metadata": metadata
        }

## Document Used for Practice

Amazon Bedrock Manual Documentation (https://docs.aws.amazon.com/bedrock/latest/userguide/)

- Link: https://d1jp7kj5nqor8j.cloudfront.net/bedrock-manual.pdf
- File name: `bedrock-manual.pdf`

_Please copy the downloaded file to the data folder for the practice session_

In [9]:
loader = PDFLoader("data/bedrock-ug.pdf", start_page=19, end_page=100)
docs = loader.load()

## Document Preprocessing

In [10]:
import re
from typing import List, Optional

def split_text(text: str, chunk_size: int = 1000, chunk_overlap: int = 100, separators: Optional[List[str]] = None) -> List[str]:
    separators = separators or ["\n\n", "\n", " ", ""]

    def _split_text_recursive(text: str, separators: List[str]) -> List[str]:
        if not separators:
            return [text]

        separator = separators[0]
        splits = re.split(f"({re.escape(separator)})", text)
        splits = ["".join(splits[i:i+2]) for i in range(0, len(splits), 2)]

        final_chunks = []
        current_chunk = ""

        for split in splits:
            if len(current_chunk) + len(split) <= chunk_size:
                current_chunk += split
            else:
                if current_chunk:
                    final_chunks.append(current_chunk)
                if len(split) > chunk_size:
                    subsplits = _split_text_recursive(split, separators[1:])
                    final_chunks.extend(subsplits)
                else:
                    current_chunk = split

        if current_chunk:
            final_chunks.append(current_chunk)

        return final_chunks

    chunks = _split_text_recursive(text, separators)

    if chunk_overlap > 0:
        overlapped_chunks = []
        for i, chunk in enumerate(chunks):
            if i == 0:
                overlapped_chunks.append(chunk)
            else:
                overlap_text = chunks[i-1][-chunk_overlap:]
                overlapped_chunks.append(overlap_text + chunk)
        chunks = overlapped_chunks

    return chunks

In [11]:
chunks = split_text(docs['page_content'], 1000, 0)
len(chunks)

89

In [12]:
chunks_with_metadata = []
for i, chunk in enumerate(chunks):
    chunks_with_metadata.append({
        'content': chunk,
        'metadata': {
            'chunk_id': i,
            'filename': docs['metadata'].get('filename', 'unknown')
        }
    })

In [13]:
chunks_with_metadata[0]

{'content': "Amazon Bedrock User Guide\nWhat is Amazon Bedrock?\nAmazon Bedrock is a fully managed service that makes high-performing foundation models (FMs)\nfrom leading AI companies and Amazon available for your use through a unified API. You can\nchoose from a wide range of foundation models to find the model that is best suited for your use\ncase. Amazon Bedrock also offers a broad set of capabilities to build generative AI applications with\nsecurity, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and\nevaluate top foundation models for your use cases, privately customize them with your data using\ntechniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that\nexecute tasks using your enterprise systems and data sources.\nWith Amazon Bedrock's serverless experience, you can get started quickly, privately customize\nfoundation models with your own data, and easily and securely integrate and deploy them into\n",
 '

## Test Q&A Dataset Generation

In [14]:
import boto3
from botocore.config import Config

region = 'us-west-2'
retry_config = Config(
    region_name=region,
    retries={"max_attempts": 10, "mode": "standard"}
)
boto3_client = boto3.client("bedrock-runtime", region_name=region, config=retry_config)

In [15]:
import random
import json
from time import sleep

def converse_with_bedrock(model_id, sys_prompt, usr_prompt):
    temperature = 0.5
    top_p = 0.9
    inference_config = {"temperature": temperature, "topP": top_p}
    response = boto3_client.converse(
        modelId=model_id,
        messages=usr_prompt, 
        system=sys_prompt,
        inferenceConfig=inference_config,
    )
    return response

def create_prompt(sys_template, user_template):
    sys_prompt = [{"text": sys_template}]
    usr_prompt = [{"role": "user", "content": [{"text": user_template}]}]
    return sys_prompt, usr_prompt

def get_context_chunks(chunks_with_metadata, start_id):
    context_chunks = [
        chunks_with_metadata[start_id]['content'],
        chunks_with_metadata[start_id + 1]['content'],
        chunks_with_metadata[start_id + 2]['content']
    ]
    return " ".join(context_chunks)

### Tool Use 

LLM will generate Q&A dataset that conforms to the schema description in the tooluse config.

In [16]:
tool_config = {
    "tools": [
        {
            "toolSpec": {
                "name": "QuestionAnswerGenerator",
                "description": "Generates questions and answers based on the given context.",
                "inputSchema": {
                    "json": {
                        "type": "object",
                        "properties": {
                            "question": {
                                "type": "string",
                                "description": "The generated question"
                            },
                            "answer": {
                                "type": "string",
                                "description": "The answer to the generated question"
                            }
                        },
                        "required": ["question", "answer"]
                    }
                }
            }
        }
    ]
}

In [17]:
def converse_with_bedrock_tools(sys_prompt, usr_prompt, tool_config):
    temperature = 0.0
    top_p = 0.1
    top_k = 1
    inference_config = {"temperature": temperature, "topP": top_p}
    additional_model_fields = {"top_k": top_k}
    response = boto3_client.converse(
        modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
        messages=usr_prompt,
        system=sys_prompt,
        inferenceConfig=inference_config,
        additionalModelRequestFields=additional_model_fields,
        toolConfig=tool_config
    )
    return response

def parse_tool_use(message):
    stop_reason = message['stopReason']

    if stop_reason == 'tool_use':
        tool_requests = message['output']['message']['content']
        for tool_request in tool_requests:
            if 'toolUse' in tool_request:
                tool = tool_request['toolUse']

                if tool['name'] == 'QuestionAnswerGenerator':
                    return tool['input']
    return None

## Q&A Dataset Generation Instruction

- `simple`: directly answerable questions from the given context
- `complex`: reasoning questions and answers.

_Modify the system/user prompts tailored to your dataset_

Generated Q&A pair will be stored in `data/qa_dataset.jsonl`

In [18]:
def generate_qa_dataset(chunks, num_pairs=5, output_file="data/sample_qa_dataset.jsonl"):
    total_chunks = len(chunks)
    dataset = []

    for i in range(num_pairs):
        start_id = random.randint(0, total_chunks - 3)
        context = get_context_chunks(chunks_with_metadata, start_id)

        if i % 2 == 0:
            sys_template = """
            You are an expert at generating practical questions based on given documentation.
            Your task is to generate complex, reasoning questions and answers.

            Follow these rules:
            1. Generate questions that reflect real user information needs related to the document's subject matter (e.g., technical docs : feature availability, implementation details)
            2. Ensure questions are relevant, concise, preferably under 25 words, and fully answerable with the provided information
            3. Focus on extracting key information that users are likely to seek, while avoiding narrow or less important questions.
            4. When provided with code blocks, focus on understanding the overall functionality rather than the specific syntax or variables. Feel free to request examples of how to use key APIs or features.
            5. Do not use phrases like 'based on the provided context' or 'according to the context'.
            """
            question_type = "complex"
        else:
            sys_template = """
            You are an expert at generating practical questions based on given documentation.
            Your task is to create simple, directly answerable questions from the given context.

            Follow these rules:
            1. Generate questions that reflect real user information needs related to the document's subject matter (e.g., technical docs : feature availability, implementation details)
            2. Ensure questions are relevant, concise, preferably under 10 words, and fully answerable with the provided information
            3. Focus on extracting key information that users are likely to seek, while avoiding narrow or less important questions.
            4. When provided with code blocks, focus on understanding the overall functionality rather than the specific syntax or variables. Feel free to request examples of how to use key APIs or features.
            5. Do not use phrases like 'based on the provided context' or 'according to the context'.
            """
            question_type = "simple"

        user_template = f"""
        Generate a {question_type} question and its answer based on the following context:

        Context: {context}

        Use the QuestionAnswerGenerator tool to provide the output.
        """

        sys_prompt, user_prompt = create_prompt(sys_template, user_template)
        response = converse_with_bedrock_tools(sys_prompt, user_prompt, tool_config)
        qa_data = parse_tool_use(response)

        if qa_data:
            qa_item = {
                "question": qa_data["question"],
                "ground_truth": qa_data["answer"],
                "question_type": question_type,
                "contexts": context
            }

            print(qa_item)

            with open(output_file, 'a') as f:
                json.dump(qa_item, f)
                f.write('\n')

            dataset.append(qa_item)

        sleep(5)

    return dataset

{'question': 'What command lists available foundation models in Amazon Bedrock?', 
'ground_truth': 'The command to list available foundation models in Amazon Bedrock is: aws bedrock list-foundation-models --region us-east-1', 
'question_type': 'simple', 
'contexts': 'This section guides you through trying out some common operations in Amazon Bedrock using the\nAWS CLI to test that your permissions and authentication are set up properly. Before you run the\nfollowing examples, you should check that you have fulfilled the following prerequisites:\nPrerequisites\n• You have an AWS account and a user or role with authentication set up and the necessary\npermissions for Amazon Bedrock. Otherwise, follow the steps at Getting started with the API.\n• You\'ve requested access to the Amazon Titan Text G1 - Express model. Otherwise, follow the\nsteps at Request access to an Amazon Bedrock foundation model.\nRequest access to Amazon Bedrock models 18\nAmazon Bedrock User Guide\n• You\'ve installed and set up authentication for the AWS CLI. To install the CLI, follow the steps at\nInstall or update to the latest version of the AWS CLI. Verify that you\'ve set up your credentials to\nuse the CLI by following the steps at Get credentials to grant programmatic access.\n Test that your permissions are set up properly for Amazon Bedrock, using a user or role that you set\nup with the proper permissions.\nTopics\n• List the foundation models that Amazon Bedrock has to offer\n• Submit a text prompt to a model and generate a text response with InvokeModel\n• Submit a text prompt to a model and generate a text response with Converse\nList the foundation models that Amazon Bedrock has to offer\nThe following example runs the ListFoundationModels operation using an Amazon Bedrock\nendpoint. ListFoundationModels lists the foundation models (FMs) that are available in\nAmazon Bedrock in your region. In a terminal, run the following command:\naws bedrock list-foundation-models --region us-east-1\nIf the command is successful, the response returns a list of foundation models that are available in\nAmazon Bedrock.\nSubmit a text prompt to a model and generate a text response with InvokeModel\nThe following example runs the InvokeModel operation using an Amazon Bedrock runtime\n endpoint. InvokeModel lets you submit a prompt to generate a model response. In a terminal, run\nthe following command:\naws bedrock-runtime invoke-model \\\n--model-id amazon.titan-text-express-v1 \\\n--body \'{"inputText": "Describe the purpose of a \\"hello world\\" program in one line.",\n"textGenerationConfig" : {"maxTokenCount": 512, "temperature": 0.5, "topP": 0.9}}\' \\\n--cli-binary-format raw-in-base64-out \\\ninvoke-model-output-text.txt\nRun examples with the AWS CLI 19\nAmazon Bedrock User Guide\nIf the command is successful, the response generated by the model is written to the invoke-\nmodel-output-text.txt file. The text response is returned in the outputText field, alongside\naccompanying information.\nSubmit a text prompt to a model and generate a text response with Converse\nThe following example runs the Converse operation using an Amazon Bedrock runtime endpoint.\nConverse lets you submit a prompt to generate a model response. We recommend using\n'}

'question': 'What are the key steps to set up and manage access for Amazon Bedrock in an existing AWS account, and how does this process differ from setting up access for a new administrative user?', 
'ground_truth': 'For an existing AWS account, the key steps to set up and manage access for Amazon Bedrock are:\n\n1. Create an IAM role with the AmazonBedrockFullAccess managed policy.\n2. Create a custom policy to manage access to Amazon Bedrock models, including marketplace actions like ViewSubscriptions, Unsubscribe, and Subscribe.\n3. Attach the custom policy to the Amazon Bedrock role.\n4. Add users to the Amazon Bedrock role and grant them permissions to switch to this role.\n\nThis process differs from setting up access for a new administrative user in the following ways:\n- For new users, you would configure access using the default IAM Identity Center directory.\n- New administrative users receive a sign-in URL for the AWS access portal.\n- Existing account setup focuses on creating and managing IAM roles, while new user setup involves creating IAM Identity Center users.\n- The existing account method provides more granular control over permissions through custom policies, whereas the new user method relies more on predefined access levels within IAM Identity Center.', 
'question_type': 'complex', 
'contexts': 'Configure user access with the default IAM Identity Center directory in the AWS IAM Identity\nCenter User Guide.\nSign in as the user with administrative access\n• To sign in with your IAM Identity Center user, use the sign-in URL that was sent to your email\naddress when you created the IAM Identity Center user.\nFor help signing in using an IAM Identity Center user, see Signing in to the AWS access portal in\nthe AWS Sign-In User Guide.\nTo learn more about IAM, see Identity and access management for Amazon Bedrock and the IAM\nUser Guide.\nAfter you have created an administrative user, proceed to I already have an AWS account to set up\npermissions for Amazon Bedrock.\nI already have an AWS account\nUse IAM to create a role for with the necessary permissions to use Amazon Bedrock. You can then\nadd users to this role to grant the permissions.\nI already have an AWS account 7\nAmazon Bedrock User Guide\nTo create an Amazon Bedrock role\n 1. Create a role with a name of your choice by following the steps at Creating a role to delegate\npermissions to an IAM user in the IAM User Guide. When you reach the step to attach a policy\nto the role, attach the AmazonBedrockFullAccess AWS managed policy.\n2. Create a new policy to allow your role to manage access to Amazon Bedrock models. From the\nfollowing list, select the link that corresponds to your method of choice and follow the steps.\nUse the following JSON object as the policy.\n• Creating IAM policies (console)\n• Creating IAM policies (AWS CLI)\n• Creating IAM policies (AWS API)\n{\n"Version": "2012-10-17",\n"Statement": [\n{\n"Sid": "MarketplaceBedrock",\n"Effect": "Allow",\n"Action": [\n"aws-marketplace:ViewSubscriptions",\n"aws-marketplace:Unsubscribe",\n"aws-marketplace:Subscribe"\n],\n"Resource": "*"\n}\n]\n}\n3. Attach the policy that you created in the last step to your Amazon Bedrock role by following\nthe steps at Adding and removing IAM identity permissions.\n To add users to the Amazon Bedrock role\n1. For users to access an IAM role, you must add them to the role. You can add both users in your\naccount or from other accounts. To grant users permissions to switch to the Amazon Bedrock\nrole that you created, follow the steps at Granting a user permissions to switch roles and\nspecify the Amazon Bedrock role as the Resource.\nI already have an AWS account 8\nAmazon Bedrock User Guide\nNote\nIf you need to create more users in your account so that you can give them access\nto the Amazon Bedrock role, follow the steps in Creating an IAM user in your AWS\naccount.\n2. After you\'ve granted a user permissions to use the Amazon Bedrock role, provide the user\nwith role name and ID or alias of the account to which the role belongs. Then, guide the user\nthrough how to switch to the role by following the instructions at Providing information to the\nuser.\nRequest access to an Amazon Bedrock foundation model\n'}


In [19]:
generate_qa_dataset(chunks_with_metadata, 50)

{'question': 'What are the key steps to set up and manage access for Amazon Bedrock in an existing AWS account, and how does this process differ from setting up access for a new administrative user?', 'ground_truth': 'For an existing AWS account, the key steps to set up and manage access for Amazon Bedrock are:\n\n1. Create an IAM role with the AmazonBedrockFullAccess managed policy.\n2. Create a custom policy to manage access to Amazon Bedrock models, including marketplace actions like ViewSubscriptions, Unsubscribe, and Subscribe.\n3. Attach the custom policy to the Amazon Bedrock role.\n4. Add users to the Amazon Bedrock role and grant them permissions to switch to this role.\n\nThis process differs from setting up access for a new administrative user in the following ways:\n- For new users, you would configure access using the default IAM Identity Center directory.\n- New administrative users receive a sign-in URL for the AWS access portal.\n- Existing account setup focuses on crea

[{'question': 'What are the key steps to set up and manage access for Amazon Bedrock in an existing AWS account, and how does this process differ from setting up access for a new administrative user?',
  'ground_truth': 'For an existing AWS account, the key steps to set up and manage access for Amazon Bedrock are:\n\n1. Create an IAM role with the AmazonBedrockFullAccess managed policy.\n2. Create a custom policy to manage access to Amazon Bedrock models, including marketplace actions like ViewSubscriptions, Unsubscribe, and Subscribe.\n3. Attach the custom policy to the Amazon Bedrock role.\n4. Add users to the Amazon Bedrock role and grant them permissions to switch to this role.\n\nThis process differs from setting up access for a new administrative user in the following ways:\n- For new users, you would configure access using the default IAM Identity Center directory.\n- New administrative users receive a sign-in URL for the AWS access portal.\n- Existing account setup focuses on c