Let's start by creating a [Knowledge Base for Amazon Bedrock](https://aws.amazon.com/bedrock/knowledge-bases/) 
to provide knowledge about mortgages. In this notebook, we will create a knowledge base using the content available in the mortgage_dataset folder. This knowledge base will be used by Mortgage assistant agent to answer general QnA. 

Step 1: Install and import the libraries required

In [None]:
# Dependencies are managed by uv via pyproject.toml
# Run 'uv sync' in terminal to install all dependencies
# Update the Kernel to pint to the new uv that's created as part of prerequisite
print("Dependencies installed via uv sync")

In [None]:
import os
import time
import boto3
import logging
import botocore
import json
from textwrap import dedent

%load_ext autoreload
%autoreload 2

In the following cell, we add and `knowledge_base_helper` on Python path. This provides functionality for creating the knowledge base if it does not already exists.



In [None]:
import sys
sys.path.insert(0, '..') 


from src.utils.knowledge_base_helper import KnowledgeBasesForAmazonBedrock

kb = KnowledgeBasesForAmazonBedrock()

Create boto3 clients

In [None]:
s3_client = boto3.client('s3')
sts_client = boto3.client('sts')
bedrock_agent_runtime_client = boto3.client('bedrock-agent-runtime')

Get the region and bucket name. The bucket will be created if its not present already.

In [None]:
region = boto3.session.Session().region_name
account_id = sts_client.get_caller_identity()["Account"]
suffix = f"{region}-{account_id}"
bucket_name = f'agentcore-workshop-{suffix}'

In [None]:
agent_foundation_model = ["us.anthropic.claude-3-7-sonnet-20250219-v1:0"]

### Create Knowledge Base 
 We will now create the knowledge base with Amazon OpenSearch Serverless as the vector store. To do so, we will use the helper class `KnowledgeBasesForAmazonBedrock` which creates the knowledge base and all of its prerequisites:
1. IAM roles and policies
2. S3 bucket
3. Amazon OpenSearch Serverless encryption, network and data access policies
4. Amazon OpenSearch Serverless collection
5. Amazon OpenSearch Serverless vector index
6. Knowledge Base
7. Knowledge Base data source

This might take a few minutes, so have a break!

In [None]:
knowledge_base_name = "mortgage-agent-kb-test1"

knowledge_base_description = "KB containing information on mortgages"

In [None]:
%%time
kb_id, ds_id = kb.create_or_retrieve_knowledge_base(
    knowledge_base_name,
    knowledge_base_description,
    bucket_name
)

print(f"Knowledge Base ID: {kb_id}")
print(f"Data Source ID: {ds_id}")


In [None]:
# function to upload to S3 bucket
import boto3

def upload_file_to_s3(file_path, bucket_name, object_key=None):
    """Upload a file to S3 bucket"""
    s3_client = boto3.client('s3')
    
    # Check if bucket exists, create if not
    existing_buckets = [bucket['Name'] for bucket in s3_client.list_buckets()['Buckets']]
    if bucket_name not in existing_buckets:
        s3_client.create_bucket(Bucket=bucket_name)
    
    if object_key is None:
        object_key = file_path.split('/')[-1]
    
    s3_client.upload_file(file_path, bucket_name, object_key)
    return f"s3://{bucket_name}/{object_key}"

In [None]:
upload_file_to_s3("mortgage_dataset/15-Year vs. 30-Year Mortgage What's the Difference .html", bucket_name,"15-Year vs. 30-Year Mortgage What's the Difference .html") 

In [None]:
upload_file_to_s3("mortgage_dataset/Mortgage Refinancing When Does It Make Sense .html", bucket_name,"Mortgage Refinancing When Does It Make Sense .html")

Now we ingest the documents, which chunks the source documents and stores an embedding for each chunk into the underying knowledge base vector store. For a simple example, this ingestion takes a couple minutes.

In [None]:
%%time
# Start an ingestion job to synchronize data
kb.synchronize_data(kb_id, ds_id)
print('KB synchronization completed\n')

### Test the Knowledge Base
Now the Knowledge Base is available we can test it out using the [**retrieve**](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve.html) and [**retrieve_and_generate**](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-agent-runtime/client/retrieve_and_generate.html) functions. 

#### Testing Knowledge Base with Retrieve and Generate API

Let's first test the knowledge base using the retrieve and generate API. With this API, Bedrock takes care of retrieving the necessary references from the knowledge base and generating the final answer using a Bedrock LLM.

In [None]:
f"arn:aws:bedrock:{region}:{account_id}:inference-profile/{agent_foundation_model[0]}"

#### Please be aware, the sync operation may take a few minutes to complete. Before the first sync finishes, you will not get an answer when you call `retrieve_and_generate` or `retrieve`. If that is the case, wait a few minutes and try again.

In [None]:
response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        "text": "compare and contrast 15-year vs 30-year mortgage type"
    },
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            'knowledgeBaseId': kb_id,
            "modelArn": f"arn:aws:bedrock:{region}:{account_id}:inference-profile/{agent_foundation_model[0]}",
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults":5
                } 
            }
        }
    }
)

print(response['output']['text'],end='\n'*2)

As you can see, with the retrieve and generate API we get the final response directly and we don't see the different sources used to generate this response. Let's now retrieve the source information from the knowledge base with the retrieve API.

**Testing Knowledge Base with Retrieve API**

If you need an extra layer of control, you can retrieve the chuncks that best match your query using the retrieve API. In this setup, we can configure the desired number of results and control the final answer with your own application logic. The API then provides you with the matching content, its S3 location, the similarity score and the chunk metadata.

In [None]:
response_ret = bedrock_agent_runtime_client.retrieve(
    knowledgeBaseId=kb_id, 
    nextToken='string',
    retrievalConfiguration={
        "vectorSearchConfiguration": {
            "numberOfResults":3,
        } 
    },
    retrievalQuery={
        'text': 'What are the cons of a 15-year mortgage?'
    }
)

def response_print(retrieve_resp):
    #structure 'retrievalResults': list of contents. Each list has content, location, score, metadata
    for num,chunk in enumerate(response_ret['retrievalResults'],1):
        print('-----------------------------------------------------------------------------------------')
        print(f'Chunk {num}: ',chunk['content']['text'],end='\n'*2)
        print(f'Chunk {num} Location: ',chunk['location'],end='\n'*2)
        print(f'Chunk {num} Score: ',chunk['score'],end='\n'*2)
        print(f'Chunk {num} Metadata: ',chunk['metadata'],end='\n'*2)

response_print(response_ret)

Store knowledge base ID and name for subsequent labs

In [None]:
kb_id

Store the knowledge base id **kb_id** in AWS Parameter Store. It will be accessed in the labs that you will be running.

In [None]:

param_name = '/app/mortgage_assistant/agentcore/kb_id'

ssm = boto3.client("ssm")
ssm.put_parameter(Name=param_name, Value=kb_id, Type="String", Overwrite=True)
print(f"Stored {kb_id} in SSM: {param_name}")

In this lab, we created a knowledge base that will be used by one of the agents to answers queries on mortgage 