## Create a Knowledge Base with Fixed Chunking Strategy
#### What will we do in this workshop?
1. Create a Knowledgebase (KB) in the vector database.
2. We will create a data source for the KB. The data source will be the Amazon Science and 10K documents stored in S3.
3. We will ingest the data from S3, use Fixed Chunking to chunk the data, generate vector embeddings, and store the chunks and their corresponding vector embeddings in the KB.
4. We will then ask some questions and query the KB to return some chunks and inspect relevancy score.
<br>Note: We are not sending the query and its chunks to a LLM in this notebook. We will do that in other notebooks.
![We are generating vector embeddings and storing them in a KB in Vector Database](./Fixed_Chunking.png)

Chunking data is essential. If you are adding large documents with hundreds of pages to your knowledge base then you need to split them up and return only the relevant sections to use as context for your inference. If you are returning too much context it will increase costs (models charge based on input token count) and latency. It may also harm output quality. Shorter chunks will provide a better match but may lack the context necessary to answer a question.

Bedrock Knowledge bases have a few different chunking strategies to choose from. They handle everything from splitting at semantic boundaries like paragraphs and hierarchical structures. However some document types can benefit from custom chunking. For example, any form of mark up can be used by a custom chunking approach.

You can also create your own custom chunking approach using a Lambda function. If you want to add any custom metadata then you will need to add a Lambda function. You can either handle the chunking yourself, edit an existing chunk or just add metadata. Metadata can then be used for filtering.

It is important to tune your chunking to the type of documents being ingested. Getting the wrong chunk size will affect the accuracy and response times. It will also increase the costs in both the vector storage and inference steps. The defaults supplied in Bedrock are pretty good but they may need tailored to your specific circumstances. Longer and more technical documents may need larger chunk sizes to make sure they include more context. Speech (like a chat transcript) can benefit from shorter chunks.

![Chunking Strategies](./chunking-strategies.png)

## Overview

In this notebook, we will implement a knowledge base using a fixed chunking strategy. Here are the key steps we'll perform:

1. **Create a Knowledge Base**: Set up an Amazon Bedrock Knowledge Base with fixed-size chunking configuration that will store and retrieve our vector embeddings.

2. **Create a Data Source**: Connect our Knowledge Base to the documents we uploaded to S3 in the previous notebook.

3. **Start Ingestion Job**: Begin the process of transforming our documents into chunks, creating embeddings, and storing them in our vector database.

4. **Retrieve and Generate**: Test our Knowledge Base by retrieving relevant information based on a sample query.

#### Concept

**Fixed Chunking**: Involves dividing your documents into fixed-size chunks, regardless of the content within them. Each chunk contains a predefined number of tokens or characters, and this method allows for more uniform data organization. 

![How Fixed Sized Chunking Works](./Fixed_how_it_works.png)

Fixed chunking is useful when you want to ensure that your chunks are of a consistent size, making them easier to process and retrieve in a predictable manner. The document is split into sections of equal length, and each section becomes a separate chunk. This method works well when the content is relatively homogeneous, and the chunk boundaries are not as crucial to understanding the underlying context.

#### Benefits

- **Uniformity**: Each chunk has the same size, making the system more predictable. This helps with processing efficiency since you know that each chunk is of a consistent size, making batch operations and parallel processing easier.
- **Simplified Retrieval**: Since the chunk sizes are uniform, searching through the data becomes straightforward. You can quickly determine the length of chunks, which can be useful for performance optimization and scalability in large datasets.
- **Performance Optimization**: Fixed chunks are ideal when you want to control the computational cost of document retrieval and chunking. Having equal-sized chunks reduces the chance of computational bottlenecks in scenarios requiring large-scale document processing.

> **Note:** While fixed chunking can be efficient for certain use cases, it may not preserve the natural semantic boundaries of the content, such as paragraphs or sections. This may lead to chunks that start or end at arbitrary places, potentially cutting off context in the middle of a sentence or idea.

### **Best Use Cases**
Fixed chunking is suitable for cases where:
- **Homogeneous content**: The content is consistent, and boundaries are not as important.
- **Performance**: You need uniform-sized chunks for predictable processing or optimization of large-scale systems.
- **Simplified text processing**: When chunk boundaries do not need to match natural semantic structures like paragraphs or sentences.

Examples include:
- **General document indexing**: When large datasets are involved, and uniform chunk sizes optimize retrieval.
- **Text summarization**: Fixed chunking is helpful when generating summaries from uniformly sized data pieces.


In [41]:
# Import a module with few helper functions. 
# These functions will help us create knowledge base (KB), create data source for KB, and ingest data using semantic chunking to KB.

import importlib
import advanced_rag_utils

# Reload module
importlib.reload(advanced_rag_utils)

# Re-import all functions
from advanced_rag_utils import *

from datetime import datetime, timedelta, UTC

notebook_start_time = datetime.now(UTC)

In [42]:
# Let's load the variables we saved in the first notebook. We will use these variables
import json
with open("../variables.json", "r") as f:
    variables = json.load(f)

variables

{'accountNumber': '270597685972',
 'regionName': 'us-west-2',
 'collectionArn': 'arn:aws:aoss:us-west-2:270597685972:collection/3ethft3xms9as2092ulg',
 'collectionId': '3ethft3xms9as2092ulg',
 'vectorIndexName': 'ws-index-',
 'bedrockExecutionRoleArn': 'arn:aws:iam::270597685972:role/advanced-rag-workshop-bedrock_execution_role-us-west-2',
 's3Bucket': '270597685972-us-west-2-advanced-rag-workshop',
 'kbFixedChunk': 'SN9KSOQPOV',
 'kbSemanticChunk': 'KMZYCTNSWW',
 'kbHierarchicalChunk': 'V8EJKFPYTK',
 'kbCustomChunk': 'G8P2D7M28S',
 'sagemakerLLMEndpoint': 'endpoint-llama-3-2-3b-instruct-2025-05-02-18-22-06'}

In [43]:
# Load the dataframe related to costs from a csv file (if it already exists)
df_costs = load_df_from_csv()
df_costs

Loaded existing file: /home/sagemaker-user/sample-advanced-rag-using-bedrock-and-sagemaker/embed_algo_costs.csv


Unnamed: 0,chunking_algo,embedding_seconds,input_tokens,invocation_count,total_token_costs
0,fixed,41.132119,243046,906,0.004861
1,hierarchical,56.287275,295358,1157,0.005907
2,semantic,131.845039,680599,4178,0.013612


### 1. Create a Knowledge Base
Let's specify  chunking strategy, name and descripotion for Knowledge Base (KB) and create a KB.

In [44]:
model_id = "amazon.titan-embed-text-v2:0"
kb_chunking_strategy = "fixed" # ["fixed", "hierarchical", "semantic", "custom"]

In [45]:
kb_name = f"advanced-rag-workshop-{kb_chunking_strategy}-chunking"

kb_description = "Knowledge base using Amazon OpenSearch Service as a vector store"

kb = create_kb(kb_name, kb_description, kb_chunking_strategy, variables, model_id)

{'collectionArn': 'arn:aws:aoss:us-west-2:270597685972:collection/3ethft3xms9as2092ulg', 'vectorIndexName': 'ws-index-fixed', 'fieldMapping': {'vectorField': 'vector', 'textField': 'text', 'metadataField': 'text-metadata'}}
{'collectionArn': 'arn:aws:aoss:us-west-2:270597685972:collection/3ethft3xms9as2092ulg', 'vectorIndexName': 'ws-index-fixed', 'fieldMapping': {'vectorField': 'vector', 'textField': 'text', 'metadataField': 'text-metadata'}}
{'collectionArn': 'arn:aws:aoss:us-west-2:270597685972:collection/3ethft3xms9as2092ulg', 'vectorIndexName': 'ws-index-fixed', 'fieldMapping': {'vectorField': 'vector', 'textField': 'text', 'metadataField': 'text-metadata'}}
Knowledge Base already exists. Retrieving its ID...
Found existing knowledge base with Name: advanced-rag-workshop-fixed-chunking and ID: SN9KSOQPOV
OpenSearch Knowledge Response: {
    "createdAt": "2025-05-02 22:29:24.527448+00:00",
    "description": "Knowledge base using Amazon OpenSearch Service as a vector store",
    "k

### 2. Create Datasource for Knowledge Base

In [46]:
data_source_name = f"advanced-rag-example-{kb_chunking_strategy}"

ds_object = create_data_source_for_kb(kb_chunking_strategy, data_source_name, kb, variables)

Found existing data source 'advanced-rag-example-fixed'. Deleting it...
Waiting for data source deletion to complete...
Data source deleted successfully.
Creating new data source 'advanced-rag-example-fixed' with {'chunkingStrategy': 'FIXED_SIZE', 'fixedSizeChunkingConfiguration': {'maxTokens': 300, 'overlapPercentage': 20}} chunking...
fixed chunking data source created successfully.


### 3. Start Ingestion Job for Amazon Bedrock Knowledge base pointing to Amazon OpenSearch

> **Note**: The ingestion process will take approximately 2-3 minutes to complete. During this time, the system is processing your documents by:
> 1. Extracting text from the source files
> 2. Chunking the content according to the defined strategy (Fixed / Semantic / Hierachical / Custom)
> 3. Generating embeddings for each chunk
> 4. Storing the embeddings and associated metadata in a Knowledge Base (KB) in OpenSearch vector database
>
> You'll see status updates as the process progresses. Please wait for the "Ingestion job completed successfully" message before proceeding to the next step.

In [47]:
from time import sleep
ingestion_start_time = datetime.now(UTC)
sleep(3)
create_ingestion_job(kb, ds_object, variables)
sleep(3)
ingestion_end_time = datetime.now(UTC)

Ingestion job started successfully for kb_name = advanced-rag-workshop-fixed-chunking and kb_id = SN9KSOQPOV

running...
running...
running...
running...
Job completed successfully



In [48]:
time_taken = (ingestion_end_time-ingestion_start_time).total_seconds()
print(f"time taken to ingest into KB = {fmt_n(time_taken)} seconds")

time taken to ingest into KB = 47.17 seconds


## Embedding LLM Costs
1. Specify model id
2. Specify start and end time
3. Invoke a helper function to query cloud watch
5. Calculate costs (please note that pricing is subject to change per region and over time)

<br>![Embedding LLM Input Token Costs](./Input_token_embedding_llm_costs.png)

In [49]:
vector_store_embedding_cost = get_bedrock_token_based_cost(model_id, ingestion_start_time, ingestion_end_time)
print(json.dumps(vector_store_embedding_cost, indent=4))

{
    "model_id": "amazon.titan-embed-text-v2:0",
    "start_time": "2025-05-02T23:42:36.153412+00:00",
    "end_time": "2025-05-02T23:43:23.327428+00:00",
    "duration in minutes": 0.7862336,
    "input_tokens": 111341,
    "output_tokens": 0,
    "invocation_count": 417,
    "per million input token costs": 0.02,
    "per million output token costs": 0.0,
    "input token costs": 0.00222682,
    "output token costs": 0.0,
    "total token costs": 0.00222682,
    "average token costs per invocation": 5.340095923261391e-06,
    "token costs per MILLION such invocations": 5.340095923261391
}


In [50]:
# Let's add or update the cost binfo to dataframe. 
# This will help us compare the costs from various chunking strategies visually.
new_row = {
    'chunking_algo': kb_chunking_strategy,
    'embedding_seconds': vector_store_embedding_cost['duration in minutes']*60,
    'input_tokens': vector_store_embedding_cost['input_tokens'],
    'invocation_count': vector_store_embedding_cost['invocation_count'],
    'total_token_costs': vector_store_embedding_cost['total token costs']
}
df_costs = update_or_add_row(df_costs, new_row)
df_costs

Updated existing row for: fixed


Unnamed: 0,chunking_algo,embedding_seconds,input_tokens,invocation_count,total_token_costs
0,fixed,47.174016,111341,417,0.002227
1,hierarchical,56.287275,295358,1157,0.005907
2,semantic,131.845039,680599,4178,0.013612


In [51]:
# Let's save the df
save_df_to_csv(df_costs)

Successfully saved DataFrame to: /home/sagemaker-user/sample-advanced-rag-using-bedrock-and-sagemaker/embed_algo_costs.csv


### 4. Retrieve: Use input query to RETRIEVE chunks from Vector Database
We will use a helper function where you can specify the number of chunks to extract.<br>
The helper function will 1/ generate a vector embedding for the query, 2/ search the vector embedding in the Knowledge Base (KB) vector database, 3/ get the number of chunks specified, 4/ Optionally, you can also specify minimum score for similarity in which case the helper function will get chunks with at least the minimum relevancy.

<b>Warning: After data is ingested into a KB, when you query immediately, the results might be empty because of eventual consistency. If that happens, please wait for a few seconds and then retry.</b>

In [63]:
# Let's ask some completely irrelevant question and see what we get from the KnolwedgeBase (KB) in vector database.
query = "What were the Taco sales in our franchise location on Guadalupe street?"

# specify the number of chunks we want to get from the KB in vector database.
n_chunks = 3 

# get chunks from KB
chunks_from_kb = retrieve_from_kb(query, kb, n_chunks, variables)

# print the chunks, metadata, and the score
print(json.dumps(chunks_from_kb, indent=4))

# You should see a very low score (typically less than 0.4) as there is nothing related to the question in the KB.

[
    {
        "content": "We also have firm, non-cancellable commitments for certain products offered in our Whole Foods Market stores.     Accounts Receivable, Net and Other     Included in \u201cAccounts receivable, net and other\u201d on our consolidated balance sheets are amounts primarily related to customers, vendors, and sellers. As of December 31, 2021 and 2022, customer receivables, net, were $20.2 billion and $26.6 billion, vendor receivables, net, were $5.3 billion and $6.9 billion, and seller receivables, net, were $1.0 billion and $1.3 billion. Seller receivables are amounts due from sellers related to our seller lending program, which provides funding to sellers primarily to procure inventory.     We estimate losses on receivables based on expected losses, including our historical experience of actual losses. Receivables are considered impaired and written-off when it is probable that all contractual payments due will not be collected in accordance with the terms of the

In [64]:
# Let's summarize with total chunks, minimum score, maximum score, average score, 
# and lastly the number of chunks with a score more than a specified threshold.
score_threshold = 0.40

# We will use the helper function below to iterate through each element in json structure and print the 
# score statistics for the returned chunks from the KB
score_structure = analyze_chunk_scores_above_threshold(chunks_from_kb, score_threshold)

print(json.dumps(score_structure, indent=4))

{
    "total_chunks": 3,
    "min_score": 0.3711651,
    "max_score": 0.37424883,
    "avg_score": 0.3722690433333333,
    "count_above_threshold": 0
}


In [65]:
# Let's ask some something more related about Amazon's net incomes.
query = "What were net incomes of Amazon in 2022, 2023 and 2024?"

# specify the number of chunks
n_chunks = 5

# get chunks from KB
chunks_from_kb = retrieve_from_kb(query, kb, n_chunks, variables)

#print the chunks, metadata, and the score
print(json.dumps(chunks_from_kb, indent=4))

# You should see relevant content with relatively higher score as compared to when youa sked an irrelevant question.

[
    {
        "content": "(5,936) 37,557 68,614 Benefit (provision) for income taxes 3,217 (7,120) (9,265) Equity-method investment activity, net of tax (3) (12) (101) Net income (loss) $ (2,722) $ 30,425 $ 59,248 Basic earnings per share $ (0.27) $ 2.95 $ 5.66 Diluted earnings per share $ (0.27) $ 2.90 $ 5.53 Weighted-average shares used in computation of earnings per share:     Basic 10,189 10,304 10,473 Diluted 10,189 10,492 10,721     See accompanying notes to consolidated financial statements.     37Table of Contents     AMAZON.COM, INC.",
        "metadata": {
            "x-amz-bedrock-kb-source-uri": "s3://270597685972-us-west-2-advanced-rag-workshop/data/pdf_documents/Amazon-10k-2025.pdf",
            "x-amz-bedrock-kb-document-page-number": 37.0,
            "year": 2025.0,
            "docType": "10K Report",
            "x-amz-bedrock-kb-data-source-id": "HLINVC0VCJ",
            "company": "Amazon",
            "x-amz-bedrock-kb-chunk-id": "1%3A0%3ASfthk5YBIk6Sd2CH9MgH",

In [67]:
# Let's summarize with total chunks, minimum score, maximum score, average score, 
# and lastly the number of chunks with a score more than a specified threshold.
score_threshold = 0.56
score_structure = analyze_chunk_scores_above_threshold(chunks_from_kb, score_threshold)
print(json.dumps(score_structure, indent=4))

{
    "total_chunks": 5,
    "min_score": 0.536493,
    "max_score": 0.55985385,
    "avg_score": 0.5453553099999999,
    "count_above_threshold": 0
}


In [66]:
# For the following query. We know that the document does not mention the abbreviations CEO, CFO or CTO but mentions
# Chief Executive Officer and Chief Fincnace Officer. Let's see how good is the vector search.
query = "Who is the CEO, CFO, and CTO of Amazon? While answering the question, only use the data in context. If for any part of the question, you dont find the information in the context, please say I dont know for that part of the question."

#specify the number of chunks
n_chunks = 5 

# get chunks from KB
chunks_from_kb = retrieve_from_kb(query, kb, n_chunks, variables)

#print the chunks, metadata, and the score
print(json.dumps(chunks_from_kb, indent=2))

#You will see that vector search extracts the chunks that contains the word Chief Executive Officer and Chief Financial Officer

[
  {
    "content": "We promptly make available on this website, free of charge, the reports that we file or furnish with the Securities and Exchange Commission (\u201cSEC\u201d), corporate governance information (including our Code of Business Conduct and Ethics), and select press releases.     Executive Officers and Directors The following tables set forth certain information regarding our Executive Officers and Directors as of January 29, 2025:     Information About Our Executive Officers Name Age Position     Jeffrey P. Bezos 61 Executive Chair Andrew R. Jassy 57 President and Chief Executive Officer Matthew S. Garman 48 CEO Amazon Web Services Douglas J. Herrington 58 CEO Worldwide Amazon Stores Brian T. Olsavsky 61 Senior Vice President and Chief Financial Officer Shelley L. Reynolds 60 Vice President, Worldwide Controller, and Principal Accounting Officer David A. Zapolsky 61 Senior Vice President, Global Public Policy and General Counsel     Jeffrey P. Bezos. Mr. Bezos founded

#### Note: In the above results, the metadata has the name of the file, page number and other info. Optionally, this is something you could choose to share in your Generative application. Users can then click on the link and learn more from that content.

In [68]:
# Now let's pick the chunks with some minimum relevance score for the same question.
query = "Who is the CEO, CFO, and CTO of Amazon? While answering the question, only use the data in context. If for any part of the question, you dont find the information in the context, please say I dont know for that part of the question."

#specify the number of chunks
n_chunks = 5

#Let's specify a minimum similarity score. We should see less chunks retrieved as compared to the previous invocation.
min_score = 0.50

# get chunks from KB
chunks_from_kb = retrieve_from_kb(query, kb, n_chunks, variables, min_score)

print(json.dumps(chunks_from_kb, indent=2))

# You should see less number of chunks retrieved as compared to the previous cell 
# because of the minimum relevance score.

[
  {
    "content": "We promptly make available on this website, free of charge, the reports that we file or furnish with the Securities and Exchange Commission (\u201cSEC\u201d), corporate governance information (including our Code of Business Conduct and Ethics), and select press releases.     Executive Officers and Directors The following tables set forth certain information regarding our Executive Officers and Directors as of January 29, 2025:     Information About Our Executive Officers Name Age Position     Jeffrey P. Bezos 61 Executive Chair Andrew R. Jassy 57 President and Chief Executive Officer Matthew S. Garman 48 CEO Amazon Web Services Douglas J. Herrington 58 CEO Worldwide Amazon Stores Brian T. Olsavsky 61 Senior Vice President and Chief Financial Officer Shelley L. Reynolds 60 Vice President, Worldwide Controller, and Principal Accounting Officer David A. Zapolsky 61 Senior Vice President, Global Public Policy and General Counsel     Jeffrey P. Bezos. Mr. Bezos founded

In [69]:
# Let's summarize with total chunks, minimum score, maximum score, average score, 
# and lastly the number of chunks with a score more than a specified threshold.
score_threshold = 0.55
score_structure = analyze_chunk_scores_above_threshold(chunks_from_kb, score_threshold)
print(json.dumps(score_structure, indent=4))

{
    "total_chunks": 5,
    "min_score": 0.539641,
    "max_score": 0.55985385,
    "avg_score": 0.5470688100000001,
    "count_above_threshold": 1
}


### Cost Summary for Running This Notebook
In this notebook, we have used an embedding LLM for two purposes. 
1. Populate a vector store for six PDF files and one CSV file. (7 documents in total)
2. Generate a query embedding.

In [70]:
import time
time.sleep(5)

# Marking notebook endtime
notebook_end_time = datetime.now(UTC)

In [71]:
from IPython.display import display, Markdown
from advanced_rag_utils import embedding_cost_report

cost_for_notebook = get_bedrock_token_based_cost(model_id, notebook_start_time, notebook_end_time)

# Your assumptions for your use case:
scenario_number_of_documents = 1000
scenario_number_of_queries = 15000000
 
display(Markdown(embedding_cost_report(vector_store_embedding_cost, cost_for_notebook, scenario_number_of_documents, scenario_number_of_queries)))


#### Scenario
* Number of documents to ingest: 1000
* Number of queries: 15000000

#### Cost Estimation based on the Scenario (USD)
|-| Notebook Cost | Scenario Cost |
|-|-|-|
|VectorStore|0.002227|0.318117|
|Queries|8.080000000000153e-06|12.12|
|**TOTAL**|0.002235|12.438116999999998|

#### The cost estimation is based on a scenario that the similar documents and queries are multiplied.
        