## Create a Knowledge Base with Fixed Chunking Strategy
#### What will we do in this workshop?
1. Create a Knowledgebase (KB) in the vector database.
2. We will create a data source for the KB. The data source will be the Amazon Science and 10K documents stored in S3.
3. We will ingest the data from S3, use Fixed Chunking to chunk the data, generate vector embeddings, and store the chunks and their corresponding vector embeddings in the KB.
4. We will then ask some questions and query the KB to return some chunks and inspect relevancy score.
<br>Note: We are not sending the query and its chunks to a LLM in this notebook. We will do that in other notebooks.
![We are generating vector embeddings and storing them in a KB in Vector Database](./Fixed_Chunking.png)

Chunking data is essential. If you are adding large documents with hundreds of pages to your knowledge base then you need to split them up and return only the relevant sections to use as context for your inference. If you are returning too much context it will increase costs (models charge based on input token count) and latency. It may also harm output quality. Shorter chunks will provide a better match but may lack the context necessary to answer a question.

Bedrock Knowledge bases have a few different chunking strategies to choose from. They handle everything from splitting at semantic boundaries like paragraphs and hierarchical structures. However some document types can benefit from custom chunking. For example, any form of mark up can be used by a custom chunking approach.

You can also create your own custom chunking approach using a Lambda function. If you want to add any custom metadata then you will need to add a Lambda function. You can either handle the chunking yourself, edit an existing chunk or just add metadata. Metadata can then be used for filtering.

It is important to tune your chunking to the type of documents being ingested. Getting the wrong chunk size will affect the accuracy and response times. It will also increase the costs in both the vector storage and inference steps. The defaults supplied in Bedrock are pretty good but they may need tailored to your specific circumstances. Longer and more technical documents may need larger chunk sizes to make sure they include more context. Speech (like a chat transcript) can benefit from shorter chunks.

![Chunking Strategies](./chunking-strategies.png)

## Overview

In this notebook, we will implement a knowledge base using a fixed chunking strategy. Here are the key steps we'll perform:

1. **Create a Knowledge Base**: Set up an Amazon Bedrock Knowledge Base with fixed-size chunking configuration that will store and retrieve our vector embeddings.

2. **Create a Data Source**: Connect our Knowledge Base to the documents we uploaded to S3 in the previous notebook.

3. **Start Ingestion Job**: Begin the process of transforming our documents into chunks, creating embeddings, and storing them in our vector database.

4. **Retrieve and Generate**: Test our Knowledge Base by retrieving relevant information based on a sample query.

#### Concept

**Fixed Chunking**: Involves dividing your documents into fixed-size chunks, regardless of the content within them. Each chunk contains a predefined number of tokens or characters, and this method allows for more uniform data organization. 

![How Fixed Sized Chunking Works](./Fixed_how_it_works.png)

Fixed chunking is useful when you want to ensure that your chunks are of a consistent size, making them easier to process and retrieve in a predictable manner. The document is split into sections of equal length, and each section becomes a separate chunk. This method works well when the content is relatively homogeneous, and the chunk boundaries are not as crucial to understanding the underlying context.

#### Benefits

- **Uniformity**: Each chunk has the same size, making the system more predictable. This helps with processing efficiency since you know that each chunk is of a consistent size, making batch operations and parallel processing easier.
- **Simplified Retrieval**: Since the chunk sizes are uniform, searching through the data becomes straightforward. You can quickly determine the length of chunks, which can be useful for performance optimization and scalability in large datasets.
- **Performance Optimization**: Fixed chunks are ideal when you want to control the computational cost of document retrieval and chunking. Having equal-sized chunks reduces the chance of computational bottlenecks in scenarios requiring large-scale document processing.

> **Note:** While fixed chunking can be efficient for certain use cases, it may not preserve the natural semantic boundaries of the content, such as paragraphs or sections. This may lead to chunks that start or end at arbitrary places, potentially cutting off context in the middle of a sentence or idea.

### **Best Use Cases**
Fixed chunking is suitable for cases where:
- **Homogeneous content**: The content is consistent, and boundaries are not as important.
- **Performance**: You need uniform-sized chunks for predictable processing or optimization of large-scale systems.
- **Simplified text processing**: When chunk boundaries do not need to match natural semantic structures like paragraphs or sentences.

Examples include:
- **General document indexing**: When large datasets are involved, and uniform chunk sizes optimize retrieval.
- **Text summarization**: Fixed chunking is helpful when generating summaries from uniformly sized data pieces.


In [7]:
# Import a module with few helper functions. 
# These functions will help us create knowledge base (KB), create data source for KB, and ingest data using semantic chunking to KB.

import importlib
import advanced_rag_utils

# Reload module
importlib.reload(advanced_rag_utils)

# Re-import all functions
from advanced_rag_utils import *

from datetime import datetime, timedelta, UTC

notebook_start_time = datetime.now(UTC)

In [8]:
# Let's load the variables we saved in the first notebook. We will use these variables
import json
with open("../variables.json", "r") as f:
    variables = json.load(f)

variables

{'accountNumber': '674655509879',
 'regionName': 'us-west-2',
 'collectionArn': 'arn:aws:aoss:us-west-2:674655509879:collection/itgjkhz5b0epjlrptxql',
 'collectionId': 'itgjkhz5b0epjlrptxql',
 'vectorIndexName': 'ws-index-',
 'bedrockExecutionRoleArn': 'arn:aws:iam::674655509879:role/advanced-rag-workshop-bedrock_execution_role-us-west-2',
 's3Bucket': '674655509879-us-west-2-advanced-rag-workshop'}

In [9]:
# Load the dataframe related to costs from a csv file (if it already exists)
df_costs = load_df_from_csv()
df_costs

Loaded existing file: /home/sagemaker-user/sample-advanced-rag-using-bedrock-and-sagemaker/embed_algo_costs.csv


Unnamed: 0,chunking_algo,embedding_seconds,input_tokens,invocation_count,total_token_costs
0,fixed,41.270667,297653,1111,0.0
1,semantic,151.967345,994947,6676,0.0


### 1. Create a Knowledge Base
Let's specify  chunking strategy, name and descripotion for Knowledge Base (KB) and create a KB.

In [10]:
kb_chunking_strategy = "fixed" # ["fixed", "hierarchical", "semantic", "custom"]

In [11]:
kb_name = f"advanced-rag-workshop-{kb_chunking_strategy}-chunking"

kb_description = "Knowledge base using Amazon OpenSearch Service as a vector store"

kb = create_kb(kb_name, kb_description, kb_chunking_strategy, variables)

{'collectionArn': 'arn:aws:aoss:us-west-2:674655509879:collection/itgjkhz5b0epjlrptxql', 'vectorIndexName': 'ws-index-fixed', 'fieldMapping': {'vectorField': 'vector', 'textField': 'text', 'metadataField': 'text-metadata'}}
OpenSearch Knowledge Response: {
    "ResponseMetadata": {
        "RequestId": "318edd21-40a3-4eb8-9630-0faacdd15872",
        "HTTPStatusCode": 200,
        "HTTPHeaders": {
            "date": "Fri, 02 May 2025 06:15:19 GMT",
            "content-type": "application/json",
            "content-length": "956",
            "connection": "keep-alive",
            "x-amzn-requestid": "318edd21-40a3-4eb8-9630-0faacdd15872",
            "x-amz-apigw-id": "J7RmwFMsPHcEY8A=",
            "x-amzn-trace-id": "Root=1-681462f7-26312ed64f0ae80b223c8a8e"
        },
        "RetryAttempts": 0
    },
    "knowledgeBase": {
        "createdAt": "2025-05-02 06:15:17.935242+00:00",
        "description": "Knowledge base using Amazon OpenSearch Service as a vector store",
        "k

### 2. Create Datasource for Knowledge Base

In [12]:
data_source_name = f"advanced-rag-example-{kb_chunking_strategy}"

ds_object = create_data_source_for_kb(kb_chunking_strategy, data_source_name, kb, variables)

Creating new data source 'advanced-rag-example-fixed' with {'chunkingStrategy': 'FIXED_SIZE', 'fixedSizeChunkingConfiguration': {'maxTokens': 300, 'overlapPercentage': 20}} chunking...
fixed chunking data source created successfully.


### 3. Start Ingestion Job for Amazon Bedrock Knowledge base pointing to Amazon OpenSearch

> **Note**: The ingestion process will take approximately 2-3 minutes to complete. During this time, the system is processing your documents by:
> 1. Extracting text from the source files
> 2. Chunking the content according to the defined strategy (Fixed / Semantic / Hierachical / Custom)
> 3. Generating embeddings for each chunk
> 4. Storing the embeddings and associated metadata in a Knowledge Base (KB) in OpenSearch vector database
>
> You'll see status updates as the process progresses. Please wait for the "Ingestion job completed successfully" message before proceeding to the next step.

In [13]:
ingestion_start_time = datetime.now(UTC)
create_ingestion_job(kb, ds_object, variables)
ingestion_end_time = datetime.now(UTC)

Ingestion job started successfully for kb_name = advanced-rag-workshop-fixed-chunking and kb_id = X747HTHUTQ

running...
running...
running...
running...
running...
Job completed successfully



In [14]:
time_taken = (ingestion_end_time-ingestion_start_time).total_seconds()
print(f"time taken to ingest into KB = {fmt_n(time_taken)} seconds")

time taken to ingest into KB = 51.37 seconds


## Embedding LLM Costs
1. Specify model id
2. Specify start and end time
3. Invoke a helper function to query cloud watch
5. Calculate costs (please note that pricing is subject to change per region and over time)

<br>![Embedding LLM Input Token Costs](./Input_token_embedding_llm_costs.png)

In [18]:
model_id = 'arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-text-v2:0'
tokens = get_embedding_LLM_costs_for_KB(model_id, ingestion_start_time, ingestion_end_time)
print(json.dumps(tokens, indent=4))

{
    "model_id": "arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-text-v2:0",
    "start_time": "2025-05-02T06:15:25.896237+00:00",
    "end_time": "2025-05-02T06:16:17.265704+00:00",
    "duration in minutes": 0.8561577833333334,
    "input_tokens": 0,
    "output_tokens": 0,
    "invocation_count": 0,
    "per million input token costs": 0.0,
    "per million output token costs": 0.0,
    "input token costs": 0.0,
    "output token costs": 0.0,
    "total token costs": 0.0,
    "average token costs per invocation": 0,
    "token costs per MILLION such invocations": 0
}


In [19]:
# Let's add or update the cost binfo to dataframe. 
# This will help us compare the costs from various chunking strategies visually.
new_row = {
    'chunking_algo': kb_chunking_strategy,
    'embedding_seconds': tokens['duration in minutes']*60,
    'input_tokens': tokens['input_tokens'],
    'invocation_count': tokens['invocation_count'],
    'total_token_costs': tokens['total token costs']
}
df_costs = update_or_add_row(df_costs, new_row)
df_costs

Updated existing row for: fixed


Unnamed: 0,chunking_algo,embedding_seconds,input_tokens,invocation_count,total_token_costs
0,fixed,51.369467,0,0,0.0
1,semantic,151.967345,994947,6676,0.0


In [20]:
# Let's save the df
save_df_to_csv(df_costs)

Successfully saved DataFrame to: /home/sagemaker-user/sample-advanced-rag-using-bedrock-and-sagemaker/embed_algo_costs.csv


### 4. Retrieve: Use input query to RETRIEVE chunks from Vector Database
We will use a helper function where you can specify the number of chunks to extract.<br>
The helper function will 1/ generate a vector embedding for the query, 2/ search the vector embedding in the Knowledge Base (KB) vector database, 3/ get the number of chunks specified, 4/ Optionally, you can also specify minimum score for similarity in which case the helper function will get chunks with at least the minimum relevancy.

<b>Warning: After data is ingested into a KB, when you query immediately, the results might be empty because of eventual consistency. If that happens, please wait for a few seconds and then retry.</b>

In [21]:
# Let's ask some completely irrelevant question and see what we get from the KnolwedgeBase (KB) in vector database.
query = "What were the Taco sales in our franchise location on Guadalupe street?"

# specify the number of chunks we want to get from the KB in vector database.
n_chunks = 3 

# get chunks from KB
chunks_from_kb = retrieve_from_kb(query, kb, n_chunks, variables)

# print the chunks, metadata, and the score
print(json.dumps(chunks_from_kb, indent=4))

# You should see a very low score (typically less than 0.4) as there is nothing related to the question in the KB.

[
    {
        "content": "Overview     Our primary source of revenue is the sale of a wide range of products and services to customers. The products offered through our stores include merchandise and content we have purchased for resale and products offered by third-party sellers, and we also manufacture and sell electronic devices and produce media content. Generally, we recognize gross revenue from items we sell from our inventory as product sales and recognize our net share of revenue of items sold by third-party sellers as service sales. We seek to increase unit sales across our stores, through increased product selection, across numerous product categories. We also offer other services such as compute, storage, and database offerings, fulfillment, advertising, publishing, and digital content subscriptions.     Our financial focus is on long-term, sustainable growth in free cash flows. Free cash flows are driven primarily by increasing operating income and efficiently managing ac

In [22]:
# Let's summarize with total chunks, minimum score, maximum score, average score, 
# and lastly the number of chunks with a score more than a specified threshold.
score_threshold = 0.40

# We will use the helper function below to iterate through each element in json structure and print the 
# score statistics for the returned chunks from the KB
score_structure = analyze_chunk_scores_above_threshold(chunks_from_kb, score_threshold)

print(json.dumps(score_structure, indent=4))

{
    "total_chunks": 3,
    "min_score": 0.37658685,
    "max_score": 0.3767433,
    "avg_score": 0.37669115000000003,
    "count_above_threshold": 0
}


In [23]:
# Let's ask some something more related about Amazon's net incomes.
query = "What were net incomes of Amazon in 2022, 2023 and 2024?"

# specify the number of chunks
n_chunks = 5

# get chunks from KB
chunks_from_kb = retrieve_from_kb(query, kb, n_chunks, variables)

#print the chunks, metadata, and the score
print(json.dumps(chunks_from_kb, indent=4))

# You should see relevant content with relatively higher score as compared to when youa sked an irrelevant question.

[
    {
        "content": ", 2022 10,242 108 (7,837) 75,066 (4,487) 83,193 146,043 Net income \u2014 \u2014 \u2014 \u2014 \u2014 30,425 30,425 Other comprehensive income (loss) \u2014 \u2014 \u2014 \u2014 1,447 \u2014 1,447 Stock-based compensation and issuance of employee benefit plan stock 141 1 \u2014 23,959 \u2014 \u2014 23,960 Balance as of December 31, 2023 10,383 $ 109 $ (7,837) $ 99,025 $ (3,040) $ 113,618 $ 201,875     See accompanying notes to consolidated financial statements.     41Table of Contents     AMAZON.COM, INC. NOTES TO CONSOLIDATED FINANCIAL STATEMENTS     Note 1 \u2014 DESCRIPTION OF BUSINESS, ACCOUNTING POLICIES, AND SUPPLEMENTAL DISCLOSURES     Description of Business We seek to be Earth\u2019s most customer-centric company. In each of our segments, we serve our primary customer sets, consisting of consumers, sellers,     developers, enterprises, content creators, advertisers, and employees. We serve consumers through our online and physical stores and focus o

In [24]:
# Let's summarize with total chunks, minimum score, maximum score, average score, 
# and lastly the number of chunks with a score more than a specified threshold.
score_threshold = 0.60
score_structure = analyze_chunk_scores_above_threshold(chunks_from_kb, score_threshold)
print(json.dumps(score_structure, indent=4))

{
    "total_chunks": 5,
    "min_score": 0.56256187,
    "max_score": 0.64564687,
    "avg_score": 0.583513602,
    "count_above_threshold": 1
}


In [29]:
# For the following query. We know that the document does not mention the abbreviations CEO, CFO or CTO but mentions
# Chief Executive Officer and Chief Fincnace Officer. Let's see how good is the vector search.
query = "Who is the CEO, CFO, and CTO of Amazon? While answering the question, only use the data in context. If for any part of the question, you dont find the information in the context, please say I dont know for that part of the question."

#specify the number of chunks
n_chunks = 5 

# get chunks from KB
chunks_from_kb = retrieve_from_kb(query, kb, n_chunks, variables)

#print the chunks, metadata, and the score
print(json.dumps(chunks_from_kb, indent=2))

#You will see that vector search extracts the chunks that contains the word Chief Executive Officer and Chief Financial Officer

[
  {
    "content": "We promptly make available on this website, free of charge, the reports that we file or furnish with the Securities and Exchange Commission (\u201cSEC\u201d), corporate governance information (including our Code of Business Conduct and Ethics), and select press releases.     Executive Officers and Directors The following tables set forth certain information regarding our Executive Officers and Directors as of January 29, 2025:     Information About Our Executive Officers Name Age Position     Jeffrey P. Bezos 61 Executive Chair Andrew R. Jassy 57 President and Chief Executive Officer Matthew S. Garman 48 CEO Amazon Web Services Douglas J. Herrington 58 CEO Worldwide Amazon Stores Brian T. Olsavsky 61 Senior Vice President and Chief Financial Officer Shelley L. Reynolds 60 Vice President, Worldwide Controller, and Principal Accounting Officer David A. Zapolsky 61 Senior Vice President, Global Public Policy and General Counsel     Jeffrey P. Bezos. Mr. Bezos founded

#### Note: In the above results, the metadata has the name of the file, page number and other info. Optionally, this is something you could choose to share in your Generative application. Users can then click on the link and learn more from that content.

In [34]:
# Now let's pick the chunks with some minimum relevance score for the same question.
query = "Who is the CEO, CFO, and CTO of Amazon? While answering the question, only use the data in context. If for any part of the question, you dont find the information in the context, please say I dont know for that part of the question."

#specify the number of chunks
n_chunks = 5

#Let's specify a minimum similarity score. We should see less chunks retrieved as compared to the previous invocation.
min_score = 0.50

# get chunks from KB
chunks_from_kb = retrieve_from_kb(query, kb, n_chunks, variables, min_score)

print(json.dumps(chunks_from_kb, indent=2))

# You should see less number of chunks retrieved as compared to the previous cell 
# because of the minimum relevance score.

[
  {
    "content": "We promptly make available on this website, free of charge, the reports that we file or furnish with the Securities and Exchange Commission (\u201cSEC\u201d), corporate governance information (including our Code of Business Conduct and Ethics), and select press releases.     Executive Officers and Directors The following tables set forth certain information regarding our Executive Officers and Directors as of January 29, 2025:     Information About Our Executive Officers Name Age Position     Jeffrey P. Bezos 61 Executive Chair Andrew R. Jassy 57 President and Chief Executive Officer Matthew S. Garman 48 CEO Amazon Web Services Douglas J. Herrington 58 CEO Worldwide Amazon Stores Brian T. Olsavsky 61 Senior Vice President and Chief Financial Officer Shelley L. Reynolds 60 Vice President, Worldwide Controller, and Principal Accounting Officer David A. Zapolsky 61 Senior Vice President, Global Public Policy and General Counsel     Jeffrey P. Bezos. Mr. Bezos founded

In [33]:
# Let's summarize with total chunks, minimum score, maximum score, average score, 
# and lastly the number of chunks with a score more than a specified threshold.
score_threshold = 0.50
score_structure = analyze_chunk_scores_above_threshold(chunks_from_kb, score_threshold)
print(json.dumps(score_structure, indent=4))

{
    "total_chunks": 5,
    "min_score": 0.539641,
    "max_score": 0.55985385,
    "avg_score": 0.5470688100000001,
    "count_above_threshold": 0
}


In [None]:
#Let's print the costs of running this notebook.

model_id = 'arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-text-v2:0'

notebook_end_time = datetime.now(UTC)
tokens = get_bedrock_tokens(model_id, notebook_start_time, notebook_end_time, 5)
print(json.dumps(tokens, indent=4))
print(f"Cost of running this notebook is approximately ${tokens['total token costs']}")

{
    "model_id": "arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-text-v2:0",
    "start_time": "2025-05-02T06:15:13.083975+00:00",
    "end_time": "2025-05-02T06:20:30.629528+00:00",
    "duration in minutes": 5.292425883333333,
    "input_tokens": 51,
    "output_tokens": 0,
    "invocation_count": 3,
    "per million input token costs": 0.0,
    "per million output token costs": 0.0,
    "input token costs": 0.0,
    "output token costs": 0.0,
    "total token costs": 0.0,
    "average token costs per invocation": 0.0,
    "token costs per MILLION such invocations": 0.0
}
Cost of running this notebook is approximately $0.0
