## Create a Knowledge Base with Custom chunking strategy

#### Custom Chunking Logic with Lambda Functions in Amazon Bedrock

When creating a Knowledge Base (KB) for Amazon Bedrock, you can connect a Lambda function to specify your custom chunking logic. During the ingestion process, if a Lambda function is provided, the Knowledge Base will execute the Lambda function and store the input and output values in the specified intermediate S3 bucket.

#### Use Cases for Lambda Functions in KBs

- **Custom Chunking Logic:** Lambda functions can be used to implement custom logic for chunking documents during ingestion, enabling more control over how documents are divided into meaningful chunks.
- **Chunk-level Metadata Processing:** Lambda functions can also process chunked data, for example, by adding custom metadata at the chunk level, enriching the data for more advanced retrieval or analysis.

This allows for more flexibility and tailored handling of document data within the Knowledge Base, making it possible to apply unique chunking strategies and augment the data with specific metadata for improved search and retrieval.


In [1]:
# Import the advanced_rag_utils module
import advanced_rag_utils
import json
import importlib

# Reload module
importlib.reload(advanced_rag_utils)

# Re-import all functions
from advanced_rag_utils import *

from datetime import datetime, timedelta, UTC

notebook_start_time = datetime.now(UTC)

# Load the variables from the JSON file
with open("../variables.json", "r") as f:
    variables = json.load(f)

variables

{'accountNumber': '989679345636',
 'regionName': 'us-west-2',
 'collectionArn': 'arn:aws:aoss:us-west-2:989679345636:collection/ny2d41n7rmju74rh4ue2',
 'collectionId': 'ny2d41n7rmju74rh4ue2',
 'vectorIndexName': 'ws-index-',
 'bedrockExecutionRoleArn': 'arn:aws:iam::989679345636:role/advanced-rag-workshop-bedrock_execution_role-us-west-2',
 's3Bucket': '989679345636-us-west-2-advanced-rag-workshop',
 'kbFixedChunk': 'TYG3IXCHCX',
 'kbSemanticChunk': 'N7ZHYZVLOX',
 'kbHierarchicalChunk': 'UDPUVOULM1',
 'kbCustomChunk': 'AD07GOEBQ2'}

In [2]:
kb_chunking_strategy = "custom" # ["fixed", "hierarchical", "semantic", "custom"]

In [3]:
df_costs = load_df_from_csv()
df_costs

Loaded existing file: /home/sagemaker-user/brsk-GTM/Advanced_RAG_Workshop/simplified_labs/embed_algo_costs.csv


Unnamed: 0,chunking_algo,embedding_seconds,input_tokens,invocation_count,total_token_costs
0,fixed,54.113933,0,0,0.0
1,semantic,122.857522,0,0,0.0
2,hierarchical,46.745694,0,0,0.0


### 0. Create a Lambda function with custom chunking logic

In [4]:
# Create or update the Lambda function with custom chunking logic
role_arn, function_arn = create_or_update_custom_chunking_lambda(
    region_name=variables["regionName"],
    account_number=variables["accountNumber"],
    role_name=f"advanced-rag-custom-chunk-{variables['regionName']}-role",
    function_name="advanced-rag-custom-chunk",
    s3_bucket=variables['s3Bucket']
)

IAM role 'advanced-rag-custom-chunk-us-west-2-role' already exists. Using the existing role.
Lambda function 'advanced-rag-custom-chunk' already exists. Updating code...
Lambda function code updated successfully


In [5]:
# Create an S3 bucket for custom chunking if it doesn't exist
create_custom_chunk_s3_bucket(
    s3_bucket=variables["s3Bucket"],
    region_name=variables["regionName"]
)

Bucket '989679345636-us-west-2-advanced-rag-workshop-custom-chunk' already exists.


'989679345636-us-west-2-advanced-rag-workshop-custom-chunk'

### 1. Create a Knowledge Base

In [6]:
# Create the knowledge base with custom chunking
kb = create_kb(
    kb_name="advanced-rag-workshop-custom-chunking",
    kb_description="Knowledge base using Amazon OpenSearch Service as a vector store",
    kb_chunking_type="custom",
    variables=variables
)

{'collectionArn': 'arn:aws:aoss:us-west-2:989679345636:collection/ny2d41n7rmju74rh4ue2', 'vectorIndexName': 'ws-index-custom', 'fieldMapping': {'vectorField': 'vector', 'textField': 'text', 'metadataField': 'text-metadata'}}
{'collectionArn': 'arn:aws:aoss:us-west-2:989679345636:collection/ny2d41n7rmju74rh4ue2', 'vectorIndexName': 'ws-index-custom', 'fieldMapping': {'vectorField': 'vector', 'textField': 'text', 'metadataField': 'text-metadata'}}
{'collectionArn': 'arn:aws:aoss:us-west-2:989679345636:collection/ny2d41n7rmju74rh4ue2', 'vectorIndexName': 'ws-index-custom', 'fieldMapping': {'vectorField': 'vector', 'textField': 'text', 'metadataField': 'text-metadata'}}
Knowledge Base already exists. Retrieving its ID...
Found existing knowledge base with Name: advanced-rag-workshop-custom-chunking and ID: AD07GOEBQ2
OpenSearch Knowledge Response: {
    "createdAt": "2025-03-19 16:39:43.196942+00:00",
    "description": "Knowledge base using Amazon OpenSearch Service as a vector store",
  

### 2. Create Datasources for Knowledge Base

In [7]:
# Create the data source with custom transformation configuration
ds_custom_chunk = create_custom_data_source_for_kb(
    kb=kb,
    variables=variables,
    data_source_name="advanced-rag-example",
    function_arn=function_arn
)

Checking for existing data sources in knowledge base AD07GOEBQ2...
Found existing data source 'advanced-rag-example'. Deleting it...
Waiting for data source deletion to complete...
Data source deleted.
Creating new data source 'advanced-rag-example' with custom chunking...
Custom chunking data source created successfully with ID: JX7YUQ0R7E


### 3. Start Ingestion Job for Amazon Bedrock Knowledge base pointing to Amazon OpenSearch

> **Note**: The ingestion process will take approximately 2-3 minutes to complete. During this time, the system is processing your documents by:
> 1. Extracting text from the source files
> 2. Chunking the content according to the defined strategy (Fixed / Semantic / Hierachical / Custom)
> 3. Generating embeddings for each chunk
> 4. Storing the embeddings and associated metadata in the OpenSearch vector database
>
> You'll see status updates as the process progresses. Please wait for the "Ingestion job completed successfully" message before proceeding to the next step.

In [8]:
# Create the ingestion job
ingestion_start_time = datetime.now(UTC)
time.sleep(5)

create_ingestion_job(
    kb=kb,
    ds_object=ds_custom_chunk,
    variables=variables
)
ingestion_end_time = datetime.now(UTC)

Ingestion job started successfully for kb_name = advanced-rag-workshop-custom-chunking and kb_id = AD07GOEBQ2

running...
running...
running...
running...
running...
running...
Job completed successfully



In [9]:
time_taken = (ingestion_end_time-ingestion_start_time).total_seconds()
print(f"time taken to ingest into KB = {fmt_n(time_taken)} seconds")

time taken to ingest into KB = 67.00 seconds


In [10]:
model_id = 'arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-text-v2:0'

# use the helper function to get input tokens to embedding LLM and the associated costs
tokens = get_embedding_LLM_costs_for_KB(model_id, ingestion_start_time, ingestion_end_time)

print(json.dumps(tokens, indent=4))

{
    "model_id": "arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-text-v2:0",
    "start_time": "2025-04-30T03:11:54.511984+00:00",
    "end_time": "2025-04-30T03:13:01.511178+00:00",
    "duration in minutes": 1.1166532333333334,
    "input_tokens": 0,
    "invocation_count": 0,
    "per million input token costs": 0.02,
    "total token costs": 0.0
}


In [11]:
# Let's add or update the cost binfo to dataframe. 
# This will help us compare the costs from various chunking strategies visually.
new_row = {
    'chunking_algo': kb_chunking_strategy,
    'embedding_seconds': tokens['duration in minutes']*60,
    'input_tokens': tokens['input_tokens'],
    'invocation_count': tokens['invocation_count'],
    'total_token_costs': tokens['total token costs']
}
df_costs = update_or_add_row(df_costs, new_row)
df_costs

Added new row for: custom


Unnamed: 0,chunking_algo,embedding_seconds,input_tokens,invocation_count,total_token_costs
0,fixed,54.113933,0,0,0.0
1,semantic,122.857522,0,0,0.0
2,hierarchical,46.745694,0,0,0.0
3,custom,66.999194,0,0,0.0


### 4. Retrieve

In [12]:
# Define the query for retrieving relevant documents
query = "What were net incomes of Amazon in 2022, 2023 and 2024?"

# Get the knowledge base ID from the variables
kb_id = variables.get("kbCustomChunk")

# Retrieve results from the knowledge base
chunks_from_kb = retrieve_from_kb(
    query=query,
    kb={"knowledgeBaseId": kb_id},
    n_chunks=3,
    variables=variables
)


#Let's specify a minimum similarity score. We should see less chunks retrieved as compared to the previous invocation.
min_score = 0.50

# # get chunks from KB
# chunks_from_kb = retrieve_from_kb(query, kb, n_chunks, variables, min_score)

print(json.dumps(chunks_from_kb, indent=2))


[
  {
    "content": "computation of earnings per share: Basic 10,005 10,117 10,189 Diluted 10,198 10,296 10,189 See accompanying notes to consolidated financial statements. 37Table of Contents AMAZON.COM, INC. CONSOLIDATED STATEMENTS OF COMPREHENSIVE INCOME (LOSS) (in millions) Year Ended December 31, 2020 2021 2022 Net income (loss) $ 21,331 $ 33,364 $ (2,722) Other comprehensive income (loss): Foreign currency translation adjustments, net of tax of $(36), $47, and $100 561 (819) (2,586) Net change in unrealized gains (losses) on available-for-sale debt securities: Unrealized gains (losses), net of tax of $(83), $72, and $159 273 (343) (823) Reclassification adjustment for losses (gains) included in \u201cOther",
    "metadata": {
      "x-amz-bedrock-kb-source-uri": "s3://989679345636-us-west-2-advanced-rag-workshop/data/pdf_documents/Amazon-10k-2023.pdf",
      "x-amz-bedrock-kb-document-page-number": 1.0,
      "year": 2023.0,
      "docType": "10K Report",
      "x-amz-bedrock-kb

> **Note**: After creating the knowledge base, you can explore its details and settings in the Amazon Bedrock console. This gives you a more visual interface to understand how the knowledge base is structured.
> 
> **[➡️ View your Knowledge Bases in the AWS Console](https://us-west-2.console.aws.amazon.com/bedrock/home?region=us-west-2#/knowledge-bases)**
>
> In the console, you can:
> - See all your knowledge bases in one place
> - View ingestion status and statistics
> - Test queries through the built-in chat interface
> - Modify settings and configurations

In [13]:
# Let's summarize with total chunks, minimum score, maximum score, average score, 
# and lastly the number of chunks with a score more than a specified threshold.
score_threshold = 0.40
score_structure = analyze_chunk_scores_above_threshold(chunks_from_kb, score_threshold)
print(json.dumps(score_structure, indent=4))

{
    "total_chunks": 3,
    "min_score": 0.5282634,
    "max_score": 0.5553531,
    "avg_score": 0.53924048,
    "count_above_threshold": 3
}


In [14]:
#Let's print the costs of running this notebook.

model_id = 'arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-text-v2:0'

notebook_end_time = datetime.now(UTC)
tokens = get_bedrock_tokens(model_id, notebook_start_time, notebook_end_time, 5)
print(json.dumps(tokens, indent=4))
print(f"Cost of running this notebook is approximately ${tokens['total token costs']}")

{
    "model_id": "arn:aws:bedrock:us-west-2::foundation-model/amazon.titan-embed-text-v2:0",
    "start_time": "2025-04-30T03:11:17.494770+00:00",
    "end_time": "2025-04-30T03:13:02.151004+00:00",
    "duration in minutes": 1.7442705666666667,
    "input_tokens": 0,
    "output_tokens": 0,
    "invocation_count": 0,
    "per million input token costs": 0.02,
    "per million output token costs": 0,
    "input token costs": 0.0,
    "output token costs": 0.0,
    "total token costs": 0.0,
    "average token costs per invocation": 0,
    "token costs per MILLION such invocations": 0
}
Cost of running this notebook is approximately $0.0
