## Create a Knowledge Base with Semantic chunking strategy

#### Concept

Semantic chunking analyzes the relationships within a text and divides it into meaningful and complete chunks, which are derived based on the semantic similarity calculated by the embedding model. This approach preserves the information’s integrity during retrieval, helping to ensure accurate and contextually appropriate results. Knowledge Bases for Amazon Bedrock first divides documents into chunks based on the specified token size. Embeddings are created for each chunk, and similar chunks in the embedding space are combined based on the similarity threshold and buffer size, forming new chunks. Consequently, the chunk size can vary across chunks.

#### Benefits

* By focusing on the text’s meaning and context, semantic chunking significantly improves the quality of retrieval. It should be used in scenarios where maintaining the semantic integrity of the text is crucial.

* Although this method is more computationally intensive than fixed-size chunking, it can be beneficial for chunking documents where contextual boundaries aren’t clear—for example, legal documents or technical manuals.

In [None]:
import json
with open("variables.json", "r") as f:
    variables = json.load(f)

variables

### 1. Create a Knowledge Base

In [None]:
# Helper function definition
from retrying import retry
import boto3

# Initialize the Bedrock agent client with the specified region
bedrock_agent = boto3.client("bedrock-agent", region_name=variables["regionName"])

@retry(wait_random_min=1000, wait_random_max=2000, stop_max_attempt_number=3)
def create_knowledge_base_func(name, description, chunking_type):
    # The embedding model used by Bedrock to embed ingested documents and real-time prompts
    embedding_model_arn = f"arn:aws:bedrock:{variables['regionName']}::foundation-model/amazon.titan-embed-text-v2:0"
    
    # Configuration for OpenSearch Serverless to store vectors and associated metadata
    opensearch_serverless_configuration = {
            "collectionArn": variables["collectionArn"],  # ARN for the OpenSearch collection
            "vectorIndexName": variables["vectorIndexName"] + chunking_type,  # Vector index name appended with chunking type
            "fieldMapping": {  # Mapping the fields for vector, text, and metadata
                "vectorField": "vector",
                "textField": "text",
                "metadataField": "text-metadata"
            }
        }
    
    # Printing the configuration to verify before creating the Knowledge Base
    print(opensearch_serverless_configuration)
    
    # Create the Knowledge Base using Bedrock Agent's API
    create_kb_response = bedrock_agent.create_knowledge_base(
        name=name,  # Knowledge base name
        description=description,  # Knowledge base description
        roleArn=variables["bedrockExecutionRoleArn"],  # IAM role ARN for Bedrock to assume
        knowledgeBaseConfiguration={  # Configuration for the knowledge base
            "type": "VECTOR",  # Type of Knowledge Base: VECTOR for vectorized data
            "vectorKnowledgeBaseConfiguration": {
                "embeddingModelArn": embedding_model_arn  # ARN for the embedding model
            }
        },
        storageConfiguration={  # Storage configuration for the knowledge base
            "type": "OPENSEARCH_SERVERLESS",  # Using OpenSearch Serverless as the storage option
            "opensearchServerlessConfiguration": opensearch_serverless_configuration  # OpenSearch configuration details
        }
    )
    
    # Return the created knowledge base details
    return create_kb_response["knowledgeBase"]

In [None]:
import boto3
import json

try:
    # Create a knowledge base using the predefined function
    kb = create_knowledge_base_func(
        name="advanced-rag-workshop-semantic-chunking",
        description="Knowledge base using Amazon OpenSearch Service as a vector store",
        chunking_type="semantic"
    )

    # Retrieve details of the newly created knowledge base
    get_kb_response = bedrock_agent.get_knowledge_base(knowledgeBaseId=kb['knowledgeBaseId'])

    # Update the variables dictionary with the new knowledge base ID
    variables["kbSemanticChunk"] = kb['knowledgeBaseId']

    # Save updated variables to a JSON file, handling datetime serialization
    with open("variables.json", "w") as f:
        json.dump(variables, f, indent=4, default=str)  # Convert datetime to string

    # Print the retrieved knowledge base response in a readable format
    print(f'OpenSearch Knowledge Response: {json.dumps(get_kb_response, indent=4, default=str)}')
    
except Exception as e:
    # Check if error message indicates the knowledge base already exists
    error_message = str(e).lower()
    if any(phrase in error_message for phrase in ["already exist", "duplicate", "already been created"]):
        print("Knowledge Base already exists. Retrieving its ID...")
        
        # List all knowledge bases to find the one that already exists
        list_kb_response = bedrock_agent.list_knowledge_bases()
        
        # Look for a knowledge base with the desired name
        for kb in list_kb_response.get('knowledgeBaseSummaries', []):
            if kb['name'] == "advanced-rag-workshop-semantic-chunking":
                kb_id = kb['knowledgeBaseId']
                print(f"Found existing knowledge base with ID: {kb_id}")
                
                # Get the details of the existing knowledge base
                get_kb_response = bedrock_agent.get_knowledge_base(knowledgeBaseId=kb_id)
                
                # With this code that reads existing values first:
                try:
                    # Read existing variables
                    with open("variables.json", "r") as f:
                        existing_variables = json.load(f)
                except (FileNotFoundError, json.JSONDecodeError):
                    # If file doesn't exist or is invalid JSON
                    existing_variables = {}
                
                # Update only the semantic chunking value
                existing_variables["kbSemanticChunk"] = kb_id
                                
                # Write back all variables
                with open("variables.json", "w") as f:
                    json.dump(existing_variables, f, indent=4, default=str)
                
                # Print the retrieved knowledge base response
                print(f'OpenSearch Knowledge Response: {json.dumps(get_kb_response, indent=4, default=str)}')
                break        
        else:
            print("Could not find a knowledge base with the specified name.")
    else:
        # If it's a different error, re-raise it
        raise e

### 2. Create Datasources for Knowledge Base

In [None]:
import time

# Define the chunking strategy configuration for semantic chunking
chunkingStrategyConfiguration = {
    "chunkingStrategy": "SEMANTIC",  # Using semantic chunking strategy
    "semanticChunkingConfiguration": {
        "maxTokens": 300,  # Maximum token length per chunk
        "bufferSize": 1,   # Buffer size to handle context overlap between chunks
        "breakpointPercentileThreshold": 95  # Percentile threshold for breaking chunks
    }
}

# Configuration for the data source, here it is an S3 bucket where documents will be ingested from
s3Configuration = {
    "bucketArn": f"arn:aws:s3:::{variables['s3Bucket']}",  # S3 bucket ARN
    # "inclusionPrefixes": ["shareholder_letters"] # Optional: can be used to filter specific files in S3
}

# If a data source already exists with the ID 'ds_semantic_chunk', attempt to delete it
if 'ds_semantic_chunk' in locals():
    try:
        # Deleting the existing data source
        bedrock_agent.delete_data_source(
            knowledgeBaseId = kb['knowledgeBaseId'],
            dataSourceId = ds_semantic_chunk["dataSourceId"],
        )
        time.sleep(10)  # Wait for a while before creating a new one
    except Exception as e:
        # Handle any exceptions during deletion (e.g., if the data source doesn't exist)
        print(e)
        pass

# Create a new data source for ingestion into the knowledge base
create_ds_response = bedrock_agent.create_data_source(
    name = f'advanced-rag-example',  # Data source name
    description = "A data source for Advanced RAG workshop",  # Description for the data source
    knowledgeBaseId = kb['knowledgeBaseId'],  # Reference to the knowledge base
    dataSourceConfiguration = {
        "type": "S3",  # Data source type is S3
        "s3Configuration": s3Configuration  # S3 configuration for the data source
    },
    vectorIngestionConfiguration = {
        "chunkingConfiguration": chunkingStrategyConfiguration  # Apply the defined chunking strategy
    }
)

# Save the created data source in a variable for later use
ds_semantic_chunk = create_ds_response["dataSource"]

# Return the created data source object
ds_semantic_chunk

### 3. Start Ingestion Job for Amazon Bedrock Knowledge base pointing to Amazon OpenSearch

> **Note**: The ingestion process will take approximately 2-3 minutes to complete. During this time, the system is processing your documents by:
> 1. Extracting text from the source files
> 2. Chunking the content according to the defined strategy (Fixed / Semantic / Hierachical / Custom)
> 3. Generating embeddings for each chunk
> 4. Storing the embeddings and associated metadata in the OpenSearch vector database
>
> You'll see status updates as the process progresses. Please wait for the "Ingestion job completed successfully" message before proceeding to the next step.

In [None]:
import time

# List to store ingestion jobs
ingest_jobs=[]

# Start an ingestion job for the given data source and knowledge base
try:
    # Initiate the ingestion job and capture the response
    start_job_response = bedrock_agent.start_ingestion_job(
        knowledgeBaseId = kb['knowledgeBaseId'],  # Knowledge base ID
        dataSourceId = ds_semantic_chunk["dataSourceId"]  # Data source ID
    )
    job = start_job_response["ingestionJob"]  # Retrieve the ingestion job details
    print(f"Ingestion job started successfully\n")

    # Monitor the ingestion job status until it completes
    while(job['status'] != 'COMPLETE'):
        # Sleep for a brief period to ensure the job is fully completed
        print("running...")
        time.sleep(10)
        # Check the status of the ingestion job
        get_job_response = bedrock_agent.get_ingestion_job(
            knowledgeBaseId = kb['knowledgeBaseId'],
            dataSourceId = ds_semantic_chunk["dataSourceId"],
            ingestionJobId = job["ingestionJobId"]  # Use the ingestion job ID to fetch the status
        )
        job = get_job_response["ingestionJob"]  # Update the job status

    print(f"Job completed successfully\n")

except Exception as e:
    # Handle any errors that occur during the job start process
    print(f"Couldn't start job.\n")
    print(e)

### 4. Retrieve

In [None]:
import boto3

# Initialize the Bedrock agent runtime client to interact with the Bedrock service
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", region_name=variables["regionName"])

# Define the query to retrieve relevant documents from the knowledge base
query = "What were net incomes of Amazon in 2022, 2023 and 2024?" 

# Use the Bedrock agent runtime to retrieve relevant documents from the knowledge base
relevant_documents_os = bedrock_agent_runtime.retrieve(
    retrievalQuery= {
        'text': query  # The text query for retrieving documents
    },
    knowledgeBaseId=kb['knowledgeBaseId'],  # The knowledge base ID to search within
    retrievalConfiguration= {
        'vectorSearchConfiguration': {
            'numberOfResults': 3  # Fetch the top 3 documents that closely match the query
        }
    }
)

# Return the relevant documents found for the query
print(json.dumps([i["content"]["text"] for i in relevant_documents_os["retrievalResults"]], indent=2))

> **Note**: After creating the knowledge base, you can explore its details and settings in the Amazon Bedrock console. This gives you a more visual interface to understand how the knowledge base is structured.
> 
> **[➡️ View your Knowledge Bases in the AWS Console](https://us-west-2.console.aws.amazon.com/bedrock/home?region=us-west-2#/knowledge-bases)**
>
> In the console, you can:
> - See all your knowledge bases in one place
> - View ingestion status and statistics
> - Test queries through the built-in chat interface
> - Modify settings and configurations