## Create a Knowledge Base with fixed chunking strategy

## Overview

In this notebook, we will implement a knowledge base using a fixed chunking strategy. Here are the key steps we'll perform:

1. **Create a Knowledge Base**: Set up an Amazon Bedrock Knowledge Base with fixed-size chunking configuration that will store and retrieve our vector embeddings.

2. **Create a Data Source**: Connect our Knowledge Base to the documents we uploaded to S3 in the previous notebook.

3. **Start Ingestion Job**: Begin the process of transforming our documents into chunks, creating embeddings, and storing them in our vector database.

4. **Retrieve and Generate**: Test our Knowledge Base by retrieving relevant information based on a sample query.

#### Concept

**Fixed Chunking**: Involves dividing your documents into fixed-size chunks, regardless of the content within them. Each chunk contains a predefined number of tokens or characters, and this method allows for more uniform data organization. 

Fixed chunking is useful when you want to ensure that your chunks are of a consistent size, making them easier to process and retrieve in a predictable manner. The document is split into sections of equal length, and each section becomes a separate chunk. This method works well when the content is relatively homogeneous, and the chunk boundaries are not as crucial to understanding the underlying context.

#### Benefits

- **Uniformity**: Each chunk has the same size, making the system more predictable. This helps with processing efficiency since you know that each chunk is of a consistent size, making batch operations and parallel processing easier.
- **Simplified Retrieval**: Since the chunk sizes are uniform, searching through the data becomes straightforward. You can quickly determine the length of chunks, which can be useful for performance optimization and scalability in large datasets.
- **Performance Optimization**: Fixed chunks are ideal when you want to control the computational cost of document retrieval and chunking. Having equal-sized chunks reduces the chance of computational bottlenecks in scenarios requiring large-scale document processing.

> **Note:** While fixed chunking can be efficient for certain use cases, it may not preserve the natural semantic boundaries of the content, such as paragraphs or sections. This may lead to chunks that start or end at arbitrary places, potentially cutting off context in the middle of a sentence or idea.

### **Best Use Cases**
Fixed chunking is suitable for cases where:
- **Homogeneous content**: The content is consistent, and boundaries are not as important.
- **Performance**: You need uniform-sized chunks for predictable processing or optimization of large-scale systems.
- **Simplified text processing**: When chunk boundaries do not need to match natural semantic structures like paragraphs or sentences.

Examples include:
- **General document indexing**: When large datasets are involved, and uniform chunk sizes optimize retrieval.
- **Text summarization**: Fixed chunking is helpful when generating summaries from uniformly sized data pieces.


In [None]:
import json
with open("variables.json", "r") as f:
    variables = json.load(f)

variables

### 1. Create a Knowledge Base

In [None]:
from retrying import retry
import boto3

# Initialize the Bedrock Agent client using the provided AWS region
bedrock_agent = boto3.client("bedrock-agent", region_name=variables["regionName"])

# Retry decorator: If the function fails, it will retry up to 3 times with a random wait time between 1-2 seconds
@retry(wait_random_min=1000, wait_random_max=2000, stop_max_attempt_number=3)
def create_knowledge_base_func(name, description, chunking_type):
    """
    Creates a knowledge base in Amazon Bedrock with OpenSearch Serverless as the vector store.
    
    Parameters:
        name (str): The name of the knowledge base.
        description (str): A description of the knowledge base.
        chunking_type (str): The type of chunking strategy applied to vector indexing.

    Returns:
        dict: The response containing details of the created knowledge base.
    """
    
    # Define the ARN of the embedding model used for vectorization
    embedding_model_arn = f"arn:aws:bedrock:{variables['regionName']}::foundation-model/amazon.titan-embed-text-v2:0"

    # Configure OpenSearch Serverless for vector storage
    opensearch_serverless_configuration = {
        "collectionArn": variables["collectionArn"],  # ARN of the OpenSearch collection
        "vectorIndexName": variables["vectorIndexName"] + chunking_type,  # Index name based on chunking strategy
        "fieldMapping": {  # Define field mappings for vectors, text, and metadata
            "vectorField": "vector",
            "textField": "text",
            "metadataField": "text-metadata"
        }
    }

    print(opensearch_serverless_configuration)  # Print configuration for debugging

    # Create the knowledge base in Amazon Bedrock
    create_kb_response = bedrock_agent.create_knowledge_base(
        name=name,
        description=description,
        roleArn=variables["bedrockExecutionRoleArn"],  # IAM Role ARN for Bedrock execution
        knowledgeBaseConfiguration={
            "type": "VECTOR",
            "vectorKnowledgeBaseConfiguration": {
                "embeddingModelArn": embedding_model_arn  # Reference to the embedding model
            }
        },
        storageConfiguration={
            "type": "OPENSEARCH_SERVERLESS",
            "opensearchServerlessConfiguration": opensearch_serverless_configuration
        }
    )

    return create_kb_response["knowledgeBase"]  # Return the created knowledge base details

In [None]:
import boto3
import json

# Create a knowledge base using the predefined function
kb = create_knowledge_base_func(
    name="advanced-rag-workshop-fixed-chunking",
    description="Knowledge base using Amazon OpenSearch Service as a vector store",
    chunking_type="fixed"
)

# Retrieve details of the newly created knowledge base
get_kb_response = bedrock_agent.get_knowledge_base(knowledgeBaseId=kb['knowledgeBaseId'])

# Update the variables dictionary with the new knowledge base ID
variables["kbFixedChunk"] = kb['knowledgeBaseId']

# Save updated variables to a JSON file, handling datetime serialization
with open("variables.json", "w") as f:
    json.dump(variables, f, indent=4, default=str)  # Convert datetime to string

# Print the retrieved knowledge base response in a readable format
print(f'OpenSearch Knowledge Response: {json.dumps(get_kb_response, indent=4, default=str)}')

### 2. Create Datasources for Knowledge Base

In [None]:
import time

# Define the chunking strategy for data ingestion.
# This specifies how the text will be divided into smaller chunks before storing in OpenSearch.
chunking_strategy_configuration = {
    "chunkingStrategy": "FIXED_SIZE",  # Use fixed-size chunks
    "fixedSizeChunkingConfiguration": {
        "maxTokens": 1024,  # Maximum number of tokens per chunk
        "overlapPercentage": 20  # Overlap percentage between consecutive chunks
    }
}

# Define the S3 bucket configuration for the data source.
# This tells Bedrock where to fetch the documents from.
s3_configuration = {
    "bucketArn": f"arn:aws:s3:::{variables['s3Bucket']}"  # S3 bucket ARN for document storage
    # "inclusionPrefixes": ["shareholder_letters"]  # Uncomment to filter specific folder prefixes
}

# Check if a data source (`ds_fixed_chunk`) already exists in local variables.
# If it exists, delete it before creating a new one.
if 'ds_fixed_chunk' in locals():
    try:
        bedrock_agent.delete_data_source(
            knowledgeBaseId=kb['knowledgeBaseId'],  # ID of the Knowledge Base
            dataSourceId=ds_fixed_chunk["dataSourceId"],  # ID of the existing data source
        )
        time.sleep(15)  # Wait for deletion to complete before proceeding
    except Exception as e:
        print(f"Error while deleting existing data source: {e}")
        pass  # Continue execution even if deletion fails

# Create a new data source in the Knowledge Base.
create_ds_response = bedrock_agent.create_data_source(
    name="advanced-rag-example",  # Name of the data source
    description="A data source for Advanced RAG workshop",  # Description of the data source
    knowledgeBaseId=kb['knowledgeBaseId'],  # Associate with the correct Knowledge Base
    dataSourceConfiguration={
        "type": "S3",
        "s3Configuration": s3_configuration  # Use the defined S3 configuration
    },
    vectorIngestionConfiguration={
        "chunkingConfiguration": chunking_strategy_configuration  # Define chunking settings
    }
)

# Store the created data source object for future reference
ds_fixed_chunk = create_ds_response["dataSource"]

# Print the newly created data source information
ds_fixed_chunk

### 3. Start Ingestion Job for Amazon Bedrock Knowledge base pointing to Amazon OpenSearch

> **Note**: The ingestion process will take approximately 2-3 minutes to complete. During this time, the system is processing your documents by:
> 1. Extracting text from the source files
> 2. Chunking the content according to the defined strategy (Fixed / Semantic / Hierachical / Custom)
> 3. Generating embeddings for each chunk
> 4. Storing the embeddings and associated metadata in the OpenSearch vector database
>
> You'll see status updates as the process progresses. Please wait for the "Ingestion job completed successfully" message before proceeding to the next step.

In [None]:
import time

# List to keep track of all ingestion jobs
ingest_jobs = []

# Start an ingestion job for the data source
try:
    start_job_response = bedrock_agent.start_ingestion_job(
        knowledgeBaseId=kb['knowledgeBaseId'],  # ID of the Knowledge Base
        dataSourceId=ds_fixed_chunk["dataSourceId"]  # ID of the associated data source
    )
    
    # Extract job details
    job = start_job_response["ingestionJob"]
    print("Ingestion job started successfully.")

    # Polling mechanism to check job status until it is complete
    while job['status'] != 'COMPLETE':
        time.sleep(30)  # Wait before checking the status again

        get_job_response = bedrock_agent.get_ingestion_job(
            knowledgeBaseId=kb['knowledgeBaseId'],  # ID of the Knowledge Base
            dataSourceId=ds_fixed_chunk["dataSourceId"],  # ID of the data source
            ingestionJobId=job["ingestionJobId"]  # ID of the running ingestion job
        )
        
        # Update job status
        job = get_job_response["ingestionJob"]
        print(f"Job status: {job['status']}")  # Log the current job status

    print("Ingestion job completed successfully.")

except Exception as e:
    print("Error: Couldn't start ingestion job.")
    print(e)  # Print the exact error message for debugging

### 4. Retrieve

In [None]:
import boto3

# Initialize the Bedrock Agent Runtime client
bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", region_name=variables["regionName"])

# Define the query for retrieving relevant documents
query = "What are three sub-tasks in question answering over knowledge bases?"

try:
    # Retrieve the top 3 most relevant documents from the knowledge base
    relevant_documents_os = bedrock_agent_runtime.retrieve(
        retrievalQuery={
            'text': query  # Query text for document retrieval
        },
        knowledgeBaseId=kb['knowledgeBaseId'],  # ID of the Knowledge Base to search in
        retrievalConfiguration={
            'vectorSearchConfiguration': {
                'numberOfResults': 3  # Fetch the top 3 most relevant documents
            }
        }
    )

    # Print the retrieved documents for debugging
    print("Successfully retrieved relevant documents.")

except Exception as e:
    print("Error: Unable to retrieve relevant documents.")
    print(e)  # Print the error details for debugging

# Output the retrieved documents
relevant_documents_os