# Create Unstructured Amazon Bedrock Knowledge Base


This notebook demonstrates how to create and configure an Amazon Bedrock Knowledge Base for unstructured data.

The Knowledge Base integrates Amazon S3 as the data source and uses Amazon OpenSearch Serverless as the vector store. It enables RAG by powering queries over unstructured financial and business content.

We will be creating an unstructured knowledge base that indexes the JSON files into Amazon OpenSearch Serverless node from S3 bucket.

![Unstructued Knowledge Base](../images/unstructured_kb.png)


## Setup and prerequisites

### Prerequisites
* Python 3.13
* AWS account
* Amazon Bedrock foundation model access
* IAM role with permissions to create Amazon Bedrock Knowledge Base, Amazon S3 bucket, Amazon OpenSearch Serverless

Let's now install the requirement packages and define the needed clients to create our Amazon Bedrock Knowledge Base:


Import required libraries for AWS service interaction, data handling, and logging to support Knowledge Base creation and management:

In [None]:
import json
import logging
import os
import random
import string
import time
import uuid
from datetime import datetime

import boto3
import botocore
import requests

Initialize AWS service clients to interact with S3, STS, and Bedrock services throughout the notebook:

In [None]:
s3_client = boto3.client('s3')
sts_client = boto3.client('sts')
session = boto3.session.Session()
region = session.region_name
account_id = sts_client.get_caller_identity()["Account"]
logging.basicConfig(format='[%(asctime)s] p%(process)s {%(filename)s:%(lineno)d} %(levelname)s - %(message)s', level=logging.INFO)
logger = logging.getLogger(__name__)
bedrock_agent_runtime_client = boto3.client("bedrock-agent-runtime")
bedrock_runtime = boto3.client("bedrock-runtime")

print(f"AWS region: {region}")
print(f"AWS account ID: {account_id}")


Generate a unique random suffix for AWS resource names. This prevents naming conflicts when multiple participants run the workshop simultaneously in the same AWS account.

In [None]:
# Generate unique suffix for resource names
suffix = ''.join(random.choices(string.ascii_lowercase + string.digits, k=8))

print(f"Suffix: {suffix}")


## Step 1: Use the Amazon Bedrock Knowledge Base helper

We will be using the `BedrockKnowledgeBase` class from `utils.knowledge_base` to create and configure the Amazon Bedrock Knowledge Base. This class simplifies the process of creating and managing the Knowledge Base by abstracting away the details of the API calls required to configure the Knowledge Base.


In [None]:
import os
if 'Lab 1' in os.getcwd():
    %cd ..
else:
    print(os.getcwd())

from utils.knowledge_base import BedrockKnowledgeBase

## Step 2: Create Amazon Bedrock Knowledge Base for Unstructured Data
In this section we will configure the Amazon Bedrock Knowledge Base containing the product reviews. We will be using Amazon OpenSearch Serverless Service as the underlying vector store and Amazon S3 as the data source containing the PDF file.


These settings define the core identity and capabilities of your Knowledge Base. The unique name prevents conflicts, the foundation model determines response quality, and the embedding model controls how documents are vectorized for semantic search.

In [None]:
knowledge_base_name = f"product-reviews-unstructured-kb-{suffix}"
knowledge_base_description = "Unstructured Knowledge Base containing product review documents."
foundation_model = "anthropic.claude-haiku-4-5-20251001-v1:0"  # Will use inference profile for invocation
generation_model = "global."+foundation_model
embedding_model = "cohere.embed-multilingual-v3"

Test that we have access to the foundation model and embedding model being used in the workshop

Note: We are testing the foundation model with a global inference profile, not direct model invocation

In [None]:
# First, verify AWS Marketplace permissions by checking Bedrock model access
bedrock_client = boto3.client('bedrock')
try:
    # This call indirectly verifies marketplace permissions for third-party models
    # List foundation models to check if Cohere models are accessible
    models = bedrock_client.list_foundation_models(
    )
    cohere_models = [m for m in models['modelSummaries'] if 'cohere' in m['modelId'].lower()]
    claude_models = [m for m in models['modelSummaries'] if 'claude' in m['modelId'].lower()]
    print(f"‚úÖ Bedrock access verified - found {len(cohere_models)} Cohere models available")
    if claude_models:
        print(f"   Available Claude models: {[m['modelId'] for m in claude_models[:3]]}")
    if cohere_models:
        print(f"   Available Cohere models: {[m['modelId'] for m in cohere_models[:3]]}")
except Exception as e:
    print(f"‚ö†Ô∏è  Bedrock model access check: {e}")

# Test foundation model using inference profile
# Note: Claude 4 and highter requires using a CRIS inference profile
try:
    foundation_response = bedrock_runtime.invoke_model(
        modelId=generation_model,
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 10,
            "messages": [{"role": "user", "content": "Hello"}]
        }),
        contentType='application/json'
    )
    print(f"Foundation model {foundation_model} is active")
except Exception as e:
    print(f"Foundation model error: {e}")

# Test embedding model (using embedding format)
try:
    embedding_response = bedrock_runtime.invoke_model(
        modelId=embedding_model,
        body=json.dumps({
            "texts": ["test"],
            "input_type": "search_document"
        }),
        contentType='application/json'
    )
    print(f"Embedding model {embedding_model} is active")
except Exception as e:
    print(f"Embedding model error: {e}")


For this notebook, we'll create a Knowledge Base with an Amazon S3 data source containing the product review documents.


The Knowledge Base needs to know where to find your documents. This configuration specifies the S3 bucket location that will store the product review JSON files for indexing.

In [None]:
data_bucket_name = f'product-reviews-unstructured-{suffix}-bucket'
data_sources = [{"type": "S3", "bucket_name": data_bucket_name}]


### Create the Amazon S3 bucket and upload the product review documents
We'll create an S3 bucket and upload the product review documents that will serve as our unstructured data source.


In [None]:
def create_s3_bucket(bucket_name, region=None):
    s3 = boto3.client('s3', region_name=region)

    try:
        if region is None or region == 'us-east-1':
            s3.create_bucket(Bucket=bucket_name)
        else:
            s3.create_bucket(
                Bucket=bucket_name,
                CreateBucketConfiguration={'LocationConstraint': region}
            )
        print(f"Bucket '{bucket_name}' created successfully.")
    except botocore.exceptions.ClientError as e:
        if e.response['Error']['Code'] == 'BucketAlreadyOwnedByYou':
            print(f"Bucket '{bucket_name}' already exists and is owned by you.")
        else:
            print(f"Failed to create bucket: {e.response['Error']['Message']}")

create_s3_bucket(data_bucket_name, region)


This helper function recursively uploads all files from a local directory to the S3 bucket, maintaining the directory structure.

In [None]:
def upload_directory(path, bucket_name):
    file_count = 0
    for root, dirs, files in os.walk(path):
        for file in files:
            file_to_upload = os.path.join(root, file)
            print(f"Uploading file {file_to_upload} to {bucket_name}")
            s3_client.upload_file(file_to_upload, bucket_name, file)
            file_count +=1

    if file_count == 0:
        raise ValueError(f"No files found in {path}")
    
    print(f"Successfully uploaded {file_count} files")

Upload the sample product review documents from the local directory to the S3 bucket for Knowledge Base ingestion.

In [None]:
# Upload the documents
upload_directory("sample_unstructured_data/selected_reviews", data_bucket_name)


### Create the Unstructured Knowledge Base

We are now going to create the Knowledge Base using the abstraction located in the helper function we previously imported.

**Note:** The Knowledge Base creation process may take approximately 6 minutes to complete.


In [None]:
unstructured_knowledge_base = BedrockKnowledgeBase(
    kb_name=f'{knowledge_base_name}',
    kb_description=knowledge_base_description,
    generation_model=generation_model,
    data_sources=data_sources,
    embedding_model=embedding_model,
    chunking_strategy="FIXED_SIZE", 
    suffix=f'{suffix}-u' 
)


### Start ingestion job
Once the KB and data source are created, we can start the ingestion job for the data source. During the ingestion job, KB will fetch the documents in the data source, pre-process it to extract text, chunk it based on the chunking size provided, create embeddings of each chunk and then write it to the vector database (OpenSearch Serverless).


In [None]:
# Ensure that the kb is available. When doing Run All, 
# we may otherwise start the ingestion job whilst the Datasource of the KB is still creating asynchronously.
time.sleep(60)

# Sync knowledge base
unstructured_knowledge_base.start_ingestion_job()
# Keep the kb_id for invocation later in the invoke request
unstructured_kb_id = unstructured_knowledge_base.get_knowledge_base_id()
print(f"Unstructured Knowledge Base ID: {unstructured_kb_id}")


Store the Knowledge Base ID and region in Jupyter's variable store. This allows the next notebook (1.2-test-unstructured-kb.ipynb) to access these values without manual copying.

In [None]:
kb_region = region
%store unstructured_kb_id
%store kb_region
%store data_bucket_name


print("="*60)
print(f"Unstructured Knowledge Base ID: {unstructured_kb_id}")
print(f"Region: {kb_region}")
print(f"S3 Bucket: {data_bucket_name}")

print("="*60)
print("Configuration stored successfully!")



Display the Knowledge Base ID for verification before storing it in Parameter Store.

In [None]:
unstructured_kb_id

The Knowledge Base ID is needed by the agent in Lab 3 to query the unstructured data. Storing it in SSM Parameter Store provides a centralized, secure way to share configuration across different components without hardcoding values.

In [None]:
param_name = '/app/intelligent_rag/agentcore/unstructured_kb_id'

ssm = boto3.client("ssm")
ssm.put_parameter(Name=param_name, Value=unstructured_kb_id, Type="String", Overwrite=True)
print(f"Stored {unstructured_kb_id} in SSM: {param_name}")

## Clean up the resources

When you are finished with the other notebooks, to avoid additional costs, delete the resources created.

üìã **For detailed cleanup instructions, please refer to the [Cleanup-Instructions.ipynb](../Cleanup-Instructions.ipynb) notebook** which provides step-by-step guidance for removing all workshop resources safely.

###  Summary
If all the above cells executed successfully, you have:

- Created an S3 bucket for unstructured data  
- Uploaded the product review documents
- Created an Amazon Bedrock Knowledge Base  
- Configured OpenSearch Serverless as the vector store  
- Successfully ingested the document  
- Tested a query with knowledge base's `RetrieveAndGenerate` API
- Stored the Knowledge Base ID for use in the main notebook  

You can now proceed to test the structured knowledge base with [1.2-test-unstructured-kb.ipynb](1.2-test-unstructured-kb.ipynb) notebook 
