# Prerequisites

1. Prepare documents to be used in Amazon Bedrock Knowledge Base.
2. Add metadata to the input documents for advanced query features (covered in Lab2).
3. Create required AWS resources to run the Bedrock Knowledge Base service.
4. Create an Amazon OpenSearch Service collection as a vector store.

### 1. Environment

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Update python packages
!pip install -U boto3 opensearch-py 2>/dev/null

In [None]:
import boto3
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth

# Initialize Boto3 session
boto3_session = boto3.session.Session()
credentials = boto3_session.get_credentials()
region_name = "us-west-2"

# Retrieve AWS account details
sts_client = boto3_session.client("sts")
account_number = sts_client.get_caller_identity()["Account"]
role_arn = sts_client.get_caller_identity()["Arn"]

# Set up authentication for OpenSearch
awsauth = AWSV4SignerAuth(credentials, region_name, "aoss")

# Print account details for verification
print(f"AWS Account: {account_number}")
print(f"Role ARN: {role_arn}")

In [None]:
# Resource names to be used in the workshop

s3_bucket_name = f"{account_number}-{region_name}-advanced-rag-workshop"
knowledge_base_name_aoss = "advanced-rag-workshop-knowledgebase-aoss"
knowledge_base_name_graphrag = "advanced-rag-workshop-knowledgebase-graphrag"

oss_vector_store_name = "advancedrag"
oss_index_name = "ws-index-"

# Print resource names for verification
print(f"S3 Bucket Name: {s3_bucket_name}")
print(f"Knowledge Base (AOSS): {knowledge_base_name_aoss}")
print(f"Knowledge Base (GraphRAG): {knowledge_base_name_graphrag}")
print(f"OpenSearch Vector Store Name: {oss_vector_store_name}")
print(f"OpenSearch Index Name Prefix: {oss_index_name}")

### 2. Create required AWS resources 

#### IAM Role

In [None]:
from bedrock_excution_iam_role import AdvancedRagIamRoles
import boto3
from botocore.exceptions import ClientError

# Initialize IAM role handler
bedrock_execution_iam_role = AdvancedRagIamRoles(account_number, region_name)

# Check if the role already exists
iam_client = boto3.client("iam", region_name=region_name)
role_name = f"advanced-rag-workshop-bedrock_execution_role-{region_name}"
bedrock_kb_execution_role_arn = ""

try:
    # Try to get the existing role
    existing_role = iam_client.get_role(RoleName=role_name)
    bedrock_kb_execution_role_arn = existing_role["Role"]["Arn"]
    print(f"Policy and roles have been created already. ARN: {bedrock_kb_execution_role_arn}")
except Exception as e:
    if e.response["Error"]["Code"] == "NoSuchEntity":
        try:
            # Role does not exist, create it
            bedrock_kb_execution_role = bedrock_execution_iam_role.create_bedrock_execution_role(s3_bucket_name)
            bedrock_kb_execution_role_arn = bedrock_kb_execution_role["Role"]["Arn"]
            print(f"Created Bedrock Knowledge Base Execution Role ARN: {bedrock_kb_execution_role_arn}")
        except Exception as e:
            print(e)
            print("Policies already exist. Please clean them up first.")
    else:
        # Handle other client errors
        print("Policy and roles have been created already.")

if not bedrock_kb_execution_role_arn:
    print("WARNING: Could not determine the Bedrock KB execution role ARN.")
    bedrock_kb_execution_role_arn = f"arn:aws:iam::{account_number}:role/{role_name}"

#### S3 bucket

In [None]:
# Initialize S3 client with the specified AWS region
s3 = boto3.client("s3", region_name=region_name)

try:
    # Check if the S3 bucket already exists
    s3.head_bucket(Bucket=s3_bucket_name)
    print(f"Bucket '{s3_bucket_name}' already exists.")
except:
    # Create the S3 bucket if it does not exist
    s3.create_bucket(Bucket=s3_bucket_name, CreateBucketConfiguration={'LocationConstraint': region_name})
    print(f"Bucket '{s3_bucket_name}' created.")

In [None]:
# Define a function to upload all files from a local directory to an S3 bucket
def upload_directory(path, bucket_name, data_s3_prefix):
    for root, dirs, files in os.walk(path):
        for file in files:
            key = f"{data_s3_prefix}/{file}"  # Construct the S3 object key
            s3.upload_file(os.path.join(root, file), bucket_name, key)  # Upload the file

### 3. Preparing Data Sources with .metadata.json

### Role of Metadata While Indexing Data in Vector Databases  

Metadata provides additional context and information about the documents, which can be used to filter, sort, and improve search accuracy. This not only helps reduce the search latency but also helps increase accuracy of responses.  

The following are some key uses of metadata when loading documents into a vector data store:  

- **Document Identification** – Metadata can include unique identifiers for each document, such as document IDs, URLs, or file names. These identifiers can be used to uniquely reference and retrieve specific documents from the vector data store.  
- **Content Categorization** – Metadata can provide information about the content or category of a document, such as the subject matter, domain, or topic. This information can be used to organize and filter documents based on specific categories or domains.  
- **Document Attributes** – Metadata can store additional attributes related to the document, such as the author, publication date, language, or any other relevant information. These attributes can be used for filtering, sorting, or faceted search within the vector data store.  
- **Access Control** – Metadata can include information about access permissions or security levels associated with a document. This information can be used to control access to sensitive or restricted documents within the vector data store.  
- **Relevance Scoring** – Metadata can be used to enhance the relevance scoring of search results. For example, if a user searches for documents within a specific date range or authored by a particular individual, the metadata can be used to prioritize and rank the most relevant documents.  
- **Data Enrichment** – Metadata can be used to enrich the vector representations of documents by incorporating additional contextual information. This can potentially improve the accuracy and quality of search results.  
- **Data Lineage and Auditing** – Metadata can provide information about the provenance and lineage of documents, such as the source system, data ingestion pipeline, or any transformations applied to the data. This information can be valuable for data governance, auditing, and compliance purposes.  


#### 3.1 Unstructured (PDF) document

#### Amazon Science papers

In [None]:
from urllib.request import urlretrieve
import json
import os
import shutil

# Define URLs of Amazon Science Publications to download as example documents
urls = [
    "https://assets.amazon.science/44/ba/e16182124eac8687e89d3cb0ea3d/retrieval-reranking-and-multi-task-learning-for-knowledge-base-question-answering.pdf",
    "https://assets.amazon.science/36/be/2669792342f2ba366ddca794069f/practiq-a-practical-conversational-text-to-sql-dataset-with-ambiguous-and-unanswerable-queries.pdf",
    "https://assets.amazon.science/a7/7c/8bdade5c4eda9168f3dee6434fff/pc-amazon-frontier-model-safety-framework-2-7-final-2-9.pdf"
]

# Define standard filenames to maintain consistency when loading data to Amazon S3
filenames = [
    "retrieval-reranking-and-multi-task-learning-for-knowledge-base-question-answering.pdf",
    "practiq-a-practical-conversational-text-to-sql-dataset-with-ambiguous-and-unanswerable-queries.pdf",
    "pc-amazon-frontier-model-safety-framework-2-7-final-2-9.pdf"
]

# Create a local temporary directory to store downloaded files before uploading to S3
os.makedirs("./data", exist_ok=True)

# Define local directory path for storing downloaded files
local_data_path = "./data/"

# Download files from URLs and save them in the local directory
for idx, url in enumerate(urls):
    file_path = os.path.join(local_data_path, filenames[idx])
    urlretrieve(url, file_path)

# Define metadata corresponding to each document for indexing in the vector database
metadata = [
    {
        "metadataAttributes": {
            "company": "Amazon",
            "authors": ["Zhiguo Wang", "Patrick Ng", "Ramesh Nallapati", "Bing Xiang"],
            "docType": "science",
            "year": 2021
        }
    },
    {
        "metadataAttributes": {
            "company": "Amazon",
            "authors": ["Marvin Dong", "Nischal Ashok Kumar", "Yiqun Hu", "Anuj Chauhan", "Chung-Wei Hang", "Shuaichen Chang", 
                        "Lin Pan", "Wuwei Lan", "Henry Zhu", "Jiarong Jiang", "Patrick Ng", "Zhiguo Wang"],
            "docType": "science",
            "year": 2025
        }
    },
    {
        "metadataAttributes": {
            "company": "Amazon",
            "authors": ["Amazon"],
            "docType": "science",
            "year": 2025
        }
    }
]

# Save metadata as JSON files alongside the corresponding documents
for i, file in enumerate(filenames):
    with open(f"{local_data_path}{file}.metadata.json", "w") as f:
        json.dump(metadata[i], f)

# Upload the directory to Amazon S3 under the 'pdf_documents' prefix
upload_directory(local_data_path, s3_bucket_name, "data/pdf_documents")

# Delete the local directory and its contents after upload to save space
shutil.rmtree(local_data_path)

#### Amazon 10-K filings

In [None]:
from urllib.request import urlretrieve
import json
import os
import shutil

# Define URLs of Amazon's 10-K reports to be downloaded as example documents
urls = [
    "https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/e42c2068-bad5-4ab6-ae57-36ff8b2aeffd.pdf",
    "https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/c7c14359-36fa-40c3-b3ca-5bf7f3fa0b96.pdf",
    "https://d18rn0p25nwr6d.cloudfront.net/CIK-0001018724/d2fde7ee-05f7-419d-9ce8-186de4c96e25.pdf"
]

# Define standard filenames to maintain consistency when loading data to Amazon S3
filenames = [
    "Amazon-10k-2025.pdf",
    "Amazon-10k-2024.pdf",
    "Amazon-10k-2023.pdf"
]

# Create a local temporary directory to store downloaded files before uploading to S3
local_data_path = "./data/"
os.makedirs(local_data_path, exist_ok=True)

# Download files from URLs and save them in the local directory
for idx, url in enumerate(urls):
    file_path = os.path.join(local_data_path, filenames[idx])
    urlretrieve(url, file_path)

# Define metadata corresponding to each document for indexing in the vector database
metadata = [
    {
        "metadataAttributes": {
            "company": "Amazon",
            "authors": ["Amazon"],
            "docType": "10K Report",
            "year": 2025
        }
    },
    {
        "metadataAttributes": {
            "company": "Amazon",
            "authors": ["Amazon"],
            "docType": "10K Report",
            "year": 2024
        }
    },
    {
        "metadataAttributes": {
            "company": "Amazon",
            "authors": ["Amazon"],
            "docType": "10K Report",
            "year": 2023
        }
    }
]

# Save metadata as JSON files alongside the corresponding documents
for i, file in enumerate(filenames):
    metadata_file_path = os.path.join(local_data_path, f"{file}.metadata.json")
    with open(metadata_file_path, "w") as f:
        json.dump(metadata[i], f, indent=4)

# Upload the directory to Amazon S3 under the 'pdf_documents' prefix
upload_directory(local_data_path, s3_bucket_name, "data/pdf_documents")

# Delete the local directory and its contents after upload to save space
shutil.rmtree(local_data_path)

#### 3.2 Metadata customization for CSV files
The data is downloaded from [here](https://github.com/ali-ce/datasets) and it is licensed under [Creative Commons Attribution-ShareAlike 4.0 International license](https://github.com/ali-ce/datasets/blob/master/README.md#:~:text=Creative%20Commons%20Attribution%2DShareAlike%204.0%20International%20License.).

In [None]:
import csv
import json
import os
import shutil
import requests

# Define a function to generate JSON metadata from a CSV file
def generate_json_metadata(csv_file, content_fields, metadata_fields, excluded_fields):
    """
    Generates a JSON metadata file for a given CSV file.

    Parameters:
        csv_file (str): Path to the CSV file.
        content_fields (list): List of fields that contain document content.
        metadata_fields (list): List of fields to include as metadata.
        excluded_fields (list): List of fields to exclude (automatically populated if empty).

    The function reads the CSV file, extracts headers, and structures metadata accordingly.
    It then saves the metadata as a JSON file in the same directory as the CSV file.
    """
    # Open the CSV file and read its headers
    with open(csv_file, 'r', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        headers = reader.fieldnames  # Get column names

    # Define JSON structure for metadata
    json_data = {
        "metadataAttributes": {},
        "documentStructureConfiguration": {
            "type": "RECORD_BASED_STRUCTURE_METADATA",
            "recordBasedStructureMetadata": {
                "contentFields": [{"fieldName": field} for field in content_fields],
                "metadataFieldsSpecification": {
                    "fieldsToInclude": [{"fieldName": field} for field in metadata_fields],
                    "fieldsToExclude": []
                }
            }
        }
    }

    # Determine fields to exclude (all fields not in content_fields or metadata_fields)
    if not excluded_fields:
        excluded_fields = set(headers) - set(content_fields + metadata_fields)

    json_data["documentStructureConfiguration"]["recordBasedStructureMetadata"]["metadataFieldsSpecification"]["fieldsToExclude"] = [
        {"fieldName": field} for field in excluded_fields
    ]

    # Generate the output JSON file name
    output_file = f"{os.path.splitext(csv_file)[0]}.metadata.json"

    # Save metadata to a JSON file
    with open(output_file, 'w', encoding='utf-8') as file:
        json.dump(json_data, file, indent=4)

    print(f"JSON metadata file '{output_file}' has been generated.")

# Create a directory to store the video game CSV dataset
local_dir = "./videogame/"
os.makedirs(local_dir, exist_ok=True)

# Define the URL of the dataset and the local file path
csv_url = "https://raw.githubusercontent.com/ali-ce/datasets/master/Most-Expensive-Things/Videogames.csv"
csv_file_path = os.path.join(local_dir, "video_games.csv")

# Download the CSV file
response = requests.get(csv_url, verify=False)  # `verify=False` ignores SSL certificate issues
if response.status_code == 200:
    with open(csv_file_path, 'wb') as file:
        file.write(response.content)
    print(f"CSV file downloaded successfully: {csv_file_path}")
else:
    print("Failed to download the CSV file.")

# Generate JSON metadata for the downloaded CSV file
generate_json_metadata(
    csv_file=csv_file_path,
    content_fields=["Description"],
    metadata_fields=["Year", "Developer", "Publisher"],
    excluded_fields=[]  # Automatically determine excluded fields
)

# Upload directory containing the CSV and metadata JSON to S3
upload_directory(local_dir, s3_bucket_name, "data/csv")

# Remove the local directory after upload to save space
shutil.rmtree(local_dir)

### 4. Create a Vector Store using Amazon Open Search Serveless

#### 4.1 Amazon OpenSearch Vector Collection  
This will be used in Amazon Bedrock Knowledge Bases.  

### **Code Steps:**  
1. **Create security, network, and data access policies** within Amazon OpenSearch Serverless.  
   - These will be assigned to the OpenSearch Vector Collection.  
2. **Create an OpenSearch Serverless Vector Collection.**  
3. **Retrieve the OpenSearch Serverless collection URL** for the Vector Collection created above.  
4. **Wait for the Vector Collection** to reach the "Ready" state.  
5. **Create an OpenSearch Serverless access policy** and attach it to the Bedrock execution role.


> **Note**: This process will take approximately 4-5 minutes to complete. The system is creating security policies, network configurations, and a vector collection for storing embeddings.

In [None]:
import boto3
import time

# Initialize the OpenSearch Serverless client
aoss = boto3.client("opensearchserverless", region_name=region_name)

print("Creating OpenSearch Serverless vector collection. This process will take approximately 4-5 minutes...")

# Create security, network, and data access policies within OpenSearch Serverless (OSS)
# These policies are essential for the correct access configuration of the OSS
try:
    result = bedrock_execution_iam_role.create_policies_in_oss(
        vector_store_name=oss_vector_store_name,
        aoss_client=aoss,
        bedrock_kb_execution_role_arn=bedrock_kb_execution_role_arn
    )
    if result is not None:  # Check if the result is valid
        encryption_policy, network_policy, access_policy = result
    else:
        print("Policies already exist or were not created properly.")
        encryption_policy = network_policy = access_policy = None
except Exception as e:
    print(f"Error creating policies: {str(e)}")
    encryption_policy = network_policy = access_policy = None

# Check if the collection already exists before creation
try:
    response = aoss.batch_get_collection(names=[oss_vector_store_name])
    if response['collectionDetails']:
        print(f"Collection '{oss_vector_store_name}' already exists.")
        # Extract the collection ID from the existing collection
        collection_id = response['collectionDetails'][0]['id']
        host = f"{collection_id}.{region_name}.aoss.amazonaws.com"  # Construct the host URL
        print(f"Collection Host URL: {host}")
    else:
        # Create an OpenSearch Serverless Vector Collection
        collection = aoss.create_collection(name=oss_vector_store_name, type='VECTORSEARCH')
        collection_id = collection['createCollectionDetail']['id']
        host = f"{collection_id}.{region_name}.aoss.amazonaws.com"  # Construct the host URL
        print(f"Collection Host URL: {host}")
except Exception:
    print(f"Collection '{oss_vector_store_name}' already exists or could not be created.")

# Wait for collection creation to complete
# The creation process can take a few minutes, so we check the status periodically
response = aoss.batch_get_collection(names=[oss_vector_store_name])
print(response)
# Periodically check the collection's status until it's no longer 'CREATING'
while response['collectionDetails'][0]['status'] == 'CREATING':
    print('Collection is still being created...')
    time.sleep(10)  # Sleep for 10 seconds before checking again
    response = aoss.batch_get_collection(names=[oss_vector_store_name])

# Confirm successful collection creation
print('\nCollection successfully created!')

# Create the OpenSearch Serverless access policy and attach it to the Bedrock execution role
# This ensures that the execution role has the correct permissions to access the collection
try:
    bedrock_execution_iam_role.create_oss_policy_attach_bedrock_execution_role(
        collection_id=collection_id,
        bedrock_kb_execution_role=bedrock_kb_execution_role
    )
    # Wait for the data access rules to be enforced (may take a minute)
    time.sleep(10)
except Exception:
    print("Policy already exists or has been attached previously.")


#### 4.2 Create an index for the collection

This index will be managed via Bedrock Knowledge Bases.

**Code Steps:**

1. **Create Index Body JSON**: Define the metadata or index structure that will be used for indexing in the OpenSearch Vector Collection.
   
2. **Create OpenSearch Object**: Instantiate an object of the `OpenSearch` class from the `opensearchpy` Python module. This object will be used to connect to the OpenSearch Vector Collection.

3. **Create Index**: Using the OpenSearch object and the index body JSON, create the index in the OpenSearch Vector Collection.


In [None]:
import time
import json
import boto3
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth, RequestError

# Step 1: Set up AWS credentials for authentication with OpenSearch Service
credentials = boto3.Session().get_credentials()  # Retrieves AWS credentials from the environment
awsauth = AWSV4SignerAuth(credentials, region_name, "aoss")  # AWS authentication for OpenSearch

# Define the base index name prefix
oss_index_name = "ws-index-"

# Step 2: Define the JSON body for index settings and mappings
body_json = {
   "settings": {
      "index.knn": "true",  # Enable KNN (K-Nearest Neighbor) search
       "number_of_shards": 1,  # Set the number of primary shards for the index
       "knn.algo_param.ef_search": 512,  # KNN search efficiency parameter
       "number_of_replicas": 0,  # Set the number of replicas to 0 (no redundancy)
   },
   "mappings": {
      "properties": {
         "vector": {
            "type": "knn_vector",  # Define a KNN vector field for storing embeddings
            "dimension": 1024,  # Set the vector's dimension to 1024
             "method": {
                 "name": "hnsw",  # Use the HNSW algorithm for KNN search
                 "engine": "faiss",  # Use FAISS engine for efficient vector search
                 "space_type": "l2"  # Use L2 (Euclidean) space for distance calculation
             },
         },
         "text": {
            "type": "text"  # Define a text field for storing unstructured text
         },
         "text-metadata": {
            "type": "text"  # Define a text field for storing associated metadata
        }
      }
   }
}

# Step 3: Build the OpenSearch client using AWS credentials and settings
oss_client = OpenSearch(
    hosts=[{'host': host, 'port': 443}],  # Provide OpenSearch host details
    http_auth=awsauth,  # Use AWS authentication for API requests
    use_ssl=True,  # Enable SSL for secure connection
    verify_certs=True,  # Verify SSL certificates
    connection_class=RequestsHttpConnection,  # Use RequestsHttpConnection for HTTP communication
    timeout=300  # Set a timeout for the connection
)

# Step 4: Attempt to create multiple indices for different chunking strategies
for strategy in ["fixed", "hierarchical", "semantic", "custom"]:
    index_name = oss_index_name + strategy
    try:
        # Check if the index already exists
        if oss_client.indices.exists(index=index_name):
            print(f'Index "{index_name}" already exists. Skipping creation.')  # CHANGED
            continue
        
        # Create the index if it doesn't exist
        oss_client.indices.create(index=index_name, body=json.dumps(body_json))
        print(f'Creating Index: {index_name}...')  # Inform user about index creation
    except RequestError as e:
        print(f'Error while trying to create the index "{index_name}", with error {e.error}')  # CHANGED

print('Index Creation Process Completed.')  # Inform user that the process is finished


### Export variables to a file for the next lab

> **Note**: We're saving all the important configuration variables to a JSON file so they can be easily accessed in subsequent notebooks. This ensures consistency and prevents the need to recreate these resources for each notebook in the workshop.

In [None]:
import json
with open("variables.json", "w") as f:
    # Create a collection ARN using the standard format if needed
    collection_arn = f"arn:aws:aoss:{region_name}:{account_number}:collection/{collection_id}"
    
    json.dump(
        {
            "accountNumber": account_number,
            "regionName": region_name,
            "collectionArn": collection_arn,
            "collectionId": collection_id,
            "vectorIndexName": oss_index_name,
            "bedrockExecutionRoleArn": bedrock_kb_execution_role_arn,
            "s3Bucket": s3_bucket_name
        }, f, indent=4
    )