# Data Ingestion to Knowledge Base for Amazon Bedrock
**_Use of Knowledge Bases for Amazon Bedrock with Amazon OpenSearch Serverless as a vector database for storing embeddings_**

This notebook provides sample code for a data pipeline that ingests documents (typically stored in Amazon S3) into a knowledge base i.e. a vector database such as Amazon OpenSearch Service Serverless.

This notebook works well with the `Data Science 3.0` kernel on a SageMaker Studio `ml.t3.medium` instance.

Here is a list of packages that are used in this notebook.
```
!pip list | grep -E -w "sagemaker|langchain|langchainhub|opensearch-py|sh"
----------------------------------------------------------------------------------------
boto3                                1.34.107
langchain                            0.1.16
langchain-aws                        0.1.0
langchain-community                  0.0.34
langchain-core                       0.1.52
langchain-text-splitters             0.0.2
langchainhub                         0.1.15
opensearch-py                        2.3.1
sagemaker                            2.215.0
SQLAlchemy                           2.0.28
```

# Prerequsites

The following IAM policies need to be attached to the SageMaker execution role that you use to run this notebook:

- AmazonSageMakerFullAccess
- AWSCloudFormationReadOnlyAccess
- AmazonS3FullAccess
- inline policy for Amazon OpenSearch Service Serverless
  ```
  {
      "Version": "2012-10-17",
      "Statement": [
          {
              "Action": [
                  "aoss:BatchGetCollection",
                  "aoss:GetAccessPolicy",
                  "aoss:GetAccountSettings",
                  "aoss:GetSecurityConfig",
                  "aoss:GetSecurityPolicy",
                  "aoss:ListAccessPolicies",
                  "aoss:ListCollections",
                  "aoss:ListSecurityConfigs",
                  "aoss:ListSecurityPolicies",
                  "aoss:ListTagsForResource",
                  "aoss:ListVpcEndpoints",
                  "aoss:UpdateAccessPolicy"
              ],
              "Resource": "*",
              "Effect": "Allow",
              "Sid": "UsingOpenSearchServerlessIntheConsole"
          },
          {
              "Action": "aoss:APIAccessAll",
              "Resource": "arn:aws:aoss:us-east-1:819320734790:collection/*",
              "Effect": "Allow",
              "Sid": "OpenSearchServerlessCollectionAccess"
          },
          {
              "Action": "aoss:DashboardsAccessAll",
              "Resource": "arn:aws:aoss:us-east-1:819320734790:dashboards/default",
              "Effect": "Allow",
              "Sid": "OpenSearchServerlessDashboardAccess"
          }
      ]
  }
  ```
- inline policy for Amazon Bedrock
  ```
  {
      "Version": "2012-10-17",
      "Statement": [
          {
              "Action": [
                  "bedrock:ListDataSources",
                  "bedrock:ListFoundationModelAgreementOffers",
                  "bedrock:ListFoundationModels",
                  "bedrock:ListIngestionJobs",
                  "bedrock:ListKnowledgeBases",
                  "bedrock:ListModelInvocationJobs"
              ],
              "Resource": "*",
              "Effect": "Allow",
              "Sid": "BedrockList"
          },
          {
              "Action": [
                  "bedrock:GetDataSource",
                  "bedrock:GetFoundationModel",
                  "bedrock:GetFoundationModelAvailability",
                  "bedrock:GetIngestionJob",
                  "bedrock:GetKnowledgeBase",
                  "bedrock:GetModelInvocationJob",
                  "bedrock:InvokeModel",
                  "bedrock:InvokeModelWithResponseStream",
                  "bedrock:ListTagsForResource",
                  "bedrock:Retrieve"
              ],
              "Resource": "*",
              "Effect": "Allow",
              "Sid": "BedrockRead"
          },
          {
              "Action": [
                  "bedrock:CreateFoundationModelAgreement",
                  "bedrock:CreateModelInvocationJob",
                  "bedrock:CreateProvisionedModelThroughput",
                  "bedrock:DeleteFoundationModelAgreement",
                  "bedrock:DeleteModelInvocationLoggingConfiguration",
                  "bedrock:DeleteProvisionedModelThroughput",
                  "bedrock:PutModelInvocationLoggingConfiguration",
                  "bedrock:RetrieveAndGenerate",
                  "bedrock:StartIngestionJob",
                  "bedrock:UpdateDataSource",
                  "bedrock:UpdateKnowledgeBase"
              ],
              "Resource": "*",
              "Effect": "Allow",
              "Sid": "BedrockWrite"
          },
          {
              "Action": [
                  "bedrock:TagResource",
                  "bedrock:UntagResource"
              ],
              "Resource": "*",
              "Effect": "Allow",
              "Sid": "BedrockTagging"
          }
      ]
  }
  ```


# Data Ingestion

## Step 1: Setup
Install the required packages.

In [None]:
!pip install -Uq pip

!pip install -Uq langchain==0.1.16
!pip install -Uq "boto3>=1.26.159" langchain-aws==0.1.0
!pip install -Uq langchain-community==0.0.34
!pip install -Uq langchainhub==0.1.15
!pip install -Uq SQLAlchemy==2.0.28

!pip install -Uq opensearch-py==2.3.1

In [None]:
!pip list | grep -E -w "boto3|sagemaker|langchain|langchainhub|opensearch-py|SQLAlchemy"

boto3                                1.34.107
langchain                            0.1.16
langchain-aws                        0.1.0
langchain-community                  0.0.34
langchain-core                       0.1.52
langchain-text-splitters             0.0.2
langchainhub                         0.1.15
opensearch-py                        2.3.1
sagemaker                            2.215.0
sagemaker-data-insights              0.3.3
sagemaker-datawrangler               0.4.3
sagemaker-headless-execution-driver  0.0.13
sagemaker-scikit-learn-extension     2.5.0
sagemaker-studio-analytics-extension 0.0.20
sagemaker-studio-sparkmagic-lib      0.1.4
SQLAlchemy                           2.0.28


## Step 2: Check if Amazon OpenSearch Serverless Collection exists

In [None]:
import pprint
import time

pp = pprint.PrettyPrinter(indent=2)

In [None]:
import boto3
from sagemaker import get_execution_role

aws_region = boto3.Session().region_name
sagemaker_execution_role = get_execution_role()

aws_region, sagemaker_execution_role

In [None]:
from utils import (
    check_if_index_exists,
    get_aws_auth,
    get_aoss_data_access_policy,
    update_aoss_data_access_policy_with_caller_arn,
    get_cfn_outputs
)

In [None]:
CFN_STACK_NAME = "BedrockKnowledgeBaseStack"
cfn_stack_outputs = get_cfn_outputs(CFN_STACK_NAME, aws_region)

knowledge_base_id = cfn_stack_outputs['KnowledgeBaseId']
data_source_name = cfn_stack_outputs['DataSourceName']

knowledge_base_id, data_source_name

In [None]:
bedrock_agent_client = boto3.client(
    'bedrock-agent',
    region_name=aws_region
)

In [None]:
# Get KnowledgeBase

response = bedrock_agent_client.get_knowledge_base(
    knowledgeBaseId=knowledge_base_id
)

kb_info = response['knowledgeBase']
kb_info

In [None]:
collection_arn = kb_info['storageConfiguration']['opensearchServerlessConfiguration']['collectionArn']
collection_arn

In [None]:
region_name = aws_region
collection_id = collection_arn.split('/')[-1]

opensearch_endpoint_url = f"https://{collection_id}.{region_name}.aoss.amazonaws.com"
opensearch_endpoint_url

In [None]:
opensearch_vector_index = kb_info['storageConfiguration']['opensearchServerlessConfiguration']['vectorIndexName']
opensearch_vector_index

In [None]:
data_access_policy = get_aoss_data_access_policy(collection_id, aws_region)
opensearch_data_access_policy_name = data_access_policy['name']
opensearch_data_access_policy_name, data_access_policy

In [None]:
%%time

is_ok = update_aoss_data_access_policy_with_caller_arn(
    policy_name=opensearch_data_access_policy_name,
    caller_arn=sagemaker_execution_role,
    region_name=aws_region
)

is_ok

In [None]:
aws_auth = get_aws_auth(region_name=aws_region)

exists = check_if_index_exists(
    index_name=opensearch_vector_index,
    host=opensearch_endpoint_url,
    auth=aws_auth
)

exists

## Step 3: Download and prepare dataset

### Dataset

In this example, you will use several years of Amazon's Letter to Shareholders as a text corpus to perform Q&A on.

In [None]:
from pathlib import Path
from urllib.request import urlretrieve

data_root_dir = Path('./data')
data_root_dir.mkdir(parents=True, exist_ok=True)

urls = [
    'https://s2.q4cdn.com/299287126/files/doc_financials/2020/ar/2019-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Shareholder-Letter-and-1997-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf',
    'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf',
]

filenames = [
    'AMZN-2019-Shareholder-Letter.pdf',
    'AMZN-2020-Shareholder-Letter.pdf',
    'AMZN-2021-Shareholder-Letter.pdf',
    'AMZN-2022-Shareholder-Letter.pdf',
]

for idx, url in enumerate(urls):
    file_path = data_root_dir.joinpath(filenames[idx])
    urlretrieve(url, file_path)

## Step 4: Upload data to S3 Bucket

In [None]:
# Get DataSourceId

response = bedrock_agent_client.list_data_sources(
    knowledgeBaseId=knowledge_base_id
)

data_source_id = response['dataSourceSummaries'][0]['dataSourceId']
data_source_id

In [None]:
# Get DataSource

response = bedrock_agent_client.get_data_source(
    knowledgeBaseId=knowledge_base_id,
    dataSourceId=data_source_id
)

ds_info = response['dataSource']
ds_info

In [None]:
data_source_s3_bucket_arn = ds_info['dataSourceConfiguration']['s3Configuration']['bucketArn']
data_source_s3_bucket_name = data_source_s3_bucket_arn.split(':')[-1]
data_source_s3_bucket_arn, data_source_s3_bucket_name

In [None]:
from sagemaker.s3 import S3Uploader

bucket, prefix = data_source_s3_bucket_name, 'data' # Replace prefix with yours

dataset_s3_path = S3Uploader.upload(
    local_path=str(data_root_dir), desired_s3_uri=f"s3://{bucket}/{prefix}"
)

dataset_s3_path

## Step 5: Start ingestion job

Once the Knowledge Base and Data Source are created by deploying CDK Stacks, we can start the ingestion job. During the ingestion job, Knowledge Base will fetch the documents in the data source, pre-process it to extract text, chunk it based on the chunking size provided, create embeddings of each chunk and then write it to the vector database, in this case Amazon OpenSearch Serverless Service.

In [None]:
# Start an ingestion job

start_job_response = bedrock_agent_client.start_ingestion_job(
    knowledgeBaseId=knowledge_base_id,
    dataSourceId=data_source_id
)

In [None]:
job = start_job_response["ingestionJob"]
pp.pprint(job)

In [None]:
while (job['status'] != 'COMPLETE'):
    get_job_response = bedrock_agent_client.get_ingestion_job(
        knowledgeBaseId=knowledge_base_id,
        dataSourceId=data_source_id,
        ingestionJobId=job["ingestionJobId"]
    )

    job = get_job_response["ingestionJob"]
    pp.pprint(job)
    time.sleep(30)

pp.pprint(job)

# Test the knowledge base

## Using Knowlege Bases for Amazon Bedrock APIs

### RetrieveAndGenerate API

Behind the scenes, RetrieveAndGenerate API converts queries into embeddings, searches the knowledge base, and then augments the foundation model prompt with the search results as context information and returns the FM-generated response to the question. For multi-turn conversations, Knowledge Bases manage short-term memory of the conversation to provide more contextual results.

The output of the RetrieveAndGenerate API includes the generated response, source attribution as well as the retrieved text chunks

In [None]:
bedrock_agent_runtime_client = boto3.client(
    "bedrock-agent-runtime",
    region_name=aws_region
)

In [None]:
model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
model_arn = f"arn:aws:bedrock:{aws_region}::foundation-model/{model_id}"

model_arn

In [None]:
query = "What is Amazon's doing in the field of generative AI?"

response = bedrock_agent_runtime_client.retrieve_and_generate(
    input={
        'text': query
    },
    retrieveAndGenerateConfiguration={
        'type': 'KNOWLEDGE_BASE',
        'knowledgeBaseConfiguration': {
            'knowledgeBaseId': knowledge_base_id,
            'modelArn': model_arn
        }
    },
)

generated_text = response['output']['text']
pp.pprint(generated_text)

In [None]:
## print out the source attribution/citations from the original documents to see if the response generated belongs to the context.

citations = response["citations"]
contexts = []
for citation in citations:
    retrievedReferences = citation["retrievedReferences"]
    for reference in retrievedReferences:
        contexts.append(reference["content"]["text"])

pp.pprint(contexts)

### Retrieve API

Retrieve API converts user queries into embeddings, searches the knowledge base, and returns the relevant results, giving you more control to build custom workﬂows on top of the semantic search results. The output of the Retrieve API includes the the retrieved text chunks, the location type and URI of the source data, as well as the relevance scores of the retrievals.

In [None]:
# retreive api for fetching only the relevant context.

relevant_documents = bedrock_agent_runtime_client.retrieve(
    retrievalQuery= {
        'text': query
    },
    knowledgeBaseId=knowledge_base_id,
    retrievalConfiguration= {
        'vectorSearchConfiguration': {
            'numberOfResults': 3 # will fetch top 3 documents which matches closely with the query.
        }
    }
)

pp.pprint(relevant_documents["retrievalResults"])

## Using LangChain Integration with AWS

### Using the Knowledge Bases Retriever (AmazonKnowledgeBasesRetriever)

In [None]:
from langchain_aws import AmazonKnowledgeBasesRetriever


retriever = AmazonKnowledgeBasesRetriever(
    knowledge_base_id=knowledge_base_id,
    retrieval_config={
        "vectorSearchConfiguration": {
            "numberOfResults": 3,
            # 'overrideSearchType': "SEMANTIC", # optional, [SEMANTIC, HYBRID]
        }
    },
    region_name=aws_region
)

In [None]:
query = "What is Amazon doing in the field of Generative AI?"

retrieved_docs = retriever.invoke(query)
pp.pprint(retrieved_docs)

### Q&A with RAG using LangChain RetrievalQA

In [None]:
from langchain_aws import ChatBedrock as BedrockChat


llm = BedrockChat(
    model_id=model_id,
    model_kwargs={
        "max_tokens": 512,
        "temperature": 0,
        "top_p": 0.9
    }
)

In [None]:
from langchain.prompts import PromptTemplate


PROMPT_TEMPLATE = """
Human: You are a financial advisor AI system, and provides answers to questions by using fact based and statistical information when possible.
Use the following pieces of information to provide a concise answer to the question enclosed in <question> tags.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
<context>
{context}
</context>

<question>
{question}
</question>

The response should be specific and use statistics or numbers when possible.

Assistant:"""
claude_prompt = PromptTemplate(template=PROMPT_TEMPLATE,
                               input_variables=["context", "question"])

In [None]:
from langchain.chains import RetrievalQA


qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": claude_prompt}
)

In [None]:
answer = qa.invoke(query)
pp.pprint(answer)

### Q&A with RAG using LCEL (LangChain Expression Language) Chains

In [None]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import (
  create_retrieval_chain
)
from langchain import hub


retrieval_qa_chat_prompt = hub.pull("langchain-ai/retrieval-qa-chat")
combine_docs_chain = create_stuff_documents_chain(llm, retrieval_qa_chat_prompt)
retrieval_qa_chain = create_retrieval_chain(retriever, combine_docs_chain)

In [None]:
answer = retrieval_qa_chain.invoke({'input': query})
pp.pprint(answer)

## Cleanup

To avoid incurring future charges, delete the resources. You can do this by deleting the CloudFormation template used to create the IAM role and SageMaker notebook.

---

## Conclusion

In this notebook we were able to see how to use LLMs provided on Amazon Bedrock to generate embeddings and then ingest those embeddings into Amazon OpenSearch Service Serverless and finally do a similarity search for user input to the documents (embeddings) stored in the OpenSearch Service Searverless. We used langchain as an abstraction layer to talk to both Amazon Bedrock as well as Amazon OpenSearch Service Serverless.

## References

  * [Amazon Bedrock Knowledge Base - Samples for building RAG workflows](https://github.com/aws-samples/amazon-bedrock-samples/tree/main/knowledge-bases) - This repository contains examples for customers to get started using the Amazon Bedrock Service.
  * [Build a powerful question answering bot with Amazon SageMaker, Amazon OpenSearch Service, Streamlit, and LangChain](https://aws.amazon.com/blogs/machine-learning/build-a-powerful-question-answering-bot-with-amazon-sagemaker-amazon-opensearch-service-streamlit-and-langchain/)
  * [Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks](https://aws.amazon.com/blogs/machine-learning/using-the-amazon-sagemaker-studio-image-build-cli-to-build-container-images-from-your-studio-notebooks/)
  * [LangChain](https://python.langchain.com/docs/get_started/introduction.html) - A framework for developing applications powered by language models.
  * [LangChain-AWS](https://python.langchain.com/v0.1/docs/integrations/platforms/aws/) - The `LangChain` integrations related to `Amazon AWS` platform.
  * [LangChain > Components > Chains](https://python.langchain.com/v0.1/docs/modules/chains/) - Chains refer to sequences of calls - whether to an LLM, a tool, or a data preprocessing step. The primary supported way to do this is with [LCEL](https://python.langchain.com/v0.1/docs/expression_language/).
  * [LangChain Use cases > Q&A with RAG](https://python.langchain.com/v0.1/docs/use_cases/question_answering/)