# Indexing Pipeline to write Documents to Amazon OpenSearch

**_Use of Amazon OpenSearch as a vector database for storing embeddings_**

This notebook works well with the `PyTorch 2.0.0 Python 3.10 CPU Optimized` kernel on a SageMaker Studio `ml.c5.2xlarge` instance.

Here is a list of packages that are used in this notebook.

```
!pip freeze | grep -E "sagemaker|boto3|haystack|opensearch|transformers|torch"
------------------------------------------
boto3==1.26.132
farm-haystack==1.21.0
opensearch-py==2.3.1
sagemaker==2.188.0
sagemaker-experiments==0.1.43
sagemaker-pytorch-training==2.8.0
sagemaker-training==4.5.0
sentence-transformers==2.2.2
smdebug @ file:///tmp/sagemaker-debugger
torch==2.0.0
torchaudio==2.0.1
torchdata @ file:///opt/conda/conda-bld/torchdata_1679615656247/work
torchtext==0.15.1
torchvision==0.15.1
transformers==4.32.1
```

In [1]:
!pip install -U -r requirements.txt
!pip install -U sagemaker

In [2]:
!pip freeze | grep -E "sagemaker|boto3|haystack|opensearch|transformers|torch"

boto3==1.26.132
farm-haystack==1.21.0
opensearch-py==2.3.1
sagemaker==2.188.0
sagemaker-experiments==0.1.43
sagemaker-pytorch-training==2.8.0
sagemaker-training==4.5.0
sentence-transformers==2.2.2
smdebug @ file:///tmp/sagemaker-debugger
torch==2.0.0
torchaudio==2.0.1
torchdata @ file:///opt/conda/conda-bld/torchdata_1679615656247/work
torchtext==0.15.1
torchvision==0.15.1
transformers==4.32.1


In [3]:
import boto3
import json


def get_opensearch_endpoint(stack_name: str, region_name: str = 'us-east-1'):
    cf_client = boto3.client('cloudformation', region_name=region_name)
    response = cf_client.describe_stacks(StackName=stack_name)
    outputs = response["Stacks"][0]["Outputs"]

    ops_endpoint = [e for e in outputs if e['ExportName'] == 'OpenSearchDomainEndpoint'][0]
    ops_endpoint_name = ops_endpoint['OutputValue']
    return ops_endpoint_name



def get_secret_name(stack_name: str, region_name: str = 'us-east-1'):
    cf_client = boto3.client('cloudformation', region_name=region_name)
    response = cf_client.describe_stacks(StackName=stack_name)
    outputs = response["Stacks"][0]["Outputs"]

    secrets = [e for e in outputs if e['ExportName'] == 'MasterUserSecretId'][0]
    secret_name = secrets['OutputValue']
    return secret_name


def get_secret(secret_name: str, region_name: str = 'us-east-1'):
    client = boto3.client('secretsmanager', region_name=region_name)
    get_secret_value_response = client.get_secret_value(SecretId=secret_name)
    secret = get_secret_value_response['SecretString']

    return json.loads(secret)

In [4]:
AWS_REGION_NAME = boto3.Session().region_name

AWS_REGION_NAME

In [5]:
stack_name = 'RAGHaystackOpenSearchStack'
secret_name = get_secret_name(stack_name, region_name=AWS_REGION_NAME)
secret = get_secret(secret_name, region_name=AWS_REGION_NAME)
display(secret)

opensearch_endpoint = get_opensearch_endpoint(stack_name, region_name=AWS_REGION_NAME)
display(opensearch_endpoint)

In [6]:
OPENSEARCH_HOST = opensearch_endpoint
OPENSEARCH_PORT = 443
OPENSEARCH_USERNAME = secret['username']
OPENSEARCH_PASSWORD = secret['password']

In [7]:
import warnings

warnings.filterwarnings("ignore")  # avoid printing out absolute paths

In [8]:
from haystack.document_stores import OpenSearchDocumentStore

In [9]:
doc_store = OpenSearchDocumentStore(host=OPENSEARCH_HOST,
                                    port=OPENSEARCH_PORT,
                                    username=OPENSEARCH_USERNAME,
                                    password=OPENSEARCH_PASSWORD,
                                    embedding_dim=384)

In [10]:
from haystack.nodes import JsonConverter

converter = JsonConverter()

In [11]:
from haystack.nodes import PreProcessor

preprocessor = PreProcessor(
    clean_empty_lines=True,
    split_by='word',
    split_respect_sentence_boundary=True,
    split_length=80,
    split_overlap=20
)

In [12]:
from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(
    document_store=doc_store,
    embedding_model="sentence-transformers/all-MiniLM-L12-v2",
    devices=["cpu"],
    top_k=5
)

In [13]:
from haystack import Pipeline

indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=converter, name="Converter", inputs=["File"])
indexing_pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["Converter"])
indexing_pipeline.add_node(component=retriever, name="Retriever", inputs=["PreProcessor"])
indexing_pipeline.add_node(component=doc_store, name="DocumentStore", inputs=["Retriever"])

In [14]:
%%sh

mkdir -p data
cd ./data
wget https://raw.githubusercontent.com/deepset-ai/haystack-sagemaker/main/data/opensearch-documentation-2.7.json
wget https://raw.githubusercontent.com/deepset-ai/haystack-sagemaker/main/data/opensearch-website.json

In [15]:
indexing_pipeline.run(file_paths=[
    "data/opensearch-documentation-2.7.json",
    "data/opensearch-website.json"
])

---
## Do a similarity search for for user input to documents (embeddings) in Amazon OpenSearch

In [16]:
from haystack.pipelines import DocumentSearchPipeline
from haystack.utils import print_documents

In [17]:
p_retrieval = DocumentSearchPipeline(retriever)

In [18]:
query = "What is OpenSearch?"
res = p_retrieval.run(query=query, params={"Retriever": {"top_k": 10}})
print_documents(res, max_text_len=200, print_meta=True)

Batches: 100%|██████████| 1/1 [00:00<00:00, 41.96it/s]



Query: What is OpenSearch?

{   'content': 'Really it’s a way to provide anything as a service, any '
               'application, and we’re pleased to announce that OpenSearch is '
               'the latest certified container available in the Virtuozzo '
               'DevOps PaaS solution.\n'
               'Op...',
    'meta': {   '_split_id': 1,
                '_split_overlap': [   {   'doc_id': '3f56a02ceac197f7bbaeac1e9973cb89',
                                          'range': [0, 332]},
                                      {   'doc_id': '246cd78e38ab76ea38332ca185e213f4',
                                          'range': [333, 549]}],
                'keywords': ['partners'],
                'title': 'Partner Highlight: How to Offer OpenSearch as a '
                         'Service using Virtuozzo DevOps PaaS',
                'type': 'News',
                'url': '/blog/opensearch-as-a-service/'},
    'name': None}

{   'content': 'OpenSearch provides a basis for ma

---
## Text Retrieval using BM25

In [19]:
from haystack.nodes import BM25Retriever

bm25_retriever = BM25Retriever(document_store=doc_store)
bm25_retrieval = DocumentSearchPipeline(bm25_retriever)

In [20]:
query = "What is OpenSearch?"
res = bm25_retrieval.run(query=query, params={"Retriever": {"top_k": 10}})
print_documents(res, max_text_len=200, print_meta=True)


Query: What is OpenSearch?

{   'content': 'Plugins are fundamental to how OpenSearch works, and the '
               'similarity extends to OpenSearch Dashboards too. Infact almost '
               'everything that you see inside OpenSearch Dashboards is built '
               'inside a plugin. A...',
    'meta': {   '_split_id': 0,
                '_split_overlap': [   {   'doc_id': 'acbabd38a6b5351a13ccd5a6624f94a6',
                                          'range': [334, 479]}],
                'keywords': ['technical-post'],
                'title': 'Introduction to OpenSearch Dashboard Plugins',
                'type': 'News',
                'url': '/blog/dashboards-plugins-intro/'},
    'name': None}

{   'content': 'Simulate an index by index name\n'
               'It is challenging to predict the appearance of the index when '
               'taking into account existing templates. To resolve this issue, '
               'OpenSearch will attempt to match the index ...',
  

---

## References

  * [Build production-ready generative AI applications for enterprise search using Haystack pipelines and Amazon SageMaker JumpStart with LLMs (2023-08-14)](https://aws.amazon.com/blogs/machine-learning/build-production-ready-generative-ai-applications-for-enterprise-search-using-haystack-pipelines-and-amazon-sagemaker-jumpstart-with-llms/)
    * [Haystack Retrieval-Augmented Generative QA Pipelines with SageMaker JumpStart](https://github.com/deepset-ai/haystack-sagemaker/)
  * [Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks](https://aws.amazon.com/blogs/machine-learning/using-the-amazon-sagemaker-studio-image-build-cli-to-build-container-images-from-your-studio-notebooks/)
  * [Haystack](https://docs.haystack.deepset.ai/docs) - The open source Python framework by deepset for building custom apps with large language models (LLMs).
  * [Tutorial: How to Use Pipelines](https://haystack.deepset.ai/tutorials/11_pipelines)