# Open Search Serverless Collection creation
This notebook demonstrates how to create an OpenSearch Serverless Collection using the AWS Python SDK (Boto3). OpenSearch Serverless is a fully managed service that makes it easy to launch and run OpenSearch clusters in the cloud. It simplifies the deployment and management of OpenSearch by automatically provisioning, configuring, and scaling the resources required to run OpenSearch

In recent years, machine learning (ML) techniques have become increasingly popular to enhance search. Among them are the use of embedding models, a type of model that can encode a large body of data into an n-dimensional space where each entity is encoded into a vector, a data point in that space, and organized such that similar entities are closer together. An embedding model, for instance, could encode the semantics of a corpus.

By searching for the vectors nearest to an encoded document — k-nearest neighbor (k-NN) search — you can find the most semantically similar documents. Sophisticated embedding models can support multiple modalities, for instance, encoding the image and text of a product catalog and enabling similarity matching on both modalities.

With OpenSearch Service’s vector database capabilities, you can implement semantic search, Retrieval Augmented Generation (RAG) with LLMs, recommendation engines, and search rich media.



## Install required libraries
The following cell installs required python libraries specified in the 'requirements.txt' file.

In [4]:
#This cell installs the required libraries specified in the 'requirements.txt' file
!pip install -r requirements.txt --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sparkmagic 0.21.0 requires pandas<2.0.0,>=0.17.1, but you have pandas 2.1.4 which is incompatible.[0m[31m
[0m

In [5]:
import os
import pandas as pd
import sagemaker
import boto3
import json
import pprint
import random 
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth, helpers
import time

session = boto3.session.Session()
region_name = session.region_name

## Required permissions
Your role or user will need a certain number of policies attached to execute the below code including AmazonBedrockFullAccess, AmazonOpenSearchServiceFullAccess, and the following policy for OpenSearchServerless. This policy grants full access to the OpenSearch Serverless service, allowing you to create, manage, and delete OpenSearch Serverless resources.

```json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "aoss:*",
            "Resource": "*"
        }
    ]
}
```

The following cells creates this policy and assigns the policy to the current user or role. If running in sagemaker notebook the code will attempt to assign the policy to the sagemaker execution role. 

In [6]:
# Create an IAM client
iam = boto3.client('iam')

suffix = random.randrange(200, 900)

# Define the policy document
policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "aoss:*",
            "Resource": "*"
        }
    ]
}

# Create the IAM policy
aossAccessPolicy = iam.create_policy(
    PolicyName='AOSSAccessPolicy-{0}'.format(suffix),
    PolicyDocument=json.dumps(policy_document)
)


aossAccessPolicyArn = aossAccessPolicy["Policy"]["Arn"]

#wait for the policy to be created
time.sleep(10)

In [7]:
# get the current identify ARN
# if running this in sagemaker this should indicate a sagemaker execution role
identity_arn = ""

try:
    # Get the execution role ARN
    identity_arn = sagemaker.get_execution_role()
    
except Exception as e:
    print("Not a sagemaker role, trying to retrieve the user identity")
    # Create an STS client
    sts_client = boto3.client('sts')

    # Get the caller identity
    caller_identity = sts_client.get_caller_identity()
    identity_arn = caller_identity['Arn']

print(f"Identity ARN:{identity_arn}")

Identity ARN:arn:aws:iam::207390309313:role/service-role/AmazonSageMaker-ExecutionRole-20220409T200590


In [8]:
# Check if the identity ARN is for a user or a role

try:
    # Try to get the user information
    user = iam.get_user(UserName=identity_arn.split('/')[-1])
    print(f"The identity ARN '{identity_arn}' is for a user.")

    # Attach the policy to the user
    iam.attach_user_policy(
        UserName=user['User']['UserName'],
        PolicyArn=aossAccessPolicyArn
    )
except iam.exceptions.NoSuchEntityException:
    # If the identity ARN is not for a user, it must be for a role
    print(f"The identity ARN '{identity_arn}' is for a role.")

    # Attach the policy to the role
    iam.attach_role_policy(
        RoleName=identity_arn.split('/')[-1],
        PolicyArn=aossAccessPolicyArn
    )

The identity ARN 'arn:aws:iam::207390309313:role/service-role/AmazonSageMaker-ExecutionRole-20220409T200590' is for a role.


## Open Search Collection Creation
Now that we have the policy created and attached to allow full access to Open Search Service (OSS), we are ready to create a OSS collection to house our embeddings and enriched metadata. There are a few additional policies we require before we can invoke to create a collection. 

1. Data access policy - to allow creation of collection & creating index with current user set as the principal.
2. Security policy - to use aws owned keys for encryption
3. Network policy - to allow access from public. NOTE: in production environments this is not recommended. You should define appropriate policy to limit access to specific resources. 

The following cell instantiates boto3 oss client before creating the required policies for security, network, and data access for the collection and index. 



In [9]:
# data access policy for OSS

collection_name = 'media-search-{0}'.format(suffix)
# Create an OpenSearch Serverless client
oss_client = boto3.client('opensearchserverless')

# define the data acccess policy 
data_access_policy = json.dumps([
      {
        "Rules": [
          {
            "Resource": [
              f"collection/{collection_name}"
            ],
            "Permission": [
              "aoss:CreateCollectionItems",
              "aoss:DeleteCollectionItems",
              "aoss:UpdateCollectionItems",
              "aoss:DescribeCollectionItems"
            ],
            "ResourceType": "collection"
          },
          {
            "Resource": [
              f"index/{collection_name}/*"
            ],
            "Permission": [
              "aoss:CreateIndex",
              "aoss:DeleteIndex",
              "aoss:UpdateIndex",
              "aoss:DescribeIndex",
              "aoss:ReadDocument",
              "aoss:WriteDocument"
            ],
            "ResourceType": "index"
          }
        ],
        "Principal": [
          identity_arn
        ],
        "Description": "data-access-rule"
      }
    ], indent=2)

data_access_policy_name_nb = f"{collection_name}-policy-notebook"

# Create the data access policy
response = oss_client.create_access_policy(
    description='Data access policy for semantic search collection',
    name=data_access_policy_name_nb,
    policy=str(data_access_policy),
    type='data'
)

pprint.pp(response)

{'accessPolicyDetail': {'createdDate': 1732368203020,
                        'description': 'Data access policy for semantic search '
                                       'collection',
                        'lastModifiedDate': 1732368203020,
                        'name': 'media-search-833-policy-notebook',
                        'policy': [{'Rules': [{'Resource': ['collection/media-search-833'],
                                               'Permission': ['aoss:CreateCollectionItems',
                                                              'aoss:DeleteCollectionItems',
                                                              'aoss:UpdateCollectionItems',
                                                              'aoss:DescribeCollectionItems'],
                                               'ResourceType': 'collection'},
                                              {'Resource': ['index/media-search-833/*'],
                                               'Permiss

In [10]:
# create the security policy 
encryption_policy_name = f"{collection_name}-sp-notebook"

encryption_policy = oss_client.create_security_policy(
    name=encryption_policy_name,
    policy=json.dumps(
        {
            'Rules': [{'Resource': ['collection/' + collection_name],
                       'ResourceType': 'collection'}],
            'AWSOwnedKey': True
        }),
        type='encryption'
    )
pprint.pp(encryption_policy)

{'securityPolicyDetail': {'createdDate': 1732368206594,
                          'lastModifiedDate': 1732368206594,
                          'name': 'media-search-833-sp-notebook',
                          'policy': {'Rules': [{'Resource': ['collection/media-search-833'],
                                                'ResourceType': 'collection'}],
                                     'AWSOwnedKey': True},
                          'policyVersion': 'MTczMjM2ODIwNjU5NF8x',
                          'type': 'encryption'},
 'ResponseMetadata': {'RequestId': 'bf644c45-a2ec-4e4f-b361-b2fdb2334369',
                      'HTTPStatusCode': 200,
                      'HTTPHeaders': {'x-amzn-requestid': 'bf644c45-a2ec-4e4f-b361-b2fdb2334369',
                                      'date': 'Sat, 23 Nov 2024 13:23:26 GMT',
                                      'content-type': 'application/x-amz-json-1.0',
                                      'content-length': '297',
                         

In [11]:
# create the network policy 
network_policy_name = f"{collection_name}-np-notebook"
network_policy = oss_client.create_security_policy(
    name=network_policy_name,
    policy=json.dumps(
        [
            {'Rules': [{'Resource': ['collection/' + collection_name],
                        'ResourceType': 'collection'}],
             'AllowFromPublic': True}
        ]),
        type='network'
    )

pprint.pp(network_policy)

{'securityPolicyDetail': {'createdDate': 1732368209301,
                          'lastModifiedDate': 1732368209301,
                          'name': 'media-search-833-np-notebook',
                          'policy': [{'Rules': [{'Resource': ['collection/media-search-833'],
                                                 'ResourceType': 'collection'}],
                                      'AllowFromPublic': True}],
                          'policyVersion': 'MTczMjM2ODIwOTMwMV8x',
                          'type': 'network'},
 'ResponseMetadata': {'RequestId': '8131728e-a347-454c-9512-ee02a508f454',
                      'HTTPStatusCode': 200,
                      'HTTPHeaders': {'x-amzn-requestid': '8131728e-a347-454c-9512-ee02a508f454',
                                      'date': 'Sat, 23 Nov 2024 13:23:29 GMT',
                                      'content-type': 'application/x-amz-json-1.0',
                                      'content-length': '300',
                    

We are now ready to create the OSS collection and index. The following cells creates a collection, index as well as the index schema required to house our metadata including a vector field to store our embeddings. An search_client (of type opensearch) is created in order to create the index and issue various calls to OSS. 

In [None]:
# create the collection of type vector search
oss_client = boto3.client('opensearchserverless')
collection = oss_client.create_collection(name=collection_name, type='VECTORSEARCH')
collection_id = collection['createCollectionDetail']['id']
host = collection_id + '.' + region_name + '.aoss.amazonaws.com'
print("OSS host: {0}".format(host))

# create the OSS client
service = 'aoss'
credentials = boto3.Session().get_credentials()
awsauth = AWSV4SignerAuth(credentials, region_name, service)

# Build the OpenSearch client
search_client = OpenSearch(
    hosts=[{'host': host, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection,
    timeout=300
)
# # It can take up to a minute for data access rules to be enforced
time.sleep(60)

### Defining the Opensearchserverless index
The following cell defines a OSS index schema for our dataset as well as creating the index via the OSS client. The schema defines what fields and their data type mappings. For our use case it is a fairly flat structure but a nested and json structure can also be defined as needed to store more complex relationships. 

The standard k-NN search (nearest neighbor) uses cruite force approach to find similaritiy but this is extremely inefficient and resource intensive for large datasets or high dimensional emebeddings. For this purposes OpenSearch leverages Approximate Nearest Neighbor (ANN) algorithm from either nmslib, faiss or Lucene libraries to power k-NN search. Further details on this is provided in the next notebook but for the purposes of this lab we are going to utilise the FAISS engine which also supports k-NN filtering used later to drive more intelligent search. 

In this instance, we are only required to store path, title, description, keywords, tags as well as the vector embeddings created in the enrichment notebook. Please note, the dimensions of the vector embeddings must match here. 



In [20]:
#index configuration. note that we're adding both text metadata as well as the vector mapping property that will be storing our embedding for each media asset.
# For additional information on the K-NN index configuration, please read the below documentation.
#https://opensearch.org/docs/latest/field-types/supported-field-types/knn-vector/
#https://opensearch.org/docs/latest/search-plugins/knn/knn-index/

index_name = 'smart-search-index-faiss'
index_body = {
   "settings": {
      "index.knn": "true"
   },
   "mappings": {
      "properties": {
          "image_vector": {
              "type": "knn_vector",
              "dimension": 1024, # Embedding size for Amanon Titan Multimodal Embedding G1 model, it is 1,024 (default), 384, 256
              "method": {
                  "engine": "faiss",
                  "name": "hnsw"
                }
          },
          "image_id" : {"type": "text"},
          "path": {"type": "text"},
          "title": {"type": "text"},
          "description": {"type": "text"},
          "keywords": {"type": "text"},
          "tags": {"type": "text"}
      }
   }
}

# We would get an index already exists exception if the index already exists, and that is fine.
try:
    response = search_client.indices.create(index_name, body=index_body)
    print(f"response received for the create index -> {response}")
except Exception as e:
    print(f"error in creating index={index_name}, exception={e}")

response received for the create index -> {'acknowledged': True, 'shards_acknowledged': True, 'index': 'smart-search-index-faiss'}


In [21]:
#display information on the index you just created

# Get index mapping
response = search_client.indices.get_mapping(index=index_name)
pprint.pp(response)

# Get index settings
response = search_client.indices.get_settings(index=index_name)
pprint.pp(response)

# Get index aliases
response = search_client.indices.get_alias(index=index_name) 
pprint.pp(response)

{'smart-search-index-faiss': {'mappings': {'properties': {'description': {'type': 'text'},
                                                          'image_id': {'type': 'text'},
                                                          'image_vector': {'type': 'knn_vector',
                                                                           'dimension': 1024,
                                                                           'method': {'engine': 'faiss',
                                                                                      'space_type': 'l2',
                                                                                      'name': 'hnsw',
                                                                                      'parameters': {}}},
                                                          'keywords': {'type': 'text'},
                                                          'path': {'type': 'text'},
                                       

In [22]:
# deleting indices
# search_client.indices.delete(index=index_name)

## Loading the data in the index 

The index is created and ready for use. The following cells will attempt to reload data from the previous notebook and populate the index before we can issue any queries. 
If there any issues with the variable, uncomment the line to reload the dataset from CSV file saved in notebook 2. 

In [23]:
# load the dataset from notebook 2 
%store -r df_metadata
# df_metadata = pd.read_csv('./data/enriched_dataset.csv')
df_metadata.head()

Unnamed: 0,image_id,path,title,description,keywords,tags,embeddings
0,00a7655d4eabf186.jpg,./data/resized-images/00a7655d4eabf186.jpg,Baseball player batting during a game,The image depicts a baseball player in a batti...,"baseball, sports, batting, player, jersey, sta...","baseball, sports, athlete, action, competition","[0.041014045, 0.021385895, -0.02275303, 0.0187..."
1,19aa926f2f7d9782.jpg,./data/resized-images/19aa926f2f7d9782.jpg,Baseball Player Batting During a Game,The image depicts a professional baseball play...,"baseball, sports, athlete, batting, swing, sta...","baseball, sports, athlete, batting, stadium, g...","[0.03466529, 0.02793032, -0.029118843, -0.0156..."
2,39209fa476d1430c.jpg,./data/resized-images/39209fa476d1430c.jpg,Baseball Players Gathered on Field,The image depicts a group of baseball players ...,"baseball, players, team, field, uniforms, bats...","sports, baseball, athletes, coaching, team gat...","[0.031110628, 0.030195609, -0.060391217, -0.01..."
3,1efc2db85591a04f.jpg,./data/resized-images/1efc2db85591a04f.jpg,Baseball batter at home plate,The image depicts a baseball game in progress....,"baseball, batter, swing, pitch, catcher, field...","baseball, sports, game, athlete, competition","[0.047018528, 0.025594763, -0.030524123, 0.023..."
4,3c06d149c8027e71.jpg,./data/resized-images/3c06d149c8027e71.jpg,Youth Baseball Player Batting,The image depicts a young baseball player in a...,"baseball, youth sports, batting, swing, athlet...","sports, baseball, youth, athlete, batting","[0.011220759, 0.06612398, -0.048392408, -0.019..."


Load the entire contents of the dataframe into opensearch index. The following cell does this simply by iterating over the dataframe and processing each row and insert into the index. 

In [24]:
%%time
from tqdm import tqdm
import tqdm.notebook as tq

for idx, record in tq.tqdm(df_metadata.iterrows(), total=len(df_metadata)):
    document = {
        'image_vector': df_metadata['embeddings'][idx],
        "description":   df_metadata['description'][idx],
        "image_id" : df_metadata['image_id'][idx],
        "image_url": df_metadata['path'][idx],
        "title": df_metadata['title'][idx],
        "keywords": df_metadata['keywords'][idx],
        "tags": df_metadata['tags'][idx],
    }
    response = search_client.index(
        index = index_name,
        body = document
    )

  0%|          | 0/107 [00:00<?, ?it/s]

CPU times: user 630 ms, sys: 35.5 ms, total: 665 ms
Wall time: 27.1 s


The following cell saves the collection, host and index name of the created OpenSearch serverless instance to be used in Notebook 4. 

In [26]:
# save variables for use in search notebook
%store index_name
%store data_access_policy_name_nb
%store network_policy_name
%store encryption_policy_name
%store aossAccessPolicyArn
%store collection_name
%store collection_id
%store host
%store identity_arn

Stored 'index_name' (str)
Stored 'data_access_policy_name_nb' (str)
Stored 'network_policy_name' (str)
Stored 'encryption_policy_name' (str)
Stored 'aossAccessPolicyArn' (str)
Stored 'collection_name' (str)
Stored 'collection_id' (str)
Stored 'host' (str)
