# Vector Search using vCore-based Azure Cosmos DB for MongoDB

This notebook demonstrates using an Azure OpenAI embedding model to vectorize documents already stored in Azure Cosmos DB API for MongoDB, storing the embedding vectors and the creation of a vector index. Lastly, the notebook will demonstrate how to query the vector index to find similar documents.

This lab expects the data that was loaded in Lab 2.

In [2]:
import os
import pymongo
import time
import json
from openai import AzureOpenAI
from dotenv import load_dotenv
from tenacity import retry, wait_random_exponential, stop_after_attempt

## Load settings

This lab expects the `.env` file that was created in Lab 1 to obtain the connection string for the database.

Add the following entries into the `.env` file to support the connection to Azure OpenAI API, replacing the values for `<your key>` and `<your endpoint>` with the values from your Azure OpenAI API resource.

```text
AOAI_ENDPOINT="<your endpoint>"
AOAI_KEY="<your key>""
```

In [3]:
load_dotenv()
CONNECTION_STRING = os.environ.get("DB_CONNECTION_STRING")
EMBEDDINGS_DEPLOYMENT_NAME = "text-embedding-3-small"
COMPLETIONS_DEPLOYMENT_NAME = "gpt-35-turbo-16k"
AOAI_ENDPOINT = os.environ.get("AOAI_ENDPOINT")
AOAI_KEY = os.environ.get("AOAI_KEY")
AOAI_API_VERSION = "2023-05-15"

## Establish connectivity to the database

In [4]:
db_client = pymongo.MongoClient(CONNECTION_STRING)
# Create database to hold cosmic works data
# MongoDB will create the database if it does not exist
db = db_client.cosmic_works

  db_client = pymongo.MongoClient(CONNECTION_STRING)


## Establish Azure OpenAI connectivity

In [5]:
ai_client = AzureOpenAI(
    azure_endpoint = AOAI_ENDPOINT,
    api_version = AOAI_API_VERSION,
    api_key = AOAI_KEY
    )

## Vectorize and store the embeddings in each document

The process of creating a vector embedding field on each document only needs to be done once. However, if a document changes, the vector embedding field will need to be updated with an updated vector.

In [6]:
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(3))
def generate_embeddings(text: str):
    '''
    Generate embeddings from string of text using the deployed Azure OpenAI API embeddings model.
    This will be used to vectorize document data and incoming user messages for a similarity search with
    the vector index.
    '''
    response = ai_client.embeddings.create(input=text, model=EMBEDDINGS_DEPLOYMENT_NAME)
    embeddings = response.data[0].embedding
    time.sleep(0.5) # rest period to avoid rate limiting on AOAI
    return embeddings

In [7]:
# demonstrate embeddings generation using a test string
test = "hello, world"
print(generate_embeddings(test))

[-0.016808968, -0.034862, 0.018986076, 0.0018201469, -0.013154537, -0.0437118, -0.024584353, 0.037632864, -0.021290418, -0.008800322, 0.016596913, -0.017685467, -0.024867095, -0.02517811, 0.009507176, 0.030847073, -0.05315536, 0.042778756, 0.019466737, 0.02667664, 0.048263937, -0.0020003945, 0.006276856, 0.028047934, 0.02719971, -0.009952493, -0.007626946, 0.043259416, -0.0064005554, -0.049338352, 0.030592605, -0.03664327, -0.0017635986, -0.015522496, 0.021460062, 0.017402725, 0.028373087, -0.0071957656, 0.0015011794, -0.014137064, -0.014476353, -0.004608683, 0.03418342, 0.01078658, -0.010694689, 0.0133241825, -0.042015355, 0.018590238, 0.03129946, 0.07198593, -0.005704305, -0.015508359, 0.024004735, 0.09698026, 0.0013421375, -0.033052456, 0.009818191, 0.05468216, -0.024810547, 0.042184997, 0.017840974, 0.009295119, 0.007641083, 0.00021117239, 0.013069715, -0.0026489324, -0.023114098, 0.0080864, -0.008496375, 0.056124143, 0.0067009684, 0.041732613, -0.03293936, -0.007001381, -0.0365867

### Vectorize and update all documents in the Cosmic Works database

In [11]:
def add_collection_content_vector_field(collection_name: str):
    '''
    Add a new field to the collection to hold the vectorized content of each document.
    '''
    collection = db[collection_name]
    bulk_operations = []
    documents = list(collection.find())
    for doc in documents:
        # remove any previous contentVector embeddings
        if "contentVector" in doc:
            del doc["contentVector"]

        # generate embeddings for the document string representation
        content = json.dumps(doc, default=str)
        content_vector = generate_embeddings(content)       
        
        bulk_operations.append(pymongo.UpdateOne(
            {"_id": doc["_id"]},
            {"$set": {"contentVector": content_vector}},
            upsert=True
        ))
    # execute bulk operations
    collection.bulk_write(bulk_operations)

In [12]:
# Add vector field to products documents - this will take approximately 3-5 minutes due to rate limiting
add_collection_content_vector_field("products")

In [13]:
# Add vector field to customers documents - this will take approximately 1-2 minutes due to rate limiting
add_collection_content_vector_field("customers")

In [14]:
# Add vector field to customers documents - this will take approximately 15-20 minutes due to rate limiting
add_collection_content_vector_field("sales")

In [15]:
# Create the products vector index
db.command({
  'createIndexes': 'products',
  'indexes': [
    {
      'name': 'VectorSearchIndex',
      'key': {
        "contentVector": "cosmosSearch"
      },
      'cosmosSearchOptions': {
        'kind': 'vector-ivf',
        'numLists': 1,
        'similarity': 'COS',
        'dimensions': 1536
      }
    }
  ]
})

# Create the customers vector index
db.command({
  'createIndexes': 'customers',
  'indexes': [
    {
      'name': 'VectorSearchIndex',
      'key': {
        "contentVector": "cosmosSearch"
      },
      'cosmosSearchOptions': {
        'kind': 'vector-ivf',
        'numLists': 1,
        'similarity': 'COS',
        'dimensions': 1536
      }
    }
  ]
})

# Create the sales vector index
db.command({
  'createIndexes': 'sales',
  'indexes': [
    {
      'name': 'VectorSearchIndex',
      'key': {
        "contentVector": "cosmosSearch"
      },
      'cosmosSearchOptions': {
        'kind': 'vector-ivf',
        'numLists': 1,
        'similarity': 'COS',
        'dimensions': 1536
      }
    }
  ]
})

{'raw': {'defaultShard': {'numIndexesBefore': 1,
   'numIndexesAfter': 2,
   'createdCollectionAutomatically': False,
   'ok': 1}},
 'ok': 1}

## Use vector search in vCore-based Azure Cosmos DB for MongoDB

Now that each document has its associated vector embedding and the vector indexes have been created on each collection, we can now use the vector search capabilities of vCore-based Azure Cosmos DB for MongoDB.

In [16]:
def vector_search(collection_name, query, num_results=3):
    """
    Perform a vector search on the specified collection by vectorizing
    the query and searching the vector index for the most similar documents.

    returns a list of the top num_results most similar documents
    """
    collection = db[collection_name]
    query_embedding = generate_embeddings(query)    
    pipeline = [
        {
            '$search': {
                "cosmosSearch": {
                    "vector": query_embedding,
                    "path": "contentVector",
                    "k": num_results
                },
                "returnStoredSource": True }},
        {'$project': { 'similarityScore': { '$meta': 'searchScore' }, 'document' : '$$ROOT' } }
    ]
    results = collection.aggregate(pipeline)
    return results

def print_product_search_result(result):
    '''
    Print the search result document in a readable format
    '''
    print(f"Similarity Score: {result['similarityScore']}")  
    print(f"Name: {result['document']['name']}")   
    print(f"Category: {result['document']['categoryName']}")
    print(f"SKU: {result['document']['categoryName']}")
    print(f"_id: {result['document']['_id']}\n")

In [17]:
query = "What bikes do you have?"
results = vector_search("products", query, num_results=4)
for result in results:
    print_product_search_result(result)   

Similarity Score: 0.4092844832796503
Name: Mountain-100 Silver, 42
Category: Bikes, Mountain Bikes
SKU: Bikes, Mountain Bikes
_id: 4DA12D36-495E-4DCA-95B0-F18CAA099779

Similarity Score: 0.40601979386452625
Name: Mountain-300 Black, 48
Category: Bikes, Mountain Bikes
SKU: Bikes, Mountain Bikes
_id: E8767BC9-D6BA-47FC-9842-3511468869B6

Similarity Score: 0.40278359626678717
Name: Mountain-100 Black, 42
Category: Bikes, Mountain Bikes
SKU: Bikes, Mountain Bikes
_id: C0FBA4E8-B617-4889-B1A5-091D12783313

Similarity Score: 0.40272855758668136
Name: Mountain-500 Black, 42
Category: Bikes, Mountain Bikes
SKU: Bikes, Mountain Bikes
_id: 8B541087-A7F5-43B1-AC9F-EEFB4F4ADAFA



In [18]:
query = "What do you have that is yellow?"
results = vector_search("products", query, num_results=4)
for result in results:
    print_product_search_result(result)   

Similarity Score: 0.32723481492788764
Name: LL Touring Frame - Yellow, 62
Category: Components, Touring Frames
SKU: Components, Touring Frames
_id: 91AA100C-D092-4190-92A7-7C02410F04EA

Similarity Score: 0.31530604867303136
Name: LL Touring Frame - Yellow, 44
Category: Components, Touring Frames
SKU: Components, Touring Frames
_id: 6F733A5D-9B66-4718-B69C-627DE4E164BA

Similarity Score: 0.31411414402383264
Name: LL Touring Frame - Yellow, 54
Category: Components, Touring Frames
SKU: Components, Touring Frames
_id: 55594B1E-1E16-4B2E-A16F-983E492321BC

Similarity Score: 0.31382573760594557
Name: Touring-1000 Yellow, 60
Category: Bikes, Touring Bikes
SKU: Bikes, Touring Bikes
_id: 5B5E90B8-FEA2-4D6C-B728-EC586656FA6D



## Use vector search results in a RAG pattern with Chat GPT-3.5

In [19]:
# A system prompt describes the responsibilities, instructions, and persona of the AI.
system_prompt = """
You are a helpful, fun and friendly sales assistant for Cosmic Works, a bicycle and bicycle accessories store. 
Your name is Cosmo.
You are designed to answer questions about the products that Cosmic Works sells.

Only answer questions related to the information provided in the list of products below that are represented
in JSON format.

If you are asked a question that is not in the list, respond with "I don't know."

List of products:
"""

In [20]:
def rag_with_vector_search(question: str, num_results: int = 3):
    """
    Use the RAG model to generate a prompt using vector search results based on the
    incoming question.  
    """
    # perform the vector search and build product list
    results = vector_search("products", question, num_results=num_results)
    product_list = ""
    for result in results:
        if "contentVector" in result["document"]:
            del result["document"]["contentVector"]
        product_list += json.dumps(result["document"], indent=4, default=str) + "\n\n"

    # generate prompt for the LLM with vector results
    formatted_prompt = system_prompt + product_list

    # prepare the LLM request
    messages = [
        {"role": "system", "content": formatted_prompt},
        {"role": "user", "content": question}
    ]

    completion = ai_client.chat.completions.create(messages=messages, model=COMPLETIONS_DEPLOYMENT_NAME)
    return completion.choices[0].message.content

In [21]:
print(rag_with_vector_search("What bikes do you have?", 5))

We have the following bikes available:

1. Mountain-100 Silver, 42
2. Mountain-300 Black, 48
3. Mountain-100 Black, 42
4. Mountain-500 Black, 42
5. Mountain-200 Silver, 46

Please let me know if you have any specific questions about these bikes or if you would like more information.


In [22]:
print(rag_with_vector_search("What are the names and skus of yellow products?", 5))

The names and SKUs of the yellow products are:

1. Touring-1000 Yellow, 60 (SKU: BK-T79Y-60)
2. Road-550-W Yellow, 48 (SKU: BK-R64Y-48)
3. Road-550-W Yellow, 40 (SKU: BK-R64Y-40)
4. ML Road Frame-W - Yellow, 42 (SKU: FR-R72Y-42)
5. Touring-3000 Yellow, 58 (SKU: BK-T18Y-58)
