## Storing and querying for embeddings with Azure Cosmos DB for MongoDB vCore

This notebook demonstrates how to use Azure Cosmos DB for MongoDB vCore to store and search vectors:

+ Create index and vector embeddings policies in Azure Cosmos DB
+ Embed the documents using Azure OpenAI's text-embedding-ada-002 model
+ Index the vector and nonvector fields on Azure Cosmos DB for NoSQL
+ Run vector similarity search query. 

The code uses Azure OpenAI to generate embeddings for title and content fields. You'll need access to Azure OpenAI to run this notebook.

The code reads the `text-sample.json` file, which contains the input data for which embeddings need to be generated.

The output is a combination of human-readable text and embeddings that can be pushed into a search index.

### Prerequisites

+ An Azure subscription, with [access to Azure OpenAI](https://aka.ms/oai/access). You must have the Azure OpenAI service name and an API key.

+ A deployment of the text-embedding-ada-002 embedding model.

+ Azure Cosmos DB for MongoDB vCore, [Vector Store in Azure Cosmos DB for MongoDB vCore
](https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/vcore/vector-search)

We used Python 3.10, [Visual Studio Code with the Python extension](https://code.visualstudio.com/docs/python/python-tutorial), and the [Jupyter extension](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter) to test this example.

### Set up a Python virtual environment in Visual Studio Code

1. Open the Command Palette (Ctrl+Shift+P).
1. Search for **Python: Create Environment**.
1. Select **Venv**.
1. Select a Python interpreter. Choose 3.10 or later.

It can take a minute to set up. If you run into problems, see [Python environments in VS Code](https://code.visualstudio.com/docs/python/environments).

### Install packages

In [None]:
%pip install -U -q pymongo langchain-openai langchain-community

### Deploy Azure Cosmos DB for MongoDB

In [None]:
import datetime
import os

resource_group_name = "VectorSearch" # change the name to match your naming style
resource_group_location = "southeastasia" # change the location to match your naming style
db_name = "vsdb"
server_version="6.0" # 5.0 6.0
collection_name = "vsc"
admin_username = "azadmin" # change the name to match your naming style
cmd_stdout = ! < /dev/urandom tr -dc 'A-Za-z0-9' | head -c12; echo
admin_password = cmd_stdout.n
cmd_stdout = ! echo -n {resource_group_name} | sha1sum | head -c 6
cluster_name = "mgovs-" +  cmd_stdout.n # cosmos db for mongodb account name,  change the name to match your naming style
deployment_name = os.path.basename(os.path.dirname(globals()['__vsc_ipynb_file__'])) 
sku = 'M40' # Free M40

In [None]:
resource_group_stdout = ! az group create --name {resource_group_name} --location {resource_group_location}
if resource_group_stdout.n.startswith("ERROR"):
    print(resource_group_stdout)
else:
    print("✅ Azure Resource Grpup ", resource_group_name, " created ⌚ ", datetime.datetime.now().time())

! az deployment group create --name {deployment_name} --resource-group {resource_group_name} --template-file "cosmos_db_for_mongodb_vcore.bicep" --parameters clusterName={cluster_name} adminUsername={admin_username} adminPassword={admin_password} sku={sku}

### Create Cosmos DB for MongoDB database and collection

In [None]:
import pymongo
import urllib.parse

mongo_conn = "mongodb+srv://"+urllib.parse.quote(admin_username)+":"+urllib.parse.quote(admin_password)+"@"+cluster_name+".mongocluster.cosmos.azure.com/"+"?tls=true&authMechanism=SCRAM-SHA-256&retrywrites=false&maxIdleTimeMS=120000"
mongo_client = pymongo.MongoClient(mongo_conn)

# create a database called TutorialDB
db = mongo_client[db_name]

# Create collection if it doesn't exist
if collection_name not in db.list_collection_names():
    # Creates a unsharded collection that uses the DBs shared throughput
    db.create_collection(collection_name)
    print("Created collection '{}'.\n".format(collection_name))
else:
    print("Using collection: '{}'.\n".format(collection_name))

collection = db[collection_name]

In [None]:
import os
from openai import AzureOpenAI
from dotenv import load_dotenv

load_dotenv()

client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version=os.getenv("OPENAI_API_VERSION"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)


### Create the vector index

**IMPORTANT: You can only create one index per vector property.** That is, you cannot create more than one index that points to the same vector property. If you want to change the index type (e.g., from IVF to HNSW) you must drop the index first before creating a new index.

#### IVF index
IVF is an approximate nerarest neighbors (ANN) approach that uses clustering to speed up the search for similar vectors in a dataset. It's a good choice for proof-of-concepts and smaller datasets (under a few thousand documents). However it's not recommended to use at scale or when higher throughput is needed.

IVF is supported on all cluster tiers, including the free tier.

#### HNSW Index
HNSW stands for Hierarchical Navigable Small World, a graph-based index that partitions vectors into clusters and subclusters. With HNSW, you can perform fast approximate nearest neighbor search at higher speeds with greater accuracy. HNSW is now available on M40 and higher cluster tiers.

In [None]:
db.command({
  'createIndexes': collection_name,
  'indexes': [
    {
      'name': 'titleVectorIndex',
      'key': {
        "titleVector": "cosmosSearch"
      },
      'cosmosSearchOptions': {
        'kind': 'vector-ivf',
        'numLists': 1,
        'similarity': 'COS',
        'dimensions': 1536
      }
    },
    {
      'name': 'contentVectorIndex',
      'key': {
        "contentVector": "cosmosSearch"
      },
      'cosmosSearchOptions': {
        "kind": "vector-hnsw", 
        "m": 16, # default value 
        "efConstruction": 64, # default value 
        "similarity": "COS", 
        'dimensions': 1536
      }
    }    
  ]
})

### Helper function to create document embeddings

In [None]:
import os
import json
import time

def generate_embeddings(text):
    '''
    Generate embeddings from string of text.
    This will be used to vectorize data and user input for interactions with Azure OpenAI.
    '''
    response = client.embeddings.create(input=text,  model="text-embedding-ada-002")
    embeddings =response.model_dump()
    time.sleep(0.5) 
    return embeddings['data'][0]['embedding']

def generate_document_embeddings(input_json_path, output_json_path):
    # Read the input JSON file
    with open(input_json_path, 'r', encoding='utf-8') as file:
        input_data = json.load(file)

    titles = [item['title'] for item in input_data]
    content = [item['content'] for item in input_data]
    
    # Generate embeddings for titles
    title_response = client.embeddings.create(input=titles, model="text-embedding-ada-002")
    title_embeddings = [item.embedding for item in title_response.data]
    
    # Generate embeddings for content
    content_response = client.embeddings.create(input=content, model="text-embedding-ada-002")
    content_embeddings = [item.embedding for item in content_response.data]

    # Assign embeddings to the original data
    for i, item in enumerate(input_data):
        item['titleVector'] = title_embeddings[i]
        item['contentVector'] = content_embeddings[i]

    # Ensure the output directory exists
    output_directory = os.path.dirname(output_json_path)
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)
    
    # Write the modified data to the output JSON file
    with open(output_json_path, "w") as f:
        json.dump(input_data, f)


# Generate and load the documents with embeddings from the output file
output_path = os.path.join('output', 'docVectors.json')
if not os.path.exists(output_path):
    generate_document_embeddings(os.path.join('..', 'data', 'text-sample.json'), output_path)
    
with open(output_path, 'r') as file:  
    documents = json.load(file)

### Insert documents into Cosmos DB for MongoDB vCore

In [None]:
collection.insert_many(documents[:1])

### Query items using embeddings

In [None]:
# Simple function to assist with vector search
def vector_search(query, num_results=5):
    query_embedding = generate_embeddings(query)
    embeddings_list = []
    pipeline = [
        {
            '$search': {
                "cosmosSearch": {
                    "vector": query_embedding,
                    "path": "contentVector",
                    "k": num_results#, #, "efsearch": 40 # optional for HNSW only 
                    #"filter": {"title": {"$ne": "Azure Cosmos DB"}}
                },
                "returnStoredSource": True }},
        {'$project': { 'similarityScore': { '$meta': 'searchScore' }, 'document' : '$$ROOT' } }
    ]
    results = collection.aggregate(pipeline)
    return results

query = "What is Azure App Service?"#"What are the services for running ML models?"
results = vector_search(query)
for result in results: 
    # print(result)
    print(f"Similarity Score: {result['similarityScore']}")  
    print(f"Title: {result['document']['title']}")  
    print(f"Content: {result['document']['content']}")  
    print(f"Category: {result['document']['category']}\n")  

### Filtered vector search (Preview)

You can add additional query filters to your vector search by creating a filtered index and specifying it in the search pipeline.

Note: filtered vector search preview and needs to be enabled via Azure Preview Features for your subscription. Search for the preview feature "filtering on vector search". Learn more about it here: https://learn.microsoft.com/azure/azure-resource-manager/management/preview-features?tabs=azure-portal

In [None]:
# Add a filter index
db.command( {
    "createIndexes": collection_name,
    "indexes": [ {
        "key": { 
            "title": 1 
               }, 
        "name": "titleFilterIndex" 
    }
    ] 
} 
)

# Verify all indexes are present
for i in collection.list_indexes():
    print(i)

# Simple function to assist with vector search
def filtered_vector_search(query, num_results=5):
    query_embedding = generate_embeddings(query)
    embeddings_list = []
    pipeline = [
        {
            '$search': {
                "cosmosSearch": {
                    "vector": query_embedding,
                    "path": "contentVector",
                    "k": num_results,
                    "filter": {"title": {"$nin": ["Azure App Service"]}}
                },
                "returnStoredSource": True }},
        {'$project': { 'similarityScore': { '$meta': 'searchScore' }, 'document' : '$$ROOT' } }
    ]
    results = collection.aggregate(pipeline)
    return results

query = "What is Azure App Service?"#"What are the services for running ML models?"
results = filtered_vector_search(query)
for result in results: 
#     print(result)
    print(f"Similarity Score: {result['similarityScore']}")  
    print(f"Title: {result['document']['title']}")  
    print(f"Content: {result['document']['content']}")  
    print(f"Category: {result['document']['category']}\n")  

### Q&A over the data with GPT-3.5

Finally, we'll create a helper function to feed prompts into the `Completions` model. Then we'll create interactive loop where you can pose questions to the model and receive information grounded in your data.

In [None]:
#This function helps to ground the model with prompts and system instructions.

def generate_completion(vector_search_results, user_prompt):
    system_prompt = '''
    You are an intelligent assistant for Microsoft Azure services.
    You are designed to provide helpful answers to user questions about Azure services given the information about to be provided.
        - Only answer questions related to the information provided below, provide at least 3 clear suggestions in a list format.
        - Write two lines of whitespace between each answer in the list.
        - If you're unsure of an answer, you can say ""I don't know"" or ""I'm not sure"" and recommend users search themselves."
        - Only provide answers that have products that are part of Microsoft Azure and part of these following prompts.
    '''

    messages=[{"role": "system", "content": system_prompt}]
    for item in vector_search_results:
        messages.append({"role": "system", "content": item['document']['content']})
    messages.append({"role": "user", "content": user_prompt})
    response = client.chat.completions.create(model='gpt-35-turbo', messages=messages,temperature=0)
    
    return response.dict()

In [None]:
# Create a loop of user input and model output. You can now perform Q&A over the sample data!

user_input = ""
print("*** Please ask your model questions about Azure services. Type 'end' to end the session.\n")
user_input = input("User prompt: ")
while user_input.lower() != "end":
    search_results = vector_search(user_input)
    completions_results = generate_completion(search_results, user_input)
    print("\n")
    print(completions_results['choices'][0]['message']['content'])
    user_input = input("User prompt: ")

In [28]:
delete_stdout = ! az group delete --name {resource_group_name} -y

if delete_stdout.n.startswith("ERROR"):
    print(delete_stdout)
else:
    print("✅ Azure Resource Grpup ", resource_group_name, " deleted ⌚ ", datetime.datetime.now().time())