## Storing and querying for embeddings with Azure Cosmos DB for NoSQL

This notebook demonstrates how to use Azure Cosmos DB for NoSQL to store and search vectors:

+ Create index and vector embeddings policies in Azure Cosmos DB
+ Embed the documents using Azure OpenAI's text-embedding-ada-002 model
+ Index the vector and nonvector fields on Azure Cosmos DB for NoSQL
+ Run vector similarity search query. 

The code uses Azure OpenAI to generate embeddings for title and content fields. You'll need access to Azure OpenAI to run this notebook.

The code reads the `text-sample.json` file, which contains the input data for which embeddings need to be generated.

The output is a combination of human-readable text and embeddings that can be pushed into a search index.

### Prerequisites

+ An Azure subscription, with [access to Azure OpenAI](https://aka.ms/oai/access). You must have the Azure OpenAI service name and an API key.

+ A deployment of the text-embedding-ada-002 embedding model.

+ Azure Cosmos DB for NoSQL, [Index and query vectors in Azure Cosmos DB for NoSQL in Python](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/how-to-python-vector-index-query)

We used Python 3.10, [Visual Studio Code with the Python extension](https://code.visualstudio.com/docs/python/python-tutorial), and the [Jupyter extension](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter) to test this example.

### Set up a Python virtual environment in Visual Studio Code

1. Open the Command Palette (Ctrl+Shift+P).
1. Search for **Python: Create Environment**.
1. Select **Venv**.
1. Select a Python interpreter. Choose 3.10 or later.

It can take a minute to set up. If you run into problems, see [Python environments in VS Code](https://code.visualstudio.com/docs/python/environments).

### Install packages

In [None]:
%pip install -U -q azure-cosmos python-dotenv openai

### Deploy Azure Cosmos DB for NoSQL

In [None]:
import datetime

resource_group_name = "VectorSearch" # change the name to match your naming style
resource_group_location = "southeastasia" # change the location to match your naming style
cosmosdb_account_name = "vssea" # cosmos db for nosql account name,  change the name to match your naming style
cosmosdb_db_name = "vsdb" # cosmos db for nosql database name,  change the name to match your naming style
cosmosdb_container_name = "vsc" # cosmos db for nosql container name,  change the name to match your naming style

In [None]:
resource_group_stdout = ! az group create --name {resource_group_name} --location {resource_group_location}
if resource_group_stdout.n.startswith("ERROR"):
    print(resource_group_stdout)
else:
    print("✅ Azure Resource Grpup ", resource_group_name, " created ⌚ ", datetime.datetime.now().time())

cmd_stdout = ! echo -n {resource_group_name} | sha1sum | head -c 6
suffix = cmd_stdout.n

cosmosdb_stdout = ! az cosmosdb create -n {cosmosdb_account_name}-{suffix} -g {resource_group_name} --default-consistency-level Eventual --capabilities EnableServerless EnableNoSQLVectorSearch

if cosmosdb_stdout.n.startswith("ERROR"):
    print(cosmosdb_stdout)
else:
    print("✅ Cosmos DB ", f'{cosmosdb_account_name}-{suffix}', " created ⌚ ", datetime.datetime.now().time())


In [None]:
endpoint_stdout = ! az cosmosdb show -g {resource_group_name} -n {cosmosdb_name}-{suffix} --query documentEndpoint --output tsv
key_stdout = ! az cosmosdb keys list -g {resource_group_name} -n {cosmosdb_name}-{suffix} --query primaryMasterKey --output tsv
endpoint = endpoint_stdout.nkey = key_stdout.n

## Create Cosmos DB for NoSQL database and container

In [None]:

from azure.cosmos import CosmosClient, PartitionKey, exceptions
exceptions
client = CosmosClient(endpoint, credential=key)
try:
    database = client.create_database(cosmosdb_db_name)
except exceptions.CosmosResourceExistsError:
    database = client.get_database_client(cosmosdb_db_name)

vector_embedding_policy = {
    "vectorEmbeddings": [ 
        { 
            "path": "/titleVector", 
            "dataType": "float32", 
            "distanceFunction": "cosine", 
            "dimensions": 1536
        },
        { 
            "path": "/contentVector", 
            "dataType": "float32", 
            "distanceFunction": "cosine", 
            "dimensions": 1536
        }        
    ] 
}

indexing_policy = {
    "indexingMode": "consistent",
    "automatic": True,
    "includedPaths": [
        {
            "path": "/*"
        }
    ],
    "excludedPaths": [
        {
            "path": "/_etag/?"
        }
    ],
    "vectorIndexes": [
        {
            "path": "/titleVector",
            "type": "quantizedFlat"
        },
        {
            "path": "/contentVector",
            "type": "quantizedFlat"
        }
    ]
}

try:
    container = database.create_container_if_not_exists( 
                    id=cosmosdb_container_name, 
                    partition_key=PartitionKey(path='/id'), 
                    indexing_policy=indexing_policy, 
                    vector_embedding_policy=vector_embedding_policy) 
    print('Container with id \'{0}\' created'.format(cosmosdb_container_name)) 

except exceptions.CosmosHttpResponseError: 
    raise

### Helper function to create document embeddings

In [None]:
import os
import json
from openai import AzureOpenAI
from dotenv import load_dotenv

load_dotenv()

def generate_document_embeddings(input_json_path, output_json_path):
    client = AzureOpenAI(
        azure_deployment="text-embedding-ada-002",
        api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
        api_version=os.getenv("OPENAI_API_VERSION"),
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
    )

    # Read the input JSON file
    with open(input_json_path, 'r', encoding='utf-8') as file:
        input_data = json.load(file)

    titles = [item['title'] for item in input_data]
    content = [item['content'] for item in input_data]
    
    # Generate embeddings for titles
    title_response = client.embeddings.create(input=titles, model="text-embedding-ada-002")
    title_embeddings = [item.embedding for item in title_response.data]
    
    # Generate embeddings for content
    content_response = client.embeddings.create(input=content, model="text-embedding-ada-002")
    content_embeddings = [item.embedding for item in content_response.data]

    # Assign embeddings to the original data
    for i, item in enumerate(input_data):
        item['titleVector'] = title_embeddings[i]
        item['contentVector'] = content_embeddings[i]

    # Ensure the output directory exists
    output_directory = os.path.dirname(output_json_path)
    if not os.path.exists(output_directory):
        os.makedirs(output_directory)
    
    # Write the modified data to the output JSON file
    with open(output_json_path, "w") as f:
        json.dump(input_data, f)


# Generate and load the documents with embeddings from the output file
output_path = os.path.join('output', 'docVectors.json')
if not os.path.exists(output_path):
    generate_document_embeddings(os.path.join('..', 'data', 'text-sample.json'), output_path)
    
with open(output_path, 'r') as file:  
    documents = json.load(file)

### Insert documents into Cosmos DB for NoSQL

In [None]:
for doc in documents[:1]:
    container.create_item(body=doc)

### Generate embeddings for the query

In [None]:
client = AzureOpenAI(
    azure_deployment="text-embedding-ada-002",
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version=os.getenv("OPENAI_API_VERSION"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)

content_response = client.embeddings.create(input="What is Azure App Service", model="text-embedding-ada-002")
content_embedding = content_response.data[0].embedding


### Query items using embeddings

Note: The query requires to ue TOP/LIMIT for the reason in below

Executing a vector search query without TOP or LIMIT can consume many RUs very fast and have long runtimes. Please ensure you are using one of the two filters with your vector search query.

In [None]:
for item in container.query_items( 
        query='SELECT TOP 5 c.title, VectorDistance(c.contentVector,@embedding) AS SimilarityScore FROM c ORDER BY VectorDistance(c.contentVector,@embedding)', 
        parameters=[ 
            {"name": "@embedding", "value": content_embedding} 
        ], 
        enable_cross_partition_query=True): 
    print(json.dumps(item, indent=True)) 