# Vector Search using vCore-based Azure Cosmos DB for MongoDB

This notebook demonstrates using an Azure OpenAI embedding model to vectorize documents already stored in Azure Cosmos DB API for MongoDB, storing the embedding vectors and the creation of a vector index. Lastly, the notebook will demonstrate how to query the vector index to find similar documents.

*This lab expects the data that was loaded in Lab 2.*

In [2]:
import os
import time
import json
from pprint import pprint

import pymongo
from openai import AzureOpenAI

from models import CalendarCourse, Degree, Course, Department, User

from dotenv import load_dotenv
load_dotenv()

True

In [3]:
CONNECTION_STRING = os.getenv("DB_CONNECTION_STRING")
EMBEDDINGS_DEPLOYMENT_NAME = os.getenv("EMBEDDINGS_DEPLOYMENT_NAME")
COMPLETIONS_DEPLOYMENT_NAME = os.getenv("COMPLETIONS_DEPLOYMENT_NAME")
AOAI_ENDPOINT = os.getenv("AOAI_ENDPOINT")
AOAI_KEY = os.getenv("AOAI_KEY")
AOAI_API_VERSION = os.getenv("AOAI_API_VERSION")

### Connect to CosmosDB Pymongo client and Azure OpenAI client

In [4]:
db_client = pymongo.MongoClient(CONNECTION_STRING)
db = db_client.db

ai_client = AzureOpenAI(
    azure_endpoint = AOAI_ENDPOINT,
    api_version = AOAI_API_VERSION,
    api_key = AOAI_KEY)

  db_client = pymongo.MongoClient(CONNECTION_STRING)


## Vectorize and store the embeddings in each document

The process of creating a vector embedding field on each document only needs to be done once. However, if a document changes, the vector embedding field will need to be updated with an updated vector.

In [4]:
def generate_embeddings(text):
    """Generate embeddings from string of text using the deployed Azure OpenAI API embeddings model"""
    response = ai_client.embeddings.create(input=text, model=EMBEDDINGS_DEPLOYMENT_NAME)
    embeddings = response.data[0].embedding
    return embeddings

In [5]:
# demonstrate embeddings generation using a test string

x = generate_embeddings("hello, world")

len(x), sum([xi**2 for xi in x])**0.5

(1536, 1.000000032497339)

### Vectorize and update all documents in the Cosmic Works database

In [None]:

def add_collection_content_vector_field(collection, keys):
    """Vectorize each string made of subset of keys from doc and store in contentVector field"""

    # Get all documents for the collection
    docs = list(collection.find({}))

    # Compute embeddings and prepare bulk operations
    bulk_operations = []
    for doc in docs:
        if "contentVector" in doc:
            continue
        
        # Get the string content from the document for the specified keys
        filtered_info = {k: v for k, v in doc.items() if k in keys}
        content = ', '.join([str(v) for v in filtered_info.values()])

        # Generate embedding and put it in database
        content_vector = generate_embeddings(content)       
        bulk_operations.append(pymongo.UpdateOne(
            {"_id": doc["_id"]},
            {"$set": {"embedding": content_vector}},
            upsert=True
        ))

    # Bulk write UpdateOnes with added embeddings
    collection.bulk_write(bulk_operations)

### Add embeddings to the courses and degrees collections

In [None]:
course_keys = ["code", "name", "description"]
add_collection_content_vector_field("courses", course_keys)

In [None]:
degree_keys = ["code", "name", "cred"]
add_collection_content_vector_field("degrees", degree_keys)

### Last step is to put an index on the database for quick lookup

In [9]:
# Course index
db.command({
  'createIndexes': 'calendar_courses',
  'indexes': [
    {
      'name': 'VectorSearchIndex',
      'key': {
        "embedding": "cosmosSearch"
      },
      'cosmosSearchOptions': {
        'kind': 'vector-ivf',  
        'numLists': 1,         
        'similarity': 'COS',   
        'dimensions': 1536
      }
    }
  ]
})

db.command({
  'createIndexes': 'degrees', # collection name
  'indexes': [ # array of indexes, here there is only one index
    {
      'name': 'VectorSearchIndex', # name of the index to be created, 
                                   # nice to keep these all the same across collections to simplify next steps
      'key': {
        "embedding": "cosmosSearch" # field name
      },
      'cosmosSearchOptions': {
        'kind': 'vector-ivf',  # indexing algorithm
        'numLists': 1,         # number of inverted lists
        'similarity': 'COS',   # cosine similarity
        'dimensions': 1536,
      }
    }
  ]
})

{'raw': {'defaultShard': {'numIndexesBefore': 1,
   'numIndexesAfter': 2,
   'createdCollectionAutomatically': False,
   'ok': 1}},
 'ok': 1}

In [8]:
# To remove an index

# db.degrees.drop_index('VectorSearchIndex')

### Use vector search in vCore-based Azure Cosmos DB for MongoDB

In [47]:
query = "Gauss"

query_embedding = generate_embeddings(query)    
pipeline = [
    {
        '$search': {
            "cosmosSearch": {
                "vector": query_embedding,
                "path": "embedding",
                "k": 10
            },
            "returnStoredSource": True }},
    {'$project': { 'similarity': { '$meta': 'searchScore' }, 'document' : '$$ROOT' } }
]

for result in db.degrees.aggregate(pipeline):
    degree = result['document']
    score = result['similarity']
    d = Degree(**degree)
    print(d.code, d.title, d.cred, f'{score: .6f}')

print()

for result in db.calendar_courses.aggregate(pipeline):
    course = result['document']
    score = result['similarity']
    c = CalendarCourse(**course)
    print(c.code, c.name, f'{score: .6f}')

MNR-MATH Mathematics General  0.804171
MNR-CMSC Computer Science General  0.804171
MNR-PHYS Physics General  0.804171
MNR-STAT Statistics General  0.804171
MNR-COSC Computer Science Minor  0.802024
MNR-APPL Applications of Psychology and Leadership Minor  0.802024
MNR-CSSY Computer Systems Minor  0.802024
MNR-ASTX Astronomy Minor  0.802024
MNR-BUSX Business Minor  0.802024
MNR-AE Art Education Minor  0.802024

PHYS326 Electricity and Magnetism  0.807680
GMST365 Marx, Nietzsche, Freud  0.806756
PHYS500A Quantum Mechanics  0.806253
PHYS415 General Relativity and Cosmology  0.805927
PHYS321A Classical Mechanics I  0.805449
PHYS502A Classical Electrodynamics  0.803858
GMST355 German Expressionism (1910-1933)  0.802304
PHYS501A Quantum Theory and Quantum Fields  0.801889
PHYS410 Topics in Mathematical Physics I  0.801147
PHYS321B Classical Mechanics II  0.800799


### Basic RAG pattern

In [54]:
def generate_embedding(ai_client, text: str):
    response = ai_client.embeddings.create(input=text, model="text-embedding-ada-002")
    embeddings = response.data[0].embedding
    return embeddings

In [55]:
def vector_search(collection, embedded_query, num_results=3):
    """
    Perform a vector search on the specified collection by vectorizing
    the query and searching the vector index for the most similar documents.

    Returns a list of the top num_results most similar documents
    """
    pipeline = [
        {
            '$search': {
                "cosmosSearch": {
                    "vector": embedded_query,
                    "path": "embedding",
                    "k": num_results
                },
                "returnStoredSource": True }},
        {'$project': { 'similarity': { '$meta': 'searchScore' }, 'document' : '$$ROOT' } }
    ]
    results = collection.aggregate(pipeline)
    return results


In [56]:
def calendar_course_to_string(doc):
    """
    Convert a CalendarCourse to a string for passing to LLM
    """
    return doc['_id'] + ": " + doc['name'] + "\n" + doc['description']

In [65]:
system_prompt = """
You are a helpful assistant.
You are given a prompt and context and you provide more information using your general knowledge about these courses and topics.
Only reference courses that have been explicitly provided to you as context.
"""

def chat(ai_client, augmented_question: str):
    """
    Get Chat Completion using augmented prompt
    """
    messages = [{"role": "system", "content": system_prompt},
                {"role": "user", "content": augmented_question}]
    completion = ai_client.chat.completions.create(messages=messages, model='gpt-35-turbo', temperature=0.42)
    return completion.choices[0].message.content

In [80]:
query = "What courses do you have about vampires?"

# Embed the query
embedded_query = generate_embedding(ai_client, query)

# Perform vector search on the query
search_results = vector_search(db.calendar_courses, embedded_query, 10)

# Create augmented question
augmented_question = "Prompt: " + query + "\nContext:"
for result in search_results:
    doc = result['document']
    docstring = calendar_course_to_string(doc)
    augmented_question += "\n" + docstring
print("Augmented question \n\n", augmented_question)

chat_response = chat(ai_client, augmented_question)

Augmented question 

 Prompt: What courses do you have about vampires?
Context:
GMST454: A Cultural History of Vampires in Literature and Film
A study of literary and cinematic vampires in historical context. Without focusing exclusively on German literature and film, follows the vampire myth and its various guises from classicism to postmodernism in novels and films.
ENSH312: Horror
Study of horror, textual and visual; the evolution of horror tropes and their adaptation to anxieties about social change, shifting ideas of race and gender, technological advancement and political impotence; horror as cultural commentary. May include short stories, novels, film or other genres.
SLST501: Introduction to the Disciplines of Germanic and Slavic Studies
An introduction to the research specialties that make up Germanic and Slavic Studies: literary and cultural studies, film studies, cultural history and second language acquisition. May include sessions on how to write a research grant proposal,

In [None]:
pprint(chat_response)

('Based on the provided courses, there is only one course that specifically '
 'focuses on vampires, which is GMST454: A Cultural History of Vampires in '
 'Literature and Film. This course explores the vampire myth and its various '
 'guises in literature and film from classicism to postmodernism.\n'
 '\n'
 'However, if you are interested in horror as a genre, ENSH312: Horror is a '
 'course that studies horror in both textual and visual forms. It explores the '
 'evolution of horror tropes and their adaptation to anxieties about social '
 'change, shifting ideas of race and gender, technological advancement, and '
 'political impotence. This course may include short stories, novels, film, or '
 'other genres.\n'
 '\n'
 'If you are interested in speculative fiction, ENSH310: Speculative Fiction '
 'is a course that studies fiction that imagines alternate histories, futures, '
 'or worlds, such as science fiction, fantasy, utopian, dystopian, and '
 'post-apocalyptic.\n'
 '\n'
 'Additi