# NOTE: Currently not working - please skip this notebook

Last part of the notebook will fail. Issue is described here:
https://github.com/langchain-ai/langchain/issues/26097

# Langchain with Azure CosmosDB for Mongo DB vCore

## Setup, Vectorize and Load Data

In this lab, we'll see how to leverage a sample dataset stored in Azure Cosmos DB for MongoDB to ground OpenAI models. We'll do this taking advantage of Azure Cosmos DB for Mongo DB vCore's [vector similarity search](https://learn.microsoft.com/azure/cosmos-db/mongodb/vcore/vector-search) functionality. 

You will need to create at the Azure Portal an M40 cluster by using Azure Cosmos DB for MongoDB vCore. You can create vector indexes on M40 cluster tiers and higher. After the cluster is created add the connection string at the .env file. 

Let's start by importing the modules we will use. 



In [1]:
import ijson
from openai import AzureOpenAI

from tenacity import retry, stop_after_attempt, wait_random_exponential
from time import sleep

from langchain.chains import ConversationalRetrievalChain
from langchain.globals import set_llm_cache
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate
from langchain_community.cache import AzureCosmosDBSemanticCache
from langchain_community.chat_message_histories import MongoDBChatMessageHistory
from langchain_community.vectorstores.azure_cosmos_db import (
    AzureCosmosDBVectorSearch,
    CosmosDBSimilarityType,
    CosmosDBVectorSearchType)
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings

We will now load the values from the `.env` file in the root of the repository and instantiate the mongo and openAI clients.

In [2]:
import os
import pymongo
from dotenv import load_dotenv

# Load environment variables
if load_dotenv():
    print("Found Azure OpenAI Endpoint: " + os.getenv("AZURE_OPENAI_ENDPOINT"))
else: 
    print("No file .env found")

cosmos_conn = os.getenv("MONGO_DB_CONNECTION_STRING")
cosmos_database = os.getenv("MONGO_DB_database_name")
cosmos_collection = os.getenv("MONGO_DB_collection_name")
cosmos_vector_property = os.getenv("MONGO_DB_vector_property_name")
cosmos_cache = os.getenv("MONGO_DB_cache_collection_name")

storage_file_url = os.getenv("storage_file_url")

Found Azure OpenAI Endpoint: https://msazuredev.openai.azure.com/


In [3]:
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings

# Create the MongoDB client
cosmos_client = pymongo.MongoClient(cosmos_conn)

# Create the OpenAI client
openai_client = AzureOpenAI(
	azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
	api_key=os.getenv("OPENAI_API_KEY"),
	api_version=os.getenv("OPENAI_API_VERSION")
)

# Create an Embeddings Instance of Azure OpenAI
embeddings = AzureOpenAIEmbeddings(
    azure_deployment = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME"),
    openai_api_version = os.getenv("OPENAI_EMBEDDING_API_VERSION"),
    model= os.getenv("AZURE_OPENAI_EMBEDDING_MODEL")
)

# Create a Chat Completion Instance of Azure OpenAI
llm = AzureChatOpenAI(
    azure_deployment = os.getenv("AZURE_OPENAI_COMPLETION_DEPLOYMENT_NAME")
)

  cosmos_client = pymongo.MongoClient(cosmos_conn)


##  Create a collection with a vector index

This function takes a database object, a collection name, the name of the document property that will store vectors, and the number of vector dimensions used for the embeddings.

In [4]:
def create_collection_and_vector_index(database, cosmos_collection, vector_property, embeddings_dimensions):

    collection = database[cosmos_collection]

    database.command(
        {
            "createIndexes": cosmos_collection,
            "indexes": [
                {
                    "name": "VectorSearchIndex",
                    "key": {
                        vector_property: "cosmosSearch"
                    },
                    "cosmosSearchOptions": { 
                        "kind": "vector-hnsw", 
                        "m": 16, # default value 
                        "efConstruction": 64, # default value 
                        "similarity": "COS", 
                        "dimensions": 1536 # Number of dimensions for vector similarity. The maximum number of supported dimensions is 2000
                    } 
                } 
            ] 
        }
    )  

    return collection
    

## Create the Database and Collections with Vector Index

In this lab, we will create two collections. One that will store the movie data with their embeddings and another to store the promts and the answers with their embeddings to implement semantic cache.

❓ What is semantic caching?

Caching systems typically store commonly retrieved data for subsequent serving in an optimal manner. In the context of LLMs, semantic cache maintains a cache of previously asked questions and responses, uses similarity measures to retrieve semantically similar queries from the cache and respond with cached responses if a match is found within the threshold for similarity. If cache is not able to return a response, then the answer can be returned from a fresh LLM call.

👌 Benefits of semantic caching:

- Cost optimization: Since the responses are served without invoking LLMs, there can be significant cost benefits for caching responses. We have come across use cases where customers have reported 20 – 30 % of the total queries from users can be served by the caching layer.
Improvement in latency: LLMs are known to exhibit higher latencies to generate responses. This can be reduced by response caching, to the extent that queries are answered from caching layer and not by invoking LLMs every time.
- Scaling: Since questions responded by cache hit do not invoke LLMs, provisioned resources/endpoints are free to answer unseen/newer questions from users. This can be helpful when applications are scaled to handle more users.
- Consistency in responses: Since caching layer answers from cached responses, there is no actual generation involved and the same response is provided to queries deemed semantically similar.

In [5]:

# Check if the collection database and drop if it does
if cosmos_database in cosmos_client.list_database_names():
    cosmos_client.drop_database(cosmos_database)

# Create the database 
database = cosmos_client[cosmos_database]

# Create the data collection with vector index
collection = create_collection_and_vector_index(database, cosmos_collection, cosmos_vector_property, 1536)

# Create the cache collection with vector index
cache = create_collection_and_vector_index(database, cosmos_cache, cosmos_vector_property, 1536)


## Generate embeddings from Azure OpenAI

The following function will generate embeddings for a given text. We add retry to handle any throttling due to quota limits.

In [6]:
@retry(wait=wait_random_exponential(min=1, max=200), stop=stop_after_attempt(20))
def generate_embeddings(text):
    
    response = openai_client.embeddings.create(
        input=text,
        model=os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME"),
        dimensions=1536
    )
    
    embeddings = response.model_dump()
    return embeddings['data'][0]['embedding']

## Stream, vectorize & store

In this lab we'll use a subset of the MovieLens dataset. 
We will stream the data out of blob storage, generate vectors on the *overview* of the json document using the function above, then store it in Azure Cosmos DB for MongoDB collection. 

In [7]:
import urllib
# open the file and stream the data to ingest
stream = urllib.request.urlopen(storage_file_url)

counter = 0
max_count = 500

# iterate through the stream, generate vectors and insert into collection
for object in ijson.items(stream, 'item', use_float=True):
    if counter >= max_count:
        break
    
    #generate embeddings
    vectorArray = generate_embeddings(object['overview'])

    #add a vectorArray field to the document that will contain the embeddings 
    object[cosmos_vector_property] = vectorArray

    #insert the document into the collection
    collection.insert_one(object)

    counter += 1

    if counter % 100 == 0:
        print("Inserted {} documents into collection: '{}'.".format(counter, collection.name))
        sleep(.5)   # sleep for 0.5 seconds to help avoid rate limiting


print("Data inserted into collection: '{}'.\n".format(collection.name))

Inserted 100 documents into collection: 'movie_data'.
Inserted 200 documents into collection: 'movie_data'.
Inserted 300 documents into collection: 'movie_data'.
Inserted 400 documents into collection: 'movie_data'.
Inserted 500 documents into collection: 'movie_data'.
Data inserted into collection: 'movie_data'.



##  Configure Vector Search w/ LangChain

In [8]:
cdb = AzureCosmosDBVectorSearch(
    collection= cosmos_collection,
    embedding=embeddings)

vectorstore = cdb.from_connection_string(
    connection_string=cosmos_conn,
    namespace = cosmos_database + "." + cosmos_collection,
    embedding = embeddings,
    embedding_key = cosmos_vector_property,
    text_key = "overview")


  client: MongoClient = MongoClient(connection_string, appname=appname)


## Setup RAG and Semantic Caching with your LLM

First let's write the prompt template to use for the LLM. We are setting up an AI assistant to help answer questions about our movies dataset. We ask to use the context of the retrieved documents from the vector store. 

In [9]:
prompt_template = """
You are an upbeat AI assistant who is excited to help answer questions about movies.  
Use only the context which is the overview of the movies:

{context},

or this chat history

{chat_history},

to answer this question. 

Question: {question}
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
"""
chatbot_prompt = PromptTemplate(
    template = prompt_template, input_variables = ["context", "question", "chat_history"])

In this section, we'll implement the RAG pattern using LangChain. In LangChain, a retriever is used to augment the prompt with contextual data. In this case, the already established vector store will be used as the `retriever`. This retriever is configured to use a similarity search with a score threshold of 0.2 and to return the top 5 most similar results (k=5).

Next we have the `ConversationalRetrievalChain` object that is responsible for managing the retrieval of responses in a conversational context. It is configured with the previously created `retriever` and is set to return the source documents of the retrieved responses. The `combine_docs_chain_kwargs` parameter is set to final prompt of the `ConversationalRetrievalChain`. We add the verbose flag to return the final prompt and see the retrieved documents that will be used for the LLM.  

The last part of the chain is to set up the semantic cache. There we need a similarity threshold of 0.99 to match the question asked. 

In [10]:
def prepare_chain():
    
    retriever = vectorstore.as_retriever(
    search_type = "similarity",
    search_kwargs = {"k": 5, 'score_threshold': 0.2})

    sem_qa = ConversationalRetrievalChain.from_llm(
    llm = llm,
    chain_type = "stuff",
    retriever = retriever,
    return_source_documents = True,
    combine_docs_chain_kwargs = {"prompt": chatbot_prompt},
    verbose=True)

    similarity_algorithm = CosmosDBSimilarityType.COS
    kind = CosmosDBVectorSearchType.VECTOR_IVF
    num_lists = 1
    score_threshold = 0.99

    sem_cache = AzureCosmosDBSemanticCache(
            cosmosdb_connection_string = cosmos_conn,
            cosmosdb_client = None,
            embedding = embeddings,
            database_name = cosmos_database, 
            collection_name = cosmos_cache,
            similarity = similarity_algorithm,
            num_lists = num_lists,
            kind = kind,
            dimensions = 1536,
            score_threshold = score_threshold)

    set_llm_cache(sem_cache)

    return retriever, llm, sem_qa, sem_cache
    

In [11]:
retriever, llm, chain, sem_cache = prepare_chain()

Let's test the chatbot with a question. This first time, the questions is not yet in cache, so it should take longer as it will send it to the LLM.

In [12]:
%%time 
query = "Tell me about films with Buzz Lightyear"
response = chain.invoke({"question": query, 'chat_history': [] })
print("***********************************************************")
print (response['answer'])


KeyError: 'metadata'

If you ask a very similar question you will notice how faster it will be as it will use the semantic cache instead of LLM.

In [None]:
%%time
query = "Tell me something about films with Buzz Lightyear"
response = chain.invoke({"question": query, 'chat_history': [] })
print("***********************************************************")
print (response['answer'])

Now notice that in order to answer the question below it will use the context from the documents the chain retrieves from the vector store. 

In [None]:
query = "Whose spaceship crashed on a desert planet"
response = chain.invoke({"question": query, 'chat_history': [] })
print("***********************************************************")
print (response['answer'])

## Next Section

📣 [Implement Retrieval Augmented Generation with Azure AI Search as vector store with semantic ranking](./aisearch.ipynb)