# Queries with and without Azure OpenAI

Now that we have our Search Engine loaded **from two different data sources in two diferent indexes**, we are going to try some example queries and then use Azure OpenAI service to see if we can get even better results.

The idea is that a user can ask a question about Computer Science (first datasource/index) or about Covid (second datasource/index), and the engine will respond accordingly.
This **Multi-Index** demo, mimics the scenario where a company loads multiple type of documents of different types and about completly different topics and the search engine must respond with the most relevant results.

## Set up variables

In [1]:
import os
import random
import urllib
import requests
from collections import OrderedDict
from IPython.display import display, HTML
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.llms import AzureOpenAI
from langchain.chat_models import AzureChatOpenAI
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings import OpenAIEmbeddings

from app.prompts import COMBINE_QUESTION_PROMPT, COMBINE_PROMPT
from app.utils import model_tokens_limit, num_tokens_from_docs

# Demo Datasource Blob Storage. Change if using your own data
DATASOURCE_SAS_TOKEN = "?sv=2022-11-02&ss=bfqt&srt=sco&sp=rwdlacupyx&se=2024-05-24T08:46:47Z&st=2023-04-24T00:46:47Z&spr=https&sig=jttV8Xj2fBbzWklIZXCc%2BUroUoUygcXzS3XyFv%2F0XW0%3D"

# Don't mess with these unless you really know what you are doing
AZURE_SEARCH_API_VERSION = '2021-04-30-Preview'
AZURE_OPENAI_API_VERSION = "2023-03-15-preview"

# Change these below with your own services credentials

AZURE_SEARCH_ENDPOINT = "https://cog-search-lrj44ck74ca4y.search.windows.net"
AZURE_SEARCH_KEY = "tfEzqIH0tgFA8fi04C99RKVgz4BwtFXpcr0NBKLEvxAzSeBhNwug" # Make sure is the MANAGEMENT KEY no the query key
AZURE_OPENAI_ENDPOINT = "https://openainlp1.openai.azure.com/"
AZURE_OPENAI_API_KEY = "9e64f10580c44989b8fabe9b23336344"

In [2]:
# Setup the Payloads header
headers = {'Content-Type': 'application/json','api-key': AZURE_SEARCH_KEY}

## Multi-Index Search queries

In [5]:
# Index that we are going to query (from Notebook 01 and 02)
index1_name = "cogsrch-index-files"
index2_name = "cogsrch-index-csv"
#indexes = [index1_name, index2_name]
indexes = [index1_name]

Try questions that you think might be answered or addressed in computer science papers in 2020-2021 or that can be addressed by medical publications about COVID in 2020-201. Try comparing the results with the open version of ChatGPT.<br>
The idea is that the answers using Azure OpenAI only looks at the information contained on these publications.

**Example Questions you can ask**:
- What is CLP?
- How Markov chains work?
- What are some examples of reinforcement learning?
- What are the main risk factors for Covid-19?
- What medicine reduces inflamation in the lungs?
- Why Covid doesn't affect kids that much compared to adults?
- Does chloroquine really works against covid?
- tell me Use cases where I can use deep learning to solve it

In [39]:
QUESTION = "muestrame detalles de  porblemas con goldengate?" 

### Search on both indexes individually and aggragate results

**Note**: In order to standarize the indexes we are setting 4 mandatory fields to be present on each index: id, title, content, pages, language. These fields must be present in each index so that each document can be treated the same along the code.

In [40]:
agg_search_results = []

for index in indexes:
    url = AZURE_SEARCH_ENDPOINT + '/indexes/'+ index + '/docs'
    url += '?api-version={}'.format(AZURE_SEARCH_API_VERSION)
    url += '&search={}'.format(QUESTION)
    url += '&select=*'
    url += '&$top=5'  # You can change this to anything you need/want
    url += '&queryLanguage=en-us'
    url += '&queryType=semantic'
    url += '&semanticConfiguration=my-semantic-config'
    url += '&$count=true'
    url += '&speller=lexicon'
    url += '&answers=extractive|count-3'
    url += '&captions=extractive|highlight-false'

    resp = requests.get(url, headers=headers)
    print(url)
    print(resp.status_code)

    search_results = resp.json()
    agg_search_results.append(search_results)
    print("Results Found: {}, Results Returned: {}".format(search_results['@odata.count'], len(search_results['value'])))

https://cog-search-lrj44ck74ca4y.search.windows.net/indexes/cogsrch-index-files/docs?api-version=2021-04-30-Preview&search=muestrame detalles de  porblemas con goldengate?&select=*&$top=5&queryLanguage=en-us&queryType=semantic&semanticConfiguration=my-semantic-config&$count=true&speller=lexicon&answers=extractive|count-3&captions=extractive|highlight-false
200
Results Found: 10, Results Returned: 5


### Display the top results (from both searches) based on the score

In [48]:
display(HTML('<h4>Top Answers</h4>'))

for search_results in agg_search_results:
    for result in search_results['@search.answers']:
        if result['score'] > 0.01: # Show answers that are at least 50% of the max possible score=1
            display(HTML('<h5>' + 'Answer - score: ' + str(round(result['score'],2)) + '</h5>'))
            display(HTML(result['text']))
            
print("\n\n")
display(HTML('<h4>Top Results</h4>'))

content = dict()
ordered_content = OrderedDict()


for search_results in agg_search_results:
    for result in search_results['value']:
        if result['@search.rerankerScore'] > 0.01: # Filter results that are at least 25% of the max possible score=4
            content[result['id']]={
                                    "title": result['title'],
                                    "chunks": result['pages'],
                                    "language": result['language'], 
                                    "caption": result['@search.captions'][0]['text'],
                                    "score": result['@search.rerankerScore'],
                                    "name": result['metadata_storage_name'], 
                                    "location": result['metadata_storage_path']                  
                                }
    
#After results have been filtered we will Sort and add them as an Ordered list\n",
for id in sorted(content, key= lambda x: content[x]["score"], reverse=True):
    ordered_content[id] = content[id]
    url = ordered_content[id]['location'] + DATASOURCE_SAS_TOKEN
    title = str(ordered_content[id]['title']) if (ordered_content[id]['title']) else ordered_content[id]['name']
    score = str(round(ordered_content[id]['score'],2))
    display(HTML('<h5><a href="'+ url + '">' + title + '</a> - score: '+ score + '</h5>'))
    display(HTML(ordered_content[id]['caption']))






## Comments on Query results

As seen above the semantic search feature of Azure Cognitive Search service is good. It gives us some answers and also the top results with the corresponding file and the paragraph where the answers is possible located.

Let's see if we can make this better with Azure OpenAI

# Using Azure OpenAI

To use OpenAI to get a better answer to our question, the thought process is: let's send the the documents of the search result to the GPT model and let it understand the document's content and provide a better response.

We will use a genius library call **LangChain** that wraps a lot of boiler plate code.
Langchain is one library that does a lot of the prompt engineering for us under the hood, for more information see [here](https://python.langchain.com/en/latest/index.html)

## A gentle intro to chaining LLMs and prompt engineering

Chains are what you get by connecting one or more large language models (LLMs) in a logical way. (Chains can be built of entities other than LLMs but for now, let’s stick with this definition for simplicity).

Azure OpenAI is a type of LLM (provider) that you can use but there are others like Cohere, Huggingface, etc.

Chains can be simple (i.e. Generic) or specialized (i.e. Utility).

* Generic — A single LLM is the simplest chain. It takes an input prompt and the name of the LLM and then uses the LLM for text generation (i.e. output for the prompt).

Here’s an example:

In [49]:
# Set the ENV variables that Langchain needs to connect to Azure OpenAI
os.environ["OPENAI_API_BASE"] = os.environ["AZURE_OPENAI_ENDPOINT"] = AZURE_OPENAI_ENDPOINT
os.environ["OPENAI_API_KEY"] = os.environ["AZURE_OPENAI_API_KEY"] = AZURE_OPENAI_API_KEY
os.environ["OPENAI_API_VERSION"] = os.environ["AZURE_OPENAI_API_VERSION"] = AZURE_OPENAI_API_VERSION
os.environ["OPENAI_API_TYPE"] = "azure"

In [50]:
# Create our LLM model
# Make sure you have the deployment named "gpt-35-turbo" for the model "gpt-35-turbo (0301)". 
# Use "gpt-4" if you have it available.
MODEL = "gpt-35-turbo" # options: gpt-35-turbo, gpt-4, gpt-4-32k
llm = AzureChatOpenAI(deployment_name=MODEL, temperature=0, max_tokens=500)

In [51]:
# Now we create a simple prompt template
prompt = PromptTemplate(
    input_variables=["question", "language"],
    template='Answer the following question: "{question}". Give your response in {language}',
)

print(prompt.format(question=QUESTION, language="French"))

Answer the following question: "muestrame detalles de  porblemas con goldengate?". Give your response in French


In [52]:
# And finnaly we create our first generic chain
chain_chat = LLMChain(llm=llm, prompt=prompt)
chain_chat({"question": QUESTION, "language": "French"})

{'question': 'muestrame detalles de  porblemas con goldengate?',
 'language': 'French',
 'text': "Je suis désolé, en tant qu'IA, je ne peux pas fournir de réponse en français car je suis programmé pour répondre en anglais. Cependant, je peux vous dire que GoldenGate est un logiciel de réplication de données qui peut rencontrer des problèmes tels que des erreurs de configuration, des conflits de données, des pannes de réseau, des problèmes de performance, etc. Pour obtenir des détails spécifiques sur les problèmes avec GoldenGate, il est recommandé de consulter la documentation officielle ou de contacter le support technique."}

Great!!, now you know how to create a simple prompt and use a chain in order to answer a general question using ChatGPT knowledge!. 

It is important to note that we rarely use generic chains as standalone chains. More often they are used as building blocks for Utility chains (as we will see next). Also important to notice is that we are NOT using our documents or the result of the Azure Search yet, just the knowledge of ChatGPT on the data it was trained on.

**The second type of Chains are Utility:**

* Utility — These are specialized chains, comprised of many LLMs to help solve a specific task. For example, LangChain supports some end-to-end chains (such as [QA_WITH_SOURCES](https://python.langchain.com/en/latest/modules/chains/index_examples/qa_with_sources.html) for QnA Doc retrieval, Summarization, etc) and some specific ones (such as GraphQnAChain for creating, querying, and saving graphs). 

We will look at one specific chain called **qa_with_sources** in this workshop for digging deeper and solve our use case of enhancing the results of Azure Cognitive Search.


But before dealing with the utility chain needed, we need to deal first with this problem: **the content of the search result files is or can be very lengthy, more than the allowed tokens allowed by the GPT Azure OpenAI models**. So what we need to do is: split in chunks, vectorize those chunks and do a vector semantic search to get the top chunks in order to provide the best and not too lenghy context to the LLM.

Notice that **the documents chunks are already done in Azure Search**. *ordered_content* dictionary (created a few cells above) contains the pages (chunks) of each document. So we don't really need to chunk them again, but we still need to make sure that we can be as fast as possible and that we are below the max allowed input token limits of our selected OpenAI model.

In [53]:
# Iterate over each of the results chunks and create a LangChain Document class to use further in the pipeline
docs = []
for key,value in ordered_content.items():
    for page in value["chunks"]:
        docs.append(Document(page_content=page, metadata={"source": value["location"]}))
        
print("Number of chunks:",len(docs))

Number of chunks: 5


We need now to calculate the number of tokens for all the chunks combined to decide what to do:
1) Should we embed to vectors and do cosine similarity because there is too much data to fit on the prompt as context?
2) If embedding is the decision, should we use OpenAI embedding model or a local parallelizable faster embedder?

In [54]:
# Calculate number of tokens of our docs
if(len(docs)>0):
    tokens_limit = model_tokens_limit(MODEL) # this is a custom function we created in app/utils.py
    num_tokens = num_tokens_from_docs(docs) # this is a custom function we created in app/utils.py
    print("Custom token limit for", MODEL, ":", tokens_limit)
    print("Combined docs tokens count:",num_tokens)
        
else:
    print("NO RESULTS FROM AZURE SEARCH")

Custom token limit for gpt-35-turbo : 3000
Combined docs tokens count: 5035


Now, depending of the amount of chunks/pages returned from the search result, which is very related to the size of the documents returned, 
we pick the embedding model that give us fast results.

The logic is, if there is less than 50 chunks (of 5000 chars each) to vectorize, then we use 
OpenAI models, which currently don't offer batch processing, but if there is more than 50 chunks we use a BERT based in-memory model that processes in batches and in parallel (it is recommended a VM of at least 4 cores).

For more information on in-memory BERT transformer models that you can use, see [HERE](https://www.sbert.net/docs/pretrained_models.html)

In [55]:
%%time
if num_tokens > tokens_limit:
    # Select the Embedder model
    if len(docs) < 50:
        # OpenAI models are accurate but slower, they also only (for now) accept one text at a time (chunk_size)
        embedder = OpenAIEmbeddings(deployment="text-embedding-ada-002", chunk_size=1) 
    else:
        # Bert based models are faster (3x-10x) but not as great in accuracy as OpenAI models
        # Since this repo supports Multiple languages we need to use a multilingual model. 
        # But if English only is the requirement, use "multi-qa-MiniLM-L6-cos-v1"
        # The fastest english model is "all-MiniLM-L12-v2"
        embedder = HuggingFaceEmbeddings(model_name = 'distiluse-base-multilingual-cased-v2')
    
    print(embedder)
    
    # Create our in-memory vector database index from the chunks given by Azure Search.
    # We are using FAISS. https://ai.facebook.com/tools/faiss/
    db = FAISS.from_documents(docs, embedder)
    top_docs = db.similarity_search(QUESTION, k=4)  # Return the top 4 documents
    
    # Now we need to recalculate the tokens count of the top results from similarity vector search
    # in order to select the chain type: stuff (all chunks in one prompt) or 
    # map_reduce (multiple calls to the LLM to summarize/reduce the chunks and then combine them)
    
    num_tokens = num_tokens_from_docs(top_docs)
    print("Token count after similarity search:", num_tokens)
    chain_type = "map_reduce" if num_tokens > tokens_limit else "stuff"
    
else:
    # if total tokens is less than our limit, we don't need to vectorize and do similarity search
    top_docs = docs
    chain_type = "stuff"
    
print("Chain Type selected:", chain_type)

client=<class 'openai.api_resources.embedding.Embedding'> model='text-embedding-ada-002' deployment='text-embedding-ada-002' openai_api_version=None openai_api_base=None openai_api_type=None embedding_ctx_length=8191 openai_api_key=None openai_organization=None allowed_special=set() disallowed_special='all' chunk_size=1 max_retries=6 request_timeout=None headers=None
Token count after similarity search: 4021
Chain Type selected: map_reduce
CPU times: user 28 ms, sys: 535 µs, total: 28.5 ms
Wall time: 615 ms


At this point we already have the top most similar chunks (in order of relevance) in **top_docs**

Now we need Azure OpenAI GPT model to understand these top chunks and provide us an answer to the question.

For this task, we need to come back to the Utility Chain: **qa_with_sources** that we mentioned before. See [HERE](https://python.langchain.com/en/latest/modules/chains/index_examples/qa_with_sources.html) for reference.

We created our own custom prompts so we can add translation to a specified language. But, for more information on the different types of prompts for this utility chain please see [HERE](https://github.com/hwchase17/langchain/tree/master/langchain/chains/question_answering)


In [56]:
if chain_type == "stuff":
    chain = load_qa_with_sources_chain(llm, chain_type=chain_type, 
                                       prompt=COMBINE_PROMPT)
elif chain_type == "map_reduce":
    chain = load_qa_with_sources_chain(llm, chain_type=chain_type, 
                                       question_prompt=COMBINE_QUESTION_PROMPT,
                                       combine_prompt=COMBINE_PROMPT,
                                       return_intermediate_steps=True)

In [20]:
# Uncomment the below line if you want to check our custom COMBINE_PROMPT
#print(chain.combine_document_chain.llm_chain.prompt.template)

AttributeError: 'StuffDocumentsChain' object has no attribute 'combine_document_chain'

In [57]:
%%time
# Try with other language as well
response = chain({"input_documents": top_docs, "question": QUESTION, "language": "English"})

CPU times: user 14.7 ms, sys: 4.15 ms, total: 18.8 ms
Wall time: 20.7 s


In [58]:
answer = response['output_text']

display(HTML('<h4>Azure OpenAI ChatGPT Answer:</h4>'))
display(HTML(answer.split("SOURCES:")[0]))

sources_list = answer.split("SOURCES:")[1].replace(" ","").split(",")

sources_html = '<u>Sources</u>: '
for index, value in enumerate(sources_list):
    url = value + DATASOURCE_SAS_TOKEN
    sources_html +='<sup><a href="'+ url + '">[' + str(index+1) + ']</a></sup>'
    
display(HTML(sources_html))

In [38]:
# Uncomment if you want to inspect the results from map_reduce chain type, each top similar chunk summary (k=4 by default)

if chain_type == "map_reduce":
    for step in response['intermediate_steps']:
        display(HTML("<b>Chunk Summary:</b> " + step))

# Summary
##### This answer is way better than taking just the result from Azure Cognitive Search. So the summary is:
- Azure Cognitive Search give us the top results (context)
- Azure OpenAI takes these results and understand the content and uses it as context to give the best answer
- Best of two worlds!

# NEXT
We just added a smart layer on top of Azure Cognitive Search. This is the backend for a GPT Smart Search Engine.

However, we are missing something: **How to have a conversation with this engine?**

On the next Notebook, we are going to understand the concept of **memory**. This is necessary in order to have a chatbot that can establish a conversation with the user. Without memory, there is no real conversation.