# Azure Vector Search 

**Azure Cognitive Search has now vector search capabilities** ([Watch this video](https://aka.ms/Vector_SearchSnackableVideo)). The advantages of vector search in Azure Cognitive Search include its integration with other capabilities of Azure Cognitive Search, the ability to use any type of data (text, image, audio, video, etc) from diverse Azure datastores to inform a single generative AI-powered application, and the support of vector fields in the search indexes. It also offers pure vector search, hybrid retrieval, and a sophisticated re-ranking system powered by Bing in a single integrated solution (check the release [blog site](https://techcommunity.microsoft.com/t5/azure-ai-services-blog/announcing-vector-search-in-azure-cognitive-search-public/ba-p/3872868)).


![vector-search](https://techcommunity.microsoft.com/t5/image/serverpage/image-id/489211i001E2B9B34F483C2/image-dimensions/876x416?v=v2)


**The main limitations (for now) of vector search in Azure Cognitive Search are:**

- It does not generate vector embeddings for the content. Users need to provide the embeddings themselves by using a service such as Azure OpenAI.
- There is not field type for Collection of vectors, meaning that each document in the vector-based index must be either a small document or a chunk of a bigger document.

<br>
So, based on the information above, more questions arise:

1) **How do we create the vectors of each document in the index? do we need to manually split the text, vectorize the chunk and push it to a new vector-based index?**
2) **Or, can we use the existing text-based-ai-enriched index that can ingest any type of file on a schedule, and use it as a base for a new vector-based index?**

The answer, as usual, is: it depends.

Let's think about this, if your use case is just PDFs, for example, you can just use [PyPDF library](https://pypi.org/project/pypdf/) or [Azure AI Document Intelligence SDK (former Form Recognizer)](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-3.0.0), vectorize using OpenAI API and push the content to the vector-based index. And this is problably the simplest and fastest way to go.  However if your use case entails connecting to a datalake, or Sharepoint libraries or any other document data source with thousands of documents with multiple file types and that can change dynamically, then you would want to use the Ingestion and Document Cracking and AI-Enrichment capabilities of Azure Search engine and avoid a lot of painful custom code. 

Let's try both:

1. Manually parse PDFs documents using pypdf library and Azure AI Document Intelligence, create the chunks, vectorize each chunk, and push the chunk vector and chunk text to a vector-based index.
2. Use our current text-based indexes that has already chunks on it, vectorize each chunk using a custom skill function or on-demand (as documents are discovered by user searches), load new vector-based indexes, and use these new indexes in each user query.

In [2]:
import os
import json
import time
import requests
import random
from collections import OrderedDict
import urllib.request
from tqdm import tqdm
import langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma, FAISS
from langchain import OpenAI, VectorDBQA
from langchain.chat_models import AzureChatOpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.docstore.document import Document
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

from common.utils import parse_pdf, read_pdf_files, text_to_base64
from common.prompts import COMBINE_QUESTION_PROMPT, COMBINE_PROMPT
from common.utils import model_tokens_limit, num_tokens_from_docs

from IPython.display import Markdown, HTML, display  

from dotenv import load_dotenv
load_dotenv("credentials.env")

def printmd(string):
    display(Markdown(string))
    
os.makedirs("data/books/",exist_ok=True)
    
# Set the Data source connection string.
# You can change it and use your own data if you wish
BLOB_CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=demodatasetsp;AccountKey=QVFgIKPiWB+8f0mH+F7fidVLG7wq1S3WhtAqXOWaMWtr6fZ4frhVgmUzgBSdkmw4VsjoEAo7C2Hn+ASt2Cc5HA==;EndpointSuffix=core.windows.net"
BLOB_SAS_TOKEN="?sv=2022-11-02&ss=bf&srt=sco&sp=rltfx&se=2024-10-02T01:02:07Z&st=2023-08-03T17:02:07Z&spr=https&sig=gLxStXFSY6X29OPpPDpBEhoQDdtJNDrMVExNYJ%2BhmBQ%3D"
BLOB_CONTAINER_NAME = "books"
BASE_CONTAINER_URL = "https://demodatasetsp.blob.core.windows.net/" + BLOB_CONTAINER_NAME + "/"
LOCAL_FOLDER = "./data/books/"

os.makedirs(LOCAL_FOLDER,exist_ok=True)

ImportError: cannot import name 'read_pdf_files' from 'common.utils' (/mnt/batch/tasks/shared/LS_root/mounts/clusters/pabmar2/code/Users/pabmar/GPT-Azure-Search-Engine/common/utils.py)

In [None]:
# Set the ENV variables that Langchain needs to connect to Azure OpenAI
os.environ["OPENAI_API_BASE"] = os.environ["AZURE_OPENAI_ENDPOINT"]
os.environ["OPENAI_API_KEY"] = os.environ["AZURE_OPENAI_API_KEY"]
os.environ["OPENAI_API_VERSION"] = os.environ["AZURE_OPENAI_API_VERSION"]
os.environ["OPENAI_API_TYPE"] = "azure"

In [3]:
embedder = OpenAIEmbeddings(deployment="text-embedding-ada-002", chunk_size=1) 

ValidationError: 1 validation error for OpenAIEmbeddings
__root__
  Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass  `openai_api_key` as a named parameter. (type=value_error)

## 1 - Manual Document Cracking with Push to Vector-based Index

In the previous notebook, we developed solutions for various types of files and data formats commonly found in organizations. However, we encountered an issue when dealing with questions that require answers from complex files. The complexity of these files arises from their length and the way information is distributed within them.

One example of such complex files is the Technical Specification Guides, which can span hundreds of pages and contain information in the form of images, tables, forms, and more. Books are also complex due to their length and the presence of images or tables.

These files are typically in PDF format. To better handle these PDFs, we need a smarter parsing method that treats each document as a special source and processes them page by page. The objective is to obtain more accurate and faster answers from our system. Fortunately, there are usually not many of these types of documents in an organization, allowing us to make exceptions and treat them differently.

Within our demo storage account, we have a container named `books`, which holds 5 books of different lengths, languages, and complexities. Let's create a `cogsrch-index-books-vector` and load it with the pages of all these books.


In [12]:
books = ["Azure_Cognitive_Search_Documentation.pdf", 
         "Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf",
         "Fundamentals_of_Physics_Textbook.pdf",
         "Made_To_Stick.pdf",
         "Pere_Riche_Pere_Pauvre.pdf"]

Let's download the files to the local `./data/` folder:

In [13]:
for book in tqdm(books):
    book_url = BASE_CONTAINER_URL + book + BLOB_SAS_TOKEN
    urllib.request.urlretrieve(book_url, LOCAL_FOLDER+ book)

100%|██████████| 5/5 [00:02<00:00,  1.81it/s]


### What to use: pyPDF or AI Documment Intelligence API (Form Recognizer)?

In `utils.py` there is a **parse_pdf()** function. This utility function can parse local files using PyPDF library and can also parse local or from_url PDFs files using Azure AI Document Intelligence (Former Form Recognizer).

If `form_recognizer=False`, the function will parse the PDF using the python pyPDF library, which 75% of the time does a good job.<br>

Setting `form_recognizer=True`, is the best (and slower) parsing method using AI Documment Intelligence API (former known as Form Recognizer). You can specify the prebuilt model to use, the default is `model="prebuilt-document"`. However, if you have a complex document with tables, charts and figures , you can try
`model="prebuilt-layout"`, and it will capture all of the nuances of each page (it takes longer of course).

**Note: Many PDFs are scanned images. For example, any signed contract that was scanned and saved as PDF will NOT be parsed by pyPDF. Only AI Documment Intelligence API will work.**

In [15]:
book_pages_map = dict()
for book in books:
    print("Extracting Text from",book,"...")
    
    # Capture the start time
    start_time = time.time()
    
    # Parse the PDF
    book_path = LOCAL_FOLDER+book
    book_map = parse_pdf(file=book_path, form_recognizer=False, verbose=True)
    book_pages_map[book]= book_map
    
    # Capture the end time and Calculate the elapsed time
    end_time = time.time()
    elapsed_time = end_time - start_time

    print(f"Parsing took: {elapsed_time:.6f} seconds")
    print(f"{book} contained {len(book_map)} pages\n")

Extracting Text from Azure_Cognitive_Search_Documentation.pdf ...
Extracting text using PyPDF
Parsing took: 43.208935 seconds
Azure_Cognitive_Search_Documentation.pdf contained 1947 pages

Extracting Text from Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf ...
Extracting text using PyPDF
Parsing took: 2.035961 seconds
Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf contained 357 pages

Extracting Text from Fundamentals_of_Physics_Textbook.pdf ...
Extracting text using PyPDF
Parsing took: 128.467710 seconds
Fundamentals_of_Physics_Textbook.pdf contained 1450 pages

Extracting Text from Made_To_Stick.pdf ...
Extracting text using PyPDF
Parsing took: 9.256242 seconds
Made_To_Stick.pdf contained 225 pages

Extracting Text from Pere_Riche_Pere_Pauvre.pdf ...
Extracting text using PyPDF
Parsing took: 1.204234 seconds
Pere_Riche_Pere_Pauvre.pdf contained 225 pages



Now let's check a random page of each book to make sure the parsing was done correctly:

In [18]:
for bookname,bookmap in book_pages_map.items():
    print(bookname,"\n","chunk text:",bookmap[random.randint(10, 50)][2][:80],"...\n")

Azure_Cognitive_Search_Documentation.pdf 
 chunk text: 1. Select Create demo app  at the bottom of the page to generate the HTML file.
 ...

Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf 
 chunk text: 26
father said, “Did I hear you right? You don’t think he has a
problem?”
“That’ ...

Fundamentals_of_Physics_Textbook.pdf 
 chunk text: 9PROBLEMS••6You can easily convert common units and measures electroni-cally, bu ...

Made_To_Stick.pdf 
 chunk text: to a halt, ongoing activities are interrupted, our attention focuses in- 
volunt ...

Pere_Riche_Pere_Pauvre.pdf 
 chunk text: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~
~ ...



As we can see above, all books were parsed except `Pere_Riche_Pere_Pauvre.pdf` (this book is "Rich Dad, Poor Dad" written in French), why? Well, as we mentioned above, this book was scanned, so each page is an image. We need a PDF parser with good OCR capabilities in order to extract the content of this PDF. 
Let's try to parse this book again, but this time using Azure Document Intelligence API (former Form Recognizer)

In [20]:
%%time
book = "Pere_Riche_Pere_Pauvre.pdf"
book_path = LOCAL_FOLDER+book
book_map = parse_pdf(file=book_path, form_recognizer=True, model="prebuilt-document",from_url=False, verbose=True)
book_pages_map[book]= book_map

Extracting text using Azure Document Intelligence
CPU times: user 13.8 s, sys: 252 ms, total: 14.1 s
Wall time: 1min 15s


In [24]:
print(book,"\n","chunk text:",book_map[random.randint(10, 50)][2][:80],"...\n")

Pere_Riche_Pere_Pauvre.pdf 
 chunk text: monde qui les attend, un univers axé davantage sur les dépenses que sur l'épargn ...



As demonstrated above, Azure Document Intelligence proves to be superior to pyPDF. For production scenarios, we strongly recommend using Azure Document Intelligence consistently. When doing so, it's important to make a wise choice between the available models, such as "prebuilt-document," "prebuilt-layout," or others. You can find more information on model selection [HERE](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature?view=doc-intel-3.0.0).


## Create Vector-based index


Now that we have the content of the book's chunks (each page of each book) in the dictionary `book_pages_map`, let's create the Vector-based index in our Azure Search Engine where this content is going to land

In [25]:
index_name = "cogsrch-index-books-vector"

In [26]:
### Create Azure Search Vector-based Index
# Setup the Payloads header
headers = {'Content-Type': 'application/json','api-key': os.environ['AZURE_SEARCH_KEY']}
params = {'api-version': os.environ['AZURE_SEARCH_API_VERSION']}

In [29]:
index_payload = {
    "name": index_name,
    "fields": [
        {"name": "id", "type": "Edm.String", "key": "true", "filterable": "true" },
        {"name": "title","type": "Edm.String","searchable": "true","retrievable": "true"},
        {"name": "chunks","type": "Edm.String","searchable": "true","retrievable": "true"},
        {"name": "chunkVector","type": "Collection(Edm.Single)","searchable": "true","retrievable": "true","dimensions": 1536,"vectorSearchConfiguration": "vectorConfig"},
        {"name": "name", "type": "Edm.String", "searchable": "true", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
        {"name": "location", "type": "Edm.String", "searchable": "false", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
        {"name": "page_num","type": "Edm.Int32","searchable": "false","retrievable": "true"},
        
    ],
    "vectorSearch": {
        "algorithmConfigurations": [
            {
                "name": "vectorConfig",
                "kind": "hnsw"
            }
        ]
    },
    "semantic": {
        "configurations": [
            {
                "name": "my-semantic-config",
                "prioritizedFields": {
                    "titleField": {
                        "fieldName": "title"
                    },
                    "prioritizedContentFields": [
                        {
                            "fieldName": "chunks"
                        }
                    ],
                    "prioritizedKeywordsFields": []
                }
            }
        ]
    }
}

r = requests.put(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + index_name,
                 data=json.dumps(index_payload), headers=headers, params=params)
print(r.status_code)
print(r.ok)

201
True


In [13]:
# Uncomment to debug errors
# r.text

## Upload the Document chunks and its vectors to the Vector-Based Index

In [31]:
for bookname,bookmap in book_pages_map.items():
    print("Uploading chunks from",bookname)
    for page in tqdm(bookmap):
        try:
            page_num = page[0] + 1
            content = page[2]
            book_url = BASE_CONTAINER_URL + bookname + os.environ['BLOB_SAS_TOKEN']
            upload_payload = {
                "value": [
                    {
                        "id": text_to_base64(bookname + str(page_num)),
                        "title": f"{bookname}_page_{str(page_num)}",
                        "chunks": content,
                        "chunkVector": embedder.embed_query(content if content!="" else "-------"),
                        "name": bookname,
                        "location": book_url,
                        "page_num": page_num,
                        "@search.action": "upload"
                    },
                ]
            }

            r = requests.post(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + index_name + "/docs/index",
                                 data=json.dumps(upload_payload), headers=headers, params=params)
            if r.status_code != 200:
                print(r.status_code)
                print(r.text)
        except Exception as e:
            print("Exception:",e)
            print(content)
            continue

Uploading chunks from Azure_Cognitive_Search_Documentation.pdf


100%|██████████| 1947/1947 [05:22<00:00,  6.04it/s]


Uploading chunks from Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf


100%|██████████| 357/357 [01:00<00:00,  5.91it/s]


Uploading chunks from Fundamentals_of_Physics_Textbook.pdf


100%|██████████| 1450/1450 [04:29<00:00,  5.37it/s]


Uploading chunks from Made_To_Stick.pdf


100%|██████████| 225/225 [00:39<00:00,  5.69it/s]


Uploading chunks from Pere_Riche_Pere_Pauvre.pdf


100%|██████████| 225/225 [00:39<00:00,  5.66it/s]


## Query the Index

In [32]:
QUESTION = "what normally rich dad do that is different from poor dad?"
# QUESTION = "Tell me a summary of the book Boundaries"
# QUESTION = "Dime que significa la radiacion del cuerpo negro"
# QUESTION = "what is the acronym of the main point of Made to Stick book"
# QUESTION = "Tell me a python example of how do I push documents with vectors to an index using the python SDK?"
# QUESTION = "who won the soccer championship?" # this question should have no answer

In [33]:
search_payload = {
    "vectors": [{"value": embedder.embed_query(QUESTION),"fields": "chunkVector","k": 5}],
    "select": "title, chunks, name, location, page_num",
}

r = requests.post(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + index_name + "/docs/search",
                         data=json.dumps(search_payload), headers=headers, params=params)

ordered_results = r.json()
print("Results Returned: {}".format(len(ordered_results['value'])))

Results Returned: 5


In [34]:
ordered_results

{'@odata.context': "https://cog-search-arbnfv3wffx5o.search.windows.net/indexes('cogsrch-index-books-vector')/$metadata#docs(*)",
 'value': [{'@search.score': 0.8654169,
   'title': 'Pere_Riche_Pere_Pauvre.pdf_page_1',
   'chunks': "Best-seller du New York Times\nPère riche Père pauvre\nVersion française de Rich Dad, Poor Dad\nCe que les parents riches enseignent à leurs enfants à propos de l'argent afin qu'il soit à leur service\nRobert T. Kiyosaki et Sharon L. Lechter UN MONDE DIFFÉRENT ",
   'name': 'Pere_Riche_Pere_Pauvre.pdf',
   'location': 'https://demodatasetsp.blob.core.windows.net/books/Pere_Riche_Pere_Pauvre.pdf?sv=2022-11-02&ss=b&srt=o&sp=rlytf&se=2024-04-17T01:54:45Z&st=2023-08-10T17:54:45Z&spr=https&sig=nJg%2BFf7rs%2Bjp2syY5BET0GvaXOYjxGJmw36kQgVm7TE%3D',
   'page_num': 1},
  {'@search.score': 0.85955626,
   'title': 'Pere_Riche_Pere_Pauvre.pdf_page_3',
   'chunks': 'Données de catalogage avant publication (Canada)\nKiyosaki, Robert T., 1947-\nPère riche, père pauvre : de

In [35]:
MODEL = "gpt-35-turbo-16k" # options: gpt-35-turbo, gpt-35-turbo-16k, gpt-4, gpt-4-32k
llm = AzureChatOpenAI(deployment_name=MODEL, temperature=0, max_tokens=1000)

In [37]:
# Iterate over each of the results chunks and create a LangChain Document class to use further in the pipeline
top_docs = []
for key,value in ordered_results.items():
    if key == "value":
        for page in value:
            location = page["location"] if page["location"] is not None else ""
            top_docs.append(Document(page_content=page["chunks"], metadata={"source": location}))
        
print("Number of chunks:",len(top_docs))

Number of chunks: 5


In [38]:
# Calculate number of tokens of our docs
if(len(top_docs)>0):
    tokens_limit = model_tokens_limit(MODEL) # this is a custom function we created in common/utils.py
    num_tokens = num_tokens_from_docs(top_docs) # this is a custom function we created in common/utils.py
    print("Custom token limit for", MODEL, ":", tokens_limit)
    print("Combined docs tokens count:",num_tokens)
        
else:
    print("NO RESULTS FROM AZURE SEARCH")

Custom token limit for gpt-35-turbo-16k : 14500
Combined docs tokens count: 1786


In [39]:
chain_type = "map_reduce" if num_tokens > tokens_limit else "stuff"  
print("Chain Type selected:", chain_type)

Chain Type selected: stuff


In [40]:
if chain_type == "stuff":
    chain = load_qa_with_sources_chain(llm, chain_type=chain_type, 
                                       prompt=COMBINE_PROMPT)
elif chain_type == "map_reduce":
    chain = load_qa_with_sources_chain(llm, chain_type=chain_type, 
                                       question_prompt=COMBINE_QUESTION_PROMPT,
                                       combine_prompt=COMBINE_PROMPT,
                                       return_intermediate_steps=True)

In [41]:
%%time
# Try with other language as well
response = chain({"input_documents": top_docs, "question": QUESTION, "language": "English"})

CPU times: user 8.65 ms, sys: 375 µs, total: 9.02 ms
Wall time: 9.83 s


In [42]:
display(Markdown(response['output_text']))

In the book "Père riche, Père pauvre" (Rich Dad, Poor Dad), it is explained that the rich dad and the poor dad have different approaches to money and financial education. The rich dad teaches his children about money and how to make it work for them, while the poor dad does not provide the same financial education. The book explores the different mindsets and strategies for achieving financial success<sup><a href="https://demodatasetsp.blob.core.windows.net/books/Pere_Riche_Pere_Pauvre.pdf?sv=2022-11-02&ss=b&srt=o&sp=rlytf&se=2024-04-17T01:54:45Z&st=2023-08-10T17:54:45Z&spr=https&sig=nJg%2BFf7rs%2Bjp2syY5BET0GvaXOYjxGJmw36kQgVm7TE%3D">[1]</a></sup><sup><a href="https://demodatasetsp.blob.core.windows.net/books/Pere_Riche_Pere_Pauvre.pdf?sv=2022-11-02&ss=b&srt=o&sp=rlytf&se=2024-04-17T01:54:45Z&st=2023-08-10T17:54:45Z&spr=https&sig=nJg%2BFf7rs%2Bjp2syY5BET0GvaXOYjxGJmw36kQgVm7TE%3D">[2]</a></sup><sup><a href="https://demodatasetsp.blob.core.windows.net/books/Pere_Riche_Pere_Pauvre.pdf?sv=2022-11-02&ss=b&srt=o&sp=rlytf&se=2024-04-17T01:54:45Z&st=2023-08-10T17:54:45Z&spr=https&sig=nJg%2BFf7rs%2Bjp2syY5BET0GvaXOYjxGJmw36kQgVm7TE%3D">[3]</a></sup><sup><a href="https://demodatasetsp.blob.core.windows.net/books/Pere_Riche_Pere_Pauvre.pdf?sv=2022-11-02&ss=b&srt=o&sp=rlytf&se=2024-04-17T01:54:45Z&st=2023-08-10T17:54:45Z&spr=https&sig=nJg%2BFf7rs%2Bjp2syY5BET0GvaXOYjxGJmw36kQgVm7TE%3D">[4]</a></sup>.

## 2 - On-Demand Vectorization with Text-based-AI-Enriched Index

The last method proved to be highly effective, as it not only solved the challenge of handling large and complex PDF documents but also improved search speed by approximately 10 seconds on average through vector search.

However, this method does have its limitations:

- It is limited to processing only PDF files.
- Because this is a PUSH method, it doesn't use the advantages of the Indexer (PULL method): Scheduler, Change and Delete file detection, automated id key creation, etc.

Our ultimate goal is to rely solely on vector indexes to overcome our initial limitations. While it is possible to manually code parsers with OCR for various file types and develop a scheduler to synchronize data with the index, there is a more efficient alternative: **Azure Cognitive Search is soon going to release automated chunking strategies and vectorization within the next months**, so we have three options: 
1. Wait for this functionality and in the meantime keep embedding on-demand as shown in Notebook 3 
2. Create vector-based indexes per each text-based indexes and fill them up on-demand as documents are discovered
3. Use custom skills (for chunking and vectorization) and use knowledge stores in order to create a vector-base index from a text-based-ai-enriched index at ingestion time. See [HERE](https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-ingestion-python-sample.ipynb) for instructions on how to do this.

Below we are going to try Option 2: **Create vector-based indexes per each text-based indexes and fill them up on-demand as documents are discovered**. Why? because is simpler and quick to implement, while we wait for Optio 1 to become a feature of Azure Search Engine.

As you noticed in Notebooks 1 and 2, there is a field in the index called `vectorized` that we have not use yet. Now we will make use of that field. 
The goal is to NOT vectorize all documents at ingestion time, but instead vectorized the chunks as people search. That way we spend money and resources only when the documents are needed.
Normally in an organization with vast amounts of documents in a data lake, 20% of the documents are what people need, the rest are never needed.

In [1]:
#custom libraries that we will use later in the app
from common.utils import (
    get_search_results,
    order_search_results,
    model_tokens_limit,
    num_tokens_from_docs,
    embed_docs,
    search_docs,
    get_answer,
)

In [None]:
index1_name = "cogsrch-index-files"
index2_name = "cogsrch-index-csv"
indexes = [index1_name, index2_name]

agg_search_results = get_search_results(QUESTION, indexes)
ordered_results = order_search_results(agg_search_results, reranker_threshold=1)

# Summary

In this notebook, we have acquired an understanding of how to address the challenge of indexing complex or large documents by leveraging the vector search capabilities offered by Azure Cognitive Search.

Additionally, we concluded that, until Azure Search introduces automated Chunk Index creation via the Indexer, it would be more straightforward to proceed with an on-the-fly vectorization strategy for the majority of smaller documents stored in the data lake. By doing so, although might seem inefficient for now, we can avoid the manual complexities associated with creating customized skills and maintaining synchronization between the Document Index and Chunk Index.

# NEXT
So far we have learned how to use OpenAI vectors and completion APIs in order to get an excelent answer from our documents stored in Azure Cognitive Search. This is the backbone for a GPT Smart Search Engine.

However, we are missing something: **How to have a conversation with this engine?**

On the next Notebook, we are going to understand the concept of **memory**. This is necessary in order to have a chatbot that can establish a conversation with the user. Without memory, there is no real conversation.