# How to deal with complex/large Documents

In the previous notebook, we developed a solution for various types of files and data formats commonly found in organizations, and this covers 90% of the use cases. However, you will find that there are issues when dealing with questions that require answers from complex files. The complexity of these files arises from their length and the way information is distributed within them. Large documents are always a challenge for Search Engines.

One example of such complex files is Technical Specification Guides or Product Manuals, which can span hundreds of pages and contain information in the form of images, tables, forms, and more. Books are also complex due to their length and the presence of images or tables.

These files are typically in PDF format. To better handle these PDFs, we need a smarter parsing method that treats each document as a special source and processes them page by page. The objective is to obtain more accurate and faster answers from our system. Fortunately, there are usually not many of these types of documents in an organization, allowing us to make exceptions and treat them differently.

If your use case is just PDFs, for example, you can just use [PyPDF library](https://pypi.org/project/pypdf/) or [Azure AI Document Intelligence SDK (former Form Recognizer)](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-3.0.0), vectorize using OpenAI API and push the content to a vector-based index. And this is problably the simplest and fastest way to go.  However if your use case entails connecting to a datalake, or Sharepoint libraries or any other document data source with thousands of documents with multiple file types and that can change dynamically, then you would want to use the Ingestion and Document Cracking and AI-Enrichment capabilities of Azure Search engine, Notebooks 1-3, and avoid a lot of painful custom code. 


In [1]:
import os
import json
import time
import requests
import random
from collections import OrderedDict
import urllib.request
from tqdm import tqdm
import langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma, FAISS
from langchain import OpenAI, VectorDBQA
from langchain.chat_models import AzureChatOpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.docstore.document import Document
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

from common.utils import parse_pdf, read_pdf_files, text_to_base64
from common.prompts import COMBINE_QUESTION_PROMPT, COMBINE_PROMPT, COMBINE_PROMPT_TEMPLATE
from common.utils import (
    get_search_results,
    model_tokens_limit,
    num_tokens_from_docs,
    num_tokens_from_string
)


from IPython.display import Markdown, HTML, display  

from dotenv import load_dotenv
load_dotenv("credentials_my.env")

def printmd(string):
    display(Markdown(string))
    
os.makedirs("data/books/",exist_ok=True)
LOCAL_FOLDER = "./data/books/"

BLOB_CONTAINER_NAME = "books"
storage_account = [r for r in os.environ['BLOB_CONNECTION_STRING_PUBLIC'].split(';')][1].split('=')[1]
BASE_CONTAINER_URL = f"https://{storage_account}.blob.core.windows.net/{BLOB_CONTAINER_NAME}/"
print(f"Books public location: {BASE_CONTAINER_URL}")

MODEL = os.environ["COMPLETION3516_DEPLOYMENT"]
book_index_name = "cogsrch-index-books-vector"

os.makedirs(LOCAL_FOLDER,exist_ok=True)

Books public location: https://demodatasetsp.blob.core.windows.net/books/


In [2]:
#Check if this is the first time that we run this notebook
import requests
headers         = {'Content-Type': 'application/json','api-key': os.environ['AZURE_SEARCH_KEY']}
params          = {'api-version': os.environ['AZURE_SEARCH_API_VERSION']}
r               = requests.get(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes", headers=headers, params=params)
FIRST_EXECUTION = False if book_index_name in str(r.content) else True
print(f"FIRST_EXECUTION: {FIRST_EXECUTION}")

FIRST_EXECUTION: True


In [3]:
# Set the ENV variables that Langchain needs to connect to Azure OpenAI
os.environ["OPENAI_API_BASE"]    = os.environ["AZURE_OPENAI_ENDPOINT"]
os.environ["OPENAI_API_KEY"]     = os.environ["AZURE_OPENAI_API_KEY"]
os.environ["OPENAI_API_VERSION"] = os.environ["AZURE_OPENAI_API_VERSION"]
os.environ["OPENAI_API_TYPE"]    = os.environ["OPENAI_API_TYPE"]

In [4]:
embedder = OpenAIEmbeddings(deployment=os.environ["EMBEDDING_DEPLOYMENT"], chunk_size=1)

## 1 - Manual Document Cracking with Push to Vector-based Index

Within our demo storage account, we have a container named `books`, which holds 5 books of different lengths, languages, and complexities. Let's create a `cogsrch-index-books-vector` and load it with the pages of all these books.

We begin by downloading these books to our local machine:

In [5]:
if FIRST_EXECUTION:    
    books = ["Azure_Cognitive_Search_Documentation.pdf", 
             "Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf",
             "Fundamentals_of_Physics_Textbook.pdf",
             "Made_To_Stick.pdf",
             "Pere_Riche_Pere_Pauvre.pdf"]
    
else:
    print("Command skipped since it's not the first execution")

Let's download the files to the local `./data/` folder:

In [6]:
if FIRST_EXECUTION:
    
    for book in tqdm(books):
        book_url = BASE_CONTAINER_URL + book + os.environ['BLOB_SAS_TOKEN_PUBLIC']
        print(f"Downloading {book_url} to {LOCAL_FOLDER + book}...")
        urllib.request.urlretrieve(book_url, LOCAL_FOLDER + book)

    print("Copy data completed. Please note that the files might have been overwritten by this copy task.")
    
else:
    print("Command skipped since it's not the first execution")

  0%|          | 0/5 [00:00<?, ?it/s]

Downloading https://demodatasetsp.blob.core.windows.net/books/Azure_Cognitive_Search_Documentation.pdf?sv=2022-11-02&ss=bf&srt=sco&sp=rl&se=2025-11-06T23:27:04Z&st=2023-11-06T15:27:04Z&spr=https&sig=IxmYt1nWtSI0MtBHeQBC1t%2F4VeoN19HqQM1Xu6tvacU%3D to ./data/books/Azure_Cognitive_Search_Documentation.pdf...


 20%|██        | 1/5 [00:07<00:29,  7.34s/it]

Downloading https://demodatasetsp.blob.core.windows.net/books/Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf?sv=2022-11-02&ss=bf&srt=sco&sp=rl&se=2025-11-06T23:27:04Z&st=2023-11-06T15:27:04Z&spr=https&sig=IxmYt1nWtSI0MtBHeQBC1t%2F4VeoN19HqQM1Xu6tvacU%3D to ./data/books/Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf...


 40%|████      | 2/5 [00:09<00:12,  4.18s/it]

Downloading https://demodatasetsp.blob.core.windows.net/books/Fundamentals_of_Physics_Textbook.pdf?sv=2022-11-02&ss=bf&srt=sco&sp=rl&se=2025-11-06T23:27:04Z&st=2023-11-06T15:27:04Z&spr=https&sig=IxmYt1nWtSI0MtBHeQBC1t%2F4VeoN19HqQM1Xu6tvacU%3D to ./data/books/Fundamentals_of_Physics_Textbook.pdf...


 60%|██████    | 3/5 [00:15<00:10,  5.10s/it]

Downloading https://demodatasetsp.blob.core.windows.net/books/Made_To_Stick.pdf?sv=2022-11-02&ss=bf&srt=sco&sp=rl&se=2025-11-06T23:27:04Z&st=2023-11-06T15:27:04Z&spr=https&sig=IxmYt1nWtSI0MtBHeQBC1t%2F4VeoN19HqQM1Xu6tvacU%3D to ./data/books/Made_To_Stick.pdf...


 80%|████████  | 4/5 [00:17<00:03,  3.75s/it]

Downloading https://demodatasetsp.blob.core.windows.net/books/Pere_Riche_Pere_Pauvre.pdf?sv=2022-11-02&ss=bf&srt=sco&sp=rl&se=2025-11-06T23:27:04Z&st=2023-11-06T15:27:04Z&spr=https&sig=IxmYt1nWtSI0MtBHeQBC1t%2F4VeoN19HqQM1Xu6tvacU%3D to ./data/books/Pere_Riche_Pere_Pauvre.pdf...


100%|██████████| 5/5 [00:19<00:00,  3.94s/it]

Copy data completed. Please note that the files might have been overwritten by this copy task.





### What to use: pyPDF or AI Documment Intelligence API (Form Recognizer)?

In `utils.py` there is a **parse_pdf()** function. This utility function can parse local files using PyPDF library and can also parse local or from_url PDFs files using Azure AI Document Intelligence (Former Form Recognizer).

If `form_recognizer=False`, the function will parse the PDF using the python pyPDF library, which 75% of the time does a good job.<br>

Setting `form_recognizer=True`, is the best (and slower) parsing method using AI Documment Intelligence API (former known as Form Recognizer). You can specify the prebuilt model to use, the default is `model="prebuilt-document"`. However, if you have a complex document with tables, charts and figures , you can try
`model="prebuilt-layout"`, and it will capture all of the nuances of each page (it takes longer of course).

**Note: Many PDFs are scanned images. For example, any signed contract that was scanned and saved as PDF will NOT be parsed by pyPDF. Only AI Documment Intelligence API will work.**

In [7]:
if FIRST_EXECUTION:
    book_pages_map = dict()
    for book in books:
        print("Extracting Text from",book,"...")

        # Capture the start time
        start_time = time.time()

        # Parse the PDF
        book_path = LOCAL_FOLDER+book
        book_map = parse_pdf(file=book_path, form_recognizer=False, verbose=True)
        book_pages_map[book]= book_map

        # Capture the end time and Calculate the elapsed time
        end_time = time.time()
        elapsed_time = end_time - start_time

        print(f"Parsing took: {elapsed_time:.6f} seconds")
        print(f"{book} contained {len(book_map)} pages\n")
        
else:
    print("Command skipped since it's not the first execution")

Extracting Text from Azure_Cognitive_Search_Documentation.pdf ...
Extracting text using PyPDF
Parsing took: 39.764616 seconds
Azure_Cognitive_Search_Documentation.pdf contained 1947 pages

Extracting Text from Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf ...
Extracting text using PyPDF
Parsing took: 2.244751 seconds
Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf contained 357 pages

Extracting Text from Fundamentals_of_Physics_Textbook.pdf ...
Extracting text using PyPDF
Parsing took: 113.456667 seconds
Fundamentals_of_Physics_Textbook.pdf contained 1450 pages

Extracting Text from Made_To_Stick.pdf ...
Extracting text using PyPDF
Parsing took: 9.163140 seconds
Made_To_Stick.pdf contained 225 pages

Extracting Text from Pere_Riche_Pere_Pauvre.pdf ...
Extracting text using PyPDF
Parsing took: 1.322419 seconds
Pere_Riche_Pere_Pauvre.pdf contained 225 pages



### You may execute the next cell multiple times to check a random page of each book to make sure the parsing was done correctly

In [10]:
if FIRST_EXECUTION:
    for bookname,bookmap in book_pages_map.items():
        print(bookname, "\n","chunk text:", bookmap[random.randint(10, 50)][2][:80], "...\n")
        
else:
    print("Command skipped since it's not the first execution")

Azure_Cognitive_Search_Documentation.pdf 
 chunk text: Categor y                            Featur es
Inbound access Azur e role-b ased ...

Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf 
 chunk text: 39
and Eve. We need to own our attitudes and convictions because
they fall withi ...

Fundamentals_of_Physics_Textbook.pdf 
 chunk text: xixPREFACEinstead of just being flat on a printed page. Not only does this give  ...

Made_To_Stick.pdf 
 chunk text: Idea Clinics  
The goal of this book is to help you make your idea s stick. So,  ...

Pere_Riche_Pere_Pauvre.pdf 
 chunk text: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~ ...



As we can see above, all books were parsed except `Pere_Riche_Pere_Pauvre.pdf` (this book is "Rich Dad, Poor Dad" written in French), why? Well, as we mentioned above, this book was scanned, so each page is an image and with a very unique font. We need a good PDF parser with good OCR capabilities in order to extract the content of this PDF. 
Let's try to parse this book again, but this time using Azure Document Intelligence API (former Form Recognizer)

In [11]:
%%time
if FIRST_EXECUTION:
    book = "Pere_Riche_Pere_Pauvre.pdf"
    book_path = LOCAL_FOLDER+book
    book_map = parse_pdf(file=book_path, form_recognizer=True, model="prebuilt-document",from_url=False, verbose=True)
    book_pages_map[book]= book_map
    
else:
    print("Command skipped since it's not the first execution")

Extracting text using Azure Document Intelligence
CPU times: user 12.3 s, sys: 260 ms, total: 12.5 s
Wall time: 35.4 s


### Note: If the above command throws an error - Create another form recognizer resource in the azure portal in the same resource group, don't delete it. And try again.
#### This seems to be a transient error.

In [16]:
if FIRST_EXECUTION:
    print(book,"\n","chunk text:",book_map[random.randint(10, 50)][2][:80],"...\n")

else:
    print("Command skipped since it's not the first execution")

Pere_Riche_Pere_Pauvre.pdf 
 chunk text: parvient même pas à payer ses factures. La plupart des gens, si on leur donne pl ...



As demonstrated above, Azure Document Intelligence proves to be superior to pyPDF. **For production scenarios, we strongly recommend using Azure Document Intelligence consistently**. When doing so, it's important to make a wise choice between the available models, such as "prebuilt-document," "prebuilt-layout," or others. You can find more information on model selection [HERE](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature?view=doc-intel-3.0.0).


## Create Vector-based index


Now that we have the content of the book's chunks (each page of each book) in the dictionary `book_pages_map`, let's create the Vector-based index in our Azure Search Engine where this content is going to land

In [17]:
if FIRST_EXECUTION:
    ### Create Azure Search Vector-based Index
    # Setup the Payloads header
    headers = {'Content-Type': 'application/json','api-key': os.environ['AZURE_SEARCH_KEY']}
    params  = {'api-version': os.environ['AZURE_SEARCH_API_VERSION']}
    
else:
    print("Command skipped since it's not the first execution")

In [18]:
if FIRST_EXECUTION:
    index_payload = {
        "name": book_index_name,
        "fields": [
            {"name": "id", "type": "Edm.String", "key": "true", "filterable": "true" },
            {"name": "title","type": "Edm.String","searchable": "true","retrievable": "true"},
            {"name": "chunk","type": "Edm.String","searchable": "true","retrievable": "true"},
            {"name": "chunkVector","type": "Collection(Edm.Single)","searchable": "true","retrievable": "true","dimensions": 1536,"vectorSearchConfiguration": "vectorConfig"},
            {"name": "name", "type": "Edm.String", "searchable": "true", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
            {"name": "location", "type": "Edm.String", "searchable": "false", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
            {"name": "page_num","type": "Edm.Int32","searchable": "false","retrievable": "true"},

        ],
        "vectorSearch": {
            "algorithmConfigurations": [
                {
                    "name": "vectorConfig",
                    "kind": "hnsw"
                }
            ]
        },
        "semantic": {
            "configurations": [
                {
                    "name": "my-semantic-config",
                    "prioritizedFields": {
                        "titleField": {
                            "fieldName": "title"
                        },
                        "prioritizedContentFields": [
                            {
                                "fieldName": "chunk"
                            }
                        ],
                        "prioritizedKeywordsFields": []
                    }
                }
            ]
        }
    }

    r = requests.put(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + book_index_name,
                     data=json.dumps(index_payload), headers=headers, params=params)
    print(r.status_code)
    print(r.ok)
    
else:
    print("Command skipped since it's not the first execution")

201
True


In [19]:
# Uncomment to debug errors
# r.text

## Upload the Document chunks and its vectors to the Vector-Based Index

The following code will iterate over each chunk of each book and use the Azure Search Rest API upload method to insert each document with its corresponding vector (using OpenAI embedding model) to the index.

In [20]:
%%time
if FIRST_EXECUTION:
    for bookname,bookmap in book_pages_map.items():
        print("Uploading chunks from",bookname)
        for page in tqdm(bookmap):
            try:
                page_num = page[0] + 1
                content = page[2]
                book_url = BASE_CONTAINER_URL + bookname
                upload_payload = {
                    "value": [
                        {
                            "id": text_to_base64(bookname + str(page_num)),
                            "title": f"{bookname}_page_{str(page_num)}",
                            "chunk": content,
                            "chunkVector": embedder.embed_query(content if content!="" else "-------"),
                            "name": bookname,
                            "location": book_url,
                            "page_num": page_num,
                            "@search.action": "upload"
                        },
                    ]
                }

                r = requests.post(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + book_index_name + "/docs/index",
                                     data=json.dumps(upload_payload), headers=headers, params=params)
                if r.status_code != 200:
                    print(r.status_code)
                    print(r.text)
            except Exception as e:
                print("Exception:",e)
                print(content)
                continue
            
else:
    print("Command skipped since it's not the first execution")

Uploading chunks from Azure_Cognitive_Search_Documentation.pdf


100%|██████████| 1947/1947 [07:14<00:00,  4.48it/s]


Uploading chunks from Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf


 74%|███████▎  | 263/357 [00:58<00:20,  4.53it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 1 second. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit..
100%|██████████| 357/357 [01:24<00:00,  4.24it/s]


Uploading chunks from Fundamentals_of_Physics_Textbook.pdf


  3%|▎         | 39/1450 [00:09<05:48,  4.05it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 1 second. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit..
  4%|▍         | 57/1450 [00:17<05:37,  4.12it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Ple

 15%|█▍        | 215/1450 [01:54<05:04,  4.05it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 1 second. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit..
 15%|█▌        | 224/1450 [02:00<06:15,  3.26it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. P

 28%|██▊       | 411/1450 [03:45<15:08,  1.14it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 1 second. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit..
 30%|██▉       | 430/1450 [03:54<04:06,  4.14it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. P

 42%|████▏     | 602/1450 [05:33<03:23,  4.17it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 1 second. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit..
 42%|████▏     | 608/1450 [05:39<06:40,  2.10it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. P

 54%|█████▍    | 788/1450 [07:22<02:37,  4.21it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 4 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit..
 55%|█████▍    | 794/1450 [07:27<05:04,  2.16it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. 

 71%|███████   | 1028/1450 [09:35<01:47,  3.92it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 12 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 8 seconds. Please go here: ht

 85%|████████▍ | 1231/1450 [11:26<00:51,  4.27it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 1 second. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit..
 85%|████████▌ | 1239/1450 [11:32<01:11,  2.95it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier.

 98%|█████████▊| 1414/1450 [13:14<00:47,  1.32s/it]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 1 second. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit..
 98%|█████████▊| 1416/1450 [13:19<01:00,  1.79s/it]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier.

Uploading chunks from Made_To_Stick.pdf


 54%|█████▍    | 122/225 [00:28<00:23,  4.35it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 1 second. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit..
 69%|██████▉   | 155/225 [00:39<00:16,  4.32it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Ple

Uploading chunks from Pere_Riche_Pere_Pauvre.pdf


 74%|███████▍  | 167/225 [00:40<00:13,  4.33it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 4 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit..
 76%|███████▋  | 172/225 [00:45<00:28,  1.89it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Pl

CPU times: user 3min 8s, sys: 4.18 s, total: 3min 12s
Wall time: 25min 4s





## Query the Index

In [21]:
# QUESTION = "what normally rich dad do that is different from poor dad?"
# QUESTION = "Tell me a summary of the book Boundaries"
# QUESTION = "Dime que significa la radiacion del cuerpo negro"
# QUESTION = "What is the acronym of the main point of Made to Stick book"
QUESTION = "Tell me a python example of how do I push documents with vectors to an index using the python SDK?"
# QUESTION = "Who won the soccer worldcup in 1994?" # this question should have no answer

In [22]:
vector_indexes = [book_index_name]

ordered_results = get_search_results(
    QUESTION, vector_indexes, 
    k=10,
    reranker_threshold=1,
    vector_search=True, 
    similarity_k=10,
    query_vector = embedder.embed_query(QUESTION)
)

**Note**: that we are picking a larger k=10 since these chunks are NOT of 5000 chars each like prior notebooks, but instead each page is a chunk.

In [23]:
COMPLETION_TOKENS = 1000
llm = AzureChatOpenAI(deployment_name=MODEL, temperature=0.5, max_tokens=COMPLETION_TOKENS)

In [24]:
top_docs = []
for key,value in ordered_results.items():
    location = value["location"] if value["location"] is not None else ""
    top_docs.append(Document(page_content=value["chunk"], metadata={"source": location+os.environ['BLOB_SAS_TOKEN']}))
        
print("Number of chunks:", len(top_docs))

Number of chunks: 10


In [25]:
# Calculate number of tokens of our docs
if(len(top_docs)>0):
    tokens_limit = model_tokens_limit(MODEL) # this is a custom function we created in common/utils.py
    prompt_tokens = num_tokens_from_string(COMBINE_PROMPT_TEMPLATE) # this is a custom function we created in common/utils.py
    context_tokens = num_tokens_from_docs(top_docs) # this is a custom function we created in common/utils.py
    
    requested_tokens = prompt_tokens + context_tokens + COMPLETION_TOKENS
    
    chain_type = "map_reduce" if requested_tokens > 0.9 * tokens_limit else "stuff"  
    
    print("System prompt token count:",prompt_tokens)
    print("Max Completion Token count:", COMPLETION_TOKENS)
    print("Combined docs (context) token count:",context_tokens)
    print("--------")
    print("Requested token count:",requested_tokens)
    print("Token limit for", MODEL, ":", tokens_limit)
    print("Chain Type selected:", chain_type)
        
else:
    print("NO RESULTS FROM AZURE SEARCH")

System prompt token count: 1669
Max Completion Token count: 1000
Combined docs (context) token count: 3595
--------
Requested token count: 6264
Token limit for gpt-35-turbo-16k : 16384
Chain Type selected: stuff


In [26]:
if chain_type == "stuff":
    chain = load_qa_with_sources_chain(llm, chain_type=chain_type, 
                                       prompt=COMBINE_PROMPT)
elif chain_type == "map_reduce":
    chain = load_qa_with_sources_chain(llm, chain_type=chain_type, 
                                       question_prompt=COMBINE_QUESTION_PROMPT,
                                       combine_prompt=COMBINE_PROMPT,
                                       return_intermediate_steps=True)

In [27]:
%%time
# Try with other language as well
response = chain({"input_documents": top_docs, "question": QUESTION, "language": "English"})

CPU times: user 3.85 ms, sys: 3.98 ms, total: 7.83 ms
Wall time: 8.6 s


In [28]:
display(Markdown(response['output_text']))

To push documents with vectors to an index using the Python SDK, you can use the following code example:

```Python
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient

# Set the endpoint, index name, and API key
endpoint = "https://your-search-service.search.windows.net"
index_name = "your-index-name"
api_key = "your-api-key"

# Create an instance of the SearchClient
search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=AzureKeyCredential(api_key))

# Define your documents with vectors
documents = [
    {
        "@search.action": "upload",
        "id": "1",
        "title": "Document 1",
        "vector": [0.1, 0.2, 0.3]
    },
    {
        "@search.action": "upload",
        "id": "2",
        "title": "Document 2",
        "vector": [0.4, 0.5, 0.6]
    }
]

# Upload the documents to the index
result = search_client.upload_documents(documents=documents)
print("Upload of new documents succeeded: {}".format(result[0].succeeded))
```

This code example uses the Azure SDK for Python to create a `SearchClient` instance with the endpoint, index name, and API key. It then defines the documents with vectors using the `@search.action` property to specify the action as "upload". Finally, it calls the `upload_documents` method of the `SearchClient` to push the documents to the index.

[1]<sup><a href="https://demodatasetsp.blob.core.windows.net/books/Azure_Cognitive_Search_Documentation.pdf?sv=2023-01-03&ss=btqf&srt=sco&st=2023-11-25T09%3A09%3A24Z&se=2030-11-26T09%3A09%3A00Z&sp=rl&sig=1zCNOg4UIcVHew2GngrqYs%2FyF1Nq%2BnvD5nPf6Ka3k%2B0%3D">Source</a></sup>

Let me know if there's anything else I can assist you with.

# Summary

In this notebook we learned how to deal with complex and large Documents and make them available for Q&A over them using [Hybrid Search](https://learn.microsoft.com/en-us/azure/search/search-get-started-vector#hybrid-search) (text + vector search).

We also learned the power of Azure Document Inteligence API and why it is recommended for production scenarios where manual Document parsing (instead of Azure Search Indexer Document Cracking) is necessary.

Using Azure Cognitive Search with its Vector capabilities and hybrid search features eliminates the need for other vector databases such as Weaviate, Qdrant, Milvus, Pinecone, and so on.


# NEXT
So far we have learned how to use OpenAI vectors and completion APIs in order to get an excelent answer from our documents stored in Azure Cognitive Search. This is the backbone for a GPT Smart Search Engine.

However, we are missing something: **How to have a conversation with this engine?**

On the next Notebook, we are going to understand the concept of **memory**. This is necessary in order to have a chatbot that can establish a conversation with the user. Without memory, there is no real conversation.