# How to deal with complex/large Documents

In the previous notebook, we developed a solution for various types of files and data formats commonly found in organizations, and this covers big majority of the use cases. However, you will find that there are issues when dealing with questions that require answers from complex files. The complexity of these files arises from their length and the way information is distributed within them. Large documents are always a challenge for Search Engines.

One example of such complex files is Technical Specification Guides or Product Manuals, which can span hundreds of pages and contain information in the form of images, tables, forms, and more. Books are also complex due to their length and the presence of images or tables.

These files are typically in PDF format. To better handle these PDFs, we need a smarter parsing method that treats each document as a special source and processes them page by page (1 page = 1 chunk). The objective is to obtain more accurate and faster answers from our system. Fortunately, there are usually not many of these types of documents in an organization, allowing us to make exceptions and treat them differently.

If your use case is just PDFs, for example, you can just use [PyPDF library](https://pypi.org/project/pypdf/) or [Azure AI Document Intelligence SDK (former Form Recognizer)](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-3.0.0), vectorize using OpenAI API and push the content to a vector-based index. And this is problably the simplest and fastest way to go.  However if your use case entails connecting to a datalake, or Sharepoint libraries or any other document data source with thousands of documents with multiple file types and that can change dynamically, then you would want to use the Ingestion and Document Cracking and AI-Enrichment capabilities of Azure Search engine, Notebooks 1-3, and avoid a lot of painful custom code. 


In [1]:
import os
import json
import time
import requests
import random
from collections import OrderedDict
import urllib.request
from tqdm import tqdm
from typing import List

from langchain_openai import AzureOpenAIEmbeddings
from langchain_openai import AzureChatOpenAI
from langchain_core.retrievers import BaseRetriever
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.messages import HumanMessage
from langchain_core.runnables import ConfigurableField
from langchain_core.output_parsers import StrOutputParser
from operator import itemgetter


from common.utils import parse_pdf, read_pdf_files, text_to_base64
from common.prompts import DOCSEARCH_PROMPT
from common.utils import CustomAzureSearchRetriever


from IPython.display import Markdown, HTML, display  

from dotenv import load_dotenv
load_dotenv("credentials.env")

def printmd(string):
    display(Markdown(string))
    
os.makedirs("data/books/",exist_ok=True)
    

BLOB_CONTAINER_NAME = "usecases"
BASE_CONTAINER_URL = "https://blobstorageffpanhhmq7wy3.blob.core.windows.net/" + BLOB_CONTAINER_NAME + "/"
LOCAL_FOLDER = "./data/usecases/"

os.makedirs(LOCAL_FOLDER,exist_ok=True)

In [2]:
#search="example"

In [None]:
https://blobstorageffpanhhmq7wy3.blob.core.windows.net/books/StyleTTS2.pdf?sv=2022-11-02&ss=bfqt&srt=sco&sp=rwdlacupiytfx&se=2024-09-13T16:22:30Z&st=2024-09-13T08:22:30Z&spr=https&sig=mnoIDbS7Z3ooFM%2BVbIFU9TbrIGfG14ss1Lt5HD1l3io%3D&#search=we%model%the%speech


In [2]:
# Set the ENV variables that Langchain needs to connect to Azure OpenAI
os.environ["OPENAI_API_VERSION"] = os.environ["AZURE_OPENAI_API_VERSION"]

In [3]:
batch_size = 75
embedder = AzureOpenAIEmbeddings(deployment=os.environ["EMBEDDING_DEPLOYMENT_NAME"], chunk_size=batch_size, 
                                 max_retries=2, 
                                 retry_min_seconds= 60,
                                 retry_max_seconds= 70)

## 1 - Manual Document Cracking with Push to Vector-based Index

Within our demo storage account, we have a container named `books`, which holds 5 books of different lengths, languages, and complexities. Let's create a `cogsrch-index-books-vector` and load it with the pages of all these books.

We begin by downloading these books to our local machine:

In [8]:
books = ["BigVGAN.pdf", 
         "PL-BERT.pdf",
         "StarGANv2.pdf",
         "StyleTTS2.pdf"
         ]

usecases = [ '2323-EQU-Y-SA-0014_EQU design basis.PDF' , 'TR2258 Digital Factory - Information and Exchange Models.PDF' , 'TR2381 LCI Requirements Master for FPSO.PDF',
'TR1212 SAS Operator Station HMI.PDF' ,     'TR2325 Piping Detail Standard.PDF'       ,                      'TR3032 Field instrumentation.PDF']


import urllib.request
import ssl

Let's download the files to the local `./data/` folder:

In [None]:
ssl_context = ssl.create_default_context()


for book in tqdm(usecases):
    print(BASE_CONTAINER_URL + book + os.environ['BLOB_SAS_TOKEN'])
    book_url = BASE_CONTAINER_URL + book + os.environ['BLOB_SAS_TOKEN']
    req = urllib.request.urlretrieve(book_url, LOCAL_FOLDER+ book)

    print(req.text)

### What to use: pyPDF or AI Documment Intelligence API (Form Recognizer)?

In `utils.py` there is a **parse_pdf()** function. This utility function can parse local files using PyPDF library and can also parse local or from_url PDFs files using Azure AI Document Intelligence (Former Form Recognizer).

If `form_recognizer=False`, the function will parse the PDF using the python pyPDF library, which 75% of the time does a good job.<br>

Setting `form_recognizer=True`, is the best (and slower) parsing method using AI Documment Intelligence API (former known as Form Recognizer). You can specify the prebuilt model to use, the default is `model="prebuilt-document"`. However, if you have a complex document with tables, charts and figures , you can try
`model="prebuilt-layout"`, and it will capture all of the nuances of each page (it takes longer of course).

**Note: Many PDFs are scanned images. For example, any signed contract that was scanned and saved as PDF will NOT be parsed by pyPDF. Only AI Documment Intelligence API will work.**

In [None]:
import time
import concurrent.futures
from pathlib import Path

# Initialize a dictionary to store the results
book_pages_map = dict()

# Function to process each book
def process_book(book):
    try:
        print(f"Extracting Text from {book}...")

        # Capture the start time for the individual book
        start_time = time.time()

        # Create the full path to the book
        book_path = Path(LOCAL_FOLDER) / book

        # Parse the PDF (assuming parse_pdf is a custom function)
        book_map = parse_pdf(file=str(book_path), form_recognizer=False, verbose=True)

        # Capture the elapsed time
        elapsed_time = time.time() - start_time

        # Print the time it took to process the book and the number of pages
        print(f"Parsing {book} took: {elapsed_time:.2f} seconds")
        print(f"{book} contained {len(book_map)} pages\n")

        # Return the book name and the parsed result (book_map)
        return book, book_map

    except Exception as e:
        print(f"Error processing {book}: {e}")
        return book, None

# Capture the start time for the entire process
overall_start_time = time.time()

# Process the books in parallel using ThreadPoolExecutor
with concurrent.futures.ThreadPoolExecutor() as executor:
    # Map results of the parallel processing to 'results'
    results = executor.map(process_book, usecases)

# Store the results in the book_pages_map dictionary (only store successfully processed books)
book_pages_map = {book: book_map for book, book_map in results if book_map}

# Capture the end time for the entire process
overall_end_time = time.time()

# Calculate the total elapsed time for all books
total_elapsed_time = overall_end_time - overall_start_time
print(f"Total time for parsing all books: {total_elapsed_time:.2f} seconds")


Now let's check a random page of each book to make sure the parsing was done correctly:

In [None]:
import random

# Loop over each book and its corresponding map
for bookname, bookmap in book_pages_map.items():
    try:
        # Ensure the random index is within the bounds of the bookmap (number of pages)
        if len(bookmap) > 1:  # Make sure there are at least 10 pages
            random_page_index = random.randint(5, min(50, len(bookmap)-1))  # Ensure index doesn't exceed the available pages
            
            # Get the content of the randomly selected page
            page_content = bookmap[random_page_index]

            # Check if the page_content has at least 3 elements to safely access [2]
            if len(page_content) > 2 and isinstance(page_content[2], str):  # Ensure there's a valid chunk at index 2
                chunk_text = page_content[2][:120]  # Get the first 120 characters
                print(f"{bookname}\nChunk text (from page {random_page_index}): {chunk_text}...\n")
            else:
                print(f"{bookname}\nPage {random_page_index} does not have enough content or a valid chunk at index [2].\n")
        else:
            print(f"{bookname} does not have enough pages to select a random chunk.\n")
    
    except IndexError as e:
        print(f"IndexError for {bookname}: {e}")
    except Exception as e:
        print(f"Error processing {bookname}: {e}")


As we can see above, all books were parsed except `Pere_Riche_Pere_Pauvre.pdf` (this book is "Rich Dad, Poor Dad" written in French), why? Well, as we mentioned above, this book was scanned, so each page is an image and with a very unique font. We need a good PDF parser with good OCR capabilities in order to extract the content of this PDF. 
Let's try to parse this book again, but this time using Azure Document Intelligence API (former Form Recognizer)

In [None]:
%%time
book = "2323-EQU-Y-SA-0014_EQU design basis.pdf"
book_path = LOCAL_FOLDER+book
book_map = parse_pdf(file=book_path, form_recognizer=True, model="prebuilt-document",from_url=False, verbose=True)
book_pages_map[book]= book_map

In [None]:
if book in book_pages_map:
    book_map = book_pages_map[book]
    if len(book_map) > 1:
        random_page = random.randint(1, min(50, len(book_map)-1))  # Ensure index doesn't exceed the available pages
        if len(book_map[random_page]) > 2 and isinstance(book_map[random_page][2], str):  # Ensure there's a valid chunk at index [2]
            print(book, "\n", "chunk text:", book_map[random_page][2][:80], "...\n")
        else:
            print(f"{book} does not have enough content or a valid chunk at index [2].")
    else:
        print(f"{book} does not have enough pages to select a random chunk.")
else:
    print(f"{book} is not defined in book_pages_map.")


As demonstrated above, Azure Document Intelligence proves to be superior to pyPDF. **For production scenarios, we strongly recommend using Azure Document Intelligence consistently**. When doing so, it's important to make a wise choice between the available models, such as "prebuilt-document," "prebuilt-layout," or others. You can find more information on model selection [HERE](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature?view=doc-intel-3.0.0).


## Create Vector-based index


Now that we have the content of the book's chunks (each page of each book) in the dictionary `book_pages_map`, let's create the Vector index in our Azure Search Engine where this content is going to land

In [4]:

usecase_index_name = "srch-index-usecases"
books_index_name = "srch-index-books"

In [5]:
### Create Azure Search Vector-based Index
# Setup the Payloads header
headers = {'Content-Type': 'application/json','api-key': os.environ['AZURE_SEARCH_KEY']}
params = {'api-version': os.environ['AZURE_SEARCH_API_VERSION']}


Please note the following points regarding the index:

- The ParentKey field is absent.
- The page_num field is present.

The absence of the ParentKey field is due to the utilization of a PUSH method, rather than a PULL method. This approach indicates that we are not leveraging the integrated indexing provided by the Azure AI Search engine. Instead, we are engaging in the process of parsing, performing OCR, and manually creating and pushing the content along with its vectors.

This manual parsing process involves the use of either, the pyPDF library, or the Azure Document Intelligence API. These APIs allow for the segmentation of content by page rather than by a specified number of characters, which is the method employed by the Azure AI search indexer. Consequently, this enables the inclusion of page_num as a field in our index.

REST API version 2023-10-01-Preview supports external and internal vectorization. This Notebook assumes an external vectorization strategy. This API also supports:
    
- vectorSearch algorithms, hnsw and exhaustiveKnn nearest neighbors, with parameters for indexing and scoring.
- vectorProfiles for multiple combinations of algorithm configurations.

Vector search algorithms include **exhaustive k-nearest neighbors (KNN)** and **Hierarchical Navigable Small World (HNSW)**. Exhaustive KNN performs a brute-force search that scans the entire vector space. HNSW performs an approximate nearest neighbor (ANN) search. While KNN provides exact nearest neighbor search results with high accuracy, its computational cost and poor scalability make it impractical for large datasets or real-time applications. HNSW, on the other hand, offers a highly efficient and scalable solution for nearest neighbor searches by finding approximate nearest neighbors quickly, making it more suitable for large-scale and high-dimensional data applications.


check [HERE](https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-create-index?tabs=config-2023-10-01-Preview%2Crest-2023-11-01%2Cpush%2Cportal-check-index) for the details of the vector configuration.

In [None]:
index_payload = {
    "name": book_index_name,
    "vectorSearch": {
        "algorithms": [  # We are showing here 3 types of search algorithms configurations that you can do
             {
                 "name": "my-hnsw-config-1",
                 "kind": "hnsw",
                 "hnswParameters": {
                     "m": 4,
                     "efConstruction": 400,
                     "efSearch": 500,
                     "metric": "cosine"
                 }
             },
             {
                 "name": "my-hnsw-config-2",
                 "kind": "hnsw",
                 "hnswParameters": {
                     "m": 8,
                     "efConstruction": 800,
                     "efSearch": 800,
                     "metric": "cosine"
                 }
             },
             {
                 "name": "my-eknn-config",
                 "kind": "exhaustiveKnn",
                 "exhaustiveKnnParameters": {
                     "metric": "cosine"
                 }
             }
        ],
        "vectorizers": [
            {
                "name": "openai",
                "kind": "azureOpenAI",
                "azureOpenAIParameters":
                {
                    "resourceUri" : os.environ['AZURE_OPENAI_ENDPOINT'],
                    "apiKey" : os.environ['AZURE_OPENAI_API_KEY'],
                    "deploymentId" : os.environ['EMBEDDING_DEPLOYMENT_NAME'],
                    "modelName" : os.environ['EMBEDDING_DEPLOYMENT_NAME']
                }
            }
        ],
        "profiles": [  # profiles is the diferent kind of combinations of algos and vectorizers
            {
             "name": "my-vector-profile-1",
             "algorithm": "my-hnsw-config-1",
             "vectorizer":"openai"
            },
            {
             "name": "my-vector-profile-2",
             "algorithm": "my-hnsw-config-2",
             "vectorizer":"openai"
            },
            {
             "name": "my-vector-profile-3",
             "algorithm": "my-eknn-config",
             "vectorizer":"openai"
            }
        ]
    },
    "semantic": {
        "configurations": [
            {
                "name": "my-semantic-config",
                "prioritizedFields": {
                    "titleField": {
                        "fieldName": "title"
                    },
                    "prioritizedContentFields": [
                        {
                            "fieldName": "chunk"
                        }
                    ],
                    "prioritizedKeywordsFields": []
                }
            }
        ]
    },
    "fields": [
        {"name": "id", "type": "Edm.String", "key": "true", "filterable": "true" },
        {"name": "title","type": "Edm.String","searchable": "true","retrievable": "true"},
        {"name": "chunk","type": "Edm.String","searchable": "true","retrievable": "true"},
        {"name": "name", "type": "Edm.String", "searchable": "true", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
        {"name": "location", "type": "Edm.String", "searchable": "false", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
        {"name": "page_num","type": "Edm.Int32","searchable": "false","retrievable": "true"},
        {
            "name": "chunkVector",
            "type": "Collection(Edm.Single)",
            "dimensions": 3072,
            "vectorSearchProfile": "my-vector-profile-3", # we picked profile 3 to show that this index uses eKNN vs HNSW (on prior notebooks)
            "searchable": "true",
            "retrievable": "true",
            "filterable": "false",
            "sortable": "false",
            "facetable": "false"
        }
        
    ],
}

r = requests.put(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + book_index_name,
                 data=json.dumps(index_payload), headers=headers, params=params)
print(r.status_code)
print(r.ok)

In [None]:
# Uncomment to debug errors
r.text

## Upload the Document chunks and its vectors to the Index

The following code will iterate over each chunk of each book and use the Azure Search Rest API upload method to insert each document with its corresponding vector (using OpenAI embedding model) to the index.

In [94]:
# Function to process a batch of pages
def process_batch(bookname, pages):
    try:
        contents = [page[2] for page in pages]
        chunk_vectors = embedder.embed_documents(contents)
        
        upload_payload = {"value": []}
        for i, page in enumerate(pages):
            page_num = page[0] + 1
            content = page[2]
            book_url = BASE_CONTAINER_URL + bookname
            
            payload = {
                "@search.action": "upload",
                "id": text_to_base64(bookname + str(page_num)),
                "title": f"{bookname}_page_{str(page_num)}",
                "chunk": content,
                "chunkVector": chunk_vectors[i],
                "name": bookname,
                "location": book_url,
                "page_num": page_num
            }
            upload_payload["value"].append(payload)
        
        r = requests.post(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + book_index_name + "/docs/index",
                          data=json.dumps(upload_payload), headers=headers, params=params)
        if r.status_code != 200:
            print(f"Failed to upload batch of pages from {bookname}: {r.status_code}")
            print(r.text)
    except Exception as e:
        print(f"Exception processing batch of pages from {bookname}: {e}")
        time.sleep(10)  # Wait before retrying
        process_batch(bookname, pages)  # Retry the same batch

In [None]:
%%time
for bookname, bookmap in book_pages_map.items():
        print("Uploading chunks from", bookname)
        # Split bookmap into chunks of size chunk_size
        for i in tqdm(range(0, len(bookmap), batch_size)):
            batch = bookmap[i:i + batch_size]
            process_batch(bookname, batch)

## Query the Index

In [7]:
QUESTION = "Considerations in design for future tie-in(s) and tie-back(s). whats the reservoir pressure in the cambroiol central and Cappahayden West block?"
QUESTION1 = "Ablation study for verifying the effectiveness of MLM, P2G, and BERT compared to StyleTTS w/ PL-BERT."
QUESTION2 = "How did BigVGAN perform in terms of PESQ score vs WaveGlow-256?"
QUESTION3 = "what is the acronym of the main point of Made to Stick book"
QUESTION4= "Tell me a python example of how do I push documents with vectors to an index using the python SDK?"
QUESTION5 = "who won the soccer worldcup in 1994?" # this question should have no answer


In [8]:

usecases_index_name = "srch-index-usecases"
books_index_name = "srch-index-books"

indexes = [books_index_name]
k=20 # in this index k corresponds to the top pages as well


In [9]:
retriever = CustomAzureSearchRetriever(indexes=[usecases_index_name], topK=k, reranker_threshold=1)

In [10]:
COMPLETION_TOKENS = 2500
llm = AzureChatOpenAI(deployment_name=os.environ["GPT4_DEPLOYMENT_NAME"], temperature=0.5, max_tokens=COMPLETION_TOKENS).configurable_alternatives(
    ConfigurableField(id="model"),
    default_key="gpt35",
    gpt4=AzureChatOpenAI(deployment_name=os.environ["GPT4_DEPLOYMENT_NAME"], temperature=0, max_tokens=COMPLETION_TOKENS),
)

In `utils.py` we created the **CustomAzureSearchRetriever** class that we will use going forward

In [11]:
chain = (
    {
        "context": itemgetter("question") | retriever, # Passes the question to the retriever and the results are assign to context
        "question": itemgetter("question")
    }
    | DOCSEARCH_PROMPT  # Passes the 4 variables above to the prompt template
    | llm   # Passes the finished prompt to the LLM
    | StrOutputParser()  # converts the output (Runnable object) to the desired output (string)
)

#### With GPT 3.5

#### With GPT 4

In [None]:
import os
import re

# Retrieve the SAS token from the environment (or define it directly)
sas_token = os.environ['BLOB_SAS_TOKEN']  # Assuming the token starts with '?'

# Regex pattern to match .pdf/.PDF links (with trailing characters like ) and spaces)
pdf_pattern = re.compile(r'(https?://[^\s\)]+\.pdf)', re.IGNORECASE)

# Set to track processed URLs
processed_urls = set()

# Initialize a buffer to store chunks that may contain split URLs
buffer = ''

# Process each chunk in the stream
for chunk in chain.with_config(configurable={"model": "gpt4"}).stream(
    {"question": QUESTION, "language": "English"}):
    
    # Append the new chunk to the buffer
    buffer += chunk

    # Use regex to find all occurrences of .pdf/.PDF links in the buffer
    matches = pdf_pattern.findall(buffer)

    #if matches:
        #print(f"Matches found: {matches}")  # Debug: Print found matches

    for pdf_url in matches:
        # Check if this URL has already been processed
        if pdf_url not in processed_urls:
            # Append the SAS token to the URL
            pdf_url_with_token = pdf_url  + sas_token

            # Replace the URL in the buffer with the one that includes the SAS token
            buffer = buffer.replace(pdf_url, pdf_url_with_token)

            # Mark this URL as processed
            processed_urls.add(pdf_url)

            # Debug output to see what's happening
            #print(f"\nFound PDF URL: {pdf_url}\nReplaced with: {pdf_url_with_token}\n")


# Output the processed buffer
print(buffer)


# Summary

In this notebook we learned how to deal with complex and large Documents and make them available for Q&A over them using [Hybrid Search](https://learn.microsoft.com/en-us/azure/search/search-get-started-vector#hybrid-search) (text + vector search).

We also learned the power of Azure Document Inteligence API and why it is recommended for production scenarios where manual Document parsing (instead of Azure Search Indexer Document Cracking) is necessary.

Using Azure AI Search with its Vector capabilities and hybrid search features eliminates the need for other vector databases such as Weaviate, Qdrant, Milvus, Pinecone, and so on.


# NEXT
So far we have learned how to use OpenAI vectors and completion APIs in order to get an excelent answer from our documents stored in Azure AI Search. This is the backbone for a GPT Smart Search Engine.

However, we are missing something: **How to have a conversation with this engine?**

On the next Notebook, we are going to understand the concept of **memory**. This is necessary in order to have a chatbot that can establish a conversation with the user. Without memory, there is no real conversation.