# How to deal with complex/large Documents

In the previous notebook, we developed a solution for various types of files and data formats commonly found in organizations, and this covers 90% of the use cases. However, you will find that there are issues when dealing with questions that require answers from complex files. The complexity of these files arises from their length and the way information is distributed within them. Large documents are always a challenge for Search Engines.

One example of such complex files is Technical Specification Guides or Product Manuals, which can span hundreds of pages and contain information in the form of images, tables, forms, and more. Books are also complex due to their length and the presence of images or tables.

These files are typically in PDF format. To better handle these PDFs, we need a smarter parsing method that treats each document as a special source and processes them page by page. The objective is to obtain more accurate and faster answers from our system. Fortunately, there are usually not many of these types of documents in an organization, allowing us to make exceptions and treat them differently.

If your use case is just PDFs, for example, you can just use [PyPDF library](https://pypi.org/project/pypdf/) or [Azure AI Document Intelligence SDK (former Form Recognizer)](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-3.0.0), vectorize using OpenAI API and push the content to a vector-based index. And this is problably the simplest and fastest way to go.  However if your use case entails connecting to a datalake, or Sharepoint libraries or any other document data source with thousands of documents with multiple file types and that can change dynamically, then you would want to use the Ingestion and Document Cracking and AI-Enrichment capabilities of Azure Search engine, Notebooks 1-3, and avoid a lot of painful custom code. 


In [12]:
import os
import json
import time
import requests
import random
from collections import OrderedDict
import urllib.request
from tqdm import tqdm
import langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma, FAISS
from langchain import OpenAI, VectorDBQA
from langchain.chat_models import AzureChatOpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.docstore.document import Document
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

from common.prompts_povel import COMBINE_QUESTION_PROMPT, COMBINE_PROMPT, COMBINE_PROMPT_TEMPLATE
from common.utils_povel import (
    parse_pdf,
    parse_pdf_from_blob,
    read_pdf_files,
    text_to_base64,
    DocSearchTool,
    get_search_results,
    model_tokens_limit,
    num_tokens_from_docs,
    num_tokens_from_string)


from IPython.display import Markdown, HTML, display  

from dotenv import load_dotenv
load_dotenv()

def printmd(string):
    display(Markdown(string))
    
os.makedirs("data/books/",exist_ok=True)
    

# BLOB_CONTAINER_NAME = "digge-ekonomi-dokument"
# BASE_CONTAINER_URL = "https://blobstoragejd5ypzfx2l6vi.blob.core.windows.net/digge-ekonomi-dokument/"

# BLOB_CONTAINER_NAME = "digge-100-dokument"
# BASE_CONTAINER_URL = "https://blobstoragejd5ypzfx2l6vi.blob.core.windows.net/digge-100-dokument/"

BLOB_CONTAINER_NAME="test2"
BASE_CONTAINER_URL="https://blobstoragejd5ypzfx2l6vi.blob.core.windows.net/test2/"

MODEL = "gpt-35-turbo-16k" # options: gpt-35-turbo, gpt-35-turbo-16k, gpt-4, gpt-4-32k

In [13]:
# Set the ENV variables that Langchain needs to connect to Azure OpenAI
os.environ["OPENAI_API_BASE"] = os.environ["AZURE_OPENAI_ENDPOINT"]
os.environ["OPENAI_API_KEY"] = os.environ["AZURE_OPENAI_API_KEY"]
os.environ["OPENAI_API_VERSION"] = os.environ["AZURE_OPENAI_API_VERSION"]
os.environ["OPENAI_API_TYPE"] = "azure"

In [14]:
embedder = OpenAIEmbeddings(deployment="text-embedding-ada-002", chunk_size=1)

### What to use: pyPDF or AI Documment Intelligence API (Form Recognizer)?

In `utils.py` there is a **parse_pdf()** function. This utility function can parse local files using PyPDF library and can also parse local or from_url PDFs files using Azure AI Document Intelligence (Former Form Recognizer).

If `form_recognizer=False`, the function will parse the PDF using the python pyPDF library, which 75% of the time does a good job.<br>

Setting `form_recognizer=True`, is the best (and slower) parsing method using AI Documment Intelligence API (former known as Form Recognizer). You can specify the prebuilt model to use, the default is `model="prebuilt-document"`. However, if you have a complex document with tables, charts and figures , you can try
`model="prebuilt-layout"`, and it will capture all of the nuances of each page (it takes longer of course).

**Note: Many PDFs are scanned images. For example, any signed contract that was scanned and saved as PDF will NOT be parsed by pyPDF. Only AI Documment Intelligence API will work.**

As demonstrated above, Azure Document Intelligence proves to be superior to pyPDF. **For production scenarios, we strongly recommend using Azure Document Intelligence consistently**. When doing so, it's important to make a wise choice between the available models, such as "prebuilt-document," "prebuilt-layout," or others. You can find more information on model selection [HERE](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature?view=doc-intel-3.0.0).

In [15]:
# from azure.storage.blob import BlobServiceClient

# # Create a blob service client
# blob_service_client = BlobServiceClient.from_connection_string(os.environ["BLOB_CONNECTION_STRING"])

# # Get a reference to the container
# container_client = blob_service_client.get_container_client(BLOB_CONTAINER_NAME)

# # List all blobs in the container
# blob_list = container_client.list_blobs()

# model="prebuilt-layout"
# book_pages_map = dict()

# # Print all blob names
# for document in blob_list:
#     if document.name.endswith(".pdf"):
#         print("Extracting Text from",document.name,"...")

#         # Capture the start time
#         start_time = time.time()

#         # Get a reference to the blob
#         blob_client = container_client.get_blob_client(document.name)
#         # print(document.name)
        
#         # Download the blob to a stream
#         stream = blob_client.download_blob().readall()

#         # with open(stream, "rb") as filename:
#         book_map = parse_pdf_from_blob(file=stream, form_recognizer=True, verbose=True)
#         book_pages_map[document.name]= book_map

#         # Capture the end time and Calculate the elapsed time
#         end_time = time.time()
#         elapsed_time = end_time - start_time

#         print(f"Parsing took: {elapsed_time:.6f} seconds")
#         print(f"{document.name} contained {len(book_map)} pages\n")

Extracting Text from Kontoplan.pdf ...
Extracting text using Azure Document Intelligence
Parsing took: 14.954447 seconds
Kontoplan.pdf contained 95 pages

Extracting Text from Kundnummer i Agresso.pdf ...
Extracting text using Azure Document Intelligence
Parsing took: 5.259512 seconds
Kundnummer i Agresso.pdf contained 1 pages



## Create Vector-based index


Now that we have the content of the book's chunks (each page of each book) in the dictionary `book_pages_map`, let's create the Vector-based index in our Azure Search Engine where this content is going to land

In [16]:
# # index_name = "1b-digge-ekonomi-full"
# # index_name = "1b-digge-100-full"
# index_name="test"

In [17]:
# # Setup the Payloads header
# headers = {'Content-Type': 'application/json','api-key': os.environ['AZURE_SEARCH_KEY']}
# params = {'api-version': "2023-07-01-Preview"}

In [18]:
# index_payload = {
#     "name": index_name,
#     "fields": [
#         {"name": "id", "type": "Edm.String", "key": "true", "filterable": "true" },
#         {"name": "title","type": "Edm.String","searchable": "true","retrievable": "true"},
#         {"name": "content","type": "Edm.String","searchable": "true","retrievable": "true"},
#         {"name": "filepath", "type": "Edm.String", "searchable": "true", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
#         {"name": "url", "type": "Edm.String", "searchable": "false", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
#         {"name": "page_num","type": "Edm.Int32","searchable": "false","retrievable": "true"},
#         {"name": "contentVector", "type": "Collection(Edm.Single)", "searchable": "true", "retrievable": "true", "dimensions": 1536, "vectorSearchConfiguration": "vectorConfig"}
        
#     ],
#     "vectorSearch": {
#         "algorithmConfigurations": [
#             {
#                 "name": "vectorConfig",
#                 "kind": "hnsw"
#             }
#         ]
#     },
#     "semantic": {
#         "configurations": [
#             {
#                 "name": "default",
#                 "prioritizedFields": {
#                     "titleField": {
#                         "fieldName": "title"
#                     },
#                     "prioritizedContentFields": [
#                         {
#                             "fieldName": "content"
#                         }
#                     ],
#                     "prioritizedKeywordsFields": []
#                 }
#             }
#         ]
#     }
# }

# r = requests.put(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + index_name,
#                  data=json.dumps(index_payload), headers=headers, params=params)
# print(r.status_code)
# print(r.ok)

204
True


In [19]:
# Uncomment to debug errors
# r.text

## Upload the Document chunks and its vectors to the Vector-Based Index

The following code will iterate over each chunk of each book and use the Azure Search Rest API upload method to insert each document with its corresponding vector (using OpenAI embedding model) to the index.

In [20]:
# %%time
# for bookname,bookmap in book_pages_map.items():
#     print("Uploading chunks from",bookname)
#     for page in tqdm(bookmap):
#         try:
#             page_num = page[0] + 1
#             content = page[2]
#             book_url = BASE_CONTAINER_URL + bookname
#             upload_payload = {
#                 "value": [
#                     {
#                         "id": text_to_base64(bookname + str(page_num)),
#                         "title": f"{bookname}_page_{str(page_num)}",
#                         "content": content,
#                         "contentVector": embedder.embed_query(content if content!="" else "-------"),
#                         "filepath": bookname,
#                         "url": book_url,
#                         "page_num": page_num,
#                         "@search.action": "upload"
#                     },
#                 ]
#             }

#             r = requests.post(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + index_name + "/docs/index",
#                                  data=json.dumps(upload_payload), headers=headers, params=params)
#             if r.status_code != 200:
#                 print(r.status_code)
#                 print(r.text)
#         except Exception as e:
#             print("Exception:",e)
#             print(content)
#             continue

Uploading chunks from Kontoplan.pdf


100%|██████████| 95/95 [00:14<00:00,  6.58it/s]


Uploading chunks from Kundnummer i Agresso.pdf


100%|██████████| 1/1 [00:00<00:00,  6.00it/s]

CPU times: user 4.37 s, sys: 93.1 ms, total: 4.46 s
Wall time: 14.6 s





## Query the Index

In [None]:
QUESTION = "Vilket konto ska jag fakturera egenavgift för hjälpmedel?"
# QUESTION = "Hur ser rättigheterna ut för enskilda personer gällander GDPR?"
# QUESTION = "Hur använder jag Agresso?"

In [None]:
from langchain.callbacks.manager import CallbackManager
from common.callbacks import StdOutCallbackHandler

cb_handler = StdOutCallbackHandler()
cb_manager = CallbackManager(handlers=[cb_handler])

llm = AzureChatOpenAI(deployment_name=MODEL, temperature=0.2, max_tokens=800)
vector_only_indexes = ["1b-digge-100-full", "1b-digge-ekonomi-full"]

doc_search = DocSearchTool(llm=llm, vector_only_indexes = vector_only_indexes,
                           k=10, similarity_k=5, reranker_th=1,
                           sas_token=os.environ['BLOB_SAS_TOKEN'],
                           callback_manager=cb_manager, return_direct=True,
                           # This is how you can edit the default values of name and description
                           name="@docsearch",
                           description="useful when the questions includes the term: @docsearch.\n")

In [None]:
printmd(doc_search.run(QUESTION))

Bad pipe message: %s [b"\x82\xd3\x87\x10D\xe1V\xc3\x83g|~kx\xa0x\x1bp \xa4'\x99\xeaa-d\x1e\xf3\x13R\x0ew<\xde\x11\x1b\xc09\xab\xdf\x80>`\xdck\x0fc(D\x05D\x00\x08\x13\x02\x13\x03\x13\x01\x00\xff\x01\x00\x00\x8f\x00\x00\x00\x0e\x00\x0c\x00\x00\t127.0.0.1\x00\x0b\x00\x04\x03"]
Bad pipe message: %s [b'\xd4S\\\x1a\xd1^s\xde\xeab\xc6j\xcd\xf5u\xa0\xb8, \xab\xce\xa9\xd3J\xe4\xach\xc0\x83Zx\xba\xf1p\xad\xec\x0e\x81.\x8d\xd6\xd8R\x972\xfft\xcf\x91k\x96\x00\x08\x13\x02\x13\x03\x13\x01\x00\xff\x01\x00\x00\x8f\x00\x00\x00\x0e\x00\x0c\x00\x00\t127.0.0.1\x00\x0b\x00\x04\x03\x00\x01\x02\x00\n\x00\x0c\x00\n\x00\x1d\x00\x17\x00\x1e\x00\x19\x00\x18\x00#\x00\x00\x00\x16\x00\x00\x00\x17\x00\x00\x00\r\x00\x1e\x00\x1c\x04\x03\x05\x03', b'\x08\x07\x08\x08\x08', b'\n\x08\x0b\x08\x04\x08\x05\x08']
Bad pipe message: %s [b'\x01\x05\x01\x06\x01']
Bad pipe message: %s [b'\x17_\xecD=d\xfd\x8a\x14\x05\xc9\x85\xa7\xe8\x8f\x1d\xdd\x9b\x00\x00', b",\xc00\x00\xa3\x00\x9f\xcc\xa9\xcc\xa8\xcc\xaa\xc0\xaf\xc0\xad\xc0\xa3\x