# How to deal with complex/large Documents

In the previous notebook, we developed a solution for various types of files and data formats commonly found in organizations, and this covers 90% of the use cases. However, you will find that there are issues when dealing with questions that require answers from complex files. The complexity of these files arises from their length and the way information is distributed within them. Large documents are always a challenge for Search Engines.

One example of such complex files is Technical Specification Guides or Product Manuals, which can span hundreds of pages and contain information in the form of images, tables, forms, and more. Books are also complex due to their length and the presence of images or tables.

These files are typically in PDF format. To better handle these PDFs, we need a smarter parsing method that treats each document as a special source and processes them page by page. The objective is to obtain more accurate and faster answers from our system. Fortunately, there are usually not many of these types of documents in an organization, allowing us to make exceptions and treat them differently.

If your use case is just PDFs, for example, you can just use [PyPDF library](https://pypi.org/project/pypdf/) or [Azure AI Document Intelligence SDK (former Form Recognizer)](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/overview?view=doc-intel-3.0.0), vectorize using OpenAI API and push the content to a vector-based index. And this is problably the simplest and fastest way to go.  However if your use case entails connecting to a datalake, or Sharepoint libraries or any other document data source with thousands of documents with multiple file types and that can change dynamically, then you would want to use the Ingestion and Document Cracking and AI-Enrichment capabilities of Azure Search engine, Notebooks 1-3, and avoid a lot of painful custom code. 


In [25]:
import os
import json
import time
import requests
import random
from collections import OrderedDict
import urllib.request
from tqdm import tqdm
import langchain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma, FAISS
from langchain import OpenAI, VectorDBQA
from langchain.chat_models import AzureChatOpenAI
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.docstore.document import Document
from langchain.chains.question_answering import load_qa_chain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

from common.utils import parse_pdf, parse_pdf_from_blob, read_pdf_files, text_to_base64
from common.prompts import COMBINE_QUESTION_PROMPT, COMBINE_PROMPT, COMBINE_PROMPT_TEMPLATE
from common.utils import (
    get_search_results,
    model_tokens_limit,
    num_tokens_from_docs,
    num_tokens_from_string
)


from IPython.display import Markdown, HTML, display  

from dotenv import load_dotenv
load_dotenv("credentials.env")

def printmd(string):
    display(Markdown(string))
    
os.makedirs("data/books/",exist_ok=True)
    

BLOB_CONTAINER_NAME = "digge-ekonomi-dokument"
BLOB_CONTAINER_NAME = "digge-100-dokument"
BASE_CONTAINER_URL = "https://blobstoragejd5ypzfx2l6vi.blob.core.windows.net/digge-100-dokument/"

MODEL = "gpt-35-turbo-16k" # options: gpt-35-turbo, gpt-35-turbo-16k, gpt-4, gpt-4-32k

In [26]:
# Set the ENV variables that Langchain needs to connect to Azure OpenAI
os.environ["OPENAI_API_BASE"] = os.environ["AZURE_OPENAI_ENDPOINT"]
os.environ["OPENAI_API_KEY"] = os.environ["AZURE_OPENAI_API_KEY"]
os.environ["OPENAI_API_VERSION"] = os.environ["AZURE_OPENAI_API_VERSION"]
os.environ["OPENAI_API_TYPE"] = "azure"

In [27]:
embedder = OpenAIEmbeddings(deployment="text-embedding-ada-002", chunk_size=1) 

## 1 - Manual Document Cracking with Push to Vector-based Index

Within our demo storage account, we have a container named `books`, which holds 5 books of different lengths, languages, and complexities. Let's create a `cogsrch-index-books-vector` and load it with the pages of all these books.

We begin by downloading these books to our local machine:

In [28]:
# books = ["Azure_Cognitive_Search_Documentation.pdf", 
#          "Boundaries_When_to_Say_Yes_How_to_Say_No_to_Take_Control_of_Your_Life.pdf",
#          "Fundamentals_of_Physics_Textbook.pdf",
#          "Made_To_Stick.pdf",
#          "Pere_Riche_Pere_Pauvre.pdf"]

Let's download the files to the local `./data/` folder:

In [29]:
# for book in tqdm(books):
#     book_url = BASE_CONTAINER_URL + book + os.environ['BLOB_SAS_TOKEN']
#     urllib.request.urlretrieve(book_url, LOCAL_FOLDER+ book)

### What to use: pyPDF or AI Documment Intelligence API (Form Recognizer)?

In `utils.py` there is a **parse_pdf()** function. This utility function can parse local files using PyPDF library and can also parse local or from_url PDFs files using Azure AI Document Intelligence (Former Form Recognizer).

If `form_recognizer=False`, the function will parse the PDF using the python pyPDF library, which 75% of the time does a good job.<br>

Setting `form_recognizer=True`, is the best (and slower) parsing method using AI Documment Intelligence API (former known as Form Recognizer). You can specify the prebuilt model to use, the default is `model="prebuilt-document"`. However, if you have a complex document with tables, charts and figures , you can try
`model="prebuilt-layout"`, and it will capture all of the nuances of each page (it takes longer of course).

**Note: Many PDFs are scanned images. For example, any signed contract that was scanned and saved as PDF will NOT be parsed by pyPDF. Only AI Documment Intelligence API will work.**

In [30]:
# book_pages_map = dict()
# for book in books:
#     print("Extracting Text from",book,"...")
    
#     # Capture the start time
#     start_time = time.time()
    
#     # Parse the PDF
#     book_path = LOCAL_FOLDER+book
#     book_map = parse_pdf(file=book_path, form_recognizer=False, verbose=True)
#     book_pages_map[book]= book_map
    
#     # Capture the end time and Calculate the elapsed time
#     end_time = time.time()
#     elapsed_time = end_time - start_time

#     print(f"Parsing took: {elapsed_time:.6f} seconds")
#     print(f"{book} contained {len(book_map)} pages\n")

In [31]:
from azure.storage.blob import BlobServiceClient

# Create a blob service client
blob_service_client = BlobServiceClient.from_connection_string(os.environ["BLOB_CONNECTION_STRING"])

# Get a reference to the container
container_client = blob_service_client.get_container_client(BLOB_CONTAINER_NAME)

# List all blobs in the container
blob_list = container_client.list_blobs()

model="prebuilt-layout"
book_pages_map = dict()

# Print all blob names
for document in blob_list:
    if document.name.endswith(".pdf"):
        print("Extracting Text from",document.name,"...")

        # Capture the start time
        start_time = time.time()

        # Get a reference to the blob
        blob_client = container_client.get_blob_client(document.name)
        # print(document.name)
        
        # Download the blob to a stream
        stream = blob_client.download_blob().readall()

        # with open(stream, "rb") as filename:
        book_map = parse_pdf_from_blob(file=stream, form_recognizer=True, verbose=True)
        book_pages_map[document.name]= book_map

        # Capture the end time and Calculate the elapsed time
        end_time = time.time()
        elapsed_time = end_time - start_time

        print(f"Parsing took: {elapsed_time:.6f} seconds")
        print(f"{document.name} contained {len(book_map)} pages\n")

Extracting Text from Administratörsmanual ledningssystemet.pdf ...
Extracting text using Azure Document Intelligence
Parsing took: 6.486604 seconds
Administratörsmanual ledningssystemet.pdf contained 25 pages

Extracting Text from Anmälan om brott till polisen(186680).pdf ...
Extracting text using Azure Document Intelligence
Parsing took: 5.249366 seconds
Anmälan om brott till polisen(186680).pdf contained 1 pages

Extracting Text from Anskaffning, utveckling och förändring av informationssystem.pdf ...
Extracting text using Azure Document Intelligence
Parsing took: 5.426353 seconds
Anskaffning, utveckling och förändring av informationssystem.pdf contained 6 pages

Extracting Text from Ansvars- och rollfördelning i ärendeberedningsprocessen(288355).pdf ...
Extracting text using Azure Document Intelligence
Parsing took: 5.212321 seconds
Ansvars- och rollfördelning i ärendeberedningsprocessen(288355).pdf contained 1 pages

Extracting Text from Användarmanual för att söka och läsa styrand

Now let's check a random page of each book to make sure the parsing was done correctly:

In [1]:
# for bookname,bookmap in book_pages_map.items():
#     # print(bookmap)
#     print(bookname,"\n","chunk text:",bookmap[random.randint(0, 2)][2][:50],"...\n")

As we can see above, all books were parsed except `Pere_Riche_Pere_Pauvre.pdf` (this book is "Rich Dad, Poor Dad" written in French), why? Well, as we mentioned above, this book was scanned, so each page is an image and with a very unique font. We need a good PDF parser with good OCR capabilities in order to extract the content of this PDF. 
Let's try to parse this book again, but this time using Azure Document Intelligence API (former Form Recognizer)

In [18]:
# %%time
# book = "Pere_Riche_Pere_Pauvre.pdf"
# book_path = LOCAL_FOLDER+book
# book_map = parse_pdf(file=book_path, form_recognizer=True, model="prebuilt-document",from_url=False, verbose=True)
# book_pages_map[book]= book_map

Extracting text using Azure Document Intelligence
CPU times: user 11.5 s, sys: 128 ms, total: 11.7 s
Wall time: 34.5 s


In [9]:
#Note: If the above command throws an error - Create another form recognizer resource in the azure portal in the same resource group, don't delete it. And try again.
# This seems to be a transient error.

In [19]:
# print(book,"\n","chunk text:",book_map[random.randint(10, 50)][2][:80],"...\n")

Pere_Riche_Pere_Pauvre.pdf 
 chunk text: de s'efforcer d'être de bons employés tout en faisant leur possible afin de poss ...



As demonstrated above, Azure Document Intelligence proves to be superior to pyPDF. **For production scenarios, we strongly recommend using Azure Document Intelligence consistently**. When doing so, it's important to make a wise choice between the available models, such as "prebuilt-document," "prebuilt-layout," or others. You can find more information on model selection [HERE](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/choose-model-feature?view=doc-intel-3.0.0).


## Create Vector-based index


Now that we have the content of the book's chunks (each page of each book) in the dictionary `book_pages_map`, let's create the Vector-based index in our Azure Search Engine where this content is going to land

In [8]:
# index_name = "digge-ekonomi-index-files-vector-form-rec"
index_name = "digge-100-index-files-vector-form-rec"

In [26]:
### Create Azure Search Vector-based Index
# Setup the Payloads header
headers = {'Content-Type': 'application/json','api-key': os.environ['AZURE_SEARCH_KEY']}
params = {'api-version': os.environ['AZURE_SEARCH_API_VERSION']}

In [27]:
index_payload = {
    "name": index_name,
    "fields": [
        {"name": "id", "type": "Edm.String", "key": "true", "filterable": "true" },
        {"name": "title","type": "Edm.String","searchable": "true","retrievable": "true"},
        {"name": "chunk","type": "Edm.String","searchable": "true","retrievable": "true"},
        {"name": "chunkVector","type": "Collection(Edm.Single)","searchable": "true","retrievable": "true","dimensions": 1536,"vectorSearchConfiguration": "vectorConfig"},
        {"name": "name", "type": "Edm.String", "searchable": "true", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
        {"name": "location", "type": "Edm.String", "searchable": "false", "retrievable": "true", "sortable": "false", "filterable": "false", "facetable": "false"},
        {"name": "page_num","type": "Edm.Int32","searchable": "false","retrievable": "true"},
        
    ],
    "vectorSearch": {
        "algorithmConfigurations": [
            {
                "name": "vectorConfig",
                "kind": "hnsw"
            }
        ]
    },
    "semantic": {
        "configurations": [
            {
                "name": "my-semantic-config",
                "prioritizedFields": {
                    "titleField": {
                        "fieldName": "title"
                    },
                    "prioritizedContentFields": [
                        {
                            "fieldName": "chunk"
                        }
                    ],
                    "prioritizedKeywordsFields": []
                }
            }
        ]
    }
}

r = requests.put(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + index_name,
                 data=json.dumps(index_payload), headers=headers, params=params)
print(r.status_code)
print(r.ok)

204
True


In [28]:
# Uncomment to debug errors
# r.text

## Upload the Document chunks and its vectors to the Vector-Based Index

The following code will iterate over each chunk of each book and use the Azure Search Rest API upload method to insert each document with its corresponding vector (using OpenAI embedding model) to the index.

In [29]:
%%time
for bookname,bookmap in book_pages_map.items():
    print("Uploading chunks from",bookname)
    for page in tqdm(bookmap):
        try:
            page_num = page[0] + 1
            content = page[2]
            book_url = BASE_CONTAINER_URL + bookname
            upload_payload = {
                "value": [
                    {
                        "id": text_to_base64(bookname + str(page_num)),
                        "title": f"{bookname}_page_{str(page_num)}",
                        "chunk": content,
                        "chunkVector": embedder.embed_query(content if content!="" else "-------"),
                        "name": bookname,
                        "location": book_url,
                        "page_num": page_num,
                        "@search.action": "upload"
                    },
                ]
            }

            r = requests.post(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + index_name + "/docs/index",
                                 data=json.dumps(upload_payload), headers=headers, params=params)
            if r.status_code != 200:
                print(r.status_code)
                print(r.text)
        except Exception as e:
            print("Exception:",e)
            print(content)
            continue

Uploading chunks from Administratörsmanual ledningssystemet.pdf


100%|██████████| 25/25 [00:05<00:00,  4.57it/s]


Uploading chunks from Anmälan om brott till polisen(186680).pdf


100%|██████████| 1/1 [00:00<00:00,  7.22it/s]


Uploading chunks from Anskaffning, utveckling och förändring av informationssystem.pdf


100%|██████████| 6/6 [00:00<00:00,  7.05it/s]


Uploading chunks from Ansvars- och rollfördelning i ärendeberedningsprocessen(288355).pdf


100%|██████████| 1/1 [00:00<00:00,  7.67it/s]


Uploading chunks from Användarmanual för att söka och läsa styrande information i ledningssystemet.pdf


100%|██████████| 27/27 [00:04<00:00,  6.45it/s]


Uploading chunks from Användarmanual för hantering av styrande dokument i ledningssystemet..pdf


100%|██████████| 93/93 [00:14<00:00,  6.49it/s]


Uploading chunks from Arbets- och ansvarsfördelning mellan primärvård och psykiatri(305771).pdf


100%|██████████| 9/9 [00:01<00:00,  7.01it/s]


Uploading chunks from Arbetsgrupp för vårdetik(170276).pdf


100%|██████████| 2/2 [00:00<00:00,  7.52it/s]


Uploading chunks from Arbetsmiljö.pdf


100%|██████████| 2/2 [00:00<00:00,  6.56it/s]


Uploading chunks from Arkivering av allmänna handlingar(163750).pdf


100%|██████████| 5/5 [00:00<00:00,  6.94it/s]


Uploading chunks from Arkivorganisation(163749).pdf


100%|██████████| 3/3 [00:00<00:00,  6.64it/s]


Uploading chunks from Att lämna ut allmän handling(423630).pdf


100%|██████████| 2/2 [00:00<00:00,  7.25it/s]


Uploading chunks from Avbrottsgrupp.pdf


100%|██████████| 2/2 [00:00<00:00,  7.33it/s]


Uploading chunks from Avbrottsrutin för ledningssystemet(378317).pdf


100%|██████████| 3/3 [00:00<00:00,  6.81it/s]


Uploading chunks from Avbrottsrutin för ledningssystemet.pdf


100%|██████████| 5/5 [00:00<00:00,  7.17it/s]


Uploading chunks from Barnrättsombud.pdf


100%|██████████| 3/3 [00:00<00:00,  6.76it/s]


Uploading chunks from Behandling av mineraliseringsstörningar(193329).pdf


100%|██████████| 5/5 [00:00<00:00,  6.51it/s]


Uploading chunks from Beställarfunktion för Hälsoval Västerbotten(264207).pdf


100%|██████████| 4/4 [00:00<00:00,  6.67it/s]


Uploading chunks from Beställning av förtäring vid särskild händelse.pdf


100%|██████████| 1/1 [00:00<00:00,  7.43it/s]


Uploading chunks from Blankett för rapportering av personuppgiftsincident.pdf


100%|██████████| 4/4 [00:00<00:00,  7.62it/s]


Uploading chunks from Brandskydd för verksamhetschefer(230659).pdf


100%|██████████| 3/3 [00:00<00:00,  7.09it/s]


Uploading chunks from Brandskyddsintroduktion för nyanställda(230664).pdf


100%|██████████| 2/2 [00:00<00:00,  7.42it/s]


Uploading chunks from Brandskyddskontroll ledningsstaben.pdf


100%|██████████| 4/4 [00:00<00:00,  6.91it/s]


Uploading chunks from Brukarinstruktion brandfarliga gaser och gasolanläggningar, laboratorier(245357).pdf


100%|██████████| 3/3 [00:00<00:00,  7.91it/s]


Uploading chunks from Dataskyddsombud(304761).pdf


100%|██████████| 1/1 [00:00<00:00,  7.20it/s]


Uploading chunks from Dental erosion(186841).pdf


100%|██████████| 10/10 [00:01<00:00,  6.85it/s]


Uploading chunks from Dental traumaguide(233745).pdf


100%|██████████| 1/1 [00:00<00:00,  7.58it/s]


Uploading chunks from Diarieföring av allmänna handlingar(165680).pdf


100%|██████████| 4/4 [00:00<00:00,  7.37it/s]


Uploading chunks from Digital informationshantering .pdf


100%|██████████| 5/5 [00:00<00:00,  6.93it/s]


Uploading chunks from Dokumenthanteringsplan för forskningsverksamhet_forskningsprojekt(194655).pdf


100%|██████████| 6/6 [00:00<00:00,  6.52it/s]


Uploading chunks from Dokumenthanteringsplan för forskningsverksamhet_forskningsprojekt(370552).pdf


100%|██████████| 6/6 [00:00<00:00,  6.79it/s]


Uploading chunks from Dokumenthanteringsplan för patientinformation och övrig medicinsk dokumentation(370559).pdf


100%|██████████| 30/30 [00:04<00:00,  6.72it/s]


Uploading chunks from Dokumenthanteringsplan för sociala medier(194696).pdf


100%|██████████| 4/4 [00:00<00:00,  6.83it/s]


Uploading chunks from Dokumenthanteringsplan för sociala medier(373862).pdf


100%|██████████| 4/4 [00:00<00:00,  7.17it/s]


Uploading chunks from ECC- Early childhood caries(193331).pdf


100%|██████████| 6/6 [00:00<00:00,  7.31it/s]


Uploading chunks from Egenkontroll av det systematiska säkerhetsarbetet(184386).pdf


100%|██████████| 2/2 [00:00<00:00,  7.15it/s]


Uploading chunks from Egenkontroll av verksamhetens brandskydd(230684).pdf


100%|██████████| 13/13 [00:01<00:00,  6.93it/s]


Uploading chunks from Ekonomi och förvaltning.pdf


100%|██████████| 2/2 [00:00<00:00,  6.82it/s]


Uploading chunks from Enskildas rättigheter enligt GDPR.pdf


100%|██████████| 5/5 [00:00<00:00,  6.92it/s]


Uploading chunks from Exempel på en handlingsplan vid hot och våld i en verksamhet(179647).pdf


 25%|██▌       | 1/4 [00:00<00:00,  6.60it/s]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 14 seconds. Please go here: https://aka.ms/oai/quotaincrease if you would like to further increase the default rate limit..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Requests to the Get a vector representation of a given input that can be easily consumed by machine learning models and algorithms. Operation under Azure OpenAI API version 2023-05-15 have exceeded call rate limit of your current OpenAI S0 pricing tier. Please retry after 10 seconds. Please go here: https:/

Uploading chunks from Fluorguiden(235976).pdf


100%|██████████| 1/1 [00:00<00:00,  7.76it/s]


Uploading chunks from Funktionshinderspolitisk strategi för Västerbottens läns landsting 2017-2020(359602).pdf


100%|██████████| 7/7 [00:01<00:00,  6.82it/s]


Uploading chunks from Förebyggande tandvård för barn, ungdom och unga vuxna(186742).pdf


100%|██████████| 10/10 [00:01<00:00,  6.36it/s]


Uploading chunks from Granska och fastställa styrande dokument.pdf


100%|██████████| 3/3 [00:00<00:00,  7.33it/s]


Uploading chunks from Grunder för beslutsfattande.pdf


100%|██████████| 5/5 [00:00<00:00,  7.02it/s]


Uploading chunks from Gränsdragningslista_brandskyddet mellan fastighetsägare och verksamhetsägare(232538).pdf


100%|██████████| 3/3 [00:00<00:00,  6.12it/s]


Uploading chunks from Handbok för systematiskt brandskyddsarbete inom Region Västerbotten(265146).pdf


100%|██████████| 12/12 [00:01<00:00,  6.64it/s]


Uploading chunks from Handlingsplan Hot & Våld .pdf


100%|██████████| 3/3 [00:00<00:00,  7.26it/s]


Uploading chunks from Handlingsplan vid brand(230683).pdf


100%|██████████| 3/3 [00:00<00:00,  7.83it/s]


Uploading chunks from Handläggning av frivilligorganisationers ansökan om folkhälsobidrag(275965).pdf


100%|██████████| 3/3 [00:00<00:00,  6.47it/s]


Uploading chunks from Handläggning av visselblåsarärenden.pdf


100%|██████████| 3/3 [00:00<00:00,  7.33it/s]


Uploading chunks from Hantering av nyckelbrytare för brandfarliga gaser(245483).pdf


100%|██████████| 3/3 [00:00<00:00,  7.32it/s]


Uploading chunks from Hantering av redovisande dokument.pdf


100%|██████████| 4/4 [00:00<00:00,  6.65it/s]


Uploading chunks from Hanteringsrutiner för verksamhetsansvariga med tillgång till brandfarliga gaser(245449).pdf


100%|██████████| 3/3 [00:00<00:00,  6.67it/s]


Uploading chunks from Hälso- och sjukvårdsnämndens delegationsordning(346082).pdf


100%|██████████| 14/14 [00:02<00:00,  6.86it/s]


Uploading chunks from Hälso- och sjukvårdsnämndens reglemente(346112).pdf


100%|██████████| 11/11 [00:01<00:00,  6.78it/s]


Uploading chunks from Identifiering, bemötande, stöd och behandling till våldsutsatta vuxna.pdf


100%|██████████| 19/19 [00:02<00:00,  6.80it/s]


Uploading chunks from Incidentrapportering - Störningar i kontinuiteten i hälso- och sjukvårdstjänsten (NIS).pdf


100%|██████████| 8/8 [00:01<00:00,  7.03it/s]


Uploading chunks from Informationssäkerhet -  förvaltning och drift.pdf


100%|██████████| 11/11 [00:01<00:00,  6.95it/s]


Uploading chunks from Informationssäkerhet - användare.pdf


100%|██████████| 6/6 [00:01<00:00,  5.96it/s]


Uploading chunks from Informationssäkerhet.pdf


100%|██████████| 2/2 [00:00<00:00,  7.67it/s]


Uploading chunks from Informationssäkerhetsklassning i KLASSA.pdf


100%|██████████| 14/14 [00:02<00:00,  6.85it/s]


Uploading chunks from Informationsunderlag - Hantering av brandfarlig vara, gas och vätska(247283).pdf


100%|██████████| 2/2 [00:00<00:00,  7.18it/s]


Uploading chunks from Inrättande av ledningsgrupp, styrgrupp, kommitté och råd(269474).pdf


100%|██████████| 3/3 [00:00<00:00,  7.33it/s]


Uploading chunks from Instruktion för registerförteckning.pdf


100%|██████████| 2/2 [00:00<00:00,  7.62it/s]


Uploading chunks from Intern rutin för hantering av explosiv eter(188636).pdf


100%|██████████| 3/3 [00:00<00:00,  7.09it/s]


Uploading chunks from Jämställdhet och jämlikhet.pdf


100%|██████████| 2/2 [00:00<00:00,  7.34it/s]


Uploading chunks from Kamerabevakning i regionen.pdf


100%|██████████| 5/5 [00:00<00:00,  7.22it/s]


Uploading chunks from Kartlägg riskerna för hot och våld i arbetsmiljön(185773).pdf


100%|██████████| 1/1 [00:00<00:00,  7.93it/s]


Uploading chunks from Kommunikation.pdf


100%|██████████| 2/2 [00:00<00:00,  6.58it/s]


Uploading chunks from Konsekvensbedömning avseende dataskydd.pdf


100%|██████████| 4/4 [00:00<00:00,  7.10it/s]


Uploading chunks from Kontrollsystem för efterlevnad av Dataskyddsförordningen(313124).pdf


100%|██████████| 2/2 [00:00<00:00,  7.73it/s]


Uploading chunks from Kostanamnes Barn(280181).pdf


100%|██████████| 2/2 [00:00<00:00,  6.78it/s]


Uploading chunks from Kostanamnes Vuxna(280184).pdf


100%|██████████| 2/2 [00:00<00:00,  7.01it/s]


Uploading chunks from Kvalitet.pdf


100%|██████████| 2/2 [00:00<00:00,  5.72it/s]


Uploading chunks from Landstingsplan 2016-2019(266621).pdf


100%|██████████| 1/1 [00:00<00:00,  7.46it/s]


Uploading chunks from Ledning- och styrmodell för informationssäkerhet(269051).pdf


100%|██████████| 5/5 [00:00<00:00,  7.00it/s]


Uploading chunks from Länsrutiner för Hälsoval Västerbotten gällande röntgenremisser utfärdade av fysioterapeut(383967).pdf


100%|██████████| 3/3 [00:00<00:00,  7.07it/s]


Uploading chunks from Mall för larm och utökad beredskapsnivå.pdf


100%|██████████| 6/6 [00:00<00:00,  7.26it/s]


Uploading chunks from Mall för reservrutiner .pdf


100%|██████████| 10/10 [00:01<00:00,  6.80it/s]


Uploading chunks from Mall – för utredning av personuppgiftsincidenter.pdf


100%|██████████| 6/6 [00:00<00:00,  7.44it/s]


Uploading chunks from Mekaniska lås- och passagesystem, överfalls_bråk- och inbrottslarm sjukhusbyggnader och externa lokaler.pdf


100%|██████████| 4/4 [00:00<00:00,  7.10it/s]


Uploading chunks from Mekaniska lås- och passagesystem, överfalls_bråk- och inbrottslarm,  Folktandvården(285825).pdf


100%|██████████| 2/2 [00:00<00:00,  6.58it/s]


Uploading chunks from Mekaniska lås- och passagesystem, överfalls_bråk- och inbrottslarm, Hälsocentraler(285830).pdf


100%|██████████| 3/3 [00:00<00:00,  7.05it/s]


Uploading chunks from Mekaniska låssystem,  sjukhusbyggnader(285522).pdf


100%|██████████| 2/2 [00:00<00:00,  6.88it/s]


Uploading chunks from Miljö.pdf


100%|██████████| 2/2 [00:00<00:00,  7.13it/s]


Uploading chunks from Minneslista vid hot(180254).pdf


100%|██████████| 2/2 [00:00<00:00,  6.90it/s]


Uploading chunks from Misstänkta försändelser(185880).pdf


100%|██████████| 2/2 [00:00<00:00,  6.79it/s]


Uploading chunks from Passagesystem, överfalls_bråk- och inbrottslarm, sjukhusbyggnader(285568).pdf


100%|██████████| 2/2 [00:00<00:00,  6.92it/s]


Uploading chunks from Patientnämndens reglemente(346114).pdf


100%|██████████| 7/7 [00:01<00:00,  6.93it/s]


Uploading chunks from Plan för kris- och katastrofmedicinsk beredskap(418633).pdf


100%|██████████| 53/53 [00:07<00:00,  6.83it/s]


Uploading chunks from Plan för krisstöd vid särskild händelse(362232).pdf


100%|██████████| 10/10 [00:01<00:00,  7.22it/s]


Uploading chunks from Principer för fissurförsegling(301569).pdf


100%|██████████| 3/3 [00:00<00:00,  6.76it/s]


Uploading chunks from Principer för gränsdragning mellan förvaltningarnas uppgifter avseende ledningssystem(357719).pdf


100%|██████████| 3/3 [00:00<00:00,  7.37it/s]


Uploading chunks from Process för ordnat övertag från annat verksamhetsområde.pdf


100%|██████████| 6/6 [00:00<00:00,  7.45it/s]


Uploading chunks from Rapportering av säkerhetsincidenter.pdf


100%|██████████| 3/3 [00:00<00:00,  7.18it/s]


Uploading chunks from Rapportering och utredning av personuppgiftsincidenter .pdf


100%|██████████| 6/6 [00:00<00:00,  7.24it/s]


Uploading chunks from Regional evakueringsplan.pdf


100%|██████████| 10/10 [00:01<00:00,  7.02it/s]


Uploading chunks from Regionala utvecklingsnämndens reglemente(346113).pdf


100%|██████████| 10/10 [00:01<00:00,  6.75it/s]

CPU times: user 25 s, sys: 442 ms, total: 25.4 s
Wall time: 1min 58s





## Query the Index

In [9]:
# QUESTION = "what normally rich dad do that is different from poor dad?"
# QUESTION = "Tell me a summary of the book Boundaries"
# QUESTION = "Dime que significa la radiacion del cuerpo negro"
# QUESTION = "what is the acronym of the main point of Made to Stick book"
# QUESTION = "Tell me a python example of how do I push documents with vectors to an index using the python SDK?"
# QUESTION = "who won the soccer worldcup in 1994?" # this question should have no answer
QUESTION = "Vilket konto ska jag fakturera egenavgift för hjälpmedel?"
QUESTION = "Hur ser rättigheterna ut för enskilda personer gällander GDPR?"


In [10]:
vector_indexes = [index_name]

ordered_results = get_search_results(QUESTION, vector_indexes, 
                                        k=10,
                                        reranker_threshold=1,
                                        vector_search=True, 
                                        similarity_k=5,
                                        query_vector = embedder.embed_query(QUESTION)
                                        )

**Note**: that we are picking a larger k=10 since these chunks are NOT of 5000 chars each like prior notebooks, but instead each page is a chunk.

In [11]:
COMPLETION_TOKENS = 1000
MODEL = "gpt-35-turbo-16k"
llm = AzureChatOpenAI(deployment_name=MODEL, temperature=0.2, max_tokens=COMPLETION_TOKENS)

In [12]:
top_docs = []
for key,value in ordered_results.items():
    location = value["location"] if value["location"] is not None else ""
    top_docs.append(Document(page_content=value["chunk"], metadata={"source": location+os.environ['BLOB_SAS_TOKEN']}))
        
print("Number of chunks:",len(top_docs))

Number of chunks: 5


In [20]:
# Calculate number of tokens of our docs
if(len(top_docs)>0):
    tokens_limit = model_tokens_limit(MODEL) # this is a custom function we created in common/utils.py
    prompt_tokens = num_tokens_from_string(COMBINE_PROMPT_TEMPLATE) # this is a custom function we created in common/utils.py
    context_tokens = num_tokens_from_docs(top_docs) # this is a custom function we created in common/utils.py
    
    requested_tokens = prompt_tokens + context_tokens + COMPLETION_TOKENS
    
    chain_type = "map_reduce" if requested_tokens > 0.9 * tokens_limit else "stuff"  
    
    print("System prompt token count:",prompt_tokens)
    print("Max Completion Token count:", COMPLETION_TOKENS)
    print("Combined docs (context) token count:",context_tokens)
    print("--------")
    print("Requested token count:",requested_tokens)
    print("Token limit for", MODEL, ":", tokens_limit)
    print("Chain Type selected:", chain_type)
        
else:
    print("NO RESULTS FROM AZURE SEARCH")

System prompt token count: 1669
Max Completion Token count: 1000
Combined docs (context) token count: 3911
--------
Requested token count: 6580
Token limit for gpt-35-turbo-16k : 16384
Chain Type selected: stuff


In [21]:
if chain_type == "stuff":
    chain = load_qa_with_sources_chain(llm, chain_type=chain_type, 
                                       prompt=COMBINE_PROMPT)
elif chain_type == "map_reduce":
    chain = load_qa_with_sources_chain(llm, chain_type=chain_type, 
                                       question_prompt=COMBINE_QUESTION_PROMPT,
                                       combine_prompt=COMBINE_PROMPT,
                                       return_intermediate_steps=True)

In [22]:
%%time
# Try with other language as well
response = chain({"input_documents": top_docs, "question": QUESTION, "language": "Swedish"})

CPU times: user 2.48 ms, sys: 4.11 ms, total: 6.58 ms
Wall time: 12.2 s


In [23]:
display(Markdown(response['output_text']))

Enligt GDPR har enskilda personer följande rättigheter:

1. Rätt till tillgång (Art 15): Den registrerade har rätt att få information om när och hur deras personuppgifter behandlas, samt att få tillgång till personuppgifterna<sup><a href="https://blobstoragejd5ypzfx2l6vi.blob.core.windows.net/digge-100-dokumentEnskildas rättigheter enligt GDPR.pdf?sv=2022-11-02&ss=bfqt&srt=c&sp=rwdlacupiytfx&se=2024-01-30T01:31:54Z&st=2023-11-20T17:31:54Z&spr=https&sig=Ztal%2BZ4HCnOpVZBTKfCKfGpK7E0SkRHO2dcvmBL%2BHKc%3D">[1]</a></sup><sup><a href="https://blobstoragejd5ypzfx2l6vi.blob.core.windows.net/digge-100-dokumentEnskildas rättigheter enligt GDPR.pdf?sv=2022-11-02&ss=bfqt&srt=c&sp=rwdlacupiytfx&se=2024-01-30T01:31:54Z&st=2023-11-20T17:31:54Z&spr=https&sig=Ztal%2BZ4HCnOpVZBTKfCKfGpK7E0SkRHO2dcvmBL%2BHKc%3D">[2]</a></sup>.

2. Rätt till rättelse (Art 16): Den registrerade har rätt att begära att få sina personuppgifter rättade utan onödigt dröjsmål<sup><a href="https://blobstoragejd5ypzfx2l6vi.blob.core.windows.net/digge-100-dokumentEnskildas rättigheter enligt GDPR.pdf?sv=2022-11-02&ss=bfqt&srt=c&sp=rwdlacupiytfx&se=2024-01-30T01:31:54Z&st=2023-11-20T17:31:54Z&spr=https&sig=Ztal%2BZ4HCnOpVZBTKfCKfGpK7E0SkRHO2dcvmBL%2BHKc%3D">[1]</a></sup>.

3. Rätt till radering (Art 17): Den registrerade har rätt att begära att deras personuppgifter raderas, förutsatt att det inte finns något lagligt skäl för att behålla dem<sup><a href="https://blobstoragejd5ypzfx2l6vi.blob.core.windows.net/digge-100-dokumentEnskildas rättigheter enligt GDPR.pdf?sv=2022-11-02&ss=bfqt&srt=c&sp=rwdlacupiytfx&se=2024-01-30T01:31:54Z&st=2023-11-20T17:31:54Z&spr=https&sig=Ztal%2BZ4HCnOpVZBTKfCKfGpK7E0SkRHO2dcvmBL%2BHKc%3D">[1]</a></sup>.

4. Rätt till begränsning av behandling (Art 18): Den registrerade har rätt att begära att behandlingen av deras personuppgifter begränsas under vissa omständigheter<sup><a href="https://blobstoragejd5ypzfx2l6vi.blob.core.windows.net/digge-100-dokumentEnskildas rättigheter enligt GDPR.pdf?sv=2022-11-02&ss=bfqt&srt=c&sp=rwdlacupiytfx&se=2024-01-30T01:31:54Z&st=2023-11-20T17:31:54Z&spr=https&sig=Ztal%2BZ4HCnOpVZBTKfCKfGpK7E0SkRHO2dcvmBL%2BHKc%3D">[1]</a></sup>.

5. Rätt till dataportabilitet (Art 20): Den registrerade har rätt att

# Summary

In this notebook we learned how to deal with complex and large Documents and make them available for Q&A over them using [Hybrid Search](https://learn.microsoft.com/en-us/azure/search/search-get-started-vector#hybrid-search) (text + vector search).

We also learned the power of Azure Document Inteligence API and why it is recommended for production scenarios where manual Document parsing (instead of Azure Search Indexer Document Cracking) is necessary.

Using Azure Cognitive Search with its Vector capabilities and hybrid search features eliminates the need for other vector databases such as Weaviate, Qdrant, Milvus, Pinecone, and so on.


# NEXT
So far we have learned how to use OpenAI vectors and completion APIs in order to get an excelent answer from our documents stored in Azure Cognitive Search. This is the backbone for a GPT Smart Search Engine.

However, we are missing something: **How to have a conversation with this engine?**

On the next Notebook, we are going to understand the concept of **memory**. This is necessary in order to have a chatbot that can establish a conversation with the user. Without memory, there is no real conversation.