# Visual Document Search with Azure Content Understanding
## Objective
This document illustrates an example workflow for how to leverage the Azure AI Content Understanding API to enhance the quality of document search.

The sample will demonstrate the following steps:
1. Extract the layout and content of a document using Azure AI Document Intelligence.
2. For each figure in the document, extract its content with a custom analyzer using Azure AI Content Understanding, and insert it into the corresponding location in the document content.
2. Chunk and embed the document content with LangChain and Azure OpenAI, and index them with Azure Search to generate an Azure Search index.
3. Utilize an OpenAI chat model to search through content in the document with a natural language query.


## Pre-requisites
1. Follow the [README](../README.md#configure-azure-ai-service-resource) to create the required resources for this sample.
1. Install the required packages.

In [None]:
%pip install -r ../requirements.txt

## Load environment variables

In [None]:
from dotenv import load_dotenv
import os

load_dotenv(override=True)

# Load and validate Azure AI Services configs
AZURE_AI_SERVICE_ENDPOINT = os.getenv("AZURE_AI_SERVICE_ENDPOINT")
AZURE_AI_SERVICE_API_VERSION = os.getenv("AZURE_AI_SERVICE_API_VERSION") or "2024-12-01-preview"
AZURE_DOCUMENT_INTELLIGENCE_API_VERSION = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_API_VERSION") or "2024-11-30"

# Load and validate Azure OpenAI configs
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME")
AZURE_OPENAI_CHAT_API_VERSION = os.getenv("AZURE_OPENAI_CHAT_API_VERSION") or "2024-08-01-preview"
AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
AZURE_OPENAI_EMBEDDING_API_VERSION = os.getenv("AZURE_OPENAI_EMBEDDING_API_VERSION") or "2023-05-15"

# Load and validate Azure Search Services configs
AZURE_SEARCH_ENDPOINT = os.getenv("AZURE_SEARCH_ENDPOINT")
AZURE_SEARCH_INDEX_NAME = os.getenv("AZURE_SEARCH_INDEX_NAME") or "sample-index-visual-doc"

## File to analyze

In [None]:
from pathlib import Path

# Get the path to the file that will be analyzed
# Sample report source: https://www.imf.org/en/Publications/CR/Issues/2024/07/18/United-States-2024-Article-IV-Consultation-Press-Release-Staff-Report-and-Statement-by-the-552100
file = Path("../data/sample_report.pdf")

## Create custom analyzer using chart and diagram understanding template

In [None]:
import json
import sys
import uuid

# Add the parent directory to the path to use shared modules
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))
from python.content_understanding_client import AzureContentUnderstandingClient

from azure.identity import DefaultAzureCredential, get_bearer_token_provider
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")

# Get path to sample template
ANALYZER_TEMPLATE_PATH = "../analyzer_templates/image_chart_diagram_understanding.json"

# Create analyzer
ANALYZER_ID = "content-understanding-search-sample-" + str(uuid.uuid4())
content_understanding_client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_SERVICE_ENDPOINT,
    api_version=AZURE_AI_SERVICE_API_VERSION,
    token_provider=token_provider,
    x_ms_useragent="azure-ai-content-understanding-python/search_with_visusal_document", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.
)

try:
    response = content_understanding_client.begin_create_analyzer(ANALYZER_ID, analyzer_template_path=ANALYZER_TEMPLATE_PATH)
    result = content_understanding_client.poll_result(response)
    print(f'Analyzer details for {result["result"]["analyzerId"]}:')
    print(json.dumps(result, indent=2))
except Exception as e:
    print(e)
    print("Error in creating analyzer. Please double-check your analysis settings.\nIf there is a conflict, you can delete the analyzer and then recreate it, or move to the next cell and use the existing analyzer.")

## Analyze document layout and compose with figure descriptions

In [None]:
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
import fitz
from PIL import Image

# Define helper functions for document-figure composition
def insert_figure_contents(md_content, figure_contents, span_offsets):
    """
    Inserts the figure content for each of the provided figures in figure_contents
    before the span offset of that figure in the given markdown content.

    Args:
    - md_content (str): The original markdown content.
    - figure_contents (list[str]): The contents of each figure to insert.
    - span_offsets (list[int]): The span offsets of each figure in order. These should be sorted and strictly increasing.

    Returns:
    - str: The modified markdown content with the the figure contents prepended to each figure's span.
    """
    # NOTE: In this notebook, we only alter the Markdown content returned by the Document Intelligence API,
    # and not the per-element spans in the API response. Thus, after figure content insertion, these per-element spans will be inaccurate.
    # This may impact use cases like citation page number calculation.
    # Additional code may be needed to correct the spans or otherwise infer the page numbers for each citation.
    # The main purpose of the notebook is to show the feasibility of using Content Understanding with Azure Search for RAG chat applications.

    # Validate span_offsets are sorted and strictly increasing
    if span_offsets != sorted(span_offsets) or not all([o < span_offsets[i + 1] for i, o in enumerate(span_offsets) if i < len(span_offsets) - 1]):
        raise ValueError("span_offsets should be sorted and strictly increasing.")

    # Split the content based on the provided spans
    parts = []
    preamble = None
    for i, offset in enumerate(span_offsets):
        if i == 0 and offset > 0:
            preamble = md_content[0:offset]
            parts.append(md_content[offset:span_offsets[i + 1]])
        elif i == len(span_offsets) - 1:
            parts.append(md_content[offset:])
        else:
            parts.append(md_content[offset:span_offsets[i + 1]])

    # Join the parts back together with the figure content inserted
    modified_content = ""
    if preamble:
        modified_content += preamble
    for i, part in enumerate(parts):
        modified_content += f"<!-- FigureContent=\"{figure_contents[i]}\" -->" + part

    return modified_content

def crop_image_from_pdf_page(pdf_path, page_number, bounding_box):
    """
    Crops a region from a given page in a PDF and returns it as an image.

    Args:    
    - pdf_path (pathlib.Path): Path to the PDF file.
    - page_number (int): The page number to crop from (0-indexed).
    - bounding_box (tuple): A tuple of (x0, y0, x1, y1) coordinates for the bounding box.
    
    Returns:
    - PIL.Image: A PIL Image of the cropped area.
    """
    doc = fitz.open(pdf_path)
    page = doc.load_page(page_number)
    
    # Cropping the page. The rect requires the coordinates in the format (x0, y0, x1, y1).
    bbx = [x * 72 for x in bounding_box]
    rect = fitz.Rect(bbx)
    pix = page.get_pixmap(matrix=fitz.Matrix(300 / 72, 300 / 72), clip=rect)
    
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    
    doc.close()

    return img

def format_content_understanding_result(content_understanding_result):
    """
    Formats the JSON output of the Content Understanding result as Markdown for downstream usage in text.
    
    Args:
    - content_understanding_result (dict): A dictionary containing the output from Content Understanding.

    Returns:
    - str: A Markdown string of the result content.
    """
    def _format_result(key, result):
        result_type = result["type"]
        if result_type in ["string", "integer", "number", "boolean"]:
            return f"**{key}**: " + str(result[f'value{result_type.capitalize()}']) + "\n"
        elif result_type == "array":
            return f"**{key}**: " + ', '.join([str(result["valueArray"][i][f"value{r['type'].capitalize()}"]) for i, r in enumerate(result["valueArray"])]) + "\n"
        elif result_type == "object":
            return f"**{key}**\n" + ''.join([_format_result(f"{key}.{k}", result["valueObject"][k]) for k in result["valueObject"]])

    fields = content_understanding_result['result']['contents'][0]['fields']
    markdown_result = ""
    for field in fields:
        markdown_result += _format_result(field, fields[field])

    return markdown_result

In [None]:
import io
import json
import os

# Run Content Understanding on each figure, format figure contents, and insert figure contents into corresponding document locations
with open(file, 'rb') as f:
    pdf_bytes = f.read()

    document_intelligence_client = DocumentIntelligenceClient(
        endpoint=AZURE_AI_SERVICE_ENDPOINT,
        api_version=AZURE_DOCUMENT_INTELLIGENCE_API_VERSION,
        credential=credential,
        output=str('figures')
    )

    poller = document_intelligence_client.begin_analyze_document(
        "prebuilt-layout",
        AnalyzeDocumentRequest(bytes_source=pdf_bytes),
        output=[str('figures')],
        features=['ocrHighResolution'],
        output_content_format="markdown"
    )

    result: AnalyzeResult = poller.result()
    
    md_content = result.content

    figure_contents = []
    if result.figures:
        print("Extracting figure contents with Content Understanding.")
        for figure_idx, figure in enumerate(result.figures):
            for region in figure.bounding_regions:
                    # Uncomment the below to print out the bounding regions of each figure
                    # print(f"Figure {figure_idx + 1} body bounding regions: {region}")
                    # To learn more about bounding regions, see https://aka.ms/bounding-region
                    bounding_box = (
                            region.polygon[0],  # x0 (left)
                            region.polygon[1],  # y0 (top
                            region.polygon[4],  # x1 (right)
                            region.polygon[5]   # y1 (bottom)
                        )
            page_number = figure.bounding_regions[0]['pageNumber']
            cropped_img = crop_image_from_pdf_page(file, page_number - 1, bounding_box)

            os.makedirs("figures", exist_ok=True)

            figure_filename = f"figure_{figure_idx + 1}.png"
            # Full path for the file
            figure_filepath = os.path.join("figures", figure_filename)

            # Save the figure
            cropped_img.save(figure_filepath)
            bytes_io = io.BytesIO()
            cropped_img.save(bytes_io, format='PNG')
            cropped_img = bytes_io.getvalue()

            # Collect formatted content from the figure
            content_understanding_response = content_understanding_client.begin_analyze(ANALYZER_ID, figure_filepath)
            content_understanding_result = content_understanding_client.poll_result(content_understanding_response, timeout_seconds=1000)
            figure_content = format_content_understanding_result(content_understanding_result)
            figure_contents.append(figure_content)
            print(f"Figure {figure_idx + 1} contents:\n{figure_content}")

        # Insert figure content into corresponding location in document
        md_content = insert_figure_contents(md_content, figure_contents, [f.spans[0]["offset"] for f in result.figures])
    
    # Save results as a JSON file to cache the result for downstream use
    result.content = md_content
    output = {}
    output['analyzeResult'] = result.as_dict()
    output = json.dumps(output)
    with open('sample_report.cache', 'w') as f:
        f.write(output)

In [None]:
# Uncomment the first line below to load in a previously cached result.
# output = open("sample_report.cache").read()
document_content = json.loads(output)
document_content = document_content['analyzeResult']['content']

## Chunk text by splitting with Markdown header splitting and recursive character splitting
This is a simple starting point. Feel free to give your own chunking strategies a try!

In [None]:

from langchain_text_splitters import MarkdownHeaderTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Configure langchain text splitting settings
EMBEDDING_CHUNK_SIZE = 512
EMBEDDING_CHUNK_OVERLAP = 20
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3")
]

# First split text using Markdown headers
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)
chunks = text_splitter.split_text(document_content)

# Then further split the text using recursive character text splitting
char_text_splitter = RecursiveCharacterTextSplitter(separators=["<!--", "\n\n", "#"], chunk_size=EMBEDDING_CHUNK_SIZE, chunk_overlap=EMBEDDING_CHUNK_OVERLAP, is_separator_regex=True)
chunks = char_text_splitter.split_documents(chunks)

print("Number of chunks: " + str(len(chunks)))

## Query vector index to retrieve relevant documents

In [None]:
from langchain_openai import AzureOpenAIEmbeddings
from langchain.vectorstores.azuresearch import AzureSearch

aoai_embeddings = AzureOpenAIEmbeddings(model=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME,
                                        azure_endpoint=AZURE_OPENAI_ENDPOINT,
                                        azure_ad_token_provider=token_provider,
                                        api_version=AZURE_OPENAI_EMBEDDING_API_VERSION)

vector_store = AzureSearch(
    azure_search_endpoint=AZURE_SEARCH_ENDPOINT,
    azure_search_key=None,
    index_name=AZURE_SEARCH_INDEX_NAME,
    embedding_function=aoai_embeddings.embed_query
)

# This is a one-time operation to add the documents to the vector store. Comment out this line if you are re-running this cell with the same index.
vector_store.add_documents(documents=chunks)

# Set up the retriever that will be used to query the index for similar documents
retriever = vector_store.as_retriever(search_type="similarity")

In [None]:
# Retrieve relevant documents
query = "What was the crude oil production in 2019?"
retrieved_docs = retriever.invoke(query)

# Print retrieved documents
for doc in retrieved_docs:
    print("Document id:", doc.metadata['id'])
    print("Content:", doc.page_content)
    print("=" * 50)

## Generate answer to query

In [None]:
# Define system prompt template for chat model
prompt = """
You are an expert in document analysis. You are proficient in reading and analyzing technical reports. You are good at numerical reasoning and have a good understanding of financial concepts. You are given a question which you need to answer based on the references provided. To answer this question, you may first read the question carefully to know what information is required or helpful to answer the question. Then, you may read the references to find the relevant information.

If you find enough information to answer the question, you can first write down your thinking process and then provide a concise answer at the end.
If you find that there is not enough information to answer the question, you can state that there is insufficient information.
If you are not able or sure how to answer the question, say that you are not able to answer the question.
Do not provide any information that is not present in the references.
References are in markdown format, you may follow the markdown syntax to better understand the references.

---
References:
{context}
---

Now, here is the question:
---
Question:
{question}
---
Thinking Process::: 
Answer::: 
"""

# Helper function to generate the formatted context from each retrieved document
def generate_context(chunks):
    context = []
    for i, chunk in enumerate(chunks):
        s = (f'Source {i} Metadata: {chunk.metadata}\n'
                f'Source {i} Content: {chunk.page_content}')
        context.append(s)
    context = '\n---\n'.join(context)
    return context

# Remove redundant chunks
appeared = set()
unique_chunks = []
for chunk in retrieved_docs:
    chunk_id = chunk.metadata['id']
    if chunk_id not in appeared:
        appeared.add(chunk_id)
        unique_chunks.append(chunk)
context = generate_context(unique_chunks)

# Format the prompt with the provided query and formatted context
prompt = prompt.format(question=query,
                       context=context)

In [None]:
from langchain_openai import AzureChatOpenAI

chat_llm = AzureChatOpenAI(model=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
                            azure_endpoint=AZURE_OPENAI_ENDPOINT,
                            azure_ad_token_provider=token_provider,
                            api_version=AZURE_OPENAI_CHAT_API_VERSION,
                            temperature=0.7)

# Print the LLM's answer to the query with the retrieved documents as additional context
answer = chat_llm.invoke(prompt)
print(answer.content)