# Retrieval Augmented Generation with Content Understanding

This notebook presents a guideline on how to leverage the Azure AI Content Understanding for Retrieval Augmented Generation (RAG) in document files.

Leveraging Azure AI's Content Understanding Layout analysis, it extracts tables, paragraphs, and layout information from PDF files. The resulting markdown output can be utilized with LangChain's markdown header splitter, facilitating semantic chunking of documents. These chunked documents are then indexed into the Azure AI Search vector store. When a user query is received, Azure AI Search retrieves the relevant chunks, which are subsequently used to generate a context-aware response.

# Pre-requisites
1. Follow [README](../README.md#configure-azure-ai-service-resource) to create essential resource that will be used in this sample
2. Install required packages

In [None]:
%pip install -r ../requirements.txt
! pip install python-dotenv langchain langchain-community langchain-openai langchainhub openai tiktoken azure-identity azure-search-documents==11.6.0b3

In [None]:
import os
from pathlib import Path
from dotenv import load_dotenv

# Load environment variables from .env file in the parent directory
env_path = Path(__file__).parent.parent / '.env' if '__file__' in globals() else Path('../.env')
load_dotenv(dotenv_path=env_path)

print(f"✓ Loaded environment variables from: {env_path.absolute()}")

# Load environment variables

In [None]:
# Load and validate Azure AI Services configs
AZURE_AI_SERVICE_ENDPOINT = os.getenv("AZURE_AI_SERVICE_ENDPOINT")
AZURE_AI_SERVICE_API_VERSION = os.getenv("AZURE_AI_SERVICE_API_VERSION") or "2024-12-01-preview"
AZURE_DOCUMENT_INTELLIGENCE_API_VERSION = os.getenv("AZURE_DOCUMENT_INTELLIGENCE_API_VERSION") or "2024-11-30"

# Load and validate Azure OpenAI configs
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME")
AZURE_OPENAI_CHAT_API_VERSION = os.getenv("AZURE_OPENAI_CHAT_API_VERSION") or "2024-08-01-preview"
AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")
AZURE_OPENAI_EMBEDDING_API_VERSION = os.getenv("AZURE_OPENAI_EMBEDDING_API_VERSION") or "2023-05-15"

# Load and validate Azure Search Services configs
AZURE_SEARCH_ENDPOINT = os.getenv("AZURE_SEARCH_ENDPOINT")
AZURE_SEARCH_INDEX_NAME = os.getenv("AZURE_SEARCH_INDEX_NAME") or "sample-doc-index"

# Create custom analyzer

In [45]:
import logging
import os
import json
import sys
import uuid
from pathlib import Path
from dotenv import find_dotenv, load_dotenv
from azure.identity import DefaultAzureCredential, get_bearer_token_provider

#File location
DOC_LOCATION = Path("../data/sample_layout.pdf")

# Add the parent directory to the path to use shared modules
parent_dir = Path(Path.cwd()).parent
sys.path.append(str(parent_dir))

from python.content_understanding_client import AzureContentUnderstandingClient
credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(credential, "https://cognitiveservices.azure.com/.default")


ANALYZER_TEMPLATE_PATH = "../analyzer_templates/content_document.json"
ANALYZER_ID = "layout-sample-" + str(uuid.uuid4())

# Create Content Understanding client
content_understanding_client = AzureContentUnderstandingClient(
    endpoint=AZURE_AI_SERVICE_ENDPOINT,
    api_version=AZURE_AI_SERVICE_API_VERSION,
    token_provider=token_provider,
    x_ms_useragent="azure-ai-content-understanding-python/content_extraction", # This header is used for sample usage telemetry, please comment out this line if you want to opt out.
)

# Create analyzer and use analyzer to extract document content with layout analysis
try:
    # Create analyzer
    response = content_understanding_client.begin_create_analyzer(ANALYZER_ID, analyzer_template_path=ANALYZER_TEMPLATE_PATH)
    result = content_understanding_client.poll_result(response)
    
    # Analyze document
    response = content_understanding_client.begin_analyze(ANALYZER_ID, file_location=DOC_LOCATION)
    result = content_understanding_client.poll_result(response)
    result_data = result.get("result", {})
    contents = result_data.get("contents", [])

    #extract markdown content
    for content in contents:
        markdown_content = content.get("markdown", "")
        print(f"Markdown", markdown_content)
    print(json.dumps(result, indent=2))
except Exception as e:
    print(e)
    print("Error in creating analyzer. Please double-check your analysis settings.\nIf there is a conflict, you can delete the analyzer and then recreate it, or move to the next cell and use the existing analyzer.")

# Delete the analyzer if it is no longer needed
content_understanding_client.delete_analyzer(ANALYZER_ID)

Markdown <!-- PageHeader="This is the header of the document." -->


# This is title


## 1. Text

Latin refers to an ancient Italic language
originating in the region of Latium in
ancient Rome.


## 2. Page Objects


### 2.1 Table

Here's a sample table below, designed to
be simple for easy understand and quick
reference.


<table>
<caption>Table 1: This is a dummy table</caption>
<tr>
<th>Name</th>
<th>Corp</th>
<th>Remark</th>
</tr>
<tr>
<td>Foo</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Bar</td>
<td>Microsoft</td>
<td>Dummy</td>
</tr>
</table>


### 2.2. Figure


<figure>
<figcaption>Figure 1: Here is a figure with text</figcaption>

Values

500

450

400

400

350

300

300

250

200

200

100

0

Jan

Feb

Mar

Apr

May

Jun

Months

</figure>


## 3. Others

Al Document Intelligence is an Al service
that applies advanced machine learning
to extract text, key-value pairs, tables,
and structures from documents
automatically and accurately:

☒
clear

☒
precise

☐
vague

☒
coherent

☐


<Response [204]>

# Split document content into semantic chunks

In [46]:
from langchain import hub
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.vectorstores.azuresearch import AzureSearch
# Configure langchain text splitting settings
EMBEDDING_CHUNK_SIZE = 512
EMBEDDING_CHUNK_OVERLAP = 20

# Split the document into chunks base on markdown headers.
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

docs_string = markdown_content
splits = text_splitter.split_text(docs_string)

print("Length of splits: " + str(len(splits)))

Length of splits: 5


# Embed and index the chunks

In [47]:
# Embed the splitted documents and insert into Azure Search vector store
def embed_and_index_chunks(docs):
    aoai_embeddings = AzureOpenAIEmbeddings(
        azure_deployment=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME,
        openai_api_version=AZURE_OPENAI_EMBEDDING_API_VERSION,  # e.g., "2023-12-01-preview"
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        azure_ad_token_provider=token_provider
    )

    vector_store: AzureSearch = AzureSearch(
        azure_search_endpoint=AZURE_SEARCH_ENDPOINT,
        azure_search_key=None,
        index_name=AZURE_SEARCH_INDEX_NAME,
        embedding_function=aoai_embeddings.embed_query
    )
    vector_store.add_documents(documents=docs)
    return vector_store


# embed and index the docs:
vector_store = embed_and_index_chunks(splits)

# Retrieve relevant chunks based on a question

In [48]:
# Retrieve relevant chunks based on the question

retriever = vector_store.as_retriever(search_type="similarity", k=3)

retrieved_docs = retriever.invoke(
    "<your question>"
)

print(retrieved_docs[0].page_content)

# Use a prompt for RAG that is checked into the LangChain prompt hub (https://smith.langchain.com/hub/rlm/rag-prompt?organizationId=989ad331-949f-4bac-9694-660074a208a7)
prompt = hub.pull("rlm/rag-prompt")
llm = AzureChatOpenAI(
    openai_api_version=AZURE_OPENAI_CHAT_API_VERSION,  # e.g., "2023-12-01-preview"
    azure_deployment=AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
    temperature=1,
    azure_ad_token_provider=token_provider
)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

###


# Document Q & A

In [49]:
# Ask a question about the document

rag_chain.invoke("What is the main theme of the document?")

'The document\'s main theme is a page header—it\'s a repeated header placeholder ("This is the header of the document.").'

# Document Q&A with references

In [None]:
# Return the retrieved documents or certain source metadata from the documents

from operator import itemgetter

from langchain.schema.runnable import RunnableMap

rag_chain_from_docs = (
    {
        "context": lambda input: format_docs(input["documents"]),
        "question": itemgetter("question"),
    }
    | prompt
    | llm
    | StrOutputParser()
)
rag_chain_with_source = RunnableMap(
    {"documents": retriever, "question": RunnablePassthrough()}
) | {
    "documents": lambda input: [doc.metadata for doc in input["documents"]],
    "answer": rag_chain_from_docs,
}

rag_chain_with_source.invoke("What is the longest word in the document?")