#### Developed using API version 2024-02-29

# Retrieval Augmented Generation (RAG) with Azure AI Document Intelligence.

In an earlier notebook, I demonstrated how Azure AI Search can automatically convert data into vectors using the built-in vectorization feature. It can manage the entire workflow of pipeline tasks from ingestion, extraction, enrichment and data upload to the search index with minimal or no custom coding. However, a drawback is that the existing skills may not capture all the relevant content from the document.

In this notebook, I demonstrate a solution that uses the prebuilt layout model of the Azure AI Document Intelligence resource to get all the necessary content from the PDF booklet and enable the semantic chunking feature. This should overcome the encountered limitation with the previous solution and improve the relevance and accuracy of the search retrieval.

This is the first of two notebooks, which shows a solution that uses Azure AI Document Intelligence and Langchain to create a Retrieval Augmented Generation (RAG) workflow. It uses the Langchain Azure AI Document Intelligence document loader to get tables, paragraphs, and layout information from a PDF file. The output is in markdown format, which is processed by Langchain's markdown header splitter. This allows the semantic chunking feature of Azure AI Document Intelligence service to produce semantic chunks of the source document 

We employ the AI Search Python SDK to build the Azure AI Search index, load the semantically chunked documents into this index and execute a hybrid + semantic search query at the end of the notebook to assess the search result relevance.

![Semantic chunking in RAG](https://github.com/jbernec/rag-orchestrations/blob/main/images/semantic-chunking.png?raw=true)


## Prerequisites
- An Azure AI Document Intelligence resource - follow [this document](https://learn.microsoft.com/azure/ai-services/document-intelligence/create-document-intelligence-resource?view=doc-intel-4.0.0) to create one if you don't have.
- An Azure AI Search resource - follow [this document](https://learn.microsoft.com/azure/search/search-create-service-portal) to create one if you don't have.
- An Azure OpenAI resource and deployments for embeddings model and chat model - follow [this document](https://learn.microsoft.com/azure/ai-services/openai/how-to/create-resource?pivots=web-portal) to create one if you don't have.
- I have attached a requirements file in the repo folder as this notebook to show the python libraries required for this poc.


In [0]:
# Import required packages
from langchain import hub
from langchain_openai import AzureChatOpenAI
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult, AnalyzeDocumentRequest, ContentFormat
from langchain_openai import AzureOpenAIEmbeddings
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.vectorstores.azuresearch import AzureSearch
from azure.core.credentials import AzureKeyCredential
from azure.storage.blob import BlobServiceClient
import base64
from openai import AzureOpenAI
import azure.identity
from azure.identity import DefaultAzureCredential, EnvironmentCredential, ManagedIdentityCredential, SharedTokenCacheCredential
from azure.identity import ClientSecretCredential
import time

In [0]:
"""
This code loads and sets the necessary variables for Azure services.
The variables are loaded from Azure Key Vault.
"""
azure_openai_endpoint=dbutils.secrets.get(scope="myscope", key="aoai-endpoint")
azure_openai_api_key=dbutils.secrets.get(scope="myscope", key="aoai-api-key")
azure_openai_api_version = "2024-02-15-preview"
azure_openai_embedding_deployment = dbutils.secrets.get(scope="myscope", key="aoai-embedding-deployment")
azure_openai_embedding_model = dbutils.secrets.get(scope="myscope", key="aoai-embedding-model")
doc_intelligence_endpoint = dbutils.secrets.get(scope="myscope", key="docintelligence-endpoint")
doc_intelligence_key = dbutils.secrets.get(scope="myscope", key="docintelligence-key")

In [0]:
# Connect to Blob Storage
# blob_connection_string = dbutils.secrets.get(scope="myscope", key="blobstore-connstr")
# blob_service_client = BlobServiceClient.from_connection_string(blob_connection_string)

# Service principal authentication variables
tenant_id=dbutils.secrets.get(scope="myscope", key="tenantid")
client_id = dbutils.secrets.get(scope="myscope", key="clientid")
client_secret = dbutils.secrets.get(scope="myscope", key="clientsecret")
credential = azure.identity.ClientSecretCredential(tenant_id=tenant_id, client_id=client_id, client_secret=client_secret)

blob_storage_name = "blobstore05" #dbutils.secrets.get(scope="myscope", key="blobstore-account-name")
# Use the above defined service principal to authenticate against the blob storage endpoint of the ADLS Gen 2 service
blob_service_client = BlobServiceClient(
    account_url=f"https://{blob_storage_name}.blob.core.windows.net",
    credential=credential
)
blob_container_name = "document-intelligence"
container_client = blob_service_client.get_container_client(blob_container_name)
container_url = container_client.url
blobs = container_client.list_blobs()
first_blob = blobs.next()
blob_url = container_client.get_blob_client(first_blob).url
print(f"URL of first blob: {blob_url}")

## Utility Function Definitions.

In [0]:
# Initialize the OpenAI client
# client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
azure_openai_client = AzureOpenAI(
    api_key=azure_openai_api_key,
    api_version=azure_openai_api_version,
    azure_deployment=dbutils.secrets.get(scope="myscope", key="aoai-deploymentname"),
    azure_endpoint=azure_openai_endpoint,
)

In [0]:
def extract_pdf(url: str):
    print(f"{url}\n\n")
    print(f"---------------------------------------------")
    
    document_intelligence_client = DocumentIntelligenceClient(endpoint=doc_intelligence_endpoint, credential=AzureKeyCredential(key=doc_intelligence_key), api_version="2024-02-29-preview")

    poller= document_intelligence_client.begin_analyze_document(model_id="prebuilt-layout", analyze_request=AnalyzeDocumentRequest(url_source=url), output_content_format="markdown")

    result: AnalyzeResult = poller.result()
    return result


def get_table_description(table_content):
    prompt = f"""
    Given the following table and its context from the original document,
    provide a detailed description of the table. Then, include the table in markdown format.

    Table Content:
    {table_content}

    Please provide:
    1. A comprehensive description of the table.
    2. The table in markdown format.
    """

    response = azure_openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that describes tables and formats them in markdown."},
            {"role": "user", "content": prompt}
        ]
    )

    return response.choices[0].message.content


def capture_table_info(result: AnalyzeResult):
    # Initialize an empty dictionary to store table information
    table_info = {}

    if result.tables:
        for table_idx, table in enumerate(result.tables):
            # Initialize a dictionary to store information about each table
            table_details = {
                "page_number": None,
                "content": ""
            }

            # Capture table location information
            if table.bounding_regions:
                table_details["page_number"] = int(table.bounding_regions[0].page_number)

            # Capture raw table content
            for page in result.pages:
                if page.page_number == table_details["page_number"]:
                    start_pos = page.spans[0].offset
                    end_pos = start_pos + page.spans[0].length
                    table_details["content"] += result.content[start_pos:end_pos] + " "

            # Store the table details in the dictionary using the table index as the key
            table_info[f"Table_{table_idx}"] = table_details

    return table_info

# Example usage:
# table_info = capture_table_info(result)
# Now, table_info dictionary contains all the details about the tables
# You can retrieve the information about any table using its index, for example, table_info["Table_0"]
    
# Function to crack and extract PDF documents using Azure AI Document Intelligence
def parse_pdf_content(result: AnalyzeResult):
    # Initialize an empty list to store page information
    page_list = []

    for page in result.pages:
        page_num = page.page_number
        page_content = ""

        # Capture raw page content
        start_pos = page.spans[0].offset
        end_pos = start_pos + page.spans[0].length
        page_content += result.content[start_pos:end_pos] + " "

        # Check if there are tables on the current page
        tables_on_page = [table for table in result.tables if table.bounding_regions[0].page_number == page_num]
        if tables_on_page:
            # If there are tables on the page, label the page as type: "table"
            page_type = "table"
        else:
            page_type = "page"

        # Add the page information to the list
        page_list.append({
            "type": page_type,
            "page_number": page_num,
            "content": page_content
        })

    return page_list

# Example usage:
# result = extract_pdf(url=blob_url)
# page_list = extract_and_capture_info(result)
# Now, page_list contains all the pages with their type labeled accordingly

## Load a document and split it into semantic chunks

#### Create a new index with custom filterable and retrievable fields and upload the data to the content field of the search index.

In [0]:
# Create the search index fields and vector search configuration

from azure.search.documents.indexes.models import (
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    SearchIndex,
    SemanticConfiguration, SemanticSearch, SemanticPrioritizedFields, SemanticField
)
from azure.search.documents.indexes import SearchIndexClient

import os
from azure.search.documents import SearchClient
from azure.identity import DefaultAzureCredential, AzureAuthorityHosts


fields = [
    SearchField(name="parent_id",key=True,type=SearchFieldDataType.String),
    SearchField(name="title",type=SearchFieldDataType.String),
    SearchField(name="chunk",type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False),
    SearchField(name="location",type=SearchFieldDataType.String),
    SearchField(name="pagenum",type=SearchFieldDataType.String),
    SearchField(name="vector",type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=1536, vector_search_profile_name="myHnswProfile")
]

# Define the vector search configuration and parameters
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(name="myHsnw")
    ],
    profiles=[
        VectorSearchProfile(
            name="myHnswProfile",
            algorithm_configuration_name="myHsnw",
            vectorizer_name="myOpenAI"
        )
    ],
    vectorizers=[
        AzureOpenAIVectorizer(
            vectorizer_name="myOpenAI",
            kind="azureOpenAI",
            parameters=AzureOpenAIVectorizerParameters(
                resource_url=azure_openai_endpoint,
                deployment_name=azure_openai_embedding_deployment,
                model_name=azure_openai_embedding_model,
            )
        )
    ]
)

# Configure semantic search on the index
semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="title"),
        content_fields=[SemanticField(field_name="chunk")]
    )
)

# Create the semantic search config
semantic_search = SemanticSearch(configurations=[semantic_config])

scoring_profiles = []

In [0]:
# Create a search index client required to create the index
search_credential = AzureKeyCredential(dbutils.secrets.get(scope="myscope", key="aisearch-key"))
search_endpoint = dbutils.secrets.get(scope="myscope", key="aisearch-endpoint")
index_client = SearchIndexClient(endpoint=search_endpoint, credential=search_credential)

index_name = "benefits-index"
index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search, scoring_profiles=scoring_profiles, semantic_search=semantic_search)
result = index_client.create_or_update_index(index=index)
print(f"{result.name} created")

#### Upload documents to AI Search index

In [0]:
# Create the langchain azure open ai embedding object. This will be used to embed the vector field content
# https://python.langchain.com/v0.1/docs/integrations/vectorstores/azuresearch/#create-embeddings-and-vector-store-instances

aoai_embeddings = AzureOpenAIEmbeddings(
    azure_deployment=azure_openai_embedding_deployment,
    openai_api_version=azure_openai_api_version,
    azure_endpoint=azure_openai_endpoint,
    api_key=azure_openai_api_key,
)

In [0]:
def text_to_base64(text):
    # Convert text to bytes using UTF-8 encoding
    # and use this function for generating a unique value for the Azure AI Search Index parent_id values
    bytes_data = text.encode('utf-8')

    # Perform Base64 encoding
    base64_encoded = base64.b64encode(bytes_data)

    # Convert the result back to a UTF-8 string representation
    base64_text = base64_encoded.decode('utf-8')

    return base64_text

#### Upload the semantically chunked documents and its vectors to the Azure AI Search Index

In [0]:
# dictionary to hold and map a book to it's content and page numbers
doc_map = {}
doc_table_map = {}

for doc in container_client.list_blob_names():
    print(f"Extracting content from {doc}...")

    # Capture the start time
    start_time = time.time()
    url = container_url + "/" + doc

    # Start extraction
    result = extract_pdf(url=url)
    page_list = parse_pdf_content(result=result)
    doc_name = doc.split(sep=".")[0].title()
    table_content = [page["content"] for page in page_list if page["type"] == "table"][0]
    llm_content = get_table_description(table_content=table_content)
    page_list[3]["content"] = llm_content
    doc_map[doc_name] = page_list

    # Capture the end time and Calculate the elapsed time
    end_time = time.time()
    elapsed_time = end_time - start_time

    print(f"Parsing took: {elapsed_time:.6f} seconds")
    print(f"The {doc_name} claim contains {len(page_list)} pages\n")

In [0]:
from azure.search.documents import SearchClient

search_client = SearchClient(search_endpoint, index_name, credential=search_credential)
payload_list = []
for doc, pagelist in doc_map.items():
    for page in pagelist:
        try:
            id = doc + page["content"][1:10]
            title = f"{doc}"
            upload_payload = {
                        "parent_id": text_to_base64(text=id),
                        "title": title,
                        "chunk": page["content"],
                        "location": container_url + "/" + doc + ".pdf",
                        "pagenum": str(page["page_number"]),
                        "vector": aoai_embeddings.embed_query(page["content"] if page["content"]!="" else "-------")
            }
            payload_list.append(upload_payload)
            print(f"Uploading pages.............for :{doc}")
            result_upload = search_client.upload_documents(documents=[upload_payload])
            print(f"Successfully uploaded pages for :{doc}")
        except Exception as e:
            print("Exception:", e)

#### Perform a hybrid search + semantic reranking

In [0]:
from azure.search.documents.models import (
    QueryType,
    QueryCaptionType,
    QueryAnswerType
)

from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery

# Semantic Hybrid Search
# query = "Which is more comprehensive, Northwind Health Plus vs Northwind Standard?"
#query = "Can you summarize the employee handbook for me in 3 sentences. Use bullet points."
query = "How much is the employee's cost per pay check for the north wind standard?"

search_client = SearchClient(search_endpoint, index_name, search_credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields="vector", exhaustive=True)

results = search_client.search(  
    search_text=query,
    vector_queries=[vector_query],
    select=["parent_id", "chunk"],
    query_type=QueryType.SEMANTIC,
    semantic_configuration_name='my-semantic-config',
    query_caption=QueryCaptionType.EXTRACTIVE,
    query_answer=QueryAnswerType.EXTRACTIVE,
    top=1
)

semantic_answers = results.get_answers()
if semantic_answers:
    for answer in semantic_answers:
        if answer.highlights:
            print(f"Semantic Answer: {answer.highlights}")
        else:
            print(f"Semantic Answer: {answer.text}")
        print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    print(f"parent_id: {result['parent_id']}")   
    print(f"Reranker Score: {result['@search.reranker_score']}")
    print(f"Content: {result['chunk']}")  

    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")


In [0]:
from azure.search.documents.models import (
    QueryType,
    QueryCaptionType,
    QueryAnswerType
)

from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery

# Semantic Hybrid Search
# query = "Which is more comprehensive, Northwind Health Plus vs Northwind Standard?"
#query = "Can you summarize the employee handbook for me in 3 sentences. Use bullet points."
query = "How much is the employee's cost per pay check for the north wind standard?"

search_client = SearchClient(search_endpoint, index_name, search_credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields="vector", exhaustive=True)

results = search_client.search(  
    search_text=query,
    vector_queries=[vector_query],
    select=["parent_id", "content"],
    query_type=QueryType.SEMANTIC,
    semantic_configuration_name='my-semantic-config',
    query_caption=QueryCaptionType.EXTRACTIVE,
    query_answer=QueryAnswerType.EXTRACTIVE,
    top=1
)

semantic_answers = results.get_answers()
if semantic_answers:
    for answer in semantic_answers:
        if answer.highlights:
            print(f"Semantic Answer: {answer.highlights}")
        else:
            print(f"Semantic Answer: {answer.text}")
        print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    print(f"parent_id: {result['parent_id']}")   
    print(f"Reranker Score: {result['@search.reranker_score']}")
    print(f"Content: {result['content']}")  

    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")
