# Retrieval Augmented Generation (RAG)

In an earlier notebook, I demonstrated how Azure AI Search can automatically convert data into vectors using the built-in vectorization feature. It can manage the entire workflow of pipeline tasks from ingestion, extraction, enrichment and data upload to the search index with minimal or no custom coding. However, a drawback is that the existing skills may not capture all the relevant content from the document.

In this notebook, I demonstrate a solution that uses the prebuilt layout model of the Azure AI Document Intelligence resource to get all the necessary content from the PDF booklet and enable the semantic chunking feature. This should overcome the encountered limitation with the previous solution and improve the relevance and accuracy of the search retrieval.

This is the first of two notebooks, which shows a solution that uses Azure AI Document Intelligence and Langchain to create a Retrieval Augmented Generation (RAG) workflow. It uses the Langchain Azure AI Document Intelligence document loader to get tables, paragraphs, and layout information from a PDF file. The output is in markdown format, which is processed by Langchain's markdown header splitter. This allows the semantic chunking feature of Azure AI Document Intelligence service to produce semantic chunks of the source document 

We employ the AI Search Python SDK to build the Azure AI Search index, load the semantically chunked documents into this index and execute a hybrid + semantic search query at the end of the notebook to assess the search result relevance.

![Semantic chunking in RAG](https://github.com/jbernec/rag-orchestrations/blob/main/images/semantic-chunking.png?raw=true)


## Prerequisites
- An Azure AI Document Intelligence resource - follow [this document](https://learn.microsoft.com/azure/ai-services/document-intelligence/create-document-intelligence-resource?view=doc-intel-4.0.0) to create one if you don't have.
- An Azure AI Search resource - follow [this document](https://learn.microsoft.com/azure/search/search-create-service-portal) to create one if you don't have.
- An Azure OpenAI resource and deployments for embeddings model and chat model - follow [this document](https://learn.microsoft.com/azure/ai-services/openai/how-to/create-resource?pivots=web-portal) to create one if you don't have.
- I have attached a requirements file in the repo folder as this notebook to show the python libraries required for this poc.


In [0]:
# Import required packages
from langchain import hub
from langchain_openai import AzureChatOpenAI
from langchain_community.document_loaders import AzureAIDocumentIntelligenceLoader
from langchain_openai import AzureOpenAIEmbeddings
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain.vectorstores.azuresearch import AzureSearch
from azure.core.credentials import AzureKeyCredential
from azure.storage.blob import BlobServiceClient
import base64

In [0]:
"""
This code loads and sets the necessary variables for Azure services.
The variables are loaded from Azure Key Vault.
"""

azure_openai_endpoint=dbutils.secrets.get(scope="myscope", key="aoai-endpoint")
azure_openai_api_key=dbutils.secrets.get(scope="myscope", key="aoai-api-key")
azure_openai_api_version = "2024-02-15-preview"
azure_openai_embedding_deployment = dbutils.secrets.get(scope="myscope", key="aoai-embedding-deployment")
doc_intelligence_endpoint = dbutils.secrets.get(scope="myscope", key="docintelligence-endpoint")
doc_intelligence_key = dbutils.secrets.get(scope="myscope", key="docintelligence-key")
#docUrl = "https://raw.githubusercontent.com/jbernec/rag-orchestrations/main/data/Benefit_Options.pdf"

In [0]:

# Connect to Blob Storage
blob_connection_string = dbutils.secrets.get(scope="myscope", key="blobstore-connstr")
blob_container_name = "aisearch-rag"
blob_service_client = BlobServiceClient.from_connection_string(blob_connection_string)
container_client = blob_service_client.get_container_client(blob_container_name)
blobs = container_client.list_blobs()
first_blob = blobs.next()
blob_url = container_client.get_blob_client(first_blob).url
#print(f"URL of first blob: {blob_url}")

## Load a document and split it into semantic chunks

In [0]:
# Instantiate the Langchain Azure AI Document Intelligence loader to load the document. You can either specify file_path or url_path to load the document.
# Ensure that the Document Intelligence managed identity is configured with SBDC RBAC on the Blob storage resource.
loader = AzureAIDocumentIntelligenceLoader(url_path=blob_url, api_key = doc_intelligence_key, api_endpoint = doc_intelligence_endpoint, api_model="prebuilt-layout")
docs = loader.load()

# Split the document into semantic chunks based on markdown headers, using the MarkdownHeaderTextSplitter class.
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

docs_string = docs[0].page_content
splits = text_splitter.split_text(docs_string)

print("Length of splits: " + str(len(splits)))

Length of splits: 6


In [0]:
# Display first document in the splits list object
splits[0]

Document(page_content='Contoso Electronics Plan and Benefit Packages  \n<figure>  \n![](figures/0)  \n<!-- FigureContent="Contoso Electronics" -->  \n</figure>  \nThis document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information contained in this document.  \nAll rights reserved to Microsoft  \n<!-- PageHeader="Welcome to Contoso Electronics! We are excited to offer our employees two comprehensive health insurance plans through Northwind Health." -->')

In [0]:
# Display second document in the splits list object
splits[1]

Document(metadata={'Header 1': 'Northwind Health Plus'}, page_content='Northwind Health Plus is a comprehensive plan that provides comprehensive coverage for medical, vision, and dental services. This plan also offers prescription drug coverage, mental health and substance abuse coverage, and coverage for preventive care services. With Northwind Health Plus, you can choose from a variety of in-network providers, including primary care physicians, specialists, hospitals, and pharmacies. This plan also offers coverage for emergency services, both in-network and out-of-network.')

#### Create a new index with custom filterable and retrievable fields

In [0]:
# Create the search index fields and vector search configuration

from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import SearchField, SearchFieldDataType, VectorSearch, SimpleField, SearchableField, HnswAlgorithmConfiguration, HnswParameters, VectorSearchAlgorithmMetric, ExhaustiveKnnAlgorithmConfiguration, ExhaustiveKnnParameters, VectorSearchProfile, AzureOpenAIVectorizer, AzureOpenAIParameters, SemanticConfiguration, SemanticSearch, SemanticPrioritizedFields, SemanticField, SearchIndex

search_credential = AzureKeyCredential(dbutils.secrets.get(scope="myscope", key="aisearch-adminkey"))
search_endpoint = dbutils.secrets.get(scope="myscope", key="aisearch-endpoint")
# Create a search index client required to create the index
index_client = SearchIndexClient(endpoint=search_endpoint, credential=search_credential)

fields = [
    SimpleField(name="parent_id", key=True, type=SearchFieldDataType.String, filterable=True, sortable=True, facetable=True),
    SearchableField(name="title", type=SearchFieldDataType.String, filterable=True, searchable=True, retrievable=True),
    SearchableField(name="content", type=SearchFieldDataType.String, searchable=True, sortable=True, facetable=True, retrievable=True),
    SearchableField(name="location", type=SearchFieldDataType.String, searchable=True, filterable=True, retrievable=True),
    SearchField(name="vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), searchable=True, retrievable=True, hidden=False, vector_search_dimensions=1536, vector_search_profile_name="myHnswProfile")
]

# Configure the vector search config
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="myHnsw",
            parameters=HnswParameters(
                m=4,
                ef_construction=400,
                ef_search=500,
                metric=VectorSearchAlgorithmMetric.COSINE
            )
        )
    ],
    profiles=[  
        VectorSearchProfile(  
            name="myHnswProfile",  
            algorithm_configuration_name="myHnsw",  
            vectorizer="myOpenAI",  
        ),
    ],
    vectorizers=[  
        AzureOpenAIVectorizer(  
            name="myOpenAI",  
            kind="azureOpenAI",  
            azure_open_ai_parameters=AzureOpenAIParameters(  
                resource_uri=azure_openai_endpoint,  
                deployment_id=azure_openai_embedding_deployment,  
                api_key=azure_openai_api_key,  
            ),  
        ),  
    ]
)

# Configure semantic search on the index
semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=SemanticPrioritizedFields(
        content_fields=[
            SemanticField(field_name="content")
        ]
    )
)
# Create the semantic search config
semantic_search = SemanticSearch(configurations=[semantic_config])

In [0]:
# Create the search index
index_name = "manual-aisearch-index"
index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search, semantic_search=semantic_search)
result = index_client.create_or_update_index(index=index)
print(f"{result.name} created")

manual-aisearch-index created


#### Upload documents to AI Search index

In [0]:
# Create the langchain azure open ai embedding object. This will be used to embed the vector field content
# https://python.langchain.com/v0.1/docs/integrations/vectorstores/azuresearch/#create-embeddings-and-vector-store-instances

aoai_embeddings = AzureOpenAIEmbeddings(
    azure_deployment=azure_openai_embedding_deployment,
    openai_api_version=azure_openai_api_version,
    azure_endpoint=azure_openai_endpoint,
    api_key=azure_openai_api_key,
)

In [0]:
def text_to_base64(text):
    # Convert text to bytes using UTF-8 encoding
    # and use this function for generating a unique value for the Azure AI Search Index parent_id values
    bytes_data = text.encode('utf-8')

    # Perform Base64 encoding
    base64_encoded = base64.b64encode(bytes_data)

    # Convert the result back to a UTF-8 string representation
    base64_text = base64_encoded.decode('utf-8')

    return base64_text

#### Upload the semantically chunked documents its vectors to the Azure AI Search Index

In [0]:
# upload data to the search index

from azure.search.documents import SearchClient

bookname = "Benefit_Options.pdf"
search_client = SearchClient(search_endpoint, index_name, credential=search_credential)
for doc in splits:
    try:
        pass
        content = doc.page_content
        book_url = blob_url
        upload_payload = {
                    "parent_id": text_to_base64(doc.metadata.get("Header 1", "Default") if len(doc.metadata)!=0 else "Default"),
                    "title": doc.metadata.get("Header 1", "Default") if len(doc.metadata)!=0 else "Default",
                    "content": content,
                    "location": book_url,
                    "vector": aoai_embeddings.embed_query(content if content!="" else "-------")
        }

        result_upload = search_client.upload_documents(documents=[upload_payload])
        if result_upload[0].status_code != 200:
            print("Status code:",result_upload.status_code)
            print("Error message:", result_upload[0].error_message)
    except Exception as e:
        print("Exception:", e)

#### Perform a hybrid search + semantic reranking

In [0]:
from azure.search.documents.models import (
    QueryType,
    QueryCaptionType,
    QueryAnswerType
)

from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery

# Semantic Hybrid Search
# query = "Which is more comprehensive, Northwind Health Plus vs Northwind Standard?"
query = "How much is the employee's cost per pay check for the north wind standard?"

search_client = SearchClient(search_endpoint, index_name, search_credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields="vector", exhaustive=True)

results = search_client.search(  
    search_text=query,
    vector_queries=[vector_query],
    select=["parent_id", "content"],
    query_type=QueryType.SEMANTIC,
    semantic_configuration_name='my-semantic-config',
    query_caption=QueryCaptionType.EXTRACTIVE,
    query_answer=QueryAnswerType.EXTRACTIVE,
    top=1
)

semantic_answers = results.get_answers()
if semantic_answers:
    for answer in semantic_answers:
        if answer.highlights:
            print(f"Semantic Answer: {answer.highlights}")
        else:
            print(f"Semantic Answer: {answer.text}")
        print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    print(f"parent_id: {result['parent_id']}")   
    print(f"Reranker Score: {result['@search.reranker_score']}")
    print(f"Content: {result['content']}")  

    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")


parent_id: Q29zdCBDb21wYXJpc29u
Reranker Score: 3.1452205181121826
Content: Contoso Electronics deducts the employee's portion of the healthcare cost from each paycheck. This means that the cost of the health insurance will be spread out over the course of the year, rather than being paid in one lump sum. The employee's portion of the cost will be calculated based on the selected health plan and the number of people covered by the insurance. The table below shows a cost comparison between the different health plans offered by Contoso Electronics:  
| | Employee's cost per paycheck ||
| | Northwind Standard | Northwind Health Plus |
| - | - | - |
| Employee Only | $45.00 | $55.00 |
| Employee +1 | $65.00 | $71.00 |
| Employee +2 or more | $78.00 | $89.00 |
Caption: Contoso Electronics deducts the employee's portion of the healthcare cost from each paycheck. This means that the cost of the health insurance will be spread out over the course of the year, rather than being paid in one lump