#### In this notebook I demonstrate the use of the Azure AI Search Integrated Vectorization feature. I utilize the following skills and prerequisites
* Document Layout skill to extract data from a sample PDF file in Azure Blob storage 
* https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-document-intelligence-layout
* https://learn.microsoft.com/en-us/azure/search/search-how-to-semantic-chunking
* SplitSkill to chunk the data
* OCR skill to add page numbers for every chunk that is extracted
* Deploy the following services in the same region; Azure AI Document Intelligence, Azure AI Search, Azure Open AI, AI Foundry, Azure Blob Storage
* Enable system assigned managed identity
* Deploy text-embedding-3-small on Azure OpenAI (in Azure AI Foundry) for embeddings
* Deploy gpt-4o on Azure OpenAI for chat completion
* Configure search engine RBAC to Azure Blob Storage by adding a role for Storage Blob Data Reader, assigned to the search service system-managed identity
* Configure search engine RBAC to Azure Open AI by adding a role for Cognitive Services OpenAI User, assigned to the search service system-managed identity
* The model names and endpoint should be saved in AKV. Embedding skills and vectorizers assemble the full endpoint internally, so only the resource URI is needed. For example, given https://MY-FAKE-ACCOUNT.openai.azure.com/openai/deployments/text-embedding-3-large/embeddings?api-version=2024-06-01, the endpoint should to be provided in skill and vectorizer definitions is https://MY-FAKE-ACCOUNT.openai.azure.com.
* The Azure AI multiservice account is used for skills processing. The multiservice account key must be provided, even if RBAC is in use. The key isn't used on the connection, but it's currently used for billing purposes.

#### Import Required Packages

In [0]:
from azure.core.credentials import AzureKeyCredential
from azure.storage.blob import BlobServiceClient
import base64
from openai import AzureOpenAI
import azure.identity
from azure.identity import DefaultAzureCredential, EnvironmentCredential, ManagedIdentityCredential, SharedTokenCacheCredential
from azure.identity import ClientSecretCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult, AnalyzeDocumentRequest, ContentFormat
from azure.search.documents.indexes.models import (
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    SearchIndex,
    SemanticConfiguration, SemanticSearch, SemanticPrioritizedFields, SemanticField
)
from azure.search.documents.indexes import SearchIndexClient

import os
from azure.search.documents import SearchClient
from azure.identity import DefaultAzureCredential, AzureAuthorityHosts

#### Define Required Variables

In [0]:
"""
This code loads and sets the necessary variables for Azure services.
The variables are loaded from Azure Key Vault.
"""
azure_openai_endpoint=dbutils.secrets.get(scope="myscope", key="aoai-endpoint")
azure_openai_api_key=dbutils.secrets.get(scope="myscope", key="aoai-api-key")
azure_openai_api_version = "2024-02-15-preview"
azure_openai_embedding_deployment = dbutils.secrets.get(scope="myscope", key="aoai-embedding-deployment")
azure_openai_embedding_model = dbutils.secrets.get(scope="myscope", key="aoai-embedding-model")
azure_openai_vector_dimension = 1536
doc_intelligence_endpoint = dbutils.secrets.get(scope="myscope", key="docintelligence-endpoint")
doc_intelligence_key = dbutils.secrets.get(scope="myscope", key="docintelligence-key")
search_credential = AzureKeyCredential(dbutils.secrets.get(scope="myscope", key="aisearch-key"))
search_endpoint = dbutils.secrets.get(scope="myscope", key="aisearch-endpoint")
index_name = "integrated-vector-layout-index"
data_source_connection_name = "iv-indexer-datasource-connection"
azure_ai_services_key = dbutils.secrets.get(scope="myscope", key="azure-ai-services-key")
azure_ai_services_endpoint = dbutils.secrets.get(scope="myscope", key="azure-ai-services-endpoint")

# Connect to Blob Storage
# blob_connection_string = dbutils.secrets.get(scope="myscope", key="blobstore-connstr")
# blob_service_client = BlobServiceClient.from_connection_string(blob_connection_string)

# Service principal authentication variables
tenant_id=dbutils.secrets.get(scope="myscope", key="tenantid")
client_id = dbutils.secrets.get(scope="myscope", key="clientid")
client_secret = dbutils.secrets.get(scope="myscope", key="clientsecret")
credential = azure.identity.ClientSecretCredential(tenant_id=tenant_id, client_id=client_id, client_secret=client_secret)

blob_storage_name = "blobstore05" #dbutils.secrets.get(scope="myscope", key="blobstore-account-name")
# Use the above defined service principal to authenticate against the blob storage endpoint of the ADLS Gen 2 service
blob_service_client = BlobServiceClient(
    account_url=f"https://{blob_storage_name}.blob.core.windows.net",
    credential=credential
)
blob_container_name = "integrated-vectorization"
container_client = blob_service_client.get_container_client(blob_container_name)
use_layout = True
document_layout_depth = "h3"

#### Create Azure AI Search Datasource Object

In [0]:
from azure.search.documents.indexes import SearchIndexerClient
from azure.search.documents.indexes.models import (
    SearchIndexerDataContainer,
    SearchIndexerDataSourceConnection,
    SearchIndexerDataIdentity,
    SearchIndexerDataUserAssignedIdentity,
    SearchIndexerDataNoneIdentity
)
from azure.search.documents.indexes.models import (
    NativeBlobSoftDeleteDeletionDetectionPolicy,
    HighWaterMarkChangeDetectionPolicy,
    DataChangeDetectionPolicy
)

indexer_client = SearchIndexerClient(
    endpoint=search_endpoint, credential=search_credential
)
indexer_container = SearchIndexerDataContainer(name=blob_container_name)
data_source_connection = SearchIndexerDataSourceConnection(name=data_source_connection_name, connection_string= dbutils.secrets.get(scope="myscope", key="blobstore-connstr"), container=indexer_container, data_deletion_detection_policy=NativeBlobSoftDeleteDeletionDetectionPolicy(), type="azureblob")

# Create the data source object
data_source = indexer_client.create_or_update_data_source_connection(data_source_connection=data_source_connection)

print(f"Data source {data_source.name} created or updated successfully.")

Data source iv-indexer-datasource-connection created or updated successfully.


#### Create Azure AI Search Index

In [0]:
fields = [
    SearchField(name="parent_id",type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=True),
    SearchField(name="title",type=SearchFieldDataType.String),
    SearchField(name="chunk",type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False),
    SearchField(name="chunk_id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True, analyzer_name="keyword"),
    SearchField(name="vector",type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=azure_openai_vector_dimension, vector_search_profile_name="myHnswProfile")
]

if use_layout:
    fields.extend([
        SearchField(name="header_1",type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=False),
        SearchField(name="header_2",type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=False),
        SearchField(name="header_3",type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=False),
    ])


# Define the vector search configuration and parameters
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(name="myHsnw")
    ],
    profiles=[
        VectorSearchProfile(
            name="myHnswProfile",
            algorithm_configuration_name="myHsnw",
            vectorizer_name="myOpenAI"
        )
    ],
    vectorizers=[
        AzureOpenAIVectorizer(
            vectorizer_name="myOpenAI",
            kind="azureOpenAI",
            parameters=AzureOpenAIVectorizerParameters(
                resource_url=azure_openai_endpoint,
                deployment_name=azure_openai_embedding_deployment,
                model_name=azure_openai_embedding_model,
            )
        )
    ]
)

# Configure semantic search on the index
semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="title"),
        content_fields=[SemanticField(field_name="chunk")]
    )
)

# Create the semantic search config
semantic_search = SemanticSearch(configurations=[semantic_config])

scoring_profiles = []

In [0]:
# Create a search index client required to create the index
index_client = SearchIndexClient(endpoint=search_endpoint, credential=search_credential)

index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search, scoring_profiles=scoring_profiles, semantic_search=semantic_search)
result = index_client.create_or_update_index(index=index)
print(f"{result.name} created")

integrated-vector-layout-index created


> #### Create Required Skillsets for the document extraction processes and operations.
Skills drive integrated vectorization. Text Split provides data chunking. AzureOpenAIEmbedding handles calls to Azure OpenAI, using the connection information you provide in the environment variables. An indexer projection specifies secondary indexes used for chunked data

In [0]:
# Import required libraries
from azure.search.documents.indexes.models import (
    SplitSkill,
    AzureOpenAIEmbeddingSkill,
    OcrSkill,
    SearchIndexerSkillset,
    DocumentIntelligenceLayoutSkill,
    DocumentIntelligenceLayoutSkillMarkdownHeaderDepth,
    InputFieldMappingEntry,
    OutputFieldMappingEntry,
    SearchIndexerIndexProjection,
    SearchIndexerIndexProjectionSelector,
    SearchIndexerIndexProjectionsParameters,
    IndexProjectionMode,
    AIServicesAccountKey,
    AIServicesAccountIdentity
)

skillset_name = f"{index_name}-skillset"

def create_layout_skillset():
    layout_skill = DocumentIntelligenceLayoutSkill(
        description="Extracts layout information from the document",
        context="/document",
        output_mode="oneToMany",
        markdown_header_depth=document_layout_depth,
        inputs=[
            InputFieldMappingEntry(name="file_data", source="/document/file_data"),
        ],
        outputs=[
            OutputFieldMappingEntry(name="markdown_document", target_name="markdownDocument"),
        ]
    )

    split_skill = SplitSkill(
        description="Split skill to chunk documents",
        text_split_mode="pages",
        context="/document/markdownDocument/*",
        maximum_page_length=2000,
        page_overlap_length=500,
        inputs=[
            InputFieldMappingEntry(name="text", source="/document/markdownDocument/*/content"),
        ],
        outputs=[
            OutputFieldMappingEntry(name="textItems", target_name="pages")
        ]
    )

    embedding_skill = AzureOpenAIEmbeddingSkill(
        description="Skill to generate embeddings via Azure OpenAI",
        context="/document/markdownDocument/*/pages/*",
        resource_url=azure_openai_endpoint,
        deployment_name=azure_openai_embedding_deployment,
        model_name=azure_openai_embedding_model,
        dimensions=azure_openai_vector_dimension,
        api_key=azure_openai_api_key,
        inputs=[
            InputFieldMappingEntry(name="text", source="/document/markdownDocument/*/pages/*"),
        ],
        outputs=[
            OutputFieldMappingEntry(name="embedding", target_name="vector")
        ]
    )

    index_projections = SearchIndexerIndexProjection(
        selectors=[
            SearchIndexerIndexProjectionSelector(
                target_index_name=index_name,
                parent_key_field_name="parent_id",
                source_context="/document/markdownDocument/*/pages/*",
                mappings=[
                    InputFieldMappingEntry(name="chunk", source="/document/markdownDocument/*/pages/*"),
                    InputFieldMappingEntry(name="vector", source="/document/markdownDocument/*/pages/*/vector"),
                    InputFieldMappingEntry(name="title", source="/document/metadata_storage_name"),
                    InputFieldMappingEntry(name="header_1", source="/document/markdownDocument/*/sections/h1"),
                    InputFieldMappingEntry(name="header_2", source="/document/markdownDocument/*/sections/h2"),
                    InputFieldMappingEntry(name="header_3", source="/document/markdownDocument/*/sections/h3"),
                ]
            )
        ],
        parameters=SearchIndexerIndexProjectionsParameters(
            projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS
        )
    )

    skills = [layout_skill, split_skill, embedding_skill]

    return SearchIndexerSkillset(
        name=skillset_name,
        description="Skillset to chunk documents and generating embeddings",
        skills=skills,
        index_projection=index_projections,
        cognitive_services_account=AIServicesAccountKey(key=azure_ai_services_key, subdomain_url=azure_ai_services_endpoint)
    )

def create_skillset():
    split_skill = SplitSkill(
        description="Skill to split document text into smaller manageable chunks",
        text_split_mode="pages",
        context="/document",
        maximum_page_length=2000,
        page_overlap_length=500,
        inputs=[
            InputFieldMappingEntry(name="text", source="/document/content"),
        ],
        outputs=[
            OutputFieldMappingEntry(name="textItems", target_name="pages")
        ]
    )

    embedding_skill = AzureOpenAIEmbeddingSkill(
        description="Skill used to generate embeddings via Azure Open AI embedding model",
        context="/document/pages/*",
        resource_url=azure_openai_endpoint,
        deployment_name=azure_openai_embedding_deployment,
        model_name=azure_openai_embedding_model,
        dimensions=azure_openai_vector_dimension,
        api_key=azure_openai_api_key,
        inputs=[
            InputFieldMappingEntry(name="text", source="/document/pages/*"),
        ],
        outputs=[
            OutputFieldMappingEntry(name="embedding", target_name="vector")
        ]
    )

    index_projections = SearchIndexerIndexProjection(
        selectors=[
            SearchIndexerIndexProjectionSelector(
                target_index_name=index_name,
                parent_key_field_name="parent_id",
                source_context="/document/pages/*",
                mappings=[
                    InputFieldMappingEntry(name="chunk", source="/document/pages/*"),
                    InputFieldMappingEntry(name="vector", source="/document/pages/*/vector"),
                    InputFieldMappingEntry(name="title", source="/document/metadata_storage_name"),
                ]
            )
        ],
        parameters=SearchIndexerIndexProjectionsParameters(
            projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS
        )
    )

    skills = [split_skill, embedding_skill]

    return SearchIndexerSkillset(
        name=skillset_name,
        skills=skills,
        index_projection=index_projections,
        description="Skillset that enables doc chunking and generating embeddings",
        cognitive_services_account=AIServicesAccountKey(key=azure_ai_services_key, subdomain_url=azure_ai_services_endpoint)
    )

if use_layout:
    skillset = create_layout_skillset()
else:
    skillset = create_skillset()

indexer_client.create_or_update_skillset(skillset)
print(f"Created skillset {skillset.name}")

Created skillset integrated-vector-layout-index-skillset


In [0]:
help(DocumentIntelligenceLayoutSkill)

Help on class DocumentIntelligenceLayoutSkill in module azure.search.documents.indexes._generated.models._models_py3:

class DocumentIntelligenceLayoutSkill(SearchIndexerSkill)
 |  DocumentIntelligenceLayoutSkill(*, inputs: List[ForwardRef('_models.InputFieldMappingEntry')], outputs: List[ForwardRef('_models.OutputFieldMappingEntry')], name: Optional[str] = None, description: Optional[str] = None, context: Optional[str] = None, output_mode: Union[str, ForwardRef('_models.DocumentIntelligenceLayoutSkillOutputMode')] = 'oneToMany', markdown_header_depth: Union[str, ForwardRef('_models.DocumentIntelligenceLayoutSkillMarkdownHeaderDepth')] = 'h6', **kwargs: Any) -> None
 |  
 |  A skill that extracts content and layout information (as markdown), via Azure AI Services, from
 |  files within the enrichment pipeline.
 |  
 |  All required parameters must be populated in order to send to server.
 |  
 |  :ivar odata_type: A URI fragment specifying the type of skill. Required.
 |  :vartype odat

#### Create Indexer

In [0]:
from azure.search.documents.indexes.models import (
    SearchIndexer,
    IndexingParameters,
    IndexingParametersConfiguration,
    BlobIndexerImageAction
)

# Define indexer name  
indexer_name = f"{index_name}-indexer"

index_parameters = None
if use_layout:
  index_parameters = IndexingParameters(
    configuration=IndexingParametersConfiguration(
      allow_skillset_to_read_file_data=True,
      query_timeout=None,
    )
  )

indexer = SearchIndexer(
  name=indexer_name,
  description="Indexer to orchestrate the document indexing and embedding generation",
  skillset_name=skillset_name,
  target_index_name=index_name,
  data_source_name=data_source.name,
  parameters=index_parameters
)

indexer_result = indexer_client.create_or_update_indexer(indexer)

# Run the indexer to kick off the indexing process
indexer_client.run_indexer(indexer_name)
print(f' {indexer_name} is created and running. If queries return no results, please wait a bit and try again.')

 integrated-vector-layout-index-indexer is created and running. If queries return no results, please wait a bit and try again.


#### Perform a vector similarity search

This example shows a pure vector search using the vectorizable text query, all you need to do is pass in text and your vectorizer will handle the query vectorization.

If you indexed the health plan PDF file, send queries that ask plan-related questions.

In [0]:
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery

# Pure Vector Search
query = "Which is more comprehensive, Northwind Health Plus vs Northwind Standard?"

search_client = SearchClient(endpoint=search_endpoint, index_name=index_name, credential=search_credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields="vector", exhaustive=True)
# Use the below query to pass in the raw vector query instead of the query vectorization
# vector_query = RawVectorQuery(vector=generate_embeddings(query), k_nearest_neighbors=3, fields="vector")
  
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    top=1
)  
  
for result in results:  
    print(f"parent_id: {result['parent_id']}")  
    print(f"chunk_id: {result['chunk_id']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['chunk']}")   

parent_id: aHR0cHM6Ly9ibG9ic3RvcmUwNS5ibG9iLmNvcmUud2luZG93cy5uZXQvaW50ZWdyYXRlZC12ZWN0b3JpemF0aW9uL0JlbmVmaXRfT3B0aW9ucy5wZGY1
chunk_id: 9cf44e99b2e2_aHR0cHM6Ly9ibG9ic3RvcmUwNS5ibG9iLmNvcmUud2luZG93cy5uZXQvaW50ZWdyYXRlZC12ZWN0b3JpemF0aW9uL0JlbmVmaXRfT3B0aW9ucy5wZGY1_markdownDocument_3_pages_0
Score: 0.814661
Content: Both plans offer coverage for routine physicals, well-child visits, immunizations, and other preventive
care services. The plans also cover preventive care services such as mammograms, colonoscopies, and
other cancer screenings.

Northwind Health Plus offers more comprehensive coverage than Northwind Standard. This plan offers
coverage for emergency services, both in-network and out-of-network, as well as mental health and
substance abuse coverage. Northwind Standard does not offer coverage for emergency services, mental
health and substance abuse coverage, or out-of-network services.

Both plans offer coverage for prescription drugs. Northwind Health Plus offers

In [0]:
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery

# Pure Vector Search
query = "How much is the employee's cost per pay check for the Northwind Health Plus?"

search_client = SearchClient(endpoint=search_endpoint, index_name=index_name, credential=search_credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields="vector", exhaustive=True)
# Use the below query to pass in the raw vector query instead of the query vectorization
# vector_query = RawVectorQuery(vector=generate_embeddings(query), k_nearest_neighbors=3, fields="vector")
  
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    top=1
)  
  
for result in results:  
    print(f"parent_id: {result['parent_id']}")  
    print(f"chunk_id: {result['chunk_id']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['chunk']}")   

parent_id: aHR0cHM6Ly9ibG9ic3RvcmUwNS5ibG9iLmNvcmUud2luZG93cy5uZXQvaW50ZWdyYXRlZC12ZWN0b3JpemF0aW9uL0JlbmVmaXRfT3B0aW9ucy5wZGY1
chunk_id: 9cf44e99b2e2_aHR0cHM6Ly9ibG9ic3RvcmUwNS5ibG9iLmNvcmUud2luZG93cy5uZXQvaW50ZWdyYXRlZC12ZWN0b3JpemF0aW9uL0JlbmVmaXRfT3B0aW9ucy5wZGY1_markdownDocument_4_pages_0
Score: 0.7760778
Content: Contoso Electronics deducts the employee's portion of the healthcare cost from each paycheck. This
means that the cost of the health insurance will be spread out over the course of the year, rather
than being paid in one lump sum. The employee's portion of the cost will be calculated based on the
selected health plan and the number of people covered by the insurance. The table below shows a
cost comparison between the different health plans offered by Contoso Electronics:


<table>
<tr>
<th rowspan="2"></th>
<th colspan="2">Employee's cost per paycheck</th>
</tr>
<tr>
<th>Northwind Standard</th>
<th>Northwind Health Plus</th>
</tr>
<tr>
<td>Employee Only

#### Perform a hybrid search + semantic reranking

In [0]:
from azure.search.documents.models import (
    QueryType,
    QueryCaptionType,
    QueryAnswerType
)
# Semantic Hybrid Search
query = "Which is more comprehensive, Northwind Health Plus vs Northwind Standard?"

search_client = SearchClient(endpoint=search_endpoint, index_name=index_name, credential=search_credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields="vector", exhaustive=True)

results = search_client.search(  
    search_text=query,
    vector_queries=[vector_query],
    select=["parent_id", "chunk_id", "chunk"],
    query_type=QueryType.SEMANTIC,
    semantic_configuration_name='my-semantic-config',
    query_caption=QueryCaptionType.EXTRACTIVE,
    query_answer=QueryAnswerType.EXTRACTIVE,
    top=1
)

semantic_answers = results.get_answers()
if semantic_answers:
    for answer in semantic_answers:
        if answer.highlights:
            print(f"Semantic Answer: {answer.highlights}")
        else:
            print(f"Semantic Answer: {answer.text}")
        print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    print(f"parent_id: {result['parent_id']}")  
    print(f"chunk_id: {result['chunk_id']}")  
    print(f"Reranker Score: {result['@search.reranker_score']}")
    print(f"Content: {result['chunk']}")  

    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")

Semantic Answer: Contoso Electronics deducts the employee's portion of the healthcare... The table below shows a cost comparison between the different health plans offered by Contoso Electronics:      Employee's cost per paycheck<em>   Northwind Standard Northwind Health Plus </em>  Employee Only $45.00 $55.00   Employee +1 $65.00 $71.00   Employee +2 or more $78.00 $89.00.
Semantic Answer Score: 0.9549999833106995

parent_id: aHR0cHM6Ly9ibG9ic3RvcmUwNS5ibG9iLmNvcmUud2luZG93cy5uZXQvaW50ZWdyYXRlZC12ZWN0b3JpemF0aW9uL0JlbmVmaXRfT3B0aW9ucy5wZGY1
chunk_id: 9cf44e99b2e2_aHR0cHM6Ly9ibG9ic3RvcmUwNS5ibG9iLmNvcmUud2luZG93cy5uZXQvaW50ZWdyYXRlZC12ZWN0b3JpemF0aW9uL0JlbmVmaXRfT3B0aW9ucy5wZGY1_markdownDocument_3_pages_0
Reranker Score: 3.4568681716918945
Content: Both plans offer coverage for routine physicals, well-child visits, immunizations, and other preventive
care services. The plans also cover preventive care services such as mammograms, colonoscopies, and
other cancer screenings.

Northwi