#### In this notebook I demonstrate the use of the Azure AI Search Integrated Vectorization feature. and the Split Skill and Azure Open AI Embedding skill to index and build an agentic RAG solution on a glossary dataset in the CSV file format.
* Key vault retrievals are implemented using the azure key vault sdk
* The data source object connection string parameter was updated to use the storage account resource id: ResourceId=/subscriptions/00000000-0000-8888-7777-555555555555/resourceGroups/rgxxx/providers/Microsoft.Storage/storageAccounts/blob00000store
* The NativeBlobSoftDeleteDeletionDetectionPolicy is not supported for parsingMode indexer config set to delimitedText. Further research required
* It's important that the requirements.txt file pinned packages in this directory are used, to avoid breaking changes in newer versions for now
* https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-document-intelligence-layout
* https://learn.microsoft.com/en-us/azure/search/search-how-to-semantic-chunking
* SplitSkill to chunk the data
* AzureOpenAIEmbedding skill to embed the dataset "Definition" field 
* Deploy the following services in the same region; Azure AI Document Intelligence, Azure AI Search, Azure Open AI, AI Foundry, Azure Blob Storage
* Enable system assigned managed identity
* Deploy text-embedding-3-small on Azure OpenAI (in Azure AI Foundry) for embeddings
* Deploy gpt-4o on Azure OpenAI for chat completion
* Configure search engine RBAC to Azure Blob Storage by adding a role for Storage Blob Data Reader, assigned to the search service system-managed identity
* Configure search engine RBAC to Azure Open AI by adding a role for Cognitive Services OpenAI User, assigned to the search service system-managed identity
* The model names and endpoint should be saved in AKV. Embedding skills and vectorizers assemble the full endpoint internally, so only the resource URI is needed. For example, given https://MY-FAKE-ACCOUNT.openai.azure.com/openai/deployments/text-embedding-3-large/embeddings?api-version=2024-06-01, the endpoint should to be provided in skill and vectorizer definitions is https://MY-FAKE-ACCOUNT.openai.azure.com.
* The Azure AI multiservice account is used for skills processing. The multiservice account key must be provided, even if RBAC is in use. The key isn't used on the connection, but it's currently used for billing purposes.

#### Import Required Packages

In [93]:
# %pip install azure
# %pip install azure-keyvault-secrets
# %pip install azure-storage-blob
# %pip install azure-identity azure-search-documents azure-storage-blob

In [1]:
from azure.core.credentials import AzureKeyCredential
from azure.storage.blob import BlobServiceClient
import base64
from openai import AzureOpenAI
import azure.identity
from azure.identity import DefaultAzureCredential, EnvironmentCredential, ManagedIdentityCredential, SharedTokenCacheCredential
from azure.identity import ClientSecretCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult, AnalyzeDocumentRequest
from azure.search.documents.indexes.models import (
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    SearchIndex,
    BlobIndexerParsingMode,
    SemanticConfiguration, SemanticSearch, SemanticPrioritizedFields, SemanticField
)
from azure.search.documents.indexes import SearchIndexClient

import os
from azure.search.documents import SearchClient
from azure.identity import DefaultAzureCredential, AzureAuthorityHosts
from azure.keyvault.secrets import SecretClient
from azure.identity import DefaultAzureCredential
import os

#### Define Required Variables

In [95]:
# try:
#     keyVaultName = os.environ["KEY_VAULT_NAME"]
# except KeyError:
#     # Get input from user if not set
#     keyVaultName = input("Please enter your Key Vault name: ")
#     # Save for future cells in this session
#     os.environ["KEY_VAULT_NAME"] = keyVaultName

In [2]:
# keyVaultName = os.environ["KEY_VAULT_NAME"]
keyVaultName = "akvlab00"  # Replace with your Key Vault name
KVUri = f"https://{keyVaultName}.vault.azure.net"

credential = DefaultAzureCredential()
client = SecretClient(vault_url=KVUri, credential=credential)

In [15]:
"""
This code loads and sets the necessary variables for Azure services.
The variables are loaded from Azure Key Vault.
"""
# Open AI
azure_openai_endpoint=client.get_secret(name="aoai-endpoint").value
azure_openai_api_key=client.get_secret(name="aoai-api-key").value
azure_openai_api_version = "2024-02-15-preview"
# Embedding
# azure_openai_embedding_deployment = client.get_secret(name="aoai-embedding-deployment").value
azure_openai_embedding_deployment = "text-embedding-3-small"
azure_openai_embedding_model =client.get_secret(name="aoai-embedding-model").value
azure_openai_vector_dimension = 1536
# AI Search
search_credential =AzureKeyCredential(client.get_secret(name="aisearch-key").value)
search_endpoint =client.get_secret(name="aisearch-endpoint").value
source = 'json'
index_name = f"{source}-glossary-index"
# AI Service
data_source_connection_name = f"{source}-glossary-ds"
azure_ai_services_key =client.get_secret(name="azure-ai-services-key").value
azure_ai_services_endpoint =client.get_secret(name="azure-ai-services-endpoint").value
# Blob Storage
blob_container_name = f"{source}-data"
blob_storage_name =client.get_secret(name="blobstore-account-name").value
# Cognitive Services
# azure_ai_cognitive_services_key = client.get_secret(name="azure-ai-cognitive-services-key").value
# azure_ai_cognitive_services_endpoint = client.get_secret(name="azure-ai-cognitive-services-endpoint").value
azure_ai_services_key =client.get_secret(name="azure-ai-services-key").value
azure_ai_services_endpoint =client.get_secret(name="azure-ai-services-endpoint").value

#### Create Azure AI Search Datasource Object

In [16]:
from azure.search.documents.indexes import SearchIndexerClient
from azure.search.documents.indexes.models import (
    SearchIndexerDataContainer,
    SearchIndexerDataSourceConnection,
    SearchIndexerDataIdentity,
    SearchIndexerDataUserAssignedIdentity,
    SearchIndexerDataNoneIdentity
)
from azure.search.documents.indexes.models import (
    NativeBlobSoftDeleteDeletionDetectionPolicy,
    HighWaterMarkChangeDetectionPolicy,
    DataChangeDetectionPolicy
)

indexer_client = SearchIndexerClient(
    endpoint=search_endpoint, credential=search_credential
)
indexer_container = SearchIndexerDataContainer(name=blob_container_name)
resource_id = client.get_secret(name="ds-resource-id").value
data_source_connection = SearchIndexerDataSourceConnection(name=data_source_connection_name, type="azureblob", connection_string=resource_id, container=indexer_container)
data_source_connection
# Create the data source object
data_source = indexer_client.create_or_update_data_source_connection(data_source_connection=data_source_connection)

print(f"Data source {data_source.name} created or updated successfully.")

Data source json-glossary-ds created or updated successfully.


#### Create Azure AI Search Index

In [17]:
fields = [
    SearchField(name="chunk_id", type=SearchFieldDataType.String, searchable=True, filterable=False, sortable=True, key=True, facetable=False, analyzer_name="keyword"),
    SearchField(name="parent_id", type=SearchFieldDataType.String, searchable=False, filterable=True, sortable=False, key=False, facetable=False),
     SearchField(name="chunk", type=SearchFieldDataType.String, searchable=True, filterable=False, sortable=False, facetable=False, key=False),
    SearchField(name="title",type=SearchFieldDataType.String, searchable=True, sortable=True, filterable=True, facetable=True),
    SearchField(name="text_vector",type=SearchFieldDataType.Collection(SearchFieldDataType.Single), searchable=True, filterable=False, sortable=False, facetable=False, key=False, vector_search_dimensions=azure_openai_vector_dimension, vector_search_profile_name="myHnswProfile"),
    SearchField(name="gender",type=SearchFieldDataType.String, searchable=True, sortable=True, filterable=True, facetable=True),
    SearchField(name="definition",type=SearchFieldDataType.String, searchable=True, sortable=True, filterable=True, facetable=True),
    SearchField(name="context",type=SearchFieldDataType.String, searchable=True, sortable=True, filterable=True, facetable=True),
    SearchField(name="note",type=SearchFieldDataType.String, searchable=True, sortable=True, filterable=True, facetable=True),
    SearchField(name="incorrectTerm",type=SearchFieldDataType.String, searchable=True, sortable=True, filterable=True, facetable=True),
    SearchField(name="domain",type=SearchFieldDataType.String, searchable=True, sortable=True, filterable=True, facetable=True),
    SearchField(name="modificationDate",type=SearchFieldDataType.String, searchable=True, sortable=True, filterable=True, facetable=True),
    SearchField(name="source",type=SearchFieldDataType.String, searchable=True, sortable=True, filterable=True, facetable=True),
    SearchField(name="link",type=SearchFieldDataType.String, searchable=True, sortable=True, filterable=True, facetable=True),
    SearchField(name="englishTerm",type=SearchFieldDataType.String, searchable=True, sortable=True, filterable=True, facetable=True),
    SearchField(name="creationDate",type=SearchFieldDataType.String, searchable=True, sortable=True, filterable=True, facetable=True),
]

# Define the vector search configuration and parameters
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(name="myHsnw")

    ],
    profiles=[
        VectorSearchProfile(
            name="myHnswProfile",
            algorithm_configuration_name="myHsnw",
            vectorizer_name="myOpenAI"
        )
    ],
    vectorizers=[
        AzureOpenAIVectorizer(
            vectorizer_name="myOpenAI",
            kind="azureOpenAI",
            parameters=AzureOpenAIVectorizerParameters(
                resource_url=azure_openai_endpoint,
                deployment_name=azure_openai_embedding_deployment,
                model_name=azure_openai_embedding_model,
            )
        )
    ]
)

# Configure semantic search on the index
semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="chunk"),
        content_fields=[SemanticField(field_name="chunk"), SemanticField(field_name="context"), SemanticField(field_name="note"), SemanticField(field_name="incorrectTerm")],
        keywords_fields=[SemanticField(field_name="chunk"), SemanticField(field_name="context"), SemanticField(field_name="note"), SemanticField(field_name="incorrectTerm")],
    )
)

# Create the semantic search config
semantic_search = SemanticSearch(configurations=[semantic_config])

scoring_profiles = []

In [18]:
# Create a search index client required to create the index
index_client = SearchIndexClient(endpoint=search_endpoint, credential=search_credential)

index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search, scoring_profiles=scoring_profiles, semantic_search=semantic_search)
result = index_client.create_or_update_index(index=index)
print(f"{result.name} created")

json-glossary-index created


> #### Create Required Skillsets for the document extraction processes and operations.
Skills drive integrated vectorization. Text Split provides data chunking. AzureOpenAIEmbedding handles calls to Azure OpenAI, using the connection information you provide in the environment variables. An indexer projection specifies secondary indexes used for chunked data

In [19]:
from azure.search.documents.indexes.models import (
    SplitSkill,
    AzureOpenAIEmbeddingSkill,
    OcrSkill,
    SearchIndexerSkillset,
    DocumentIntelligenceLayoutSkill,
    DocumentIntelligenceLayoutSkillMarkdownHeaderDepth,
    InputFieldMappingEntry,
    OutputFieldMappingEntry,
    SearchIndexerIndexProjection,
    SearchIndexerIndexProjectionSelector,
    SearchIndexerIndexProjectionsParameters,
    IndexProjectionMode,
    AIServicesAccountKey,
    AIServicesAccountIdentity
)

In [20]:
# Import required libraries
from azure.search.documents.indexes.models import (
    SplitSkill,
    AzureOpenAIEmbeddingSkill,
    OcrSkill,
    SearchIndexerSkillset,
    DocumentIntelligenceLayoutSkill,
    DocumentIntelligenceLayoutSkillMarkdownHeaderDepth,
    InputFieldMappingEntry,
    OutputFieldMappingEntry,
    SearchIndexerIndexProjection,
    SearchIndexerIndexProjectionSelector,
    SearchIndexerIndexProjectionsParameters,
    IndexProjectionMode,
    AIServicesAccountKey,
    AIServicesAccountIdentity
)

skillset_name = f"{index_name}-skillset"

def create_skillset():
    split_skill = SplitSkill(  
    description="Split skill to chunk documents",  
    text_split_mode="pages",  
    default_language_code="en",  
    context="/document",
    maximum_page_length=2000,  
    page_overlap_length=500,
    maximum_pages_to_take=0,
    unit = "characters",   
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/definition"),
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="textItems", target_name="pages")  
    ]  
    )  

    embedding_skill = AzureOpenAIEmbeddingSkill(
        description="Skill to generate embeddings via Azure OpenAI",
        context="/document/pages/*",
        resource_url=azure_openai_endpoint,
        deployment_name=azure_openai_embedding_deployment,
        model_name=azure_openai_embedding_model,
        dimensions=azure_openai_vector_dimension,
        api_key=azure_openai_api_key,
        inputs=[
            InputFieldMappingEntry(name="text", source="/document/pages/*"), # Chunking the definition
        ],
        outputs=[
            OutputFieldMappingEntry(name="embedding", target_name="text_vector") # Inserting the chunks into text_vector of enriched doc
        ]
    )

    index_projections = SearchIndexerIndexProjection(
        selectors=[
            SearchIndexerIndexProjectionSelector(
                target_index_name=index_name,
                parent_key_field_name="parent_id",
                source_context="/document/pages/*",
                mappings=[
                    InputFieldMappingEntry(name="text_vector", source="/document/pages/*/text_vector"),
                    InputFieldMappingEntry(name="chunk", source="/document/pages/*"),
                    InputFieldMappingEntry(name="title", source="/document/title"),
                    InputFieldMappingEntry(name="gender", source="/document/gender"),
                    InputFieldMappingEntry(name="definition", source="/document/definition"),
                    InputFieldMappingEntry(name="incorrectTerm", source="/document/incorrectTerm"),
                    InputFieldMappingEntry(name="domain", source="/document/domain"),
                    InputFieldMappingEntry(name="englishTerm", source="/document/englishTerm"),
                    InputFieldMappingEntry(name="creationDate", source="/document/creationDate"),
                    InputFieldMappingEntry(name="modificationDate", source="/document/modificationDate"),
                    InputFieldMappingEntry(name="source", source="/document/source"),
                    InputFieldMappingEntry(name="link", source="/document/link"),
                    InputFieldMappingEntry(name="context", source="/document/context"),
                    InputFieldMappingEntry(name="note", source="/document/note"),

                ]
            )
        ],
        parameters=SearchIndexerIndexProjectionsParameters(
            projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS
        )
    )

    skills = [split_skill, embedding_skill]

    return SearchIndexerSkillset(
        name=skillset_name,
        description="Skillset to chunk documents and generating embeddings",
        skills=skills,
        index_projection=index_projections,
        cognitive_services_account=AIServicesAccountKey(key=azure_ai_services_key, subdomain_url=azure_ai_services_endpoint)
    )

skillset = create_skillset()


indexer_client.create_or_update_skillset(skillset)
print(f"Created skillset {skillset.name}")

Created skillset json-glossary-index-skillset


#### Create Indexer

In [21]:
from azure.search.documents.indexes.models import (
    SearchIndexer,
    IndexingParameters,
    IndexingParametersConfiguration,
    BlobIndexerImageAction
)

# Define indexer name  
indexer_name = f"{index_name}-indexer"

index_parameters = IndexingParameters(
    configuration=IndexingParametersConfiguration(
      data_to_extract="contentAndMetadata", # contentAndMetadata
      parsing_mode="jsonArray", # jsonLines
      document_root="terms/term", #/document
      # fail_on_unprocessable_document=False,
      # fail_on_unsupported_content_type=False,
      # first_line_contains_headers=True,
      query_timeout=None,
      # allow_skillset_to_read_file_data=True
    )
  )

indexer = SearchIndexer(
  name=indexer_name,
  description="Indexer to orchestrate the document indexing and embedding generation",
  skillset_name=skillset_name,
  target_index_name=index_name,
  data_source_name=data_source.name
  ,parameters=index_parameters
)

indexer_result = indexer_client.create_or_update_indexer(indexer)

# Run the indexer to kick off the indexing process
indexer_client.run_indexer(indexer_name)
print(f' {indexer_name} is created and running. If queries return no results, please wait a bit and try again.')

# Schedule an indexer to run every 24 hours
#https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/search/azure-search-documents/samples/sample_indexer_datasource_skillset.py

 json-glossary-index-indexer is created and running. If queries return no results, please wait a bit and try again.


#### Perform a vector similarity search

This example shows a pure vector search using the vectorizable text query, all you need to do is pass in text and your vectorizer will handle the query vectorization.

If you indexed the health plan PDF file, send queries that ask plan-related questions.

In [None]:
# Optimize Search - Searching algorithm/ Return multiple Scores/ values
# Add Incorrect Term - Done
# Language
# Json or XML
# Another alternate for Searching Algorithm

## Text Search

In [22]:
# Text Search
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery
# Text Search
query = "DNA"
 
search_client = SearchClient(endpoint=search_endpoint, index_name=index_name, credential=search_credential)
 
results = search_client.search(
    search_text=query,
    select=["chunk", "note", "context", "incorrectTerm"],
    top=5
)
 
input_text = " "
for result in results:  
    # print(f"id: {result['id']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Definition: {result['chunk']}")
    print(f"note: {result['note']}")
    print(f"context: {result['context']}")
    print(f"incorrectTerm: {result['incorrectTerm']}")


Score: 8.409295
Definition: Deoxyribonucleic acid (DNA) is a molecule composed of two polynucleotide chains that coil around each other to form a double helix carrying genetic instructions for the development, functioning, growth, and reproduction of all known organisms and many viruses.
note: DNA is essential for inheritance, coding for proteins, and the genetic instruction guide for life and its processes.
context: DNA is used in genetic research and forensic science.
incorrectTerm: Deoxyribose Nucleic Acid


In [23]:
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery

# Pure Vector Search
# query = "What can you tell me about application programming interface ?"
query = "What is Agile"

search_client = SearchClient(endpoint=search_endpoint, index_name=index_name, credential=search_credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=5, fields="text_vector", exhaustive=True)
# print(vector_query)

# Use the below query to pass in the raw vector query instead of the query vectorization
# vector_query = RawVectorQuery(vector=generate_embeddings(query), k_nearest_neighbors=3, fields="text_vector")
  
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    # select=["chunk"],
    top=1
)  
  
for result in results:  
    # print(f"id: {result['id']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['chunk']}")
    print(f"note: {result['note']}")  

Score: 0.5916212
Content: Software Development is the process of conceiving, specifying, designing, programming, and testing applications and systems.
note: Agile methodologies are popular in modern software development.


#### Perform a hybrid search + semantic reranking

In [25]:
from azure.search.documents.models import (
    QueryType,
    QueryCaptionType,
    QueryAnswerType
)
# Semantic Hybrid Search
query = " What is Agile?"

search_client = SearchClient(endpoint=search_endpoint, index_name=index_name, credential=search_credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=5, fields="text_vector", exhaustive=True)

results = search_client.search(  
    search_text=query,
    vector_queries=[vector_query],
    select=["context", "chunk",  "note", "incorrectTerm"],
    query_type=QueryType.SEMANTIC,
    semantic_configuration_name='my-semantic-config',
    query_caption=QueryCaptionType.EXTRACTIVE,
    query_answer=QueryAnswerType.EXTRACTIVE,
    top=3
)

semantic_answers = results.get_answers()
if semantic_answers:
    for answer in semantic_answers:
        if answer.highlights:
            print(f"Semantic Answer: {answer.highlights}")
        else:
            print(f"Semantic Answer: {answer.text}")
            print(f"Semantic Answer: {answer.context}")
            print(f"Semantic Answer: {answer.incorrectTerm}")
            print(f"Semantic Answer: {answer.note}")
        print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    # print(f"id: {result['id']}")  
    print(f"note: {result['note']}")
    print(f"context: {result['context']}")
    print(f"incorrectTerm: {result['incorrectTerm']}") 
    print(f"Reranker Score: {result['@search.reranker_score']}")
    print(f"Content: {result['chunk']}")  

    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")

note: Agile methodologies are popular in modern software development.
context: It is used to create various software solutions across industries.
incorrectTerm: Software Developement
Reranker Score: 2.37892746925354
Content: Software Development is the process of conceiving, specifying, designing, programming, and testing applications and systems.
Caption: <em>Software Development </em>is the process of conceiving, specifying,<em> designing, programming, and testing applications and systems.</em> It is used to create various software solutions across industries. Agile methodologies are popular in modern<em> software development.</em> Software Developement.

note: Focusing on UX can significantly improve customer satisfaction.
context: UX is critical in product design and service delivery.
incorrectTerm: User Experence
Reranker Score: 1.7951167821884155
Content: User Experience (UX) refers to a person's emotions and attitudes about using a particular product, system or service.
Caption: