## Part 3 -- Configuration Deep Dive: Empowering Conversations with Vector Storage

In order to be able to create LLM Applications using Azure Cogntive Search, we need to setup the components for it, this was explained in the first part, we need Indexes, Indexers, Knowledge Store, Data Sources and Skillsets.  In this part we will create tha backend functions to support all of this.

If you prefer, check the backend folder for the entire code, which is also very well documented.

### create_indexes

This function is a wrapper on top of all what we have to do, it will be used later from the backend which is implemented as an Azure Function.
From here we instantiate the DocumentIndexManager, and then we create the document index resources with the function `create_document_index_resources`

- Index.
- Indexer.
- Datasource.
- Skillset with custom skill (OpenAI embedding generator).

Remember these resources are tied to the source documents, pdf, word, excel, powerpoint, md, or whatever is supported.  Until this point we dont have any vector storage yet.  However when we create the skillset, we define a knowledge store, this means that the output of the custom skill will be saved into the knowledge store, more on this later.


Then we instantiate the ChunkIndexManger which will create the chunk index resources using `create_chunk_index_resources`:

- Index.
- Indexer.
- Datasource.

In this second set of resources, the indexer is set to a datasource pointing to our Knowledge Store, remember a Knowledge Store is just a storage account, and there we have the projections generated by the previous step, the projections are actually a lot of JSON files with the embeddings generated in our previous step, you will see this later in the code.

In [None]:
from chunkindexmanager import ChunkIndexManager
from documentindexmanager import DocumentIndexManager

# Function to create search indexes in Azure Search
def create_indexes(prefix, customer_storage_connection_string, container_name, config):
    """
    Function to create search indexes in Azure Search.
    """
    index_manager = DocumentIndexManager()
    doc_index_resources = index_manager.create_document_index_resources(prefix, customer_storage_connection_string, container_name, config)
    time.sleep(5)
    chunk_index_manager = ChunkIndexManager()
    chunk_index_resources = chunk_index_manager.create_chunk_index_resources(prefix, config)  # doesnt need config
    return {"doc_index_resources": doc_index_resources, "chunk_index_resources": chunk_index_resources}



### Document Indexing Manager

The given code defines a Python class called *DocumentIndexManager* that facilitates the creation, management, and deletion of resources for document indexing using Azure Cognitive Search. This class encapsulates functions to set up a document index, create datasources, skillsets, and indexers, as well as to manage these resources. Let's break down the main components of the code:

- **_create_document_index:** This function creates a document index within Azure Cognitive Search. It defines the schema of the index, specifying various fields such as document ID, content, filesize, filepath, and more. It also includes searchable and retrievable attributes to enhance search and retrieval efficiency.

- **_create_document_datasource:** This function establishes a blob datasource within Azure Search, allowing documents to be ingested from a specified storage container. The function takes inputs like the index prefix, storage connection string, container name, and Azure Search configuration to create the datasource.

- **_create_document_skillset:** This function defines a skillset, which is a set of skills applied to the indexed content to extract meaningful information. It might include skills like open ai embedding, OCR, merging, and image analysis. These skills enhance search accuracy by extracting relevant data from the documents. For our project we used only Open AI Embedding, but the code is there for you to try OCR, Merging and Image Analysis skills.  When Creating a skillset, a Knowledge Store has to be defined also, why? Because the output of a custom skill needs a place to be saved.

- **_create_document_indexer:** This function creates an indexer that connects the datasource to the document index. The indexer specifies how data should be processed and ingested into the index, including field mappings and indexing parameters. It utilizes the previously defined skillset to enhance the indexed content.

- **create_document_index_resources:** This function orchestrates the creation of all necessary resources for document indexing. It invokes the previously defined functions to create the index, datasource, skillset, and indexer. After setting up these resources, it waits for the indexer to complete its processing.

- **delete_document_index_resources:** This function cleans up the resources associated with a document index. It deletes the index, indexer, datasource, skillset, and related components. Additionally, it deletes any knowledge store tables and blobs associated with the index.

The **DocumentIndexManager** class aims to provide a comprehensive solution for setting up and managing document indexing in Azure Cognitive Search. It encapsulates the various steps involved in creating an effective search solution for documents. By leveraging this class, developers can streamline the process of creating and managing the resources required for efficient document indexing and retrieval using Azure Cognitive Search.

Code is well document, use at your own risk

In [None]:
import time

from azure.storage.blob import BlobServiceClient
from azure.core.exceptions import ResourceNotFoundError
from azure.search.documents.indexes.models import (
    SimpleField,
    SearchableField,
    SearchFieldDataType,
    SearchIndexer,
    IndexingParameters,
    FieldMapping,
    FieldMappingFunction,
    InputFieldMappingEntry,
    OutputFieldMappingEntry,
    SearchIndexerSkillset,
    SearchIndexerKnowledgeStore,
    SearchIndexerKnowledgeStoreProjection,
    SearchIndexerKnowledgeStoreFileProjectionSelector,
    WebApiSkill,
    OcrSkill,
    ImageAnalysisSkill,
    MergeSkill,
    CognitiveServicesAccountKey
)

from utilities import (
    get_index_name,
    create_index,
    get_datasource_name,
    create_blob_datasource,
    get_indexer_name,
    get_indexer_client,
    get_knowledge_store_connection_string,
    get_chunk_index_blob_container_name,
    wait_for_indexer_completion,
    get_index_client,
    get_skillset_name
)


class DocumentIndexManager():
    def _create_document_index(self, index_prefix, config):
        """
        Creates a document index in Azure Search with the given index_prefix and config.

        Args:
            index_prefix (str): The prefix to use for the index name.
            config (SearchServiceClientConfiguration): The configuration for the Azure Search service.

        Returns:
            Index: The created document index.
        """
        # Get the name for the index
        name = get_index_name(index_prefix)
        # Define the fields for the index
        fields = [
            SimpleField(name="document_id", type=SearchFieldDataType.String, filterable=True, sortable=True, key=True),
            SearchableField(name="content", type=SearchFieldDataType.String),
            SimpleField(name="filesize", type=SearchFieldDataType.Int64),
            SimpleField(name="filepath", type=SearchFieldDataType.String),
            SearchableField(name="metadata_storage_name", type=SearchFieldDataType.String, filterable=True, retrievable=True),
            SimpleField(name="metadata_storage_path", type=SearchFieldDataType.String, retrievable=True),
            SearchableField(name="merged_content", type=SearchFieldDataType.String, retrievable=True),
            SimpleField(name="text", type="Collection(Edm.String)", retrievable=True, searchable=True),
            SimpleField(name="layoutText", type="Collection(Edm.String)", retrievable=True, searchable=True)
        ]
        # Create the index using the custom utility function
        return create_index(name, fields, config=config, vector_search=None, semantic_title_field_name="filepath", semantic_content_field_names=["content"])

    def _create_document_datasource(self, index_prefix, storage_connection_string, container_name, config):
        """
        Creates a blob datasource in Azure Search with the given index_prefix, storage_connection_string, container_name, and config.

        Args:
            index_prefix (str): The prefix to use for the datasource name.
            storage_connection_string (str): The connection string for the storage account.
            container_name (str): The name of the container to index.
            config (SearchServiceClientConfiguration): The configuration for the Azure Search service.

        Returns:
            DataSource: The created blob datasource.
        """
        # Get the name for the datasource
        name = get_datasource_name(index_prefix)
        # Create the datasource using the custom utility function
        return create_blob_datasource(name, storage_connection_string, container_name, config)

    def _create_document_skillset(self, index_prefix, config, content_field_name="content"):
        """
        Creates a skillset for a document using Azure Search.

        Args:
            index_prefix (str): The prefix for the index.
            config (dict): The configuration dictionary.
            content_field_name (str, optional): The name of the content field. Defaults to "content".

        Returns:
            Skillset: The created skillset.
        """

        # Get the endpoint for the embedding skill from the configuration dictionary
        embedding_skill_endpoint = config['AZURE_SEARCH_EMBEDDING_SKILL_ENDPOINT']

        # Get the name of the skillset
        name = get_skillset_name(index_prefix)

        # Get the name of the chunk index blob container
        chunk_index_blob_container_name = get_chunk_index_blob_container_name(index_prefix)

        # Define the content context
        content_context = f"/document/{content_field_name}"

        # Define the embedding skill
        embedding_skill = WebApiSkill(
            name="chunking-embedding-skill",
            uri=embedding_skill_endpoint,
            timeout="PT3M",
            batch_size=1,
            degree_of_parallelism=1,
            context=content_context,
            inputs=[
                InputFieldMappingEntry(name="document_id", source="/document/document_id"),
                InputFieldMappingEntry(name="text", source=content_context),
                InputFieldMappingEntry(name="filepath", source="/document/filepath"),
                InputFieldMappingEntry(name="fieldname", source=f"='{content_field_name}'")
            ],
            outputs=[OutputFieldMappingEntry(name="chunks", target_name="chunks")]
        )

        # Define the OCR skill
        ocr_skill = OcrSkill(
            name="ocr-skill",
            context=content_context,
            inputs=[InputFieldMappingEntry(name="image", source="/document/normalized_images/*")],
            outputs=[
                OutputFieldMappingEntry(name="text", target_name="text"),
                OutputFieldMappingEntry(name="layoutText", target_name="layoutText")
            ]
        )

        # Define the merge skill
        merge_skill = MergeSkill(
            name="merge-skill",
            context="/document",
            inputs=[
                InputFieldMappingEntry(name="text", source="/document/content"),
                InputFieldMappingEntry(name="itemsToInsert", source="/document/normalized_images/*/text"),  # Example field
                InputFieldMappingEntry(name="offsets", source="/document/normalized_images/*/contentOffset")  # Example field
            ],
            outputs=[
                OutputFieldMappingEntry(name="mergedText", target_name="merged_text")
            ]
        )

        # Define the ImageAnalysisSkill
        image_analysis_skill = ImageAnalysisSkill(
            name="image-analysis-skill",
            context=content_context,
            inputs=[InputFieldMappingEntry(name="image", source="/document/normalized_images/*")],  # Add inputs parameter
            visual_features=["tags", "description"],
            outputs=[
                OutputFieldMappingEntry(name="categories", target_name="categories"),
                OutputFieldMappingEntry(name="tags", target_name="tags"),
                OutputFieldMappingEntry(name="description", target_name="description"),
                OutputFieldMappingEntry(name="faces", target_name="faces")
            ]
        )

        # Define the knowledge store
        knowledge_store = SearchIndexerKnowledgeStore(
            storage_connection_string=get_knowledge_store_connection_string(config),
            projections=[
                SearchIndexerKnowledgeStoreProjection(
                    objects=[SearchIndexerKnowledgeStoreFileProjectionSelector(
                        storage_container=chunk_index_blob_container_name,
                        generated_key_name="id",
                        source_context=f"{content_context}/chunks/*",
                        inputs=[
                            InputFieldMappingEntry(name="source_document_id", source="/document/document_id"),
                            InputFieldMappingEntry(name="source_document_filepath", source="/document/filepath"),
                            InputFieldMappingEntry(name="source_field_name", source=f"{content_context}/chunks/*/embedding_metadata/fieldname"),
                            InputFieldMappingEntry(name="title", source=f"{content_context}/chunks/*/title"),
                            InputFieldMappingEntry(name="text", source=f"{content_context}/chunks/*/content"),
                            InputFieldMappingEntry(name="embedding", source=f"{content_context}/chunks/*/embedding_metadata/embedding"),
                            InputFieldMappingEntry(name="index", source=f"{content_context}/chunks/*/embedding_metadata/index"),
                            InputFieldMappingEntry(name="offset", source=f"{content_context}/chunks/*/embedding_metadata/offset"),
                            InputFieldMappingEntry(name="length", source=f"{content_context}/chunks/*/embedding_metadata/length")
                        ]
                    )]
                ),
                SearchIndexerKnowledgeStoreProjection(
                    files=[SearchIndexerKnowledgeStoreFileProjectionSelector(
                        storage_container=f"{chunk_index_blob_container_name}images",
                        generated_key_name="imagepath",
                        source="/document/normalized_images/*",
                        inputs=[]
                    )]
                )
            ]
        )

        # Define the cognitive services account
        cognitiveservicesaccount = CognitiveServicesAccountKey(description="Cognitive Services Account", key=config['AZURE_SEARCH_COGNITIVE_SERVICES_KEY'])

        # Define the skillset
        skillset = SearchIndexerSkillset(
            name=name,
            skills=[embedding_skill], #here more skills can be added
            description=name,
            knowledge_store=knowledge_store,
            cognitive_services_account=cognitiveservicesaccount
        )

        # Create the skillset using the indexer client
        client = get_indexer_client(config)
        return client.create_skillset(skillset)

    def _create_document_indexer(self, index_prefix, data_source_name, index_name, skillset_name, config, content_field_name="content", generate_page_images=True):
        """
        Creates an indexer in Azure Search with the given index_prefix, data_source_name, index_name, skillset_name, config, content_field_name, and generate_page_images.

        Args:
            index_prefix (str): The prefix to use for the indexer name.
            data_source_name (str): The name of the data source to use for the indexer.
            index_name (str): The name of the index to use for the indexer.
            skillset_name (str): The name of the skillset to use for the indexer.
            config (dict): The configuration for the Azure Search service.
            content_field_name (str): The name of the content field to use for the indexer. Defaults to "content".
            generate_page_images (bool): Whether to generate normalized images for each page of the document. Defaults to True.

        Returns:
            Indexer: The created indexer.
        """
        # Get the name for the indexer
        name = get_indexer_name(index_prefix)

        # Define the indexer configuration based on the generate_page_images parameter
        indexer_config = {"dataToExtract": "contentAndMetadata", "imageAction": "generateNormalizedImagePerPage"} if generate_page_images else {"dataToExtract": "contentAndMetadata"}

        # Define the indexing parameters
        parameters = IndexingParameters(max_failed_items=-1, configuration=indexer_config)

        # Define the field mappings for the indexer
        field_mappings = [
            FieldMapping(source_field_name="metadata_storage_path", target_field_name="document_id", mapping_function=FieldMappingFunction(name="base64Encode", parameters=None)),
            FieldMapping(source_field_name="metadata_storage_name", target_field_name="filepath"),
            FieldMapping(source_field_name="metadata_storage_size", target_field_name="filesize")
        ]

        # Define the output field mappings for the indexer
        output_field_mappings = []

        # Create the indexer using the custom utility function
        indexer = SearchIndexer(
            name=name,
            data_source_name=data_source_name,
            target_index_name=index_name,
            skillset_name=skillset_name,
            field_mappings=field_mappings,
            output_field_mappings=output_field_mappings,
            parameters=parameters
        )
        indexer_client = get_indexer_client(config)
        return indexer_client.create_indexer(indexer)

    def create_document_index_resources(self, index_prefix, customer_storage_connection_string, customer_container_name, config) -> dict:
        """
        Creates the necessary resources for a document index in Azure Search with the given index_prefix, customer_storage_connection_string, customer_container_name, and config.

        Args:
            index_prefix (str): The prefix to use for the index, data source, indexer, and skillset names.
            customer_storage_connection_string (str): The connection string for the customer's storage account.
            customer_container_name (str): The name of the container in the customer's storage account.
            config (dict): The configuration for the Azure Search service.

        Returns:
            dict: A dictionary containing the names of the created index, data source, indexer, and skillset.
        """
        # Create the index, data source, skillset, and indexer using the custom utility functions
        index_name = self._create_document_index(index_prefix, config).name
        data_source_name = self._create_document_datasource(index_prefix, customer_storage_connection_string, customer_container_name, config).name
        skillset_name = self._create_document_skillset(index_prefix, config).name
        time.sleep(5)
        indexer_name = self._create_document_indexer(index_prefix, data_source_name, index_name, skillset_name, config=config).name
        wait_for_indexer_completion(indexer_name, config=config)

        # Return a dictionary containing the names of the created index, data source, indexer, and skillset
        return {"index_name": index_name, "data_source_name": data_source_name, "skillset_name": skillset_name, "indexer_name": indexer_name}

    def delete_document_index_resources(self, index_prefix, config):
        """
        Deletes the resources for a document index in Azure Search with the given index_prefix and config.

        Args:
            index_prefix (str): The prefix used for the index, data source, indexer, and skillset names.
            config (dict): The configuration for the Azure Search service.
        """
        # Get the index and indexer clients using the custom utility functions
        index_client = get_index_client(config)
        indexer_client = get_indexer_client(config)

        # Delete the index, indexer, data source, and skillset using the corresponding client methods
        index_client.delete_index(index=get_index_name(index_prefix))
        indexer_client.delete_indexer(indexer=get_indexer_name(index_prefix))
        indexer_client.delete_data_source_connection(data_source_connection=get_datasource_name(index_prefix))
        indexer_client.delete_skillset(skillset=get_skillset_name(index_prefix))

        # Delete the knowledge store tables and blobs
        knowledge_store_connection_string = get_knowledge_store_connection_string()

        # Delete the container directly from storage
        try:
            blob_service = BlobServiceClient.from_connection_string(knowledge_store_connection_string)
            blob_service.delete_container(get_chunk_index_blob_container_name(index_prefix))
        except ResourceNotFoundError:
            # Handle resource not found error
            pass


### ChunkIndexManager

This code defines a Python class called **ChunkIndexManager**, which facilitates the creation, management, and deletion of resources for chunk indexing using Azure Cognitive Search. This class encapsulates functions for setting up a chunk index, creating datasources, and creating indexers for the chunks of data within documents. Let's break down the main components of the code:

- **_create_chunk_index:** This function creates a chunk index within Azure Cognitive Search. Similar to the previous example, it defines the schema of the index with various fields, including id, source_document_id, title, text, embedding, and more. Additionally, it configures a vector search using the HNSW algorithm for the embedding field, which is used to perform similarity searches based on document embeddings.

- **_create_chunk_datasource:** This function establishes a blob datasource for the chunk index. It takes inputs such as the index prefix, storage connection string, container name, and Azure Search configuration to create the datasource. This datasource allows the chunks of data (e.g., paragraphs, sections) from documents to be ingested.

- **_create_chunk_indexer:** This function creates an indexer for the chunk index. It connects the datasource to the index and specifies indexing parameters, including parsing_mode set to "json". The indexer processes the chunks of data from the datasource and indexes them in the chunk index.

- **create_chunk_index_resources:** This function orchestrates the creation of resources for chunk indexing. It invokes the previously defined functions to create the chunk index, datasource, and indexer. After setting up these resources, it waits for the indexer to complete its processing.

- **delete_chunk_index_resources:** This function cleans up the resources associated with chunk indexing. It deletes the chunk index, indexer, and datasource, as well as their related components.

The **ChunkIndexManager** class aims to provide a streamlined solution for setting up and managing chunk-based indexing in Azure Cognitive Search. It encapsulates the steps involved in creating an effective search solution for chunks of data within documents. Developers can use this class to simplify the process of creating and managing resources required for efficient chunk-based indexing and retrieval using Azure Cognitive Search.

import time

from azure.search.documents.indexes.models import (
    SimpleField,
    SearchField,
    SearchableField,
    SearchFieldDataType,
    SearchIndexer,
    IndexingParameters,
    VectorSearch,
    VectorSearchAlgorithmConfiguration
)

from utilities import (
    get_index_name,
    create_index,
    get_datasource_name,
    create_blob_datasource,
    get_indexer_name,
    get_indexer_client,
    get_knowledge_store_connection_string,
    get_chunk_index_blob_container_name,
    wait_for_indexer_completion,
    get_index_client
)


class ChunkIndexManager():

    def _create_chunk_index(self, index_prefix, config):
        """
        Creates a chunk index in Azure Search with the given index_prefix and config.

        Args:
            index_prefix (str): The prefix to use for the index name.
            config (SearchServiceClientConfiguration): The configuration for the Azure Search service.

        Returns:
            SearchIndex: The created index.
        """
        name = get_index_name(f"{index_prefix}-chunk")
        vector_search = VectorSearch(
            algorithm_configurations=[
                VectorSearchAlgorithmConfiguration(
                    name="my-vector-config",
                    kind="hnsw",
                    hnsw_parameters={
                        "m": 4,
                        "efConstruction": 400,
                        "efSearch": 1000,
                        "metric": "cosine"
                    }
                )
            ]
        )
        fields = [
            SimpleField(name="id", type=SearchFieldDataType.String,  filterable=True, sortable=True, key=True),
            SimpleField(name="source_document_id", type=SearchFieldDataType.String),
            SimpleField(name="source_document_filepath", type=SearchFieldDataType.String),
            SimpleField(name="source_field_name", type=SearchFieldDataType.String),
            SearchableField(name="title", type=SearchFieldDataType.String),
            SimpleField(name="index", type=SearchFieldDataType.Int64),
            SimpleField(name="offset", type=SearchFieldDataType.Int64),
            SimpleField(name="length", type=SearchFieldDataType.Int64),
            SimpleField(name="hash", type=SearchFieldDataType.String),
            SearchableField(name="text", type=SearchFieldDataType.String),
            SearchField(name="embedding",
                        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                        searchable=True,
                        vector_search_dimensions=1536,
                        vector_search_configuration="my-vector-config")
        ]
        index = create_index(name, fields, vector_search=vector_search, semantic_title_field_name="title", semantic_content_field_names=["text"], config=config)
        return index

    def _create_chunk_datasource(self, index_prefix, storage_connection_string, container_name, config):
        """
        Creates a blob data source for the chunk index with the given index_prefix, storage_connection_string, container_name, and config.

        Args:
            index_prefix (str): The prefix to use for the data source name.
            storage_connection_string (str): The connection string for the Azure Storage account.
            container_name (str): The name of the blob container.
            config (SearchServiceClientConfiguration): The configuration for the Azure Search service.

        Returns:
            SearchIndexerDataSource: The created data source.
        """
        name = get_datasource_name(f"{index_prefix}-chunk")
        return create_blob_datasource(name, storage_connection_string, container_name, config=config)

    def _create_chunk_indexer(self, index_prefix, data_source_name, index_name, config):
        """
        Creates an indexer for the chunk index with the given index_prefix, data_source_name, index_name, and config.

        Args:
            index_prefix (str): The prefix to use for the indexer name.
            data_source_name (str): The name of the data source.
            index_name (str): The name of the index.
            config (SearchServiceClientConfiguration): The configuration for the Azure Search service.

        Returns:
            SearchIndexer: The created indexer.
        """
        name = get_indexer_name(f"{index_prefix}-chunk")
        parameters = IndexingParameters(configuration={"parsing_mode": "json"})
        indexer = SearchIndexer(
            name=name,
            data_source_name=data_source_name,
            target_index_name=index_name,
            parameters=parameters
        )
        indexer_client = get_indexer_client(config)
        return indexer_client.create_indexer(indexer)

    def create_chunk_index_resources(self, index_prefix, config) -> dict:
        """
        Creates the resources for the chunk index with the given index_prefix and config.

        Args:
            index_prefix (str): The prefix to use for the index, data source, and indexer names.
            config (SearchServiceClientConfiguration): The configuration for the Azure Search service.

        Returns:
            dict: A dictionary containing information about the created resources.
        """
        chunk_index_storage_connection_string = get_knowledge_store_connection_string(config)
        chunk_index_blob_container_name = get_chunk_index_blob_container_name(index_prefix)
        index_name = self._create_chunk_index(index_prefix, config).name
        data_source_name = self._create_chunk_datasource(index_prefix, chunk_index_storage_connection_string, chunk_index_blob_container_name, config=config).name
        time.sleep(5)
        indexer_name = self._create_chunk_indexer(index_prefix, data_source_name, index_name, config=config).name
        wait_for_indexer_completion(indexer_name, config=config)
        return {"index_name": index_name, "data_source_name": data_source_name, "indexer_name": indexer_name}

    def delete_chunk_index_resources(self, index_prefix, config):
        """
        Deletes the resources for the chunk index with the given index_prefix and config.

        Args:
            index_prefix (str): The prefix used for the index, data source, and indexer names.
            config (SearchServiceClientConfiguration): The configuration for the Azure Search service.
        """
        index_client = get_index_client(config)
        indexer_client = get_indexer_client(config)

        index_client.delete_index(index=f"{index_prefix}-chunk-index")
        indexer_client.delete_indexer(indexer=f"{index_prefix}-chunk-indexer")
        indexer_client.delete_data_source_connection(data_source_connection=f"{index_prefix}-chunk-datasource")


### Utilities

This code provides utility functions and methods for interacting with Azure Cognitive Search services and Azure Blob Storage, particularly focused on managing index, datasource, and indexer resources. Let's break down the key components and functionalities:

- **Environment Variable Configuration:**
The code starts by retrieving essential configuration values from environment variables. These values include AZURE_SEARCH_SERVICE_ENDPOINT, AZURE_SEARCH_API_KEY (admin key for Azure Search), and AZURE_KNOWLEDGE_STORE_STORAGE_CONNECTION_STRING (connection string for an Azure Knowledge Store, which could be a blob storage).

- **Client Functions:**
The code defines functions get_index_client and get_indexer_client that return instances of SearchIndexClient and SearchIndexerClient respectively. These clients are used to interact with Azure Cognitive Search indexes and indexers.

- **Utility Functions:**
Several utility functions are provided to generate resource names and other useful operations:

- **get_index_name, get_datasource_name, get_skillset_name, get_indexer_name, get_chunk_index_blob_container_name:** These functions generate the names for Azure Search index, datasource, skillset, indexer, and a blob container for chunk indexing based on an index prefix.

- **get_knowledge_store_connection_string:** This function retrieves the connection string for an Azure Knowledge Store (such as a blob storage) from the configuration.

- **create_index:** This function creates an Azure Search index with specified fields, vector search settings, and semantic configurations. It utilizes SearchIndexClient to create the index.

- **create_blob_datasource:** This function creates an Azure Search datasource for Azure Blob Storage using a REST request. It sets up a connection to a specified blob container and includes a soft delete policy. The SearchIndexerClient is used to manage datasources.

- **wait_for_indexer_completion:** This function waits for an Azure Search indexer to complete its indexing process. It polls the indexer status and waits until the indexer completes or encounters a transient failure.

The provided code functions as a set of tools and utilities to streamline the creation, management, and monitoring of Azure Cognitive Search resources, particularly focusing on chunk indexing using Azure Blob Storage. Developers can use these utilities to interact with Azure Search services effectively and manage various aspects of the search indexing process.

In [None]:
import os
import time
import requests

from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SemanticSettings,
    SemanticConfiguration,
    PrioritizedFields,
    SemanticField
)

AZURE_SEARCH_SERVICE_ENDPOINT = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")
AZURE_SEARCH_KEY = os.getenv("AZURE_SEARCH_API_KEY")
AZURE_SEARCH_KNOWLEDGE_STORE_CONNECTION_STRING = os.getenv("AZURE_KNOWLEDGE_STORE_STORAGE_CONNECTION_STRING")


def get_index_client(config) -> SearchIndexClient:
    """Returns a SearchIndexClient object for the specified Azure Search service."""
    return SearchIndexClient(config['AZURE_SEARCH_SERVICE_ENDPOINT'], AzureKeyCredential(config['AZURE_SEARCH_ADMIN_KEY']))


def get_indexer_client(config) -> SearchIndexerClient:
    """Returns a SearchIndexerClient object for the specified Azure Search service."""
    return SearchIndexerClient(config['AZURE_SEARCH_SERVICE_ENDPOINT'], AzureKeyCredential(config['AZURE_SEARCH_ADMIN_KEY']))


def get_index_name(index_prefix):
    """Returns the name of an Azure Search index given a prefix."""
    return f"{index_prefix}-index"


def get_datasource_name(index_prefix):
    """Returns the name of an Azure Search datasource given a prefix."""
    return f"{index_prefix}-datasource"


def get_skillset_name(index_prefix):
    """Returns the name of an Azure Search skillset given a prefix."""
    return f"{index_prefix}-skillset"


def get_indexer_name(index_prefix):
    """Returns the name of an Azure Search indexer given a prefix."""
    return f"{index_prefix}-indexer"


def get_chunk_index_blob_container_name(index_prefix):
    """Returns the name of an Azure Blob Storage container for chunk indexing given a prefix."""
    return f"{index_prefix}ChunkIndex".replace('-', '').lower()


def get_knowledge_store_connection_string(config):
    """Returns the connection string for an Azure Knowledge Store."""
    return config['AZURE_SEARCH_KNOWLEDGE_STORE_CONNECTION_STRING']


def create_index(index_name, fields, vector_search, semantic_title_field_name, semantic_content_field_names, config):
    """Creates an Azure Search index with the specified fields and semantic settings."""
    semantic_settings = SemanticSettings(
        configurations=[SemanticConfiguration(
            name='default',
            prioritized_fields=PrioritizedFields(
                title_field=SemanticField(field_name=semantic_title_field_name), prioritized_content_fields=[SemanticField(field_name=field_name) for field_name in semantic_content_field_names]))])
    index = SearchIndex(
        name=index_name,
        fields=fields,
        vector_search=vector_search,
        semantic_settings=semantic_settings)
    index_client = get_index_client(config)
    return index_client.create_index(index)


def create_blob_datasource(datasource_name, storage_connection_string, container_name, config):
    """Creates an Azure Search datasource for Azure Blob Storage with the specified connection string and container name."""
    # This example utilizes a REST request as the python SDK doesn't support the blob soft delete policy yet
    api_version = '2023-07-01-Preview'
    headers = {
        'Content-Type': 'application/json',
        'api-key': f'{config["AZURE_SEARCH_ADMIN_KEY"]}'
    }
    data_source = {
        "name": datasource_name,
        "type": "azureblob",
        "credentials": {"connectionString": storage_connection_string},
        "container": {"name": container_name},
        "dataDeletionDetectionPolicy": {"@odata.type": "#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"}
    }

    url = '{}/datasources/{}?api-version={}'.format(config['AZURE_SEARCH_SERVICE_ENDPOINT'], datasource_name, api_version)
    requests.put(url, json=data_source, headers=headers)

    ds_client = get_indexer_client(config)
    return ds_client.get_data_source_connection(datasource_name)


def wait_for_indexer_completion(indexer_name, config):
    """Waits for an Azure Search indexer to complete indexing."""
    indexer_client = get_indexer_client(config)
    # poll status and wait until indexer is complete
    status = f"Indexer {indexer_name} not started yet"
    while (indexer_client.get_indexer_status(indexer_name).last_result is None) or ((status := indexer_client.get_indexer_status(indexer_name).last_result.status) != "success"):
        print(f"Indexing status:{status}")

        # It's possible that the indexer may reach a state of transient failure, especially when generating embeddings
        # via Open AI. For the purposes of the demo, we'll just break out of the loop and continue with the rest of the steps.
        if (status == "transientFailure"):
            print(f"Indexer {indexer_name} failed before fully indexing documents")
            break
        time.sleep(5)


### Requirements

The following are the required pip packages to run the solution, please note that azure-search-document its using a beta version

```
# DO NOT include azure-functions-worker in this file
# The Python Worker is managed by Azure Functions platform
# Manually managing azure-functions-worker may cause unexpected issues

azure-functions
langchain
openai
openai[datalib]
azure-storage-blob
azure-identity
azure-core
unstructured 
tiktoken
#pre release https://pypi.org/project/azure-search-documents/#history
azure-search-documents==11.4.0b6

```

### Conclusion

Azure Cognitive Search offers powerful capabilities for search indexing, but navigating its complexities can be daunting. The utility functions and code snippets provided in this project part offer a practical solution to streamline the creation, configuration, and management of search indexes, datasources, and indexers. By abstracting away intricacies and automating common tasks, developers can focus on building effective search solutions that deliver actionable insights from their data.

In a world where data-driven decisions are the driving force behind success, simplifying search indexing processes with Azure Cognitive Search utilities becomes a strategic advantage for businesses aiming to unlock the full potential of their data. By incorporating these utilities into your development workflow, you can accelerate the deployment of search solutions and empower your organization to make informed decisions based on accurate and up-to-date information.