## 📚 Prerequisites

Before running this notebook, ensure you have configured Azure AI services, set the appropriate configuration parameters, and set up a Conda environment to ensure reproducibility. You can find the setup instructions and how to create a Conda environment in the [REQUIREMENTS.md](REQUIREMENTS.md) file.

## 📋 Table of Contents

This notebook lays the foundation for subsequent notebooks by guiding you through the creation of two Azure AI Search indexes. The first index will house content extracted from documents in SharePoint Online and Blob Storage. The second index will be dedicated to storing metadata extracted from images and audio files in Blob Storage.

This notebook assists in creating an Azure AI Search Index, covering the following sections:

> We'll be using the Azure Search SDK for Python to accomplish this. 

1. [**Define Field Types**](#define-field-types): Outlines the process of defining the structure and behavior of an index using various field types.

2. [**Configuring Vector Search**](#configuring-vector-search): Discusses the setup of algorithms and profiles for handling vector-based queries.

3. [**Configuring Semantic Search**](#configuring-semantic-search): Explores how to enhance search capabilities by leveraging advanced AI models.

4. [**Create or Update Index**](#create-or-update-index): Details the steps to create a new index or update an existing one.

For additional information, refer to the following resources:
- [Azure AI Search Documentation](https://learn.microsoft.com/en-us/azure/search/)

In [1]:
import os
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    ExhaustiveKnnAlgorithmConfiguration,
    ExhaustiveKnnParameters,
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    SearchIndex,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticField,
    SearchField,
    VectorSearch,
    SemanticSearch,
    HnswAlgorithmConfiguration,
    HnswParameters,
    VectorSearch,
    VectorSearchAlgorithmKind,
    VectorSearchProfile,
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    VectorSearch,
    ExhaustiveKnnParameters,
    SearchIndex,
    SearchField,
    SearchFieldDataType,
    ComplexField,
    SimpleField,
    SearchableField,
    SearchIndex,
    SemanticConfiguration,
    SemanticField,
    SearchField,
    VectorSearch,
    HnswParameters,
    VectorSearch,
    VectorSearchAlgorithmKind,
    VectorSearchAlgorithmMetric,
    VectorSearchProfile,
)

# Define the target directory (change yours)
target_directory = r"C:\Users\pablosal\Desktop\gbbai-chat-with-your-database"

# Check if the directory exists
if os.path.exists(target_directory):
    # Change the current working directory
    os.chdir(target_directory)
    print(f"Directory changed to {os.getcwd()}")
else:
    print(f"Directory {target_directory} does not exist.")

Directory changed to C:\Users\pablosal\Desktop\gbbai-azure-ai-document-intelligence


In [2]:
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Set the service endpoint and API key from the environment
# Create an SDK client
endpoint = os.environ["AZURE_AI_SEARCH_SERVICE_ENDPOINT"]

admin_documents_index_client = SearchIndexClient(
    endpoint=endpoint,
    index_name=os.environ["AZURE_SEARCH_INDEX_NAME_DOCUMENTS"],
    credential=AzureKeyCredential(os.environ["AZURE_SEARCH_ADMIN_KEY"]),
)

# Creating the Index for Images and Audio

> We'll start by creating an index specifically for images and audio. Later, we'll adapt this process to suit our needs for document indexing.

## Define Field Types

### 🧠 Understanding Field Types in Azure AI Search

In Azure Cognitive Search, the structure and behavior of an index are defined using various field types, each tailored for specific use cases. These field types are `SearchField`, `SimpleField`, `SearchableField`, and `ComplexField`.

- **SearchField**: This is the foundational field type for defining an index's schema. It encompasses a broad range of attributes that specify the field's role and behavior in the index. Key attributes include:
  - `name` and `type`: Define the field's identifier and data type.
  - `key`: Indicates if the field is a unique identifier for documents.
  - `searchable`: Specifies if the field undergoes full-text search analysis.
  - `filterable`, `sortable`, `facetable`: Determine how the field interacts with search queries.
  - Analyzers (`analyzer_name`, `search_analyzer_name`, `index_analyzer_name`): Configure text analysis for the field.
  - Advanced search attributes like `vector_search_dimensions` and `synonym_map_names`.
  - `fields`: For complex types, defining nested sub-fields. 

- **SimpleField**: A streamlined version of `SearchField`, designed for fields that don't require full-text search or advanced text analysis. It's typically used for non-textual data like identifiers and metadata, supporting attributes like `key`, `filterable`, `sortable`, and `facetable`.

- **SearchableField**: Tailored for fields that require full-text search capabilities, this type includes most of the attributes of `SearchField`. It's particularly suitable for fields with textual content that needs to be searchable, like titles, descriptions, or full text.

- **ComplexField**: Designed for fields that contain nested data structures, `ComplexField` allows you to define a field with sub-fields. It's characterized by:
  - `name`: The unique identifier for the field.
  - `collection`: A boolean indicating if the field is a collection of complex objects.
  - `fields`: A list of sub-fields, which can be of any field type, including nested `ComplexField`.

### How to Use These Field Types 🛠️

- **Creating Simple and Searchable Fields**: Use `SimpleField` for basic data types and `SearchableField` for text-heavy fields requiring search capabilities.

- **Designing Complex Data Structures**: Utilize `ComplexField` to model hierarchical or nested data within your index, defining each level of the hierarchy with appropriate sub-fields. 

- **Optimizing Search Behavior**: Leverage `SearchField` for granular control over search behavior, including the use of analyzers and advanced search features like vector search.

> **Note:** Full-text search analyzes and searches through all text within documents, considering language nuances and relevance. Non-full-text search, on the other hand, looks for exact matches or range queries in specific fields or attributes.

In [6]:
fields_query_index = [
    # The 'document_id' field serves as a unique identifier for each document.
    # It's a string, marked as a key, and is sortable, filterable, and facetable for efficient querying.
    SimpleField(
        name="document_id",
        type=SearchFieldDataType.String,
        key=True,
        sortable=True,
        filterable=True,
        facetable=True,
    ),
    SearchableField(name="table_content", type=SearchFieldDataType.String),
    # The 'document_content' field stores the full content of the document.
    # It's searchable for detailed text queries.
    SimpleField(name="table_name", type=SearchFieldDataType.String),
    # The 'summary_vector' field is a vector representation of the document summary.
    # It's used for semantic search and configured with specific dimensions and a search profile.
    SearchField(
        name="table_vector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=1536,
        vector_search_profile_name="myHnswProfile",
    ),
]

## Configuring Vector Search

Configuring vector search in Azure AI Search involves setting up algorithms and profiles to handle vector-based queries. These are particularly useful for semantic search scenarios, such as finding similar items based on vector representations.

### Understanding the Configuration

The configuration consists of two main components: algorithm configurations and vector search profiles.

#### Algorithm Configurations:

1. **HnswAlgorithmConfiguration**: Hierarchical Navigable Small World (HNSW) is a high-performance, memory-efficient algorithm for approximate nearest neighbor search in high-dimensional spaces. It creates a multi-layer graph structure that enables fast search for nearest neighbors in high-dimensional data. The configuration includes:
   - `name`: A unique identifier for this configuration.
   - `kind`: Specifies the algorithm type, in this case, it's HNSW.
   - `parameters`: These are key settings that allow you to customize HNSW's behavior for optimal performance and accuracy. They include `m`, `ef_construction`, `ef_search`, and `metric`.
     - `m`: Controls the degree of the graph, affecting both search speed and accuracy.
     - `ef_construction`: Influences the index construction time and quality.
     - `ef_search`: Determines the trade-off between search time and accuracy.
   - `metric`: Specifies the distance function used for measuring vector similarity, such as "cosine".

2. **ExhaustiveKnnAlgorithmConfiguration**: This is a brute-force search algorithm that examines the entire vector index, used during querying. It's slower but can be more accurate for certain use cases. Similar to HNSW, it has `name`, `kind`, and `metric`. However, it lacks the additional tuning parameters found in HNSW.


### Tuning HNSW Parameters for Optimal Performance

**Striking the Right Balance between Recall, Latency, and Indexing**

The HNSW algorithm parameters can be adjusted to optimize the performance of your vector search. Here are some strategies:

- **Increase 'ef_search'**: This can improve recall without needing to reindex. However, monitor your system for potential latency increases. If increasing 'ef_search' isn't effective or causes high latency, consider the next steps.

- **Reindex with higher values of ‘m' and/or 'ef_construction'**: This can improve the quality of the search. However, keep in mind that increasing 'ef_construction' may result in longer indexing latency.

- **Increase the ‘m' value**: This should be done carefully and only if other parameters don't sufficiently improve recall after trying the previous steps. Increasing 'm' can improve the quality of the HNSW graph, but it may also increase the memory usage and indexing time.

Remember, tuning these parameters involves a trade-off between recall and latency. It's important to test different configurations and monitor their impact on your system's performance.   

#### Vector Search Profiles:

These profiles allow you to define combinations of algorithm configurations for different search scenarios. Each profile, like `myHnswProfile` or `myExhaustiveKnnProfile`, is linked to an algorithm configuration via `algorithm_configuration_name`.

For example, you might have a profile `fastSearchProfile` linked to an HNSW configuration for general queries where speed is essential, and another profile `accurateSearchProfile` linked to an exhaustive KNN configuration for scenarios where precision is paramount.

```python
fastSearchProfile = {
    "name": "fastSearchProfile",
    "algorithm_configuration_name": "myHnswConfiguration"
}

accurateSearchProfile = {
    "name": "accurateSearchProfile",
    "algorithm_configuration_name": "myExhaustiveKnnConfiguration"
}
```

### Why Configure Vector Search This Way?

+ **Flexibility**: Having different algorithms and profiles lets you tailor your search strategy to specific needs. For example, use HNSW for general queries where speed is essential and exhaustive KNN for scenarios where precision is paramount.

- **Tunable Performance**: HNSW algorithm parameters can be adjusted to find the right balance between speed and accuracy, making it adaptable to various datasets and search requirements.

+ **Accuracy vs. Speed Trade-offs**: Exhaustive KNN offers high accuracy at the cost of speed and is suitable for scenarios where search completeness is critical.

In [None]:
!pip install -r requ

In [7]:
# Configure the vector search configuration
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="myHnsw",
            kind=VectorSearchAlgorithmKind.HNSW,
            parameters=HnswParameters(
                m=5,
                ef_construction=300,
                ef_search=400,
                metric=VectorSearchAlgorithmMetric.COSINE,
            ),
        ),
        ExhaustiveKnnAlgorithmConfiguration(
            name="myExhaustiveKnn",
            kind=VectorSearchAlgorithmKind.EXHAUSTIVE_KNN,
            parameters=ExhaustiveKnnParameters(
                metric=VectorSearchAlgorithmMetric.COSINE
            ),
        ),
    ],
    profiles=[
        VectorSearchProfile(
            name="myHnswProfile",
            algorithm_configuration_name="myHnsw",
        ),
        VectorSearchProfile(
            name="myExhaustiveKnnProfile",
            algorithm_configuration_name="myExhaustiveKnn",
        ),
    ],
)

## Configuring semantic search

Azure Cognitive Search's `SemanticConfiguration` enhances search capabilities by leveraging advanced AI models to interpret the intent and context of search queries. This configuration is particularly useful for creating a more intuitive and context-aware search experience. The key components of this configuration include `SemanticPrioritizedFields` and `SemanticField`.

### SemanticPrioritizedFields

`SemanticPrioritizedFields` plays a critical role in guiding the semantic search engine towards the most relevant parts of your documents. It includes three main properties:

1. **Title Field (`title_field`)**: This field is typically given higher priority in semantic analysis. It's crucial for summarizing the document and is often used in generating captions, highlights, and determining semantic relevance.

2. **Content Fields (`content_fields`)**: These fields usually contain the bulk of the document's text in natural language. They provide detailed context and are essential for in-depth semantic analysis. The order of the fields indicates their priority, with higher-priority fields being more influential in the analysis.

3. **Keywords Fields (`keywords_fields`)**: These fields should contain key terms or concepts relevant to the document. They are used to enhance the semantic understanding of the document's main themes or topics.

### SemanticField

`SemanticField` specifies individual fields from the index to be used in the `SemanticPrioritizedFields`. Each `SemanticField` requires only one attribute:

- **Field Name (`field_name`)**: This is the name of the field in the index that is to be used for semantic analysis.


In [8]:
semantic_config_query_index = SemanticConfiguration(
    name="query-index-semantic-config",
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="table_name"),
        keywords_fields=[SemanticField(field_name="table_name")],
        content_fields=[SemanticField(field_name="table_content")],
    ),
)
# Create the semantic settings with the configuration
semantic_search_audio_images = SemanticSearch(
    configurations=[semantic_config_query_index]
)

In this example, my-semantic-config is the unique identifier for the semantic configuration. The SemanticPrioritizedFields is set up to prioritize the document_title as the title field, document_category as the keywords field, and document_content as the content field. This configuration ensures that the search engine focuses on these fields for semantic analysis, thus enhancing the relevance and accuracy of search results.

## Create or Update Index

In [10]:
index = SearchIndex(
    name="query-dev-index",
    fields=fields_query_index,
    vector_search=vector_search,
    semantic_search=semantic_search_audio_images,
)

try:
    result = admin_documents_index_client.create_or_update_index(index)
    print("Index", result.name, "created")
except Exception as ex:
    print(ex)

Index query-dev-index created


: 