# Custom Asynchronous Indexing Pipeline with Text and Image Embeddings  
   
This notebook demonstrates how to create a **custom asynchronous indexing pipeline** that:  
   
- Reads PDF documents from Azure Blob Storage.  
- Extracts text and images using Azure Document Intelligence.  
- Generates embeddings for text and images using **Cohere models** in **Azure AI Foundry**.  
- Indexes the data into **Azure AI Search** with separate `text_vector` and `image_vector` fields.  
- Allows searching over text and image vectors.  
   
We will go through the following steps:  
   
1. **Install Required Libraries**  
2. **Set Up Environment Variables**  
3. **Create the Azure AI Search Index**  
4. **Define the Custom Indexing Pipeline Components**  
5. **Initialize and Run the Indexing Pipeline**  
6. **Perform Test Searches**  

## 1. Configure logging 
To ensure a clear and structured output during the execution of this notebook, we configure logging at the start. This helps track the progress and debug any issues efficiently.

In [15]:
# Configure logging for a clearer experience   
import logging  
   
# Configure the root logger  
logging.basicConfig(  
    level=logging.INFO,  # Set root logger level  
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'  
)  
   
# Suppress logs from azure and uamqp libraries  
logging.getLogger('azure').setLevel(logging.WARNING)  
logging.getLogger('uamqp').setLevel(logging.WARNING)  

## 2. Set Up Environment Variables  
   
Ensure that the `.env` file contains the following environment variables:  
  
- `AZURE_SEARCH_SERVICE_ENDPOINT`  
- `AZURE_SEARCH_API_KEY`  
- `AZURE_STORAGE_ACCOUNT_NAME`  
- `AZURE_STORAGE_ACCOUNT_SUB_ID`  
- `AZURE_STORAGE_ACCOUNT_RG_NAME`  
- `AZURE_STORAGE_ACCOUNT_CONTAINER_NAME`  
- `DOCUMENTINTELLIGENCE_ENDPOINT`  
- `DOCUMENTINTELLIGENCE_API_KEY`  
- `AZURE_AI_FOUNDRY_ENDPOINT`  
- `AZURE_AI_FOUNDRY_KEY`  
- `TEXT_EMBEDDING_MODEL` 
- `TEXT_EMBEDDING_DIMENSIONS` 
- `IMAGE_EMBEDDING_MODEL` 
- `IMAGE_EMBEDDING_DIMENSIONS` 


Avoid hardcoding sensitive information in the notebook. 
   
Ensure that the identities used have the necessary permissions (e.g., **Storage Blob Data Reader** role).  

In [16]:
import os  
from dotenv import load_dotenv  
   
# Load environment variables from .env file  
load_dotenv(override=True)  
   
# Azure AI Search settings  
search_service_endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]  
search_api_key = os.environ["AZURE_SEARCH_API_KEY"]  
index_name = "asynch-custom-push-demo"  
   
# Azure Storage settings  
storage_account_name = os.environ["AZURE_STORAGE_ACCOUNT_NAME"]  
storage_container_name = os.environ["AZURE_STORAGE_ACCOUNT_CONTAINER_NAME"]  
   
# Azure AI Inference settings (for embeddings)  
ai_foundry_endpoint = os.environ["AZURE_AI_FOUNDRY_ENDPOINT"]  
ai_foundry_key = os.environ["AZURE_AI_FOUNDRY_KEY"] 
text_embedding_model = os.environ["TEXT_EMBEDDING_MODEL"]
text_embedding_dimensions = int(os.getenv("TEXT_EMBEDDING_DIMENSIONS", 1024)) 
image_embedding_model= os.environ["IMAGE_EMBEDDING_MODEL"]
image_embedding_dimensions =  int(os.getenv("IMAGE_EMBEDDING_DIMENSIONS", 1024))    # Set this based on your image embedding model  
   
# Azure Document Intelligence settings  
document_intelligence_endpoint = os.environ["DOCUMENTINTELLIGENCE_ENDPOINT"]  
document_intelligence_key = os.environ["DOCUMENTINTELLIGENCE_API_KEY"]  

## 3. Create the Azure AI Search Index  
   
We'll create the search index with the appropriate schema, including separate fields for `text_vector` and `image_vector`, and a `page_number` field to track the pages.  

In [None]:
from azure.search.documents.indexes import SearchIndexClient  
from azure.search.documents.indexes.models import (  
    SearchField,  
    SearchFieldDataType,  
    VectorSearch,  
    HnswAlgorithmConfiguration,  
    VectorSearchProfile,  
    SemanticConfiguration,  
    SemanticSearch,  
    SemanticPrioritizedFields,  
    SemanticField,  
    SearchIndex  
)  

from azure.core.credentials import AzureKeyCredential  
   
# Create a SearchIndexClient  
search_index_client = SearchIndexClient(  
    endpoint=search_service_endpoint,  
    credential=AzureKeyCredential(search_api_key)  
)  
   
# Define the index schema  
fields = [  
    SearchField(  
        name="parent_id",  
        type=SearchFieldDataType.String,  
        filterable=True,  
        facetable=True,  
        sortable=True  
    ),  
    SearchField(  
        name="chunk_id",  
        type=SearchFieldDataType.String,  
        key=True,  
        filterable=True,  
        facetable=True,  
        sortable=True  
    ),  
    SearchField(  
        name="title",  
        type=SearchFieldDataType.String,  
        filterable=True,  
        facetable=True,  
        sortable=True  
    ),  
    SearchField(  
        name="chunk",  
        type=SearchFieldDataType.String,  
        searchable=True  
    ),  
    SearchField(  
        name="text_vector",  
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),  
        vector_search_dimensions=text_embedding_dimensions,  
        vector_search_profile_name="textHnswProfile",  
    ),  
    SearchField(  
        name="image_vector",  
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),  
        vector_search_dimensions=image_embedding_dimensions,  
        vector_search_profile_name="imageHnswProfile",  
    ),  
    SearchField(  
        name="page_number",  
        type=SearchFieldDataType.Int32,  
        filterable=True,  
        facetable=True,  
        sortable=True  
    ),  
]  
   
# Configure the vector search settings  
vector_search = VectorSearch(  
    algorithms=[  
        HnswAlgorithmConfiguration(name="myHnswAlgorithm")  
    ],  
    profiles=[  
        VectorSearchProfile(  
            name="textHnswProfile",  
            algorithm_configuration_name="myHnswAlgorithm",  
        ),  
        VectorSearchProfile(  
            name="imageHnswProfile",  
            algorithm_configuration_name="myHnswAlgorithm",  
        ),  
    ],  
)  
   
# Configure semantic search settings (optional)  
semantic_config = SemanticConfiguration(  
    name="my-semantic-config",  
    prioritized_fields=SemanticPrioritizedFields(  
        title_field=SemanticField(field_name="title"),  
        content_fields=[SemanticField(field_name="chunk")],  
    ),  
)  
   
semantic_search = SemanticSearch(configurations=[semantic_config])  
   
# Create the search index  
index = SearchIndex(  
    name=index_name,  
    fields=fields,  
    vector_search=vector_search,  
    semantic_search=semantic_search,  
)  
   
# Create or update the index in Azure Cognitive Search  
search_index_client.create_or_update_index(index)  
   
print(f"Index '{index.name}' created or updated.")  

## 4. Define the Custom Indexing Pipeline Components  
   
The custom indexing pipeline consists of the following components:  
   
- **FileReader**: Reads PDFs using Azure Document Intelligence and extracts text and images.  
- **Chunker**: Splits the text into chunks for embedding.  
- **TextEmbedder**: Generates text embeddings using Azure OpenAI.  
- **ImageEmbedder**: Generates image embeddings using Azure AI Inference.  
- **FileUploader**: Uploads the processed documents into Azure Cognitive Search.  
- **AsynchronousIndexer**: Orchestrates the entire pipeline asynchronously.  

## 5. Initialize and Run the Indexing Pipeline  
   
Now we'll initialize the `AsynchronousIndexer` with the appropriate settings and run the indexing pipeline. 

In [None]:
import nest_asyncio  
import asyncio  
from asynch_indexer.AsynchronousIndexer import AsynchronousIndexer  
   
# Necessary when running asyncio in Jupyter notebooks  
nest_asyncio.apply()  
   
# Initialize the AsynchronousIndexer  
indexer = AsynchronousIndexer(  
    index_name=index_name,  
    search_endpoint=search_service_endpoint,  
    search_api_key=search_api_key,  
    storage_account_name=storage_account_name,  
    storage_container_name=storage_container_name,  
    ai_foundry_endpoint = ai_foundry_endpoint  ,
    ai_foundry_key = ai_foundry_key,
    text_embedding_model= text_embedding_model,
    image_embedding_model=image_embedding_model, 
    document_intelligence_endpoint=document_intelligence_endpoint,  
    document_intelligence_key=document_intelligence_key,  
)  
   
# Run the indexing pipeline  
asyncio.run(indexer.run_indexing())  

## 6. Perform Test Searches  
   
Finally, we'll perform test searches against the index to verify that the text and image vectors have been indexed correctly.  

In [None]:
from azure.search.documents import SearchClient
from azure.ai.inference import EmbeddingsClient   
from azure.search.documents.models import VectorizedQuery   
   
# Initialize the SearchClient  
search_client = SearchClient(  
    endpoint=search_service_endpoint,  
    index_name=index_name,  
    credential=AzureKeyCredential(search_api_key),  
)

embeddings_client = EmbeddingsClient(  
        endpoint=ai_foundry_endpoint,  
        credential=AzureKeyCredential(ai_foundry_key),  
        model=text_embedding_model
)  
   
# Define the search query  
query_text = "Enter your search query here"  
   
# Function to get the query embedding using Coheremebed
def get_query_embedding(query):  
    response = embeddings_client.embed(  
        input=[query]  
    )  
    print("Model:", response.model)
    print("Usage:", response.usage)
    return response.data[0].embedding
   
# Get the query embedding  
query_embedding = get_query_embedding(query_text) 
   
# Perform the vector search  
vector_query = VectorizedQuery(  
    vector=query_embedding,  
    k_nearest_neighbors=3,  
    fields="text_vector", 
)  
   
results = search_client.search(  
    search_text=query_embedding,  
    vector_queries=[vector_query],  
    select=["title", "chunk", "page_number"],  
    top=3  
)  
   
# Print the results  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Page Number: {result['page_number']}")  
    print(f"Chunk: {result['chunk']}")  
    print("---")  

## Conclusion  
   
In this notebook, we've built a custom asynchronous indexing pipeline that processes both text and images from PDF documents, generates embeddings using Azure OpenAI and Azure AI Inference, and indexes them into Azure Cognitive Search. This allows for advanced vector-based searches over both text and images.  
   
You can extend this pipeline to handle more complex scenarios, larger datasets, or integrate additional processing steps as needed.  