# Custom Asynchronous Indexing Pipeline with Text and Image Embeddings  
   
This notebook demonstrates how to create a **custom asynchronous indexing pipeline** that:  
   
- Reads PDF documents from Azure Blob Storage.  
- Extracts text and images using Azure Document Intelligence.  
- Generates embeddings for text and images using **Cohere models** in **Azure AI Foundry**.  
- Indexes the data into **Azure AI Search** with separate `text_vector` and `image_vector` fields.  
- Allows searching over text and image vectors.  
   
We will go through the following steps:  
   
1. **Install Required Libraries**  
2. **Set Up Environment Variables**  
3. **Create the Azure AI Search Index**  
4. **Define the Custom Indexing Pipeline Components**  
5. **Initialize and Run the Indexing Pipeline**  
6. **Perform Test Searches**  

## 1. Configure logging 
To ensure a clear and structured output during the execution of this notebook, we configure logging at the start. This helps track the progress and debug any issues efficiently.

In [1]:
# Configure logging for a clearer experience   
import logging  
   
# Configure the root logger  
logging.basicConfig(  
    level=logging.INFO,  # Set root logger level  
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'  
)  
   
# Suppress logs from azure and uamqp libraries  
logging.getLogger('azure').setLevel(logging.WARNING)  
logging.getLogger('uamqp').setLevel(logging.WARNING)  

## 2. Set Up Environment Variables  
   
Ensure that the `.env` file contains the following environment variables:  
  
- `AZURE_SEARCH_SERVICE_ENDPOINT`  
- `AZURE_SEARCH_API_KEY`  
- `AZURE_STORAGE_ACCOUNT_NAME`  
- `AZURE_STORAGE_ACCOUNT_SUB_ID`  
- `AZURE_STORAGE_ACCOUNT_RG_NAME`  
- `AZURE_STORAGE_ACCOUNT_CONTAINER_NAME`  
- `DOCUMENTINTELLIGENCE_ENDPOINT`  
- `DOCUMENTINTELLIGENCE_API_KEY`  
- `AZURE_AI_FOUNDRY_ENDPOINT`  
- `AZURE_AI_FOUNDRY_KEY`  
- `TEXT_EMBEDDING_MODEL` 
- `TEXT_EMBEDDING_DIMENSIONS` 
- `IMAGE_EMBEDDING_MODEL` 
- `IMAGE_EMBEDDING_DIMENSIONS` 


Avoid hardcoding sensitive information in the notebook. 
   
Ensure that the identities used have the necessary permissions (e.g., **Storage Blob Data Reader** role).  

In [2]:
import os  
from dotenv import load_dotenv  
   
# Load environment variables from .env file  
load_dotenv(override=True)  
   
# Azure AI Search settings  
search_service_endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]  
search_api_key = os.environ["AZURE_SEARCH_API_KEY"]  
index_name = "asynch-custom-push-products-demo"  
   
# Azure Storage settings  
storage_account_name = os.environ["AZURE_STORAGE_ACCOUNT_NAME"]  
storage_container_name = os.environ["AZURE_STORAGE_ACCOUNT_CONTAINER_NAME"]  
   
# Azure AI Inference settings (for embeddings)  
ai_foundry_endpoint = os.environ["AZURE_AI_FOUNDRY_ENDPOINT"]  
ai_foundry_key = os.environ["AZURE_AI_FOUNDRY_KEY"] 
text_embedding_model = os.environ["TEXT_EMBEDDING_MODEL"]
text_embedding_dimensions = int(os.getenv("TEXT_EMBEDDING_DIMENSIONS", 1024)) 
image_embedding_model= os.environ["IMAGE_EMBEDDING_MODEL"]
image_embedding_dimensions =  int(os.getenv("IMAGE_EMBEDDING_DIMENSIONS", 1024))    # Set this based on your image embedding model  
   
# Azure Document Intelligence settings  
document_intelligence_endpoint = os.environ["DOCUMENTINTELLIGENCE_ENDPOINT"]  
document_intelligence_key = os.environ["DOCUMENTINTELLIGENCE_API_KEY"]  

## 3. Create the Azure AI Search Index  
   
We'll create the search index with the appropriate schema, including separate fields for `text_vector` and `image_vector`, and a `page_number` field to track the pages.  

In [3]:
from azure.search.documents.indexes import SearchIndexClient  
from azure.search.documents.indexes.models import (  
    SearchField, 
    SimpleField, 
    SearchFieldDataType,  
    VectorSearch,  
    HnswAlgorithmConfiguration,  
    VectorSearchProfile,  
    SemanticConfiguration,  
    SemanticSearch,  
    SemanticPrioritizedFields,  
    SemanticField,  
    SearchIndex  
)  

from azure.core.credentials import AzureKeyCredential  
   
# Create a SearchIndexClient  
search_index_client = SearchIndexClient(  
    endpoint=search_service_endpoint,  
    credential=AzureKeyCredential(search_api_key)  
)  
   
# Define the index schema  
fields = [  
    SearchField(  
        name="parent_id",  
        type=SearchFieldDataType.String,  
        filterable=True,  
        facetable=True,  
        sortable=True  
    ),  
    SearchField(  
        name="chunk_id",  
        type=SearchFieldDataType.String,  
        key=True,  
        filterable=True,  
        facetable=True,  
        sortable=True  
    ),  
    SearchField(  
        name="title",  
        type=SearchFieldDataType.String,  
        filterable=True,  
        facetable=True,  
        sortable=True  
    ),  
    SearchField(  
        name="chunk",  
        type=SearchFieldDataType.String,  
        searchable=True  
    ),  
    SearchField(  
        name="text_vector",  
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),  
        vector_search_dimensions=text_embedding_dimensions,  
        vector_search_profile_name="textHnswProfile",  
    ),  
    SearchField(  
        name="image_vector",  
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),  
        vector_search_dimensions=image_embedding_dimensions,  
        vector_search_profile_name="imageHnswProfile",  
    ),  
    SearchField(  
        name="page_number",  
        type=SearchFieldDataType.Int32,  
        filterable=True,  
        facetable=True,  
        sortable=True  
    ), 
        # Field for content format (text,image)
    SimpleField(  
        name="content_type",  
        type="Edm.String",  
        filterable=True,  
        facetable=True,  
        sortable=True  
    ), 
    # Field to retrieve source document for citation 
    SimpleField(  
        name="source_link",  
        type="Edm.String",  
        retrievable=True  
    ),    
]  
   
# Configure the vector search settings  
vector_search = VectorSearch(  
    algorithms=[  
        HnswAlgorithmConfiguration(name="myHnswAlgorithm")  
    ],  
    profiles=[  
        VectorSearchProfile(  
            name="textHnswProfile",  
            algorithm_configuration_name="myHnswAlgorithm",  
        ),  
        VectorSearchProfile(  
            name="imageHnswProfile",  
            algorithm_configuration_name="myHnswAlgorithm",  
        ),  
    ],  
)  
   
# Configure semantic search settings (optional)  
semantic_config = SemanticConfiguration(  
    name="my-semantic-config",  
    prioritized_fields=SemanticPrioritizedFields(  
        title_field=SemanticField(field_name="title"),  
        content_fields=[SemanticField(field_name="chunk")],  
    ),  
)  
   
semantic_search = SemanticSearch(configurations=[semantic_config])  
   
# Create the search index  
index = SearchIndex(  
    name=index_name,  
    fields=fields,  
    vector_search=vector_search,  
    semantic_search=semantic_search,  
)  
   
# Create or update the index in Azure Cognitive Search  
search_index_client.create_or_update_index(index)  
   
print(f"Index '{index.name}' created or updated.")  

Index 'asynch-custom-push-products-demo' created or updated.


## 4. Define the Custom Indexing Pipeline Components  
   
The custom indexing pipeline consists of the following components:  
   
- **FileReader**: Reads PDFs using Azure Document Intelligence and extracts text and images.  
- **Chunker**: Splits the text into chunks for embedding.  
- **TextEmbedder**: Generates text embeddings using Azure OpenAI.  
- **ImageEmbedder**: Generates image embeddings using Azure AI Inference.  
- **FileUploader**: Uploads the processed documents into Azure Cognitive Search.  
- **AsynchronousIndexer**: Orchestrates the entire pipeline asynchronously.  

## 5. Initialize and Run the Indexing Pipeline  
   
Now we'll initialize the `AsynchronousIndexer` with the appropriate settings and run the indexing pipeline. 

In [4]:
import nest_asyncio  
import asyncio  
from asynch_indexer.AsynchronousIndexer import AsynchronousIndexer  
   
# Necessary when running asyncio in Jupyter notebooks  
nest_asyncio.apply()  
   
# Initialize the AsynchronousIndexer  
indexer = AsynchronousIndexer(  
    index_name=index_name,  
    search_endpoint=search_service_endpoint,  
    search_api_key=search_api_key,  
    storage_account_name=storage_account_name,  
    storage_container_name=storage_container_name,  
    ai_foundry_endpoint = ai_foundry_endpoint  ,
    ai_foundry_key = ai_foundry_key,
    text_embedding_model= text_embedding_model,
    image_embedding_model=image_embedding_model, 
    document_intelligence_endpoint=document_intelligence_endpoint,  
    document_intelligence_key=document_intelligence_key,  
)  
   
# Run the indexing pipeline  
asyncio.run(indexer.run_indexing())  

2025-01-09 14:51:13,205 - asynch_indexer.AsynchronousIndexer - INFO - Reader read_worker_0: Reading document Chunky_Knit_Oversized_Sweater.pdf
2025-01-09 14:51:13,221 - asynch_indexer.AsynchronousIndexer - INFO - Reader read_worker_1: Reading document Classic_Denim_Jacket.pdf
2025-01-09 14:51:13,221 - asynch_indexer.AsynchronousIndexer - INFO - Reader read_worker_2: Reading document Relaxed_Fit Linen_Pants.pdf
2025-01-09 14:51:16,871 - asynch_indexer.AsynchronousIndexer - INFO - Reader read_worker_2: Completed analyze_document for Relaxed_Fit Linen_Pants.pdf
2025-01-09 14:51:17,056 - asynch_indexer.AsynchronousIndexer - INFO - Reader read_worker_0: Completed analyze_document for Chunky_Knit_Oversized_Sweater.pdf
2025-01-09 14:51:17,071 - asynch_indexer.AsynchronousIndexer - INFO - Reader read_worker_1: Completed analyze_document for Classic_Denim_Jacket.pdf
2025-01-09 14:51:18,121 - asynch_indexer.AsynchronousIndexer - INFO - Reader read_worker_1: Found 1 figures in Classic_Denim_Jacke

## 6. Perform Test Searches  
    
Finally, we'll perform test searches against the index to verify that both the text and image vectors have been indexed correctly. We'll demonstrate how to:

1. Perform a text-over-text search, filtering results to only text content.
2. Perform an image-over-image search, retrieving image-related results.
3. Perform a text-over-image and text search, combining both content types.

### 6.1 Setup: Initialize Clients 
First, we'll initialize the necessary clients for Azure AI Search and Azure AI Foundry.

In [5]:
from azure.search.documents import SearchClient
from azure.ai.inference import EmbeddingsClient
from azure.ai.inference import ImageEmbeddingsClient  
from azure.ai.inference.models import EmbeddingInput   
from azure.search.documents.models import VectorizedQuery   
   
# Initialize the SearchClient  
search_client = SearchClient(  
    endpoint=search_service_endpoint,  
    index_name=index_name,  
    credential=AzureKeyCredential(search_api_key),  
)

# Initialize the EmbeddingsClient for text embeddings  
text_embeddings_client = EmbeddingsClient(  
    endpoint=ai_foundry_endpoint,  
    credential=AzureKeyCredential(ai_foundry_key),  
    model=text_embedding_model  
)  

# Initialize the EmbeddingsClient for image embeddings  
image_embeddings_client = ImageEmbeddingsClient(  
    endpoint=ai_foundry_endpoint,  
    credential=AzureKeyCredential(ai_foundry_key),  
    model=image_embedding_model  
)

### 6.2 Helper Functions
 
We'll define helper functions to generate embeddings for text and images.

In [6]:
import base64  
from PIL import Image  
import io  
  
def get_text_embedding(query):  
    response = text_embeddings_client.embed(  
        input=[query]  
    )  
    print("Text Embedding Model:", response.model)  
    print("Usage:", response.usage)  
    return response.data[0].embedding  
  
def get_image_embedding(image_path):  
    # Open the image file  
    with open(image_path, "rb") as image_file:  
        image_data = image_file.read()  
    # Convert image data to base64 data URL  
    image_base64 = base64.b64encode(image_data).decode('utf-8')  
    data_url = f"data:image/png;base64,{image_base64}"  
    response = image_embeddings_client.embed(  
        input=[EmbeddingInput(image=data_url)]  
    )  
    print("Image Embedding Model:", response.model)  
    print("Usage:", response.usage)  
    return response.data[0].embedding  

### 6.3 Text-over-Text Search
 
In this section, we'll perform a search where we input a text query and retrieve text content from the index.

In [7]:
# Define the search query  
query_text = "wedding dress"  
   
# Get the query embedding  
query_embedding = get_text_embedding(query_text) 
   
# Define the vector query  
vector_query = VectorizedQuery(  
    vector=query_embedding,  
    k_nearest_neighbors=3,  
    fields="text_vector", 
)  
   
# Perform the search with a filter on content_type  
results = search_client.search(  
    search_text=query_embedding,  # Leave empty if you want to use only vector search.
    vector_queries=[vector_query],  
    filter="content_type eq 'text'", #Since we're only interested in text content, we filtered the results to where content_type is 'text'.
    select=["title", "chunk", "page_number", "source_link", "content_type"],   
    top=1  
)  
   
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Page Number: {result['page_number']}")  
    print(f"Content Type: {result['content_type']}")  
    print(f"Chunk: {result['chunk']}")  
    print(f"source_link: {result['source_link']}")  
    print("---")  

Text Embedding Model: embed-multilingual-v3.0
Usage: {'prompt_tokens': 2, 'completion_tokens': 0, 'total_tokens': 2}
Title: Classic_Denim_Jacket.pdf
Page Number: 2
Content Type: text
Chunk: modern unisex fit, and a versatile design suitable for all seasons. With two front chest pockets,
button closures, and subtle distressing, it's a perfect blend of casual and chic.
How to Use:
· Pair it with a plain white t-shirt and black jeans for a classic, everyday outfit.
· Layer it over a hoodie for a relaxed streetwear aesthetic on cooler days.
. Wear it over a summer dress or shorts for a laid-back, trendy look during warmer
months.
Key Features:
· Fabric: 100% Cotton Denim
· Fit: Regular Unisex Fit
· Button closures and adjustable cuffs
· Machine washable for easy care
· Available in sizes XS to XXL
source_link: https://demosharedstorage1.blob.core.windows.net/products-demo/Classic_Denim_Jacket.pdf
---


### 6.4 Image-over-Image Search
 
Next, we'll perform a search where we input an image query and retrieve image content from the index.

In [8]:
# Provide the path to your query image  
image_query_path = "./sample_Data/test_image.png"

# Get the image embedding  
image_query_embedding = get_image_embedding(image_query_path) 

# Define the vector query  
vector_query = VectorizedQuery(  
    vector=image_query_embedding,  
    k_nearest_neighbors=5,  
    fields="image_vector",  
)  

# Perform the search with a filter on content_type  
results = search_client.search(  
    search_text="",  
    vector_queries=[vector_query],  
    filter="content_type eq 'image'",  
    select=["title", "page_number", "source_link", "content_type"],  
    top=1  
)

# Print the results  
print("Image-over-Image Search Results:")  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Page Number: {result['page_number']}")  
    print(f"Content Type: {result['content_type']}")  
    print(f"Blob URI: {result['source_link']}")  
    print("---")  

Image Embedding Model: embed-multilingual-v3.0-image
Usage: {'prompt_tokens': 1000, 'completion_tokens': 0, 'total_tokens': 1000, 'images': 1}
Image-over-Image Search Results:
Title: Classic_Denim_Jacket.pdf
Page Number: 1
Content Type: image
Blob URI: https://demosharedstorage1.blob.core.windows.net/products-demo/Classic_Denim_Jacket.pdf
---


### 6.5 Text-over-Image and Text Search
 
Finally, we'll perform a search where we input a text query and retrieve both text and image content from the index.

In [9]:
# Define the search query  
query_text = "wedding dress"  

# Get the query embedding  
query_embedding = get_text_embedding(query_text)  

# Define vector queries for both text and image vectors  
text_vector_query = VectorizedQuery(  
    vector=query_embedding,  
    k_nearest_neighbors=5,  
    fields="text_vector",  
)  
  
image_vector_query = VectorizedQuery(  
    vector=query_embedding,  
    k_nearest_neighbors=5,  
    fields="image_vector",  
)  
  
# Perform the searches separately  
text_results = search_client.search(  
    search_text=None,  
    vector_queries=[text_vector_query],  
    select=["title", "chunk", "page_number", "source_link", "content_type"],  
    top=1 
)  
  
image_results = search_client.search(  
    search_text=None,  
    vector_queries=[image_vector_query],  
    select=["title", "page_number", "source_link", "content_type"],  
    top=1  
)  

# Display the results  
for result in text_results:  
    print(f"Title: {result['title']}")  
    print(f"Page Number: {result['page_number']}")  
    print(f"Content Type: {result['content_type']}")   
    print(f"source_link: {result['source_link']}")  
    print("---") 


# Display the combined results  
for result in image_results:  
    print(f"Title: {result['title']}")  
    print(f"Page Number: {result['page_number']}")  
    print(f"Content Type: {result['content_type']}")   
    print(f"source_link: {result['source_link']}")  
    print("---")   

Text Embedding Model: embed-multilingual-v3.0
Usage: {'prompt_tokens': 2, 'completion_tokens': 0, 'total_tokens': 2}
Title: Relaxed_Fit Linen_Pants.pdf
Page Number: 2
Content Type: text
source_link: https://demosharedstorage1.blob.core.windows.net/products-demo/Relaxed_Fit%20Linen_Pants.pdf
---
Title: Classic_Denim_Jacket.pdf
Page Number: 1
Content Type: image
source_link: https://demosharedstorage1.blob.core.windows.net/products-demo/Classic_Denim_Jacket.pdf
---


## Conclusion  
   
In this notebook, we've built a custom asynchronous indexing pipeline that processes both text and images from PDF documents, generates embeddings using Azure OpenAI and Azure AI Inference, and indexes them into Azure Cognitive Search. This allows for advanced vector-based searches over both text and images.  
   
You can extend this pipeline to handle more complex scenarios, larger datasets, or integrate additional processing steps as needed.  