# Ingesting Data Using Azure AI Search Indexer (Pull Method)  
   
This notebook demonstrates how to ingest data into Azure AI Search using the indexer (pull method). We'll set up a data source, create an index, define a skillset for data enrichment, configure an indexer, and perform a search query to retrieve results.  
   
## Prerequisites  
   
- **Azure Subscription** with access to:  
  - Azure AI Search service  
  - Azure Storage account (Blob storage)  
  - Azure OpenAI service  
  - Azure AI Services  
- **Environment variables** set in a `.env` file or environment variables:  
  - `AZURE_SEARCH_SERVICE_ENDPOINT`  
  - `AZURE_SEARCH_API_KEY`  
  - `AZURE_STORAGE_ACCOUNT_SUB_ID`  
  - `AZURE_STORAGE_ACCOUNT_RG_NAME`  
  - `AZURE_STORAGE_ACCOUNT_NAME`  
  - `AZURE_OPENAI_ENDPOINT`  
  - `AZURE_OPENAI_KEY`  
  - `AZURE_OPENAI_EMBEDDING_DEPLOYMENT`  
  - `AZURE_OPENAI_EMBEDDING_MODEL_NAME`  
  - `AZURE_AI_SERVICES_ENDPOINT`  
  - `AZURE_AI_SERVICES_KEY`  

## Section 1: Set Up Environment Variables and Credentials  
   
Import necessary libraries and load environment variables required for authentication and configuration.  

In [12]:
# Import necessary libraries  
from dotenv import load_dotenv  
from azure.core.credentials import AzureKeyCredential  
import os  
  
# Load environment variables from a .env file  
load_dotenv(override=True)  # Take environment variables from .env.  
  
# Azure Cognitive Search credentials  
search_service_endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]  
search_api_key = AzureKeyCredential(os.environ["AZURE_SEARCH_API_KEY"])  
index_name = "indexer-demo"  
  
# Azure Storage account details  
storage_subscription_id = os.environ["AZURE_STORAGE_ACCOUNT_SUB_ID"]  
storage_resource_group = os.environ["AZURE_STORAGE_ACCOUNT_RG_NAME"]  
storage_account_name = os.environ["AZURE_STORAGE_ACCOUNT_NAME"]  
  
# Construct the data source connection string for the storage account  
storage_connection_string = (  
    f"ResourceId=/subscriptions/{storage_subscription_id}"  
    f"/resourceGroups/{storage_resource_group}"  
    f"/providers/Microsoft.Storage/storageAccounts/{storage_account_name}/;"  
)  
  
# Azure OpenAI service credentials  
openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]  
openai_api_key = os.environ["AZURE_OPENAI_KEY"]  
openai_embedding_deployment = os.environ["AZURE_OPENAI_EMBEDDING_DEPLOYMENT"]  
openai_model_name = os.environ["AZURE_OPENAI_EMBEDDING_MODEL_NAME"]  
openai_model_dimensions = int(  
    os.getenv("AZURE_OPENAI_EMBEDDING_DIMENSIONS", 1536)  # Default to 1536 dimensions  
)  
  
# Azure AI Services credentials  
ai_services_endpoint = os.environ["AZURE_AI_SERVICES_ENDPOINT"]  
ai_services_api_key = os.environ["AZURE_AI_SERVICES_KEY"]  

## Section 2: Create a Blob Data Source Connector on Azure AI Search  
   
Set up a data source connection to your Azure Blob Storage, which the indexer will use to pull data.  

In [13]:
# Import required classes for the indexer client  
from azure.search.documents.indexes import SearchIndexerClient  
from azure.search.documents.indexes.models import (  
    SearchIndexerDataContainer,  
    SearchIndexerDataSourceConnection  
)  
  
# Create the indexer client  
indexer_client = SearchIndexerClient(  
    endpoint=search_service_endpoint,  
    credential=search_api_key  
)  
  
# Define the data source connection  
data_source_name = f"{index_name}-blob"  
data_container_name = "demo-indexer-storage"  # Replace with your blob container name  
data_source = SearchIndexerDataSourceConnection(  
    name=data_source_name,  
    type="azureblob",  
    connection_string=storage_connection_string,  
    container=SearchIndexerDataContainer(name=data_container_name)  
)  
  
# Create or update the data source connection  
indexer_client.create_or_update_data_source_connection(data_source)  
print(f"Data source '{data_source.name}' created or updated.")  
  
# Reminder to set permissions  
print(  
    "Please ensure your Azure AI Search service has the 'Storage Blob Data Reader' role "  
    "assigned on the storage account to access blob data."  
)  


Data source 'indexer-demo-blob' created or updated.
Please ensure your Azure AI Search service has the 'Storage Blob Data Reader' role assigned on the storage account to access blob data.


## Section 3: Create a Search Index  
   
Define the index schema, including fields and configurations for vector and semantic search. 

In [14]:
# Import required classes for creating the search index  
from azure.search.documents.indexes import SearchIndexClient  
from azure.search.documents.indexes.models import (  
    SearchField,  
    SearchFieldDataType,  
    VectorSearch,  
    HnswAlgorithmConfiguration,  
    VectorSearchProfile,  
    AzureOpenAIVectorizer,  
    AzureOpenAIParameters,  
    AIServicesVisionVectorizer,  
    AIServicesVisionParameters,  
    SemanticConfiguration,  
    SemanticSearch,  
    SemanticPrioritizedFields,  
    SemanticField,  
    SearchIndex  
)  
  
# Create a search index client  
search_index_client = SearchIndexClient(  
    endpoint=search_service_endpoint,  
    credential=search_api_key  
)  
  
# Define the index schema fields  
fields = [  
    # Field for parent ID of text documents  
    SearchField(  
        name="text_parent_id",  
        type=SearchFieldDataType.String,  
        sortable=True,  
        filterable=True,  
        facetable=True  
    ),  
    # Field for parent ID of image documents  
    SearchField(  
        name="image_parent_id",  
        type=SearchFieldDataType.String,  
        sortable=True,  
        filterable=True,  
        facetable=True  
    ),  
    # Field for document title  
    SearchField(  
        name="title",  
        type=SearchFieldDataType.String  
    ),  
    # Field for chunk ID, used as the key  
    SearchField(  
        name="chunk_id",  
        type=SearchFieldDataType.String,  
        key=True,  
        sortable=True,  
        filterable=True,  
        facetable=True,  
        analyzer_name="keyword"  
    ),  
    # Field for text chunks  
    SearchField(  
        name="chunk",  
        type=SearchFieldDataType.String,  
        sortable=False,  
        filterable=False,  
        facetable=False  
    ),  
    # Field for text embeddings (vector)  
    SearchField(  
        name="text_vector",  
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),  
        vector_search_dimensions=openai_model_dimensions,  
        vector_search_profile_name="textVectorSearchProfile"  
    ),  
    # Field for image embeddings (vector)  
    SearchField(  
        name="image_vector",  
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),  
        vector_search_dimensions=1024,  
        vector_search_profile_name="imageVectorSearchProfile"  
    ),  
]  
  
# Configure vector search settings  
vector_search = VectorSearch(  
    algorithms=[  
        HnswAlgorithmConfiguration(name="hnswAlgorithm"),  # HNSW algorithm for approximate nearest neighbor search  
    ],  
    profiles=[  
        # Profile for text vector search using Azure OpenAI  
        VectorSearchProfile(  
            name="textVectorSearchProfile",  
            algorithm_configuration_name="hnswAlgorithm",  
            vectorizer="AzureOpenAIVectorizer"  
        ),  
        # Profile for image vector search using AI Services Vision  
        VectorSearchProfile(  
            name="imageVectorSearchProfile",  
            algorithm_configuration_name="hnswAlgorithm",  
            vectorizer="AIServicesVisionVectorizer"  
        ),  
    ],  
    vectorizers=[  
        # Vectorizer for AI Services Vision (images)  
        AIServicesVisionVectorizer(  
            name="AIServicesVisionVectorizer",  
            kind="aiServicesVision",  
            ai_services_vision_parameters=AIServicesVisionParameters(  
                model_version="2023-04-15",  
                resource_uri=ai_services_endpoint,  
                api_key=ai_services_api_key,  
            )  
        ),  
        # Vectorizer for Azure OpenAI (text)  
        AzureOpenAIVectorizer(  
            name="AzureOpenAIVectorizer",  
            kind="azureOpenAI",  
            azure_open_ai_parameters=AzureOpenAIParameters(  
                resource_uri=openai_endpoint,  
                deployment_id=openai_embedding_deployment,  
                model_name=openai_model_name,  
                api_key=openai_api_key,  
            ),  
        ),  
    ],  
)  
  
# Configure semantic search settings  
semantic_config = SemanticConfiguration(  
    name="semantic-config",  
    prioritized_fields=SemanticPrioritizedFields(  
        title_field=SemanticField(field_name="title"),  
        content_fields=[SemanticField(field_name="chunk")]  
    )  
)  
  
semantic_search = SemanticSearch(configurations=[semantic_config])  
  
# Create the search index with the defined schema and configurations  
index = SearchIndex(  
    name=index_name,  
    fields=fields,  
    vector_search=vector_search,  
    semantic_search=semantic_search  
)  
  
# Create or update the index in Azure Cognitive Search  
search_index_client.create_or_update_index(index)  
print(f"Index '{index.name}' created or updated.")  

Index 'indexer-demo' created or updated.


## Section 4: Create a Skillset  
   
Define a skillset for data enrichment, including skills for splitting documents, generating embeddings, and processing images.

In [15]:
# Import required classes for creating the skillset  
from azure.search.documents.indexes.models import (  
    SplitSkill,  
    InputFieldMappingEntry,  
    OutputFieldMappingEntry,  
    AzureOpenAIEmbeddingSkill,  
    VisionVectorizeSkill,  
    SearchIndexerIndexProjections,  
    SearchIndexerIndexProjectionSelector,  
    SearchIndexerIndexProjectionsParameters,  
    IndexProjectionMode,  
    SearchIndexerSkillset,  
    CognitiveServicesAccountKey  
)  
  
# Define the SplitSkill to split documents into smaller chunks (pages)  
split_skill = SplitSkill(  
    name="SplitSkill",  
    description="Split documents into pages for chunking",  
    context="/document",  
    text_split_mode="pages",  
    maximum_page_length=2000,  
    page_overlap_length=500,  
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/content"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="textItems", target_name="pages")  
    ],  
)  
  
# Define the VisionVectorizeSkill for image processing  
vision_vectorize_skill = VisionVectorizeSkill(  
    name="VisionVectorizeSkill",  
    description="Generate vector representations of images",  
    context="/document/normalized_images/*",  
    inputs=[  
        InputFieldMappingEntry(name="image", source="/document/normalized_images/*"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="vector", target_name="image_vector")  
    ],  
    model_version="2023-04-15"  
)  
  
# Define the AzureOpenAIEmbeddingSkill for text embeddings  
openai_embedding_skill = AzureOpenAIEmbeddingSkill(  
    name="AzureOpenAIEmbeddingSkill",  
    description="Generate text embeddings using Azure OpenAI",  
    context="/document/pages/*",  
    resource_uri=openai_endpoint,  
    deployment_id=openai_embedding_deployment,  
    model_name=openai_model_name,  
    dimensions=openai_model_dimensions,  
    api_key=openai_api_key,  
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/pages/*"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="embedding", target_name="text_vector")  
    ],  
)  
  
# Define index projections to map the output of the skillset to the search index  
index_projections = SearchIndexerIndexProjections(  
    selectors=[  
        # Selector for text documents  
        SearchIndexerIndexProjectionSelector(  
            target_index_name=index_name,  
            parent_key_field_name="text_parent_id",  
            source_context="/document/pages/*",  
            mappings=[  
                InputFieldMappingEntry(  
                    name="chunk",  
                    source="/document/pages/*"  
                ),  
                InputFieldMappingEntry(  
                    name="text_vector",  
                    source="/document/pages/*/text_vector"  
                ),  
                InputFieldMappingEntry(  
                    name="title",  
                    source="/document/metadata_storage_name"  
                ),  
            ],  
        ),  
        # Selector for image documents  
        SearchIndexerIndexProjectionSelector(  
            target_index_name=index_name,  
            parent_key_field_name="image_parent_id",  
            source_context="/document/normalized_images/*",  
            mappings=[  
                InputFieldMappingEntry(  
                    name="image_vector",  
                    source="/document/normalized_images/*/image_vector"  
                ),  
            ],  
        ),  
    ],  
    parameters=SearchIndexerIndexProjectionsParameters(  
        projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS  
    ),  
)  
  
# Combine all skills into a skillset  
skills = [split_skill, openai_embedding_skill, vision_vectorize_skill]  
  
skillset_name = f"{index_name}-skillset"  
  
# Define the cognitive services account for AI enrichment  
cognitive_services_account = CognitiveServicesAccountKey(  
    key=ai_services_api_key,  
    description="Azure Cognitive Services account key for AI enrichment",  
)  
  
# Create the skillset  
skillset = SearchIndexerSkillset(  
    name=skillset_name,  
    description="Skillset for chunking documents and generating embeddings",  
    skills=skills,  
    index_projections=index_projections,  
    cognitive_services_account=cognitive_services_account,  
)  
  
# Create or update the skillset in Azure Cognitive Search  
indexer_client.create_or_update_skillset(skillset)  
print(f"Skillset '{skillset.name}' created or updated.")  

Skillset 'indexer-demo-skillset' created or updated.


## Section 5: Create and Run the Indexer  
   
Configure and run the indexer to process data from the data source, apply the skillset, and index the documents. 

In [16]:
# Import required classes for creating the indexer  
from azure.search.documents.indexes.models import (  
    SearchIndexer,  
    FieldMapping,  
    IndexingParameters,  
    IndexingParametersConfiguration,  
)  
  
# Define the indexer name  
indexer_name = f"{index_name}-indexer"  
  
# Configure indexing parameters  
indexing_parameters = IndexingParameters(  
    configuration=IndexingParametersConfiguration(  
        image_action="generateNormalizedImages",  # Generate normalized images for processing  
        query_timeout=None,  
        data_to_extract="contentAndMetadata",  
    )  
)  
  
# Create the indexer  
indexer = SearchIndexer(  
    name=indexer_name,  
    description="Indexer to process documents and generate embeddings",  
    skillset_name=skillset_name,  
    target_index_name=index_name,  
    data_source_name=data_source.name,  
    # Map the metadata_storage_name field to the title field in the index  
    field_mappings=[  
        FieldMapping(  
            source_field_name="metadata_storage_name",  
            target_field_name="title"  
        )  
    ],  
    parameters=indexing_parameters,  
)  
  
# Create or update the indexer in Azure Cognitive Search  
indexer_client.create_or_update_indexer(indexer)  
  
# Run the indexer to start indexing data  
indexer_client.run_indexer(indexer_name)  
print(f"Indexer '{indexer_name}' created and running.")  

Indexer 'indexer-demo-indexer' created and running.


## Section 6: Perform a Search and Display Results  
   
Use the search client to query the indexed data and display the results.  

In [18]:
# Import necessary classes for searching  
from azure.search.documents import SearchClient  
from azure.search.documents.models import VectorizableTextQuery  
  
# Initialize the SearchClient  
search_client = SearchClient(  
    endpoint=search_service_endpoint,  
    index_name=index_name,  
    credential=search_api_key,  
)  
  
# Define the search query  
query_text = "London"  # Query text  
  
# Create a vectorizable text query for semantic search  
vector_query = VectorizableTextQuery(  
    text=query_text,  
    k_nearest_neighbors=3,  
    fields="text_vector",  # Use the text vector field for vector search  
)  
  
# Perform the search  
results = search_client.search(  
    search_text=query_text,  
    vector_queries=[vector_query],  
    top=3  # Retrieve the top 3 results  
)  
  
# Print the results  
for result in results:  
    print(f"Chunk: {result['chunk']}\n")  
    print(f"Score: {result['@search.score']}\n")  

Chunk: Margie’s Travel Presents… 

London 
London is the capital and 

most populous city of 

England and the United 

Kingdom. Standing on the 

River Thames in the south 

east of the island of Great 

Britain, London has been 

a major settlement for two 

millennia. It was founded 

by the Romans, who 

named it Londinium. 

London's ancient core, the 

City of London, largely 

retains its 1.12-square- 

mile medieval boundaries. 

Since at least the 19th century, London has also referred to the metropolis around this core, 

historically split between Middlesex, Essex, Surrey, Kent, and Hertfordshire, which today largely 

makes up Greater London, governed by the Mayor of London and the London Assembly. 

 

 
Mostly popular for: 
Leisure, Outdoors, Historical, Arts 
& Culture 

Best time to visit: 
Jun-Aug 
Averag Precipitation: 1.9 in 
Average Temperature: 56-67°F 

 
 
 

 

London Hotels 

Margie’s Travel offers the following accommodation options in London: 

The Buckingham