# Azure AI Search custom vectorization sample
This code demonstrates how to use Azure AI Search as a vector store by automatically chunking and generating embeddings using a custom embedding skill as part of the skillset pipeline in Azure AI Search. You can choose what embedding model works for your use case.
## Prerequisites
To run the code, install the following packages. This sample currently uses version `11.4.0b12`. Please note, that integrated vectorization feature is in preview and has not been published to [azure-search-documents](https://pypi.org/project/azure-search-documents/#description) on pypi. If you'd like to use this feature, please reference the whl file. We hope to publish an updated version soon!

In [None]:
! pip install ../whl/azure_search_documents-11.4.0b12-py3-none-any.whl --quiet  
! pip install openai azure-storage-blob python-dotenv --quiet
! pip install sentence-transformers --quiet

## Download the embedding model used by the GetTextEmbeddings function

In [4]:
import os
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
model.save(os.path.join(os.getcwd(), "functions", "all-MiniLM-L6-v2"))

## Deploy the function GetTextEmbeddings in the functions folder inside your azure subscription.
## Set the environment variable AZURE_SEARCH_CUSTOM_VECTORIZER_URL to the URL of the GetTextEmbeddings function

## Import required libraries and environment variables

In [19]:
# Import required libraries  
from azure.core.credentials import AzureKeyCredential  
from azure.search.documents import SearchClient  
from azure.search.documents.indexes import SearchIndexClient, SearchIndexerClient  
from azure.search.documents.models import (
    QueryAnswerType,
    QueryCaptionType,
    QueryLanguage,
    QueryType,
    RawVectorQuery,
    VectorizableTextQuery,
    VectorFilterMode,    
)
from azure.search.documents.indexes.models import (  
    WebApiSkill,  
    CustomVectorizerParameters,  
    CustomVectorizer,  
    ExhaustiveKnnParameters,  
    ExhaustiveKnnVectorSearchAlgorithmConfiguration,
    FieldMapping,  
    HnswParameters,  
    HnswVectorSearchAlgorithmConfiguration,  
    IndexProjectionMode,  
    InputFieldMappingEntry,  
    OutputFieldMappingEntry,  
    PrioritizedFields,    
    SearchField,  
    SearchFieldDataType,  
    SearchIndex,  
    SearchIndexer,  
    SearchIndexerDataContainer,  
    SearchIndexerDataSourceConnection,  
    SearchIndexerIndexProjectionSelector,  
    SearchIndexerIndexProjections,  
    SearchIndexerIndexProjectionsParameters,  
    SearchIndexerSkillset,  
    SemanticConfiguration,  
    SemanticField,  
    SemanticSettings,  
    SplitSkill,  
    VectorSearch,  
    VectorSearchAlgorithmKind,  
    VectorSearchAlgorithmMetric,  
    VectorSearchProfile,  
)  
from azure.core.pipeline.policies import HTTPPolicy
from azure.storage.blob import BlobServiceClient  
from dotenv import load_dotenv  
  
# Configure environment variables  
load_dotenv()
service_endpoint = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")  
index_name = os.getenv("AZURE_SEARCH_INDEX_NAME")  
key = os.getenv("AZURE_SEARCH_ADMIN_KEY")  
custom_vectorizer_endpoint = os.getenv("AZURE_SEARCH_CUSTOM_VECTORIZER_ENDPOINT")
blob_connection_string = os.getenv("BLOB_CONNECTION_STRING")  
container_name = os.getenv("BLOB_CONTAINER_NAME")  
credential = AzureKeyCredential(key)  

## Connect to Blob Storage  
Retrieve documents from Blob Storage. You can use the sample documents in the [documents](../data/documents) folder.  

In [None]:
# Connect to Blob Storage
blob_service_client = BlobServiceClient.from_connection_string(blob_connection_string)
container_client = blob_service_client.get_container_client(container_name)
blobs = container_client.list_blobs()

first_blob = next(blobs)
blob_url = container_client.get_blob_client(first_blob).url
print(f"URL of the first blob: {blob_url}")

## Connect your Blob Storage to a data source in Azure AI Search

In [5]:
# Create a data source 
ds_client = SearchIndexerClient(service_endpoint, AzureKeyCredential(key))
container = SearchIndexerDataContainer(name=container_name)
data_source_connection = SearchIndexerDataSourceConnection(
    name=f"{index_name}-blob",
    type="azureblob",
    connection_string=blob_connection_string,
    container=container
)
data_source = ds_client.create_or_update_data_source_connection(data_source_connection)

print(f"Data source '{data_source.name}' created or updated")

Data source 'customvectorizersample-blob' created or updated


## Create a search index

In [7]:
# Workaround required to use the preview SDK
class CustomVectorizerRewritePolicy(HTTPPolicy):
    def send(self, request):
        request.http_request.body = request.http_request.body.replace('customVectorizerParameters', 'customWebApiParameters')
        return self.next.send(request)

In [23]:
# Create a search index  
index_client = SearchIndexClient(endpoint=service_endpoint, credential=credential, per_call_policies=[CustomVectorizerRewritePolicy()])  
fields = [  
    SearchField(name="parent_id", type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=True),  
    SearchField(name="title", type=SearchFieldDataType.String),  
    SearchField(name="chunk_id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True, analyzer_name="keyword"),  
    SearchField(name="chunk", type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False),  
    SearchField(name="vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=384, vector_search_profile="myHnswProfile"),  
]  
  
# Configure the vector search configuration  
vector_search = VectorSearch(  
    algorithms=[  
        HnswVectorSearchAlgorithmConfiguration(  
            name="myHnsw",  
            kind=VectorSearchAlgorithmKind.HNSW,  
            parameters=HnswParameters(  
                m=4,  
                ef_construction=400,  
                ef_search=500,  
                metric=VectorSearchAlgorithmMetric.COSINE,  
            ),  
        ),  
        ExhaustiveKnnVectorSearchAlgorithmConfiguration(  
            name="myExhaustiveKnn",  
            kind=VectorSearchAlgorithmKind.EXHAUSTIVE_KNN,  
            parameters=ExhaustiveKnnParameters(  
                metric=VectorSearchAlgorithmMetric.COSINE,  
            ),  
        ),  
    ],  
    profiles=[  
        VectorSearchProfile(  
            name="myHnswProfile",  
            algorithm="myHnsw",  
            vectorizer="customVectorizer",  
        ),  
        VectorSearchProfile(  
            name="myExhaustiveKnnProfile",  
            algorithm="myExhaustiveKnn",  
            vectorizer="customVectorizer",  
        ),  
    ],  
    vectorizers=[  
        CustomVectorizer(name="customVectorizer", custom_vectorizer_parameters=CustomVectorizerParameters(uri=custom_vectorizer_endpoint))
    ],  
)  
  
semantic_config = SemanticConfiguration(  
    name="my-semantic-config",  
    prioritized_fields=PrioritizedFields(  
        prioritized_content_fields=[SemanticField(field_name="chunk")]  
    ),  
)  
  
# Create the semantic settings with the configuration  
semantic_settings = SemanticSettings(configurations=[semantic_config])  
  
# Create the search index with the semantic settings  
index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search, semantic_settings=semantic_settings)  
result = index_client.create_or_update_index(index)  
print(f"{result.name} created")  


customvectorizersample created


## Create a skillset

In [21]:
# Create a skillset  
skillset_name = f"{index_name}-skillset"  
  
split_skill = SplitSkill(  
    description="Split skill to chunk documents",  
    text_split_mode="pages",  
    context="/document",  
    maximum_page_length=300,  
    page_overlap_length=20,  
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/content"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="textItems", target_name="pages")  
    ],  
)  
  
embedding_skill = WebApiSkill(  
    description="Skill to generate embeddings via a custom endpoint",  
    context="/document/pages/*",
    uri=custom_vectorizer_endpoint, 
    inputs=[
        InputFieldMappingEntry(name="text", source="/document/pages/*"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="vector", target_name="vector")  
    ],
)  
  
index_projections = SearchIndexerIndexProjections(  
    selectors=[  
        SearchIndexerIndexProjectionSelector(  
            target_index_name=index_name,  
            parent_key_field_name="parent_id",  
            source_context="/document/pages/*",  
            mappings=[  
                InputFieldMappingEntry(name="chunk", source="/document/pages/*"),  
                InputFieldMappingEntry(name="vector", source="/document/pages/*/vector"),  
                InputFieldMappingEntry(name="title", source="/document/metadata_storage_name"),  
            ],  
        ),  
    ],  
    parameters=SearchIndexerIndexProjectionsParameters(  
        projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS  
    ),  
)  
  
skillset = SearchIndexerSkillset(  
    name=skillset_name,  
    description="Skillset to chunk documents and generating embeddings",  
    skills=[split_skill, embedding_skill],  
    index_projections=index_projections,  
)  
  
client = SearchIndexerClient(service_endpoint, AzureKeyCredential(key))  
client.create_or_update_skillset(skillset)  
print(f"{skillset.name} created")  


customvectorizersample-skillset created


## Create an indexer

In [22]:
# Create an indexer  
indexer_name = f"{index_name}-indexer"  
  
indexer = SearchIndexer(  
    name=indexer_name,  
    description="Indexer to index documents and generate embeddings",  
    skillset_name=skillset_name,  
    target_index_name=index_name,  
    data_source_name=data_source.name,  
    # Map the metadata_storage_name field to the title field in the index to display the PDF title in the search results  
    field_mappings=[FieldMapping(source_field_name="metadata_storage_name", target_field_name="title")]  
)  
  
indexer_client = SearchIndexerClient(service_endpoint, AzureKeyCredential(key))  
indexer_result = indexer_client.create_or_update_indexer(indexer)  
  
# Run the indexer  
indexer_client.run_indexer(indexer_name)  
print(f' {indexer_name} created')  


 customvectorizersample-indexer created


## Get Status of Indexer

This code gets the status of the indexer. Alternatively, you can view the status in the Azure portal on the Indexers tab.

In [26]:
# Get the status of the indexer  
indexer_status = indexer_client.get_indexer_status(indexer_name)
print(f"Indexer status: {indexer_status.last_result.status}")

Indexer status: success


## Perform a vector similarity search

This example shows a pure vector search using the vectorizable text query, all you need to do is pass in text and your vectorizer will handle the query vectorization.

In [27]:
# Pure Vector Search
query = "What is contoso corporation?"  
  
search_client = SearchClient(service_endpoint, index_name, credential=credential)
vector_query = VectorizableTextQuery(text=query, k=1, fields="vector", exhaustive=True)
# Use the below query to pass in the raw vector query instead of the query vectorization
# vector_query = RawVectorQuery(vector=generate_embeddings(query), k=3, fields="vector")
  
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    select=["parent_id", "chunk_id", "chunk"],
    top=1
)  
  
for result in results:  
    print(f"parent_id: {result['parent_id']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['chunk']}")  


parent_id: aHR0cHM6Ly9tYWdvdHRlaS5ibG9iLmNvcmUud2luZG93cy5uZXQvY3VzdG9tdmVjdG9yaXplcnNhbXBsZS9NU0ZUX2Nsb3VkX2FyY2hpdGVjdHVyZV9jb250b3NvLnBkZg2
Score: 0.8214695
Content: Architects

Contoso s offices around the world follow a three tier design.

The Contoso Corporation is a global business with headquarters in Paris, France. It is a 

conglomerate manufacturing, sales, and support organization with over 100,000 products.


## Perform a hybrid search

In [28]:
# Hybrid Search
query = "What is contoso corporation?"  
  
search_client = SearchClient(service_endpoint, index_name, credential=credential)
vector_query = VectorizableTextQuery(text=query, k=1, fields="vector", exhaustive=True)
  
results = search_client.search(  
    search_text=query,  
    vector_queries= [vector_query],
    select=["parent_id", "chunk_id", "chunk"],
    top=1
)  
  
for result in results:  
    print(f"parent_id: {result['parent_id']}")  
    print(f"chunk_id: {result['chunk_id']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['chunk']}")  


parent_id: aHR0cHM6Ly9tYWdvdHRlaS5ibG9iLmNvcmUud2luZG93cy5uZXQvY3VzdG9tdmVjdG9yaXplcnNhbXBsZS9NU0ZUX2Nsb3VkX2FyY2hpdGVjdHVyZV9jb250b3NvLnBkZg2
chunk_id: a0767aa61fbe_aHR0cHM6Ly9tYWdvdHRlaS5ibG9iLmNvcmUud2luZG93cy5uZXQvY3VzdG9tdmVjdG9yaXplcnNhbXBsZS9NU0ZUX2Nsb3VkX2FyY2hpdGVjdHVyZV9jb250b3NvLnBkZg2_pages_7
Score: 0.03279569745063782
Content: Architects

Contoso s offices around the world follow a three tier design.

The Contoso Corporation is a global business with headquarters in Paris, France. It is a 

conglomerate manufacturing, sales, and support organization with over 100,000 products.


## Perform a hybrid search + Semantic reranking

In [29]:
# Semantic Hybrid Search
query = "What is contoso corporation?"

search_client = SearchClient(service_endpoint, index_name, AzureKeyCredential(key))
vector_query = VectorizableTextQuery(text=query, k=2, fields="vector", exhaustive=True)

results = search_client.search(  
    search_text=query,
    vector_queries=[vector_query],
    select=["parent_id", "chunk_id", "chunk"],
    query_type=QueryType.SEMANTIC, query_language=QueryLanguage.EN_US, semantic_configuration_name='my-semantic-config', query_caption=QueryCaptionType.EXTRACTIVE, query_answer=QueryAnswerType.EXTRACTIVE,
    top=2
)

semantic_answers = results.get_answers()
for answer in semantic_answers:
    if answer.highlights:
        print(f"Semantic Answer: {answer.highlights}")
    else:
        print(f"Semantic Answer: {answer.text}")
    print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    print(f"parent_id: {result['parent_id']}")  
    print(f"chunk_id: {result['chunk_id']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['chunk']}")  

    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")


Semantic Answer: Architects

Contoso s offices around the world follow a three tier design.

The Contoso Corporation is a global business with headquarters in Paris, France. It is<em> a 

conglomerate manufacturing, sales, and support organization</em> with over 100,000 products..
Semantic Answer Score: 0.99560546875

parent_id: aHR0cHM6Ly9tYWdvdHRlaS5ibG9iLmNvcmUud2luZG93cy5uZXQvY3VzdG9tdmVjdG9yaXplcnNhbXBsZS9NU0ZUX2Nsb3VkX2FyY2hpdGVjdHVyZV9jb250b3NvLnBkZg2
chunk_id: a0767aa61fbe_aHR0cHM6Ly9tYWdvdHRlaS5ibG9iLmNvcmUud2luZG93cy5uZXQvY3VzdG9tdmVjdG9yaXplcnNhbXBsZS9NU0ZUX2Nsb3VkX2FyY2hpdGVjdHVyZV9jb250b3NvLnBkZg2_pages_7
Score: 0.03279569745063782
Content: Architects

Contoso s offices around the world follow a three tier design.

The Contoso Corporation is a global business with headquarters in Paris, France. It is a 

conglomerate manufacturing, sales, and support organization with over 100,000 products.
Caption: Architects

Contoso s offices around the world follow a three tier design.