
<div style="background: linear-gradient(90deg, #00a4ef, #7fba00, #ffb900, #f25022); padding: 20px; border-radius: 10px; text-align: left; color: black;">
    <h1> 🔍 | Step 0: Setup Zava Product Index </h1>
    <p>
    This notebook sets up an Azure AI Search index for the Zava product catalog, and enables semantic search capabilities through vector embeddings generated by Azure OpenAI. NOTE: You must have run the script to update RBAC roles first!
    </p>
</div>



---

This notebook automatically creates a product search index for the Zava product catalog. It uses the [product catalog](products.csv) file to create an Azure AI Search index with vector embeddings for semantic search capabilities.

## Prerequisites

You need the following Azure services configured:
- **Azure AI Search Service** - for indexing and searching products
- **Azure OpenAI Service** - for generating text embeddings

You need the following Embedding models deployed:
- **`text-embedding-ada-002`**

You need to have the `products-paints.csv` file in the same folder as this notebook.
- Should contain ~50 product items for index

The service names and keys should be stored in a `.env` file in the root of this repository. You can use the [`.env.sample`](../../.env.sample) file as a template.

Required environment variables:
- `AZURE_AISEARCH_ENDPOINT` - Your Azure AI Search service endpoint
- `AZURE_OPENAI_ENDPOINT` - Your Azure OpenAI service endpoint

Login to create the default authentication credential
- `az login`

---

## 1. Import Dependencies, Load Env Vars

This section imports all necessary Python libraries for:
- **Azure AI Search**: Creating and managing search indexes
- **Azure OpenAI**: Generating text embeddings for vector search
- **Data Processing**: Reading and processing the product catalog CSV file
- **Authentication**: Using Azure Default Credentials for secure service access

In [1]:
import os
import pandas as pd
from azure.identity import DefaultAzureCredential
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    HnswParameters,
    HnswAlgorithmConfiguration,
    SemanticPrioritizedFields,
    SearchableField,
    SearchField,
    SearchFieldDataType,
    SearchIndex,
    SemanticSearch,
    SemanticConfiguration,
    SemanticField,
    SimpleField,
    VectorSearch,
    VectorSearchAlgorithmKind,
    VectorSearchAlgorithmMetric,
    ExhaustiveKnnAlgorithmConfiguration,
    ExhaustiveKnnParameters,
    VectorSearchProfile,
)
from typing import List, Dict
from openai import AzureOpenAI
from dotenv import load_dotenv

from pathlib import Path

load_dotenv()

True

## 2. Utility Functions

### Delete Existing Index

This function safely deletes an existing search index if it exists. This ensures we start with a clean slate when creating our Zava product index.

In [2]:
def delete_index(search_index_client: SearchIndexClient, search_index: str):
    print(f"deleting index {search_index}")
    search_index_client.delete_index(search_index)

## 3. Search Index Definition

### Create Index Schema

This function defines the structure of our Azure AI Search index for Zava products. The index includes:

- **Standard fields**: `id`, `content`, `filepath`, `title`, `url`
- **Product fields**: `price` (filterable/sortable), `stock` (filterable/sortable)
- **Vector field**: `contentVector` - 1536-dimensional embeddings for semantic search
- **Semantic search**: Configured to prioritize product content and titles
- **Vector search algorithms**: HNSW and Exhaustive KNN for different search scenarios

In [3]:
def create_index_definition(name: str) -> SearchIndex:
    """
    Returns an Azure Cognitive Search index with the given name.
    """
    # The fields we want to index. The "embedding" field is a vector field that will
    # be used for vector search.
    fields = [
        SimpleField(name="id", type=SearchFieldDataType.String, key=True),
        SearchableField(name="content", type=SearchFieldDataType.String),
        SimpleField(name="filepath", type=SearchFieldDataType.String),
        SearchableField(name="title", type=SearchFieldDataType.String),
        SimpleField(name="url", type=SearchFieldDataType.String),
        SimpleField(name="price", type=SearchFieldDataType.Double, filterable=True, sortable=True),
        SimpleField(name="stock", type=SearchFieldDataType.Int32, filterable=True, sortable=True),
        SearchField(
            name="contentVector",
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
            searchable=True,
            # Size of the vector created by the text-embedding-ada-002 model.
            vector_search_dimensions=1536,
            vector_search_profile_name="myHnswProfile",
        ),
    ]

    # The "content" field should be prioritized for semantic ranking.
    semantic_config = SemanticConfiguration(
        name="default",
        prioritized_fields=SemanticPrioritizedFields(
            title_field=SemanticField(field_name="title"),
            keywords_fields=[],
            content_fields=[SemanticField(field_name="content")],
        ),
    )

    # For vector search, we want to use the HNSW (Hierarchical Navigable Small World)
    # algorithm (a type of approximate nearest neighbor search algorithm) with cosine
    # distance.
    vector_search = VectorSearch(
        algorithms=[
            HnswAlgorithmConfiguration(
                name="myHnsw",
                kind=VectorSearchAlgorithmKind.HNSW,
                parameters=HnswParameters(
                    m=4,
                    ef_construction=400,
                    ef_search=500,
                    metric=VectorSearchAlgorithmMetric.COSINE,
                ),
            ),
            ExhaustiveKnnAlgorithmConfiguration(
                name="myExhaustiveKnn",
                kind=VectorSearchAlgorithmKind.EXHAUSTIVE_KNN,
                parameters=ExhaustiveKnnParameters(
                    metric=VectorSearchAlgorithmMetric.COSINE
                ),
            ),
        ],
        profiles=[
            VectorSearchProfile(
                name="myHnswProfile",
                algorithm_configuration_name="myHnsw",
            ),
            VectorSearchProfile(
                name="myExhaustiveKnnProfile",
                algorithm_configuration_name="myExhaustiveKnn",
            ),
        ],
    )

    # Create the semantic settings with the configuration
    semantic_search = SemanticSearch(configurations=[semantic_config])

    # Create the search index.
    index = SearchIndex(
        name=name,
        fields=fields,
        semantic_search=semantic_search,
        vector_search=vector_search,
    )

    return index

## 4. Data Processing and Embedding Generation

### Process Zava Product Catalog

This function processes the Zava product catalog CSV file and generates the data for indexing:

1. **Load product data** from the CSV file
2. **Generate embeddings** using Azure OpenAI's text-embedding-ada-002 model
3. **Create search documents** with proper formatting for the search index
4. **Return structured data** ready for upload to Azure AI Search

The function creates embeddings from product descriptions to enable semantic search capabilities.

In [4]:
def gen_zava_products(
    path: str,
    n: int = None,
) -> List[Dict[str, any]]:
    """
    Process Zava product catalog and generate embeddings for each product.
    
    Args:
        path: Path to the products.csv file
        n: Number of products to process (if None, process all products)
        
    Returns:
        List of product documents ready for indexing
    """
    openai_service_endoint = os.environ["AZURE_OPENAI_ENDPOINT"]
    openai_deployment = "text-embedding-ada-002" # os.environ["AZURE_AISEARCH_EMBEDDING"]

    token_provider = get_bearer_token_provider(DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")
    # Initialize Azure OpenAI client
    client = AzureOpenAI(
        api_version="2025-02-01-preview",
        azure_endpoint=openai_service_endoint,
        azure_deployment=openai_deployment,
        azure_ad_token_provider=token_provider
    )

    # Load Zava product catalog
    products = pd.read_csv(path)
    
    # Limit to first n products if specified
    if n is not None:
        products = products.head(n)
        print(f"Processing first {len(products)} products from catalog")
    else:
        print(f"Processing all {len(products)} products from catalog")
    
    items = []
    
    for product in products.to_dict("records"):
        # Use description as the main content for embedding
        content = product["description"]
        # Use SKU as the unique identifier
        id = str(product["sku"])
        title = product["name"]
        # Create URL based on product name and SKU
        url = f"/products/{product['sku'].lower()}"
        
        # Generate embedding for the product description
        emb = client.embeddings.create(input=content, model=openai_deployment)
        
        # Create search document
        rec = {
            "id": id,
            "content": content,
            "filepath": f"{product['sku'].lower()}",
            "title": title,
            "url": url,
            "price": float(product["price"]),
            "stock": int(product["stock_level"]),
            "contentVector": emb.data[0].embedding,
        }
        items.append(rec)

    return items

## 5. Create Search Index

### Initialize Search Service and Create Index

This section:
1. **Connects to Azure AI Search** using the endpoint from environment variables
2. **Deletes any existing index** to ensure a clean setup
3. **Creates a new search index** with the defined schema for Zava products
4. **Configures the index** with vector search and semantic search capabilities

In [5]:
# Configure Azure AI Search service connection
zava_search = os.environ["AZURE_AISEARCH_ENDPOINT"]
index_name = os.environ["AZURE_AISEARCH_INDEX"]

# Initialize search index client
search_index_client = SearchIndexClient(
    zava_search, DefaultAzureCredential()
)

# Delete existing index if it exists
delete_index(search_index_client, index_name)

# Create new index with defined schema
index = create_index_definition(index_name)
print(f"Creating index {index_name}")
search_index_client.create_or_update_index(index)
print(f"Index {index_name} created successfully")

deleting index zava-products
Creating index zava-products
Index zava-products created successfully


## 6. Upload Product Data to Index

### Process and Upload Zava Products

This final section:
1. **Processes the product catalog** using the `gen_zava_products()` function
2. **Generates embeddings** for all product descriptions
3. **Uploads the documents** to the Azure AI Search index
4. **Completes the indexing process** making products searchable

Once complete, the Zava product catalog will be fully indexed and ready for semantic search queries.

In [6]:
# Process Zava product catalog and generate embeddings
print(f"Processing Zava product catalog and generating embeddings...")
# Process only the first 20 products (change n as needed, or set to None for all products)
#docs = gen_zava_products("products.csv", n=None)

# Original catalog has 420+ products. The products-paints.csv is a subset focused on painting with 51 items
docs = gen_zava_products("data/products-paints.csv", n=50)

# Initialize search client for document upload
search_client = SearchClient(
    endpoint=zava_search,
    index_name=index_name,
    credential=DefaultAzureCredential(),
)

Processing Zava product catalog and generating embeddings...
Processing first 50 products from catalog


In [7]:

# Upload all product documents to the index
print(f"Uploading {len(docs)} Zava products to index {index_name}")
ds = search_client.upload_documents(docs)
print(f"Successfully uploaded {len(docs)} products to the search index!")
print(f"The Zava product catalog is now ready for semantic search.")

Uploading 50 Zava products to index zava-products
Successfully uploaded 50 products to the search index!
The Zava product catalog is now ready for semantic search.
