# Azure OpenAI and AI Search Pipeline for Menu Ingestion

## 1: Notebook Introduction
This notebook demonstrates how to:
1. Configure Azure OpenAI and Azure AI Search services.
2. Prepare the JSON data for ingestion into Azure AI Search.
3. Upload the prepared data to Azure AI Search for hybrid semantic search capabilities.


## 2: Install Required Packages

### Description
This cell installs all the necessary packages required for the notebook. 
It ensures that all dependencies are met before proceeding with the rest of the notebook.


In [2]:
%pip install azure-core azure-search-documents python-dotenv langchain-openai langchain-community openai pydantic tenacity pdf2image pytesseract

Note: you may need to restart the kernel to use updated packages.


## 3: Imports and Environment Setup

### Description
This cell imports necessary libraries and loads environment variables using `dotenv`. 
Ensure your `.env` file is properly set up with the required Azure API keys and endpoints.

In [3]:
# Import required libraries
from azure.core.credentials import AzureKeyCredential
from azure.core.exceptions import HttpResponseError
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SimpleField,
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticField,
    SemanticSearch
)
from dotenv import load_dotenv
from langchain_community.vectorstores.azuresearch import AzureSearch
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from openai import AzureOpenAI
from pydantic import BaseModel
from tenacity import retry, stop_after_attempt, wait_exponential
from typing import List

import base64
import json
import os
import openai
import re

# Load environment variables
load_dotenv()


True

## 4: Azure OpenAI and Azure AI Search Configuration

### Description
This cell sets up the Azure OpenAI and AI Search configurations, including the embeddings and vector store. 
Ensure that the endpoints, API keys, and deployment names in the `.env` file match your Azure resource setup.

In [4]:
# Azure OpenAI setup
aoai_eastus_endpoint = os.getenv("AZURE_OPENAI_EASTUS_ENDPOINT")
aoai_eastus_api_key = os.getenv("AZURE_OPENAI_EASTUS_API_KEY")
aoai_gpt4o_deployment = os.getenv("AZURE_OPENAI_GPT4O_DEPLOYMENT")
aoai_gpt4o_mini_deployment = os.getenv("AZURE_OPENAI_GPT4O_MINI_DEPLOYMENT")
aoai_openai_api_version = os.getenv("AZURE_OPENAI_API_VERSION")
aoai_embedding_deployment = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT")

# Initialize the Azure OpenAI client
aoai_client = AzureOpenAI(
    azure_endpoint=aoai_eastus_endpoint,
    api_version=aoai_openai_api_version,
    api_key=aoai_eastus_api_key,
)

# Azure AI Search credentials
search_service_endpoint = os.getenv("AZURE_SEARCH_ENDPOINT")
search_api_key = os.getenv("AZURE_SEARCH_API_KEY")
index_name = os.getenv("AZURE_SEARCH_INDEX")
search_client = SearchClient(endpoint=search_service_endpoint, index_name=index_name, credential=AzureKeyCredential(search_api_key))
search_index_client = SearchIndexClient(endpoint=search_service_endpoint, index_name=index_name, credential=AzureKeyCredential(search_api_key))

print("Azure OpenAI and Azure Search clients initialized successfully.")

Azure OpenAI and Azure Search clients initialized successfully.


## 5: Define and Create/Update Index Schema with Semantic Configuration

### Description
This cell defines the schema for the Azure AI Search index, including fields for semantic search and vector search capabilities. It also includes logic to delete the existing index if it exists and create or update the index schema with the new configuration.

In [17]:

# Define and Create/Update Index Schema with Semantic Configuration
index_schema = SearchIndex(
    name=index_name,
    fields=[
        # Unique identifier for each menu item
        SimpleField(name="id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True),
        
        # Fields for semantic search
        SearchField(name="category", type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=True),
        SearchField(name="name", type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=True),
        SearchField(name="description", type=SearchFieldDataType.String),
        SearchField(name="longDescription", type=SearchFieldDataType.String),  # Added long description
        SearchField(name="origin", type=SearchFieldDataType.String, filterable=True, facetable=True),
        SearchField(name="caffeineContent", type=SearchFieldDataType.String, filterable=True),  # Treat as string for now
        SearchField(name="brewingMethod", type=SearchFieldDataType.String, filterable=True),
        SearchField(name="popularity", type=SearchFieldDataType.String, filterable=True, facetable=True),
        
        # Sizes as JSON object
        SearchField(name="sizes", type=SearchFieldDataType.String, filterable=False, facetable=False),

        # Embedding field for vector search
        SearchField(
            name="embedding", 
            type=SearchFieldDataType.Collection(SearchFieldDataType.Single), 
            vector_search_dimensions=3072,  # Adjusted for most OpenAI embedding models
            vector_search_profile_name="menuHnswProfile"
        )
    ],
    vector_search=VectorSearch(
        algorithms=[
            HnswAlgorithmConfiguration(
                name="menuHnsw",
                kind="hnsw",
                parameters={
                    "m": 10,  # Adjusted for accuracy/memory trade-off
                    "efConstruction": 200  # Ensures recall during indexing
                }
            )
        ],
        profiles=[
            VectorSearchProfile(
                name="menuHnswProfile",
                algorithm_configuration_name="menuHnsw",
                vectorizer_name="menuVectorizer"
            )
        ],
        vectorizers=[
            AzureOpenAIVectorizer(
                vectorizer_name="menuVectorizer",
                parameters=AzureOpenAIVectorizerParameters(
                    resource_url=aoai_eastus_endpoint,
                    deployment_name=aoai_embedding_deployment,
                    model_name=aoai_embedding_deployment,
                    api_key=aoai_eastus_api_key
                )
            )
        ]
    ),
    semantic_search=SemanticSearch(
        configurations=[
            SemanticConfiguration(
                name="menuSemanticConfig",
                prioritized_fields=SemanticPrioritizedFields(
                    title_field=SemanticField(field_name="name"),  # Prioritize the "name" (e.g., "Espresso")
                    content_fields=[
                        SemanticField(field_name="description"),  # Primary content field
                        SemanticField(field_name="longDescription"),  # Provide detailed context
                        SemanticField(field_name="category")  # Assist in grouping similar items
                    ]
                )
            )
        ]
    ),
)

# Delete the existing index if it exists
try:
    search_index_client.delete_index(index_name)
    print(f"Deleted existing index: {index_name}")
except Exception as e:
    print(f"Index {index_name} does not exist or could not be deleted: {e}")

# Create or update the index schema
search_index_client.create_or_update_index(index=index_schema)
print(f"Created index: {index_name}")


Deleted existing index: coffee-chat2
Created index: coffee-chat2


## 6: Load and Process Menu Data

### Description
This cell reads the `menuItems.json` file, processes the data to ensure all fields are populated, and prints the structured JSON data.


In [11]:
# Define the path to the menuItems.json file
menu_items_path = os.path.join('..', 'frontend', 'src', 'data', 'menuItems.json')

# Read the JSON file
with open(menu_items_path, 'r') as file:
    menu_items = json.load(file)
    

# Build JSON with all fields, put empty string if not applicable
structured_menu_items = {"menuItems": []}

for category in menu_items['menuItems']:
    category_dict = {"category": category['category'], "items": []}
    for item in category['items']:
        item_dict = {
            "name": item['name'],
            "description": item.get('description', ''),
            "longDescription": item.get('longDescription', ''),
            "origin": item.get('origin', ''),
            "caffeineContent": item.get('caffeineContent', ''),
            "brewingMethod": item.get('brewingMethod', ''),
            "popularity": item.get('popularity', ''),
            "sizes": [{"size": size['size'], "price": size['price']} for size in item.get('sizes', [])]
        }
        category_dict["items"].append(item_dict)
    structured_menu_items["menuItems"].append(category_dict)

print(json.dumps(structured_menu_items, indent=2))

{
  "menuItems": [
    {
      "category": "Coffee",
      "items": [
        {
          "name": "Espresso",
          "description": "Rich and bold single or double shot",
          "longDescription": "Espresso is the quintessential coffee experience, brewed under pressure to extract its bold, rich flavor. A favorite in Italy and around the world, it forms the base for many coffee drinks, including cappuccinos and lattes. Perfect for those who enjoy an intense coffee kick with a creamy layer of crema.",
          "origin": "Italy",
          "caffeineContent": "63 mg per shot",
          "brewingMethod": "Espresso Machine",
          "popularity": "High",
          "sizes": [
            {
              "size": "single",
              "price": 1.0
            },
            {
              "size": "double",
              "price": 2.0
            }
          ]
        },
        {
          "name": "Americano",
          "description": "Espresso with hot water",
          "longDescrip

## 7: Data Preparation and Embedding Generation

### Description
This cell defines functions to sanitize document keys, generate embeddings using Azure OpenAI, and prepare the structured JSON data for ingestion into Azure AI Search. It includes retry logic for embedding generation and assigns embeddings to each document.


In [None]:
def sanitize_key(key):
    """Sanitize the document key to contain only valid characters."""
    return re.sub(r'[^a-zA-Z0-9_\-]', '_', key)

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=5, max=60))
def generate_embeddings(texts):
    """
    Generate embeddings using Azure OpenAI with retry logic for a batch of texts.
    """
    response = aoai_client.embeddings.create(input=texts, model=aoai_embedding_deployment)
    return [res.embedding for res in response.data]

def prepare_data_for_azure_search(menu_items_data):
    """Transform parsed data for ingestion into Azure AI Search."""
    azure_search_documents = []
    menu_items = menu_items_data["menuItems"]  # Extract the list of categories
    texts_for_embedding = []
    document_keys = []

    for category in menu_items:
        for item in category["items"]:
            # Combine relevant fields for embedding
            combined_text = f"{category['category']} {item['name']} {item['description']} {item.get('longDescription', '')}"
            document_key = sanitize_key(f"{category['category']}_{item['name'].replace(' ', '_')}".lower())
            
            # Collect texts and keys for batch embedding
            texts_for_embedding.append(combined_text)
            document_keys.append(document_key)
            
            # Add the document to Azure Search format without embedding
            azure_search_documents.append({
                "id": document_key,  # Use sanitized document key
                "category": category["category"],
                "name": item["name"],
                "description": item["description"],
                "longDescription": item.get("longDescription", ""),  # Include long description if available
                "origin": item.get("origin", ""),  # Include origin if available
                "caffeineContent": item.get("caffeineContent", ""),  # Include caffeine content if available
                "brewingMethod": item.get("brewingMethod", ""),  # Include brewing method if available
                "popularity": item.get("popularity", ""),  # Include popularity if available
                "sizes": json.dumps(item["sizes"]),  # Convert sizes to JSON string
            })

    # Generate embeddings in batch
    embeddings = generate_embeddings(texts_for_embedding)

    # Assign embeddings to documents
    for i, embedding in enumerate(embeddings):
        azure_search_documents[i]["embedding"] = embedding

    return azure_search_documents

# Example usage
documents_for_index = prepare_data_for_azure_search(structured_menu_items)

for doc in documents_for_index:
    print(f"ID: {doc['id']}")
    print(f"Category: {doc['category']}")
    print(f"Name: {doc['name']}")
    print(f"Description: {doc['description']}")
    print(f"Long Description: {doc['longDescription']}")
    print(f"Origin: {doc['origin']}")
    print(f"Caffeine Content: {doc['caffeineContent']}")
    print(f"Brewing Method: {doc['brewingMethod']}")
    print(f"Popularity: {doc['popularity']}")
    print(f"Sizes: {doc['sizes']}")
    print(f"Embedding: {doc['embedding'][:10]}...")  # Print the first 10 dimensions for brevity
    print()  # Add a blank line between documents


ID: coffee_espresso
Category: Coffee
Name: Espresso
Description: Rich and bold single or double shot
Long Description: Espresso is the quintessential coffee experience, brewed under pressure to extract its bold, rich flavor. A favorite in Italy and around the world, it forms the base for many coffee drinks, including cappuccinos and lattes. Perfect for those who enjoy an intense coffee kick with a creamy layer of crema.
Origin: Italy
Caffeine Content: 63 mg per shot
Brewing Method: Espresso Machine
Popularity: High
Sizes: [{"size": "single", "price": 1.0}, {"size": "double", "price": 2.0}]
Embedding: [0.0025835344567894936, -0.03195502236485481, -0.023361852392554283, -0.012528417631983757, -0.06191285327076912, 0.020405488088726997, 0.006648536305874586, -0.025989733636379242, -0.04241398349404335, 0.008428924717009068]...

ID: coffee_americano
Category: Coffee
Name: Americano
Description: Espresso with hot water
Long Description: The Americano is a smooth, diluted espresso drink insp

## 8: Upload to Azure AI Search

### Description
This cell defines and calls a function to upload the prepared data to Azure AI Search.
Ensure the Azure AI Search index is properly configured before running this step.

In [24]:
def upload_documents_to_search(documents):
    batch_size = 15
    total_batches = (len(documents) + batch_size - 1) // batch_size  # Calculate total number of batches
    successful_uploads = 0

    for i in range(0, len(documents), batch_size):
        batch = documents[i:i + batch_size]
        try:
            # Upload the batch
            response = search_client.upload_documents(documents=batch)
            successful_uploads += len(batch)
            print(f"Uploaded batch {i // batch_size + 1}/{total_batches} successfully. Batch size: {len(batch)}")
        except HttpResponseError as e:
            print(f"Error uploading batch {i // batch_size + 1}/{total_batches}: {e}")
            # Log the problematic batch for further inspection
            # print(f"Problematic batch: {batch}")
            continue

    print(f"Embedding index created and documents uploaded successfully. Total successful uploads: {successful_uploads}/{len(documents)}")

upload_documents_to_search(documents_for_index)

Uploaded batch 1/2 successfully. Batch size: 15
Uploaded batch 2/2 successfully. Batch size: 2
Embedding index created and documents uploaded successfully. Total successful uploads: 17/17


## 9: Testing Search Capabilities

### Summary
- Upgraded the Azure Search Documents package for compatibility.
- Defined four search function implementations:
  - Simple text-based search
  - Semantic search using the configured semantic configuration
  - Vector search using embeddings for semantic similarity 
  - Hybrid search combining text and vector capabilities
- Tested each search type with different queries:
  - Simple search for "espresso"
  - Semantic search for "coffee with chocolate flavors"
  - Vector search for "strong coffee with intense flavor" 
  - Hybrid search for "smooth coffee from not Italy"
  - Filter-based search for coffee in the "Brewed Coffee" category from Italy
- Each search function formats and displays results with relevant coffee item details.


In [10]:
# Test Azure AI Search with various query types
%pip install --upgrade azure-search-documents
from azure.search.documents.models import VectorizedQuery

def perform_simple_search(query_text, filter_condition=None, top=5):
    """Perform a simple text-based search against the Azure Search index"""
    results = search_client.search(
        search_text=query_text,
        filter=filter_condition,
        select=["id", "name", "description", "category", "origin"],
        top=top,
        include_total_count=True
    )
    
    print(f"\n=== Simple Search Results for '{query_text}' ===")
    print(f"Total matches: {results.get_count()}")
    
    for result in results:
        print(f"\nName: {result['name']}")
        print(f"Category: {result['category']}")
        print(f"Origin: {result.get('origin', 'Not specified')}")
        print(f"Description: {result['description']}")
        print("-" * 50)
    
    return results

def perform_semantic_search(query_text, top=5):
    """Perform a semantic search using the configured semantic settings"""
    results = search_client.search(
        search_text=query_text,
        select=["id", "name", "description", "category", "origin"],
        top=top,
        query_type="semantic",
        semantic_configuration_name="menuSemanticConfig",
        include_total_count=True
    )
    
    print(f"\n=== Semantic Search Results for '{query_text}' ===")
    print(f"Total matches: {results.get_count()}")
    
    for result in results:
        print(f"\nName: {result['name']}")
        print(f"Category: {result['category']}")
        print(f"Origin: {result.get('origin', 'Not specified')}")
        print(f"Description: {result['description']}")
        print("-" * 50)
    
    return results

def perform_vector_search(query_text, top=5):
    """Generate embedding for the query and perform vector search"""
    from openai import AzureOpenAI
    
    # Create Azure OpenAI client for embeddings
    client = AzureOpenAI(
        api_key=aoai_eastus_api_key,
        api_version="2023-05-15",
        azure_endpoint=aoai_eastus_endpoint
    )
    
    # Generate embedding for the query text
    response = client.embeddings.create(
        input=query_text,
        model=aoai_embedding_deployment
    )
    
    query_vector = response.data[0].embedding
    
    # Perform vector search - fixed to use vector_queries instead of vector
    vector_query = VectorizedQuery(vector=query_vector, k_nearest_neighbors=top, fields="embedding")
    results = search_client.search(
        search_text=None,
        select=["id", "name", "description", "category", "origin"],
        top=top,
        vector_queries=[vector_query],
        include_total_count=True
    )
    
    print(f"\n=== Vector Search Results for '{query_text}' ===")
    print(f"Total matches: {results.get_count()}")
    
    for result in results:
        print(f"\nName: {result['name']}")
        print(f"Category: {result['category']}")
        print(f"Origin: {result.get('origin', 'Not specified')}")
        print(f"Description: {result['description']}")
        print("-" * 50)
    
    return results

def perform_hybrid_search(query_text, top=5):
    """Perform a hybrid search combining vector search with text search"""
    from openai import AzureOpenAI
    
    # Create Azure OpenAI client for embeddings
    client = AzureOpenAI(
        api_key=aoai_eastus_api_key,
        api_version="2023-05-15",
        azure_endpoint=aoai_eastus_endpoint
    )
    
    # Generate embedding for the query text
    response = client.embeddings.create(
        input=query_text,
        model=aoai_embedding_deployment
    )
    
    query_vector = response.data[0].embedding
    
    # Perform hybrid search - fixed to use vector_queries instead of vector
    vector_query = VectorizedQuery(vector=query_vector, k_nearest_neighbors=top, fields="embedding")
    results = search_client.search(
        search_text=query_text,
        select=["id", "name", "description", "category", "origin"],
        top=top,
        vector_queries=[vector_query],
        include_total_count=True
    )
    
    print(f"\n=== Hybrid Search Results for '{query_text}' ===")
    print(f"Total matches: {results.get_count()}")
    
    for result in results:
        print(f"\nName: {result['name']}")
        print(f"Category: {result['category']}")
        print(f"Origin: {result.get('origin', 'Not specified')}")
        print(f"Description: {result['description']}")
        print("-" * 50)
    
    return results

# Test different search types
print("\n========= TESTING SEARCH CAPABILITIES =========")

# Test 1: Simple text search
simple_results = perform_simple_search("espresso")

# Test 2: Semantic search for more natural language queries
semantic_results = perform_semantic_search("coffee with chocolate flavors")

# Test 3: Vector search for semantic similarity
vector_results = perform_vector_search("strong coffee with intense flavor")

# Test 4: Hybrid search combining text and vector capabilities
hybrid_results = perform_hybrid_search("smooth coffee from not Italy")

# Test 5: Search with filters
filtered_results = perform_simple_search(
    "coffee", 
    filter_condition="category eq 'Brewed Coffee' and origin eq 'Italy'"
)

Note: you may need to restart the kernel to use updated packages.


=== Simple Search Results for 'espresso' ===
Total matches: 10

Name: Cappuccino
Category: Coffee
Origin: Italy
Description: Espresso with steamed milk and foam
--------------------------------------------------

Name: Granita Cappuccino
Category: Chilled Coffees
Origin: Custom House Recipe
Description: Granules of sugar and ice mixed with espresso, cream, cocoa, and whipped cream
--------------------------------------------------

Name: Espresso
Category: Coffee
Origin: Italy
Description: Rich and bold single or double shot
--------------------------------------------------

Name: Mocha
Category: Coffee
Origin: Yemen
Description: Espresso with chocolate and steamed milk
--------------------------------------------------

Name: Extra Shot
Category: Extras
Origin: 
Description: Add an extra shot of espresso
--------------------------------------------------

=== Semantic Search Results for 'coffee with chocolate flavors

## 10: Final Summary

### Summary
- Installed required packages.
- Configured Azure OpenAI and Azure AI Search services.
- Defined and created/updated the index schema with semantic configuration.
- Loaded and processed menu data from `menuItems.json`.
- Prepared data for ingestion, including generating embeddings using Azure OpenAI.
- Uploaded the structured data into Azure AI Search.

The pipeline is now complete! 