# Azure OpenAI and AI Search Pipeline for Menu Ingestion

## 1: Notebook Introduction

This notebook demonstrates how to:
1. Extract text from a menu PDF.
2. Parse the text using GPT-4o into structured JSON format.
3. Upload the parsed data to Azure AI Search for hybrid semantic search capabilities.


## 2: Imports and Environment Setup

### Description
This cell imports necessary libraries and loads environment variables using `dotenv`. 
Ensure your `.env` file is properly set up with the required Azure API keys and endpoints.

In [22]:
# Import required libraries
from azure.core.credentials import AzureKeyCredential
from azure.core.exceptions import HttpResponseError
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchIndex,
    SimpleField,
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticField,
    SemanticSearch
)
from dotenv import load_dotenv
from langchain_community.vectorstores.azuresearch import AzureSearch
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from openai import AzureOpenAI
from pydantic import BaseModel
from tenacity import retry, stop_after_attempt, wait_exponential
from typing import List

import base64
import json
import os
import openai
import re

# Load environment variables
load_dotenv()


True

## 3: Azure OpenAI and Azure AI Search Configuration

### Description
This cell sets up the Azure OpenAI and AI Search configurations, including the embeddings and vector store. 
Ensure that the endpoints, API keys, and deployment names in the `.env` file match your Azure resource setup.

In [23]:
# Azure OpenAI setup
aoai_eastus_endpoint = os.getenv("AZURE_OPENAI_EASTUS_ENDPOINT")
aoai_eastus_api_key = os.getenv("AZURE_OPENAI_EASTUS_API_KEY")
aoai_gpt4o_deployment = os.getenv("AZURE_OPENAI_GPT4O_DEPLOYMENT")
aoai_gpt4o_mini_deployment = os.getenv("AZURE_OPENAI_GPT4O_MINI_DEPLOYMENT")
aoai_openai_api_version = os.getenv("AZURE_OPENAI_API_VERSION")
aoai_embedding_deployment = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT")

# Initialize the Azure OpenAI client
aoai_client = AzureOpenAI(
    azure_endpoint=aoai_eastus_endpoint,
    api_version=aoai_openai_api_version,
    api_key=aoai_eastus_api_key,
)

# Azure AI Search credentials
search_service_endpoint = os.getenv("AZURE_SEARCH_ENDPOINT")  # Replace with your Azure Cognitive Search endpoint
search_api_key = os.getenv("AZURE_SEARCH_KEY")  # Replace with your Azure Cognitive Search API key
index_name = os.getenv("INDEX_NAME")  # Replace with your Azure Cognitive Search index name
search_client = SearchClient(endpoint=search_service_endpoint, index_name=index_name, credential=AzureKeyCredential(search_api_key))
search_index_client = SearchIndexClient(endpoint=search_service_endpoint, index_name=index_name, credential=AzureKeyCredential(search_api_key))


# Define and Create/Update Index Schema with Semantic Configuration
index_schema = SearchIndex(
    name=index_name,
    fields=[
        SimpleField(name="id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True),
        SearchField(name="category", type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=True),
        SearchField(name="item", type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=True),
        SearchField(name="description", type=SearchFieldDataType.String),
        SimpleField(name="price", type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=True),

        SearchField(name="embedding", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=3072, vector_search_profile_name="myHnswProfile")
    ],
    vector_search=VectorSearch(
        algorithms=[
            HnswAlgorithmConfiguration(
                name="myHnsw",
                kind="hnsw",
                parameters={
                    "m": 10, # Number of neighbors, higher values improve accuracy at the cost of memory
                    "efConstruction": 200 # Larger value improves recall but increases indexing time
                }
            )
        ],
        profiles=[
            VectorSearchProfile(
                name="myHnswProfile",
                algorithm_configuration_name="myHnsw",
                vectorizer_name="myVectorizer"
            )
        ],
        vectorizers=[
            AzureOpenAIVectorizer(
                vectorizer_name="myVectorizer",
                parameters=AzureOpenAIVectorizerParameters(
                    resource_url=aoai_eastus_endpoint,
                    deployment_name=aoai_embedding_deployment,
                    model_name=aoai_embedding_deployment,
                    api_key=aoai_eastus_api_key
                )
            )
        ]
    ),
    semantic_search=SemanticSearch(
        configurations=[SemanticConfiguration(
            name="mySemanticConfig",
            prioritized_fields=SemanticPrioritizedFields(
                title_field=SemanticField(field_name="item"),
                content_fields=[
                    SemanticField(field_name="description")
                ]
            )
        )]
    ),
)

# Delete the existing index if it exists
try:
    search_index_client.delete_index(index_name)
    print(f"Deleted existing index: {index_name}")
except Exception as e:
    print(f"Index {index_name} does not exist or could not be deleted: {e}")

search_index_client.create_or_update_index(index=index_schema)
print(f"Created index: {index_name}")


Deleted existing index: coffee-chat
Created index: coffee-chat


## 4: Define GPT-4o Parsing Function

### Description
This function sends the extracted menu text to GPT-4o via Azure OpenAI to parse it into structured JSON format.
The JSON includes fields like `category`, `item`, `description`, and `price`.

In [24]:
class CoffeeMenuItem(BaseModel):
    category: str
    item: str
    description: str
    price: str = None

class CoffeeMenu(BaseModel):
    items: List[CoffeeMenuItem]

def print_parsed_menu(parsed_menu):
    # Print the parsed menu in a formatted way
    if parsed_menu:
        for item in parsed_menu.items:
            print(f"Category: {item.category}")
            print(f"Item: {item.item}")
            print(f"Description: {item.description}")
            print(f"Price: {item.price}")
            print()  # Add a blank line between items

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=5, max=60))
def parse_menu_with_gpt4o(raw_text, model_deployment_name):
    """Parse the raw text into structured JSON using GPT-4o."""
    prompt = f"""
    You are a menu parser. Convert the following raw text from a coffee menu into structured JSON with the fields:
    - category
    - item
    - description
    - price (if available)

    Here is the coffee menu text:
    ---
    {raw_text}
    ---
    """
    try:
        response = aoai_client.beta.chat.completions.parse(
            model=model_deployment_name,
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ],
            # max_tokens=1500,
            temperature=0,
            response_format=CoffeeMenu
        )

        parsed_output = response.choices[0].message.parsed
        return parsed_output

    except openai.ContentFilterFinishReasonError as e:
        print(f"Content filter error: {e}")
        print(f"Problematic prompt: {prompt}")
        return None

# Example usage
sample_raw_text = """
Espresso Drinks
- Espresso: A strong coffee brewed by forcing hot water under pressure through finely ground coffee beans. $2.99
- Cappuccino: Espresso with steamed milk and a layer of foam. $3.99

Cold Brews
- Cold Brew: Coffee brewed cold for a smooth, rich flavor. $4.99
- Nitro Cold Brew: Cold brew infused with nitrogen for a creamy texture. $5.99
"""

parsed_menu = parse_menu_with_gpt4o(sample_raw_text, aoai_gpt4o_mini_deployment)
# print(parsed_menu)

print_parsed_menu(parsed_menu)


Category: Espresso Drinks
Item: Espresso
Description: A strong coffee brewed by forcing hot water under pressure through finely ground coffee beans.
Price: $2.99

Category: Espresso Drinks
Item: Cappuccino
Description: Espresso with steamed milk and a layer of foam.
Price: $3.99

Category: Cold Brews
Item: Cold Brew
Description: Coffee brewed cold for a smooth, rich flavor.
Price: $4.99

Category: Cold Brews
Item: Nitro Cold Brew
Description: Cold brew infused with nitrogen for a creamy texture.
Price: $5.99



## 5: Extract Text from PDF

### Description
This cell uses `pdf2image` to convert each page of the provided PDF file into an image. It then uses `pytesseract` to perform Optical Character Recognition (OCR) on each image to extract the text. The extracted text from all pages is combined into a single string for further processing.



In [25]:
import pytesseract
from pdf2image import convert_from_path

def extract_text_from_pdf_images(pdf_path):
    """Extract text from images in a PDF file using OCR."""
    # Convert PDF pages to images
    images = convert_from_path(pdf_path)

    # Perform OCR on each image
    text = ""
    for image in images:
        text += pytesseract.image_to_string(image)

    return text

# Extract raw text from PDF images
pdf_path = "coffee-chat-beverage-menu.pdf"
raw_text = extract_text_from_pdf_images(pdf_path)

print(raw_text)

Beverage Menu View Text Menu

4 Menus Available
(COFFEE SPECIALTIES)

Intermezzo House Coffee... Proudly serving
Dancing Goats coffee. Priced for each
kannchen 4.00

Coffee Infusion... “French Press” infused at
your table (about 3 cups; please wait 2 minutes
to push the press). 5.00

Espresso... la créme de café ... The essence of
pure, rich coffee 2.90

Espresso Doppio... Double espresso 4.00

Caffé Americano... Double espresso diluted
with purified water 4.00

Turkish Coffee...

Pulverized light roast beans blended with
cardamon and sugar, boiled 3 times, as served in
Kolschitsky’s Coffeehouse in Vienna from 1683,
and throughtout Arab and Greek nations. In
contradiction to the famous story of Mr.
Kolschitsky, the first recorded coffee-serving
privilege in Vienna was granted in 1685 to an
Armenian merchant named Deodato.

(Notice: The grounds remain in the pot, some
passing into your cup, making this extremely
rich, sip slowly.) 6.00

Café Cubano... Double-rich espresso extraction
wit

## 6: Parse Menu Data

### Description
This cell sends the extracted raw text to the `parse_menu_with_gpt4o` function, 
returning structured JSON data ready for further processing.

In [None]:
# Parse menu with GPT-4o
parsed_menu = parse_menu_with_gpt4o(raw_text, aoai_gpt4o_mini_deployment)

print_parsed_menu(parsed_menu)

## Cell 7: Prepare Data for Azure AI Search

### Description
This function processes the structured JSON returned by GPT-4o, preparing it for ingestion into Azure AI Search.
Each menu item is assigned a unique ID.

In [21]:
def sanitize_key(key):
    """Sanitize the document key to contain only valid characters."""
    return re.sub(r'[^a-zA-Z0-9_\-]', '_', key)

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=5, max=60))
def generate_embedding(text):
    """
    Generate embeddings using Azure OpenAI with retry logic
    """
    response = aoai_client.embeddings.create(input=[text], model=aoai_embedding_deployment)
    return response.data[0].embedding

def prepare_data_for_azure_search(parsed_menu):
    """Transform parsed data for ingestion into Azure AI Search."""
    azure_search_documents = []
    for item in parsed_menu.items:
        combined_text = f"{item.category} {item.item} {item.description} {item.price}"
        document_key = sanitize_key(f"{item.category}_{item.item.replace(' ', '_')}".lower())
        azure_search_documents.append({
            "id": document_key,  # Use sanitized document key
            "category": item.category,
            "item": item.item,
            "description": item.description,
            "price": item.price,
            "embedding": generate_embedding(combined_text)  # Add embedding field
        })
    return azure_search_documents

# Transform structured data
documents_for_index = prepare_data_for_azure_search(parsed_menu)

for doc in documents_for_index:
    print(f"ID: {doc['id']}")
    print(f"Category: {doc['category']}")
    print(f"Item: {doc['item']}")
    print(f"Description: {doc['description']}")
    print(f"Price: {doc['price']}")
    print()  # Add a blank line between documents

NameError: name 're' is not defined

## Cell 8: Upload to Azure AI Search

### Description
This cell defines and calls a function to upload the prepared data to Azure AI Search.
Ensure the Azure AI Search index is properly configured before running this step.

In [20]:
def upload_documents_to_search(documents):
    batch_size = 15
    total_batches = (len(documents) + batch_size - 1) // batch_size  # Calculate total number of batches
    successful_uploads = 0

    for i in range(0, len(documents), batch_size):
        batch = documents[i:i + batch_size]
        try:
            # Upload the batch
            response = search_client.upload_documents(documents=batch)
            successful_uploads += len(batch)
            print(f"Uploaded batch {i // batch_size + 1}/{total_batches} successfully. Batch size: {len(batch)}")
        except HttpResponseError as e:
            print(f"Error uploading batch {i // batch_size + 1}/{total_batches}: {e}")
            # Log the problematic batch for further inspection
            # print(f"Problematic batch: {batch}")
            continue

    print(f"Embedding index created and documents uploaded successfully. Total successful uploads: {successful_uploads}/{len(documents)}")

upload_documents_to_search(documents_for_index)

Error uploading batch 1/7: (InvalidName) The request is invalid. Details: actions : 0: Invalid document key: 'coffee specialties_intermezzo_house_coffee'. Keys can only contain letters, digits, underscore (_), dash (-), or equal sign (=). If the keys in your source data contain other characters, we recommend encoding them with a URL-safe version of Base64 before uploading them to your index. If that is not an option, you can add the 'allowUnsafeKeys' query string parameter to disable this check. 1: Invalid document key: 'coffee specialties_coffee_infusion'. Keys can only contain letters, digits, underscore (_), dash (-), or equal sign (=). If the keys in your source data contain other characters, we recommend encoding them with a URL-safe version of Base64 before uploading them to your index. If that is not an option, you can add the 'allowUnsafeKeys' query string parameter to disable this check. 2: Invalid document key: 'coffee specialties_espresso'. Keys can only contain letters, dig

## Cell 9: Final Summary

### Summary
- Extracted text from the PDF file.
- Parsed the text into structured JSON using GPT-4o.
- Uploaded the structured data into Azure AI Search.
The pipeline is now complete!
