# Azure OpenAI and AI Search Pipeline for Menu Ingestion

## 1: Notebook Introduction

This notebook demonstrates how to:
1. Extract text from a menu PDF.
2. Parse the text using GPT-4o into structured JSON format.
3. Upload the parsed data to Azure AI Search for hybrid semantic search capabilities.


## 2: Imports and Environment Setup

### Description
This cell imports necessary libraries and loads environment variables using `dotenv`. 
Ensure your `.env` file is properly set up with the required Azure API keys and endpoints.

In [1]:
# Import required libraries
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential
from dotenv import load_dotenv
from langchain_community.vectorstores.azuresearch import AzureSearch
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from openai import AzureOpenAI
from pydantic import BaseModel
from tenacity import retry, stop_after_attempt, wait_exponential
from typing import List

import base64
import json
import os
import openai

# Load environment variables
load_dotenv()


True

## 3: Azure OpenAI and Azure AI Search Configuration

### Description
This cell sets up the Azure OpenAI and AI Search configurations, including the embeddings and vector store. 
Ensure that the endpoints, API keys, and deployment names in the `.env` file match your Azure resource setup.

In [2]:
# Azure OpenAI setup
azure_eastus_endpoint = os.getenv("AZURE_OPENAI_EASTUS_ENDPOINT")
azure_eastus_api_key = os.getenv("AZURE_OPENAI_EASTUS_API_KEY")
model_deployment_name = os.getenv("AZURE_OPENAI_GPT4O_DEPLOYMENT")
azure_gpt4o_chat_version = os.getenv("AZURE_OPENAI_GPT4O_CHAT_DEPLOYMENT_VERSION")
azure_gpt4o_mini_deployment = os.getenv("AZURE_OPENAI_GPT4O_MINI_DEPLOYMENT")
azure_gpt4o_mini_chat_version = os.getenv("AZURE_OPENAI_GPT4O_MINI_CHAT_DEPLOYMENT_VERSION")
azure_embedding_deployment = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT")

# Azure AI Search setup
vector_store_address = os.getenv("AZURE_SEARCH_ENDPOINT")
vector_store_password = os.getenv("AZURE_SEARCH_KEY")
index_name = os.getenv("INDEX_NAME")

# Initialize embeddings
embeddings = AzureOpenAIEmbeddings(
    azure_deployment=azure_embedding_deployment,
    openai_api_version=azure_gpt4o_chat_version,
    azure_endpoint=azure_eastus_endpoint,
    api_key=azure_eastus_api_key,
)

# Initialize vector store
vector_store = AzureSearch(
    azure_search_endpoint=vector_store_address,
    azure_search_key=vector_store_password,
    index_name=index_name,
    embedding_function=embeddings.embed_query,
    additional_search_client_options={"retry_total": 4},
)

# Initialize the Azure OpenAI client
azure_openai_client = AzureOpenAI(
    azure_endpoint=azure_eastus_endpoint,
    api_version=azure_gpt4o_mini_chat_version,
    api_key=azure_eastus_api_key,
)


## 4: Define GPT-4o Parsing Function

### Description
This function sends the extracted menu text to GPT-4o via Azure OpenAI to parse it into structured JSON format.
The JSON includes fields like `category`, `item`, `description`, and `price`.

In [16]:


class CoffeeMenuItem(BaseModel):
    category: str
    item: str
    description: str
    price: str = None

class CoffeeMenu(BaseModel):
    items: List[CoffeeMenuItem]

def print_parsed_menu(parsed_menu):
    # Print the parsed menu in a formatted way
    if parsed_menu:
        for item in parsed_menu.items:
            print(f"Category: {item.category}")
            print(f"Item: {item.item}")
            print(f"Description: {item.description}")
            print(f"Price: {item.price}")
            print()  # Add a blank line between items

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=5, max=60))
def parse_menu_with_gpt4o(raw_text, model_deployment_name):
    """Parse the raw text into structured JSON using GPT-4o."""
    prompt = f"""
    You are a menu parser. Convert the following raw text from a coffee menu into structured JSON with the fields:
    - category
    - item
    - description
    - price (if available)

    Here is the coffee menu text:
    ---
    {raw_text}
    ---
    """
    try:
        response = azure_openai_client.beta.chat.completions.parse(
            model=model_deployment_name,
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=0,
            response_format=CoffeeMenu
        )

        parsed_output = response.choices[0].message.parsed
        return parsed_output

    except openai.ContentFilterFinishReasonError as e:
        print(f"Content filter error: {e}")
        print(f"Problematic prompt: {prompt}")
        return None

# Example usage
sample_raw_text = """
Espresso Drinks
- Espresso: A strong coffee brewed by forcing hot water under pressure through finely ground coffee beans. $2.99
- Cappuccino: Espresso with steamed milk and a layer of foam. $3.99

Cold Brews
- Cold Brew: Coffee brewed cold for a smooth, rich flavor. $4.99
- Nitro Cold Brew: Cold brew infused with nitrogen for a creamy texture. $5.99
"""

parsed_menu = parse_menu_with_gpt4o(sample_raw_text, azure_gpt4o_mini_deployment)
# print(parsed_menu)

print_parsed_menu(parsed_menu)


Category: Espresso Drinks
Item: Espresso
Description: A strong coffee brewed by forcing hot water under pressure through finely ground coffee beans.
Price: $2.99

Category: Espresso Drinks
Item: Cappuccino
Description: Espresso with steamed milk and a layer of foam.
Price: $3.99

Category: Cold Brews
Item: Cold Brew
Description: Coffee brewed cold for a smooth, rich flavor.
Price: $4.99

Category: Cold Brews
Item: Nitro Cold Brew
Description: Cold brew infused with nitrogen for a creamy texture.
Price: $5.99



## 5: Extract Text from PDF

### Description
This cell uses `pdf2image` to convert each page of the provided PDF file into an image. It then uses `pytesseract` to perform Optical Character Recognition (OCR) on each image to extract the text. The extracted text from all pages is combined into a single string for further processing.



In [12]:
import pytesseract
from pdf2image import convert_from_path

def extract_text_from_pdf_images(pdf_path):
    """Extract text from images in a PDF file using OCR."""
    # Convert PDF pages to images
    images = convert_from_path(pdf_path)

    # Perform OCR on each image
    text = ""
    for image in images:
        text += pytesseract.image_to_string(image)

    return text

# Extract raw text from PDF images
pdf_path = "coffee-chat-beverage-menu.pdf"
raw_text = extract_text_from_pdf_images(pdf_path)

print(raw_text)

Beverage Menu View Text Menu

4 Menus Available
(COFFEE SPECIALTIES)

Intermezzo House Coffee... Proudly serving
Dancing Goats coffee. Priced for each
kannchen 4.00

Coffee Infusion... “French Press” infused at
your table (about 3 cups; please wait 2 minutes
to push the press). 5.00

Espresso... la créme de café ... The essence of
pure, rich coffee 2.90

Espresso Doppio... Double espresso 4.00

Caffé Americano... Double espresso diluted
with purified water 4.00

Turkish Coffee...

Pulverized light roast beans blended with
cardamon and sugar, boiled 3 times, as served in
Kolschitsky’s Coffeehouse in Vienna from 1683,
and throughtout Arab and Greek nations. In
contradiction to the famous story of Mr.
Kolschitsky, the first recorded coffee-serving
privilege in Vienna was granted in 1685 to an
Armenian merchant named Deodato.

(Notice: The grounds remain in the pot, some
passing into your cup, making this extremely
rich, sip slowly.) 6.00

Café Cubano... Double-rich espresso extraction
wit

## 6: Parse Menu Data

### Description
This cell sends the extracted raw text to the `parse_menu_with_gpt4o` function, 
returning structured JSON data ready for further processing.

In [15]:
# Parse menu with GPT-4o
parsed_menu = parse_menu_with_gpt4o(raw_text, azure_gpt4o_mini_deployment)

print_parsed_menu(parsed_menu)

LengthFinishReasonError: Could not parse response content as the length limit was reached - CompletionUsage(completion_tokens=1000, prompt_tokens=3095, total_tokens=4095, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))

## Cell 7: Prepare Data for Azure AI Search

### Description
This function processes the structured JSON returned by GPT-4o, preparing it for ingestion into Azure AI Search.
Each menu item is assigned a unique ID.

In [11]:
def prepare_data_for_azure_search(structured_data):
    """Transform parsed data for ingestion into Azure AI Search."""
    azure_search_documents = []
    for category_data in structured_data:
        category = category_data.get("category", "Uncategorized")
        for item in category_data.get("items", []):
            azure_search_documents.append({
                "category": category,
                "item": item.get("name"),
                "description": item.get("description"),
                "price": item.get("price"),
                "id": f"{category}_{item.get('name').replace(' ', '_')}".lower()  # Unique ID
            })
    return azure_search_documents

# Transform structured data
documents_for_azure = prepare_data_for_azure_search(parsed_menu)

print(documents_for_azure)

AttributeError: 'tuple' object has no attribute 'get'

## Cell 8: Upload to Azure AI Search

### Description
This cell defines and calls a function to upload the prepared data to Azure AI Search.
Ensure the Azure AI Search index is properly configured before running this step.

In [None]:
def upload_to_azure_search(documents):
    """Upload structured data to Azure AI Search."""
    search_client = SearchClient(
        endpoint=vector_store_address,
        index_name=index_name,
        credential=AzureKeyCredential(vector_store_password),
    )
    result = search_client.upload_documents(documents=documents)
    print(f"Uploaded {len(result)} documents to Azure AI Search.")
    return result

# Upload documents
upload_to_azure_search(documents_for_azure)


## Cell 9: Final Summary

### Summary
- Extracted text from the PDF file.
- Parsed the text into structured JSON using GPT-4o.
- Uploaded the structured data into Azure AI Search.
The pipeline is now complete!
