# Restaurant Agent Evaluation

This notebook evaluates the Scheibmeir's restaurant agent using a comprehensive set of test queries.
The evaluation includes both questions about Scheibmeir's restaurant (based on the grounded PDFs) and general questions to test the agent's ability to stay on topic.

## Setup and Configuration

In [1]:
import os
import json
from dotenv import load_dotenv
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    GroundednessEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator,
    AzureOpenAIModelConfiguration
)
from azure.ai.agents import AgentsClient
from azure.identity import AzureCliCredential

# Load environment variables from .env file
load_dotenv(override=True)

print("Libraries imported successfully!")

Libraries imported successfully!


In [2]:
# Configuration from environment variables
RESTAURANT_ASSISTANT_ID = os.getenv("RESTAURANT_ASSISTANT_ID")
RESTAURANT_ASSISTANT_MODEL = os.getenv("RESTAURANT_ASSISTANT_MODEL")
RESTAURANT_EVALUATION_MODEL = os.getenv("RESTAURANT_EVALUATION_MODEL")
RESTAURANT_ASSISTANT_PROJECT = os.getenv("RESTAURANT_ASSISTANT_PROJECT")

# Azure OpenAI configuration for evaluators - using restaurant project
RESTAURANT_OPENAI_ENDPOINT = os.getenv("RESTAURANT_ASSISTANT_PROJECT")
RESTAURANT_OPENAI_API_VERSION = os.getenv("RESTAURANT_OPENAI_API_VERSION", "2025-01-01-preview")
RESTAURANT_API_KEY = os.getenv("RESTAURANT_API_KEY")
RESTAURANT_MODEL_DEPLOYMENT_NAME = os.getenv("RESTAURANT_EVALUATION_MODEL")

# Azure AI project configuration - using restaurant project
RESTAURANT_SUBSCRIPTION_ID = os.getenv("RESTAURANT_SUBSCRIPTION_ID")
RESTAURANT_RESOURCE_GROUP_NAME = os.getenv("RESTAURANT_RESOURCE_GROUP")
RESTAURANT_PROJECT_NAME = os.getenv("RESTAURANT_PROJECT_NAME")

print(f"Restaurant Assistant ID: {RESTAURANT_ASSISTANT_ID}")
print(f"Restaurant Assistant Model: {RESTAURANT_ASSISTANT_MODEL}")
print(f"Restaurant Evaluation Model: {RESTAURANT_EVALUATION_MODEL}")
print(f"Restaurant OpenAI Endpoint: {RESTAURANT_OPENAI_ENDPOINT}")
#print(f"Restaurant API Key: {RESTAURANT_API_KEY}")
print(f"Restaurant OpenAI API Version: {RESTAURANT_OPENAI_API_VERSION}")
print(f"Restaurant Project: {RESTAURANT_PROJECT_NAME}")
print(f"Restaurant Subscription: {RESTAURANT_SUBSCRIPTION_ID}")
print(f"Restaurant Resource Group: {RESTAURANT_RESOURCE_GROUP_NAME}")

Restaurant Assistant ID: asst_eJv1oQY8pHlj3mgXyV2sLCYG
Restaurant Assistant Model: gpt-4.1-mini
Restaurant Evaluation Model: gpt-4o-mini
Restaurant OpenAI Endpoint: https://aipmaker-project-resource.services.ai.azure.com/api/projects/aipmaker-project
Restaurant OpenAI API Version: 2024-12-01-preview
Restaurant Project: aipmaker-project
Restaurant Subscription: 21039746-6e73-4627-88af-efa80f856c2c
Restaurant Resource Group: rg-AIPMaker


In [3]:
# Reload configuration from environment variables (fresh from .env file)
RESTAURANT_ASSISTANT_ID = os.getenv("RESTAURANT_ASSISTANT_ID")
RESTAURANT_ASSISTANT_MODEL = os.getenv("RESTAURANT_ASSISTANT_MODEL")
RESTAURANT_EVALUATION_MODEL = os.getenv("RESTAURANT_EVALUATION_MODEL")
RESTAURANT_ASSISTANT_PROJECT = os.getenv("RESTAURANT_ASSISTANT_PROJECT")

# Azure OpenAI configuration for evaluators - using restaurant project
RESTAURANT_OPENAI_ENDPOINT = os.getenv("RESTAURANT_ASSISTANT_PROJECT")
RESTAURANT_OPENAI_API_VERSION = os.getenv("RESTAURANT_OPENAI_API_VERSION", "2025-01-01-preview")
RESTAURANT_MODEL_DEPLOYMENT_NAME = os.getenv("RESTAURANT_EVALUATION_MODEL")

# Azure AI project configuration - using restaurant project
RESTAURANT_SUBSCRIPTION_ID = os.getenv("RESTAURANT_SUBSCRIPTION_ID")
RESTAURANT_RESOURCE_GROUP_NAME = os.getenv("RESTAURANT_RESOURCE_GROUP")
RESTAURANT_PROJECT_NAME = os.getenv("RESTAURANT_PROJECT_NAME")

print("✅ Configuration reloaded from .env file!")
print(f"Restaurant Assistant ID: {RESTAURANT_ASSISTANT_ID}")
print(f"Restaurant Assistant Model: {RESTAURANT_ASSISTANT_MODEL}")
print(f"Restaurant Evaluation Model: {RESTAURANT_EVALUATION_MODEL}")
print(f"Restaurant OpenAI Endpoint: {RESTAURANT_OPENAI_ENDPOINT}")
print(f"Restaurant OpenAI API Version: {RESTAURANT_OPENAI_API_VERSION}")  # This should now show 2025-01-01-preview
print(f"Restaurant Project: {RESTAURANT_PROJECT_NAME}")
print(f"Restaurant Subscription: {RESTAURANT_SUBSCRIPTION_ID}")
print(f"Restaurant Resource Group: {RESTAURANT_RESOURCE_GROUP_NAME}")

✅ Configuration reloaded from .env file!
Restaurant Assistant ID: asst_eJv1oQY8pHlj3mgXyV2sLCYG
Restaurant Assistant Model: gpt-4.1-mini
Restaurant Evaluation Model: gpt-4o-mini
Restaurant OpenAI Endpoint: https://aipmaker-project-resource.services.ai.azure.com/api/projects/aipmaker-project
Restaurant OpenAI API Version: 2024-12-01-preview
Restaurant Project: aipmaker-project
Restaurant Subscription: 21039746-6e73-4627-88af-efa80f856c2c
Restaurant Resource Group: rg-AIPMaker


## Initialize Agents Client

In [4]:
# Initialize the Agents client with Azure CLI credentials
credential = AzureCliCredential()
agents_client = AgentsClient(
    endpoint=RESTAURANT_ASSISTANT_PROJECT,
    credential=credential
)

print("Agents client initialized successfully with Azure CLI authentication!")

Agents client initialized successfully with Azure CLI authentication!


## Define Target Function for Evaluation

This function will be called by the evaluator for each test query.

In [None]:
from azure.ai.agents.models import (
    FilePurpose,
    FileSearchTool,
    ListSortOrder,
    RunAdditionalFieldList,
    RunStepFileSearchToolCall,
    RunStepToolCallDetails,
)

def query_restaurant_agent(query: str) -> dict:
    try:
        agents_client.get_agent(RESTAURANT_ASSISTANT_ID)
        thread = agents_client.threads.create()
        message = agents_client.messages.create(
            thread_id=thread.id,
            content=query,
            role="user"
        )
        run = agents_client.runs.create_and_process(
            thread_id=thread.id,
            agent_id=RESTAURANT_ASSISTANT_ID,
        )
        print(f"Run finished with status: {run.status}")

        all_text = ""

        for run_step in agents_client.run_steps.list(
            thread_id=thread.id, run_id=run.id, include=[RunAdditionalFieldList.FILE_SEARCH_CONTENTS]
        ):
            if isinstance(run_step.step_details, RunStepToolCallDetails):
                for tool_call in run_step.step_details.tool_calls:
                    if (
                        isinstance(tool_call, RunStepFileSearchToolCall)
                        and tool_call.file_search
                        and tool_call.file_search.results
                        and tool_call.file_search.results[0].content
                        and tool_call.file_search.results[0].content[0].text
                    ):
                        # Store the union of all of the contents' text
                        all_text = "\n".join(
                            content.text for content in tool_call.file_search.results[0].content
                        )

        messages = agents_client.messages.list(thread_id=thread.id, order=ListSortOrder.ASCENDING)

        messages_list = list(messages)

        print(f"DEBUG: Found {len(messages_list)} messages")
        
        for i, msg in enumerate(messages_list):
            print(f"DEBUG: Message {i}: role={msg.role}, content_type={type(msg.content)}")
            if msg.role == "assistant":  # Assistant role for agent responses
                if msg.content and len(msg.content) > 0:
                    print(f"DEBUG: Content type: {type(msg.content[0])}")
                    if hasattr(msg.content[0], 'text'):
                        return {
                            "response": msg.content[0].text.value,
                            "query": query,
                            "context": all_text
                        }
                    elif hasattr(msg.content[0], 'value'):
                        return {
                            "response": msg.content[0].value,
                            "query": query,
                            "context": all_text
                        }
    except Exception as e:
        print(f"Error querying agent: {str(e)}")
        return {
            "response": f"Error querying agent: {str(e)}",
            "query": query
        }
        
# Test the function with a sample query
test_result = query_restaurant_agent("What are the opening hours for Scheibmeir's?")
print("Test query result:")
print(f"Query: {test_result['query']}")
print(f"Response: {test_result['response'][:200]}...")
# if there's a context key, print it
if 'context' in test_result:
    print(f"Context: {test_result['context'][:200]}...")

Run finished with status: RunStatus.COMPLETED
Error querying agent: name 'RunStepToolCallDetails' is not defined
Test query result:
Query: What are the opening hours for Scheibmeir's?
Response: Error querying agent: name 'RunStepToolCallDetails' is not defined...


## Test Evaluation Model Directly

Let's test the evaluation model directly to make sure it's working before proceeding.

In [None]:
# Test the evaluation model using Azure AI Projects client
import os
from azure.ai.projects import AIProjectClient

try:
    # Set the OpenAI API version environment variable as required by the Azure AI Projects client
    os.environ['OPENAI_API_VERSION'] = '2024-10-21'
    
    # Create Azure AI Projects client for Azure AI Foundry project endpoint
    project_client = AIProjectClient(
        endpoint=RESTAURANT_OPENAI_ENDPOINT,
        credential=credential
    )
    
    # Get the Azure OpenAI client from the project using the inference property
    # This automatically handles the correct endpoint and authentication
    eval_client = project_client.inference.get_azure_openai_client(
        api_version='2024-10-21'
    )
    
    print(f"Testing evaluation model: {RESTAURANT_EVALUATION_MODEL}")
    print(f"Project endpoint: {RESTAURANT_OPENAI_ENDPOINT}")
    print(f"Using OpenAI API version: 2024-10-21")
    
    # Simple test prompt
    test_messages = [
        {"role": "user", "content": "Hello, can you respond with 'Model is working correctly'?"}
    ]
    
    response = eval_client.chat.completions.create(
        model=RESTAURANT_EVALUATION_MODEL,
        messages=test_messages,
        max_tokens=50,
        temperature=0
    )
    
    print(f"✅ Model Response: {response.choices[0].message.content}")
    print("✅ Evaluation model is working correctly!")
    
except Exception as e:
    print(f"❌ Error testing evaluation model: {str(e)}")
    print("Please check your model configuration and credentials.")

## Configure Evaluators

In [None]:
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint="https://aipmaker-project-resource.openai.azure.com/",
    azure_deployment=RESTAURANT_EVALUATION_MODEL,
    api_version=RESTAURANT_OPENAI_API_VERSION,
    api_key=RESTAURANT_API_KEY
)

# Initialize evaluators
try:
    groundedness_evaluator = GroundednessEvaluator(model_config=model_config)
    relevance_evaluator = RelevanceEvaluator(model_config=model_config)
    coherence_evaluator = CoherenceEvaluator(model_config=model_config)
    fluency_evaluator = FluencyEvaluator(model_config=model_config)
    print("Evaluators configured successfully!")
except Exception as e:
    print(f"❌ Error configuring evaluators: {str(e)}")
    # Uncomment the next line to debug the error
    # pdb.post_mortem()  # This will start debugger at the point of exception

## Load Test Data

In [None]:
# Load the evaluation queries
queries = []
with open("small_eval_queries.jsonl", "r") as f:
    for line in f:
        queries.append(json.loads(line.strip()))

print(f"Loaded {len(queries)} test queries")
print("Sample queries:")
for i in range(2):
    print(f"  {i+1}. {queries[i]['query']}")

## Run Evaluation

This will evaluate the restaurant agent using all test queries and multiple evaluation metrics.

In [None]:
import datetime

# Azure AI project configuration for evaluation
azure_ai_project = RESTAURANT_OPENAI_ENDPOINT

# Run the evaluation
print("Starting evaluation... This may take a while.")

evaluation_result = evaluate(
    data="small_eval_queries.jsonl",
    target=query_restaurant_agent,
    evaluators={
        #"groundedness": groundedness_evaluator,
        "relevance": relevance_evaluator,
        "coherence": coherence_evaluator,
        "fluency": fluency_evaluator,
    },
    azure_ai_project=azure_ai_project,
    evaluation_name="restaurant_evaluation_"+datetime.datetime.now().strftime("%Y%m%d_%H%M%S"),
)

print("Evaluation completed!")
print(f"Azure AI Foundry Studio URL: {evaluation_result.get('studio_url')}")

## Display Results

In [None]:
# Display evaluation metrics
print("Evaluation Metrics:")
print("=" * 50)

metrics = evaluation_result.get("metrics", {})
for metric_name, metric_value in metrics.items():
    print(f"{metric_name}: {metric_value:.4f}")

print("\nDetailed results are available in Azure AI Foundry Studio.")

## Sample Results Analysis

Let's look at a few specific examples to understand how the agent performs.

In [None]:
# Test a few specific queries to see the responses
sample_queries = [
    "What are the opening hours for Scheibmeir's restaurant?",
    "What steaks does Scheibmeir's serve?",
    "How do I cook pasta?",  # Non-restaurant query
    "What's the weather like today?",  # Non-restaurant query
    "Does Scheibmeir's serve Chinese food?",
]

print("Sample Agent Responses:")
print("=" * 60)

for i, query in enumerate(sample_queries, 1):
    print(f"\n{i}. Query: {query}")
    result = query_restaurant_agent(query)
    print(f"   Response: {result['response'][:300]}{'...' if len(result['response']) > 300 else ''}")
    print("-" * 60)

## Summary

This evaluation notebook:
1. Tests the restaurant agent with 250 diverse queries
2. Evaluates responses using multiple AI-assisted metrics (groundedness, relevance, coherence, fluency)
3. Provides detailed results in Azure AI Foundry Studio
4. Shows sample responses to understand agent behavior

The evaluation helps assess:
- How well the agent answers questions about Scheibmeir's restaurant using the grounded PDF data
- Whether the agent stays on topic and handles non-restaurant queries appropriately
- The quality and coherence of the agent's responses
- Overall performance across different types of queries