# Restaurant Agent Evaluation

This notebook evaluates the Scheibmeir's restaurant agent using a comprehensive set of test queries.
The evaluation includes both questions about Scheibmeir's restaurant (based on the grounded PDFs) and general questions to test the agent's ability to stay on topic.

## Setup and Configuration

In [None]:
import os
import json
from dotenv import load_dotenv
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    GroundednessEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator,
    AzureOpenAIModelConfiguration
)
from azure.ai.agents import AgentsClient
from azure.identity import DefaultAzureCredential

# Load environment variables from .env file
load_dotenv()

print("Libraries imported successfully!")

In [None]:
# Configuration from environment variables
RESTAURANT_ASSISTANT_ID = os.getenv("RESTAURANT_ASSISTANT_ID")
RESTAURANT_ASSISTANT_MODEL = os.getenv("RESTAURANT_ASSISTANT_MODEL")
RESTAURANT_ASSISTANT_PROJECT = os.getenv("RESTAURANT_ASSISTANT_PROJECT")

# Azure OpenAI configuration for evaluators
AZURE_OPENAI_ENDPOINT = os.getenv("DEEP_RESEARCH_PROJECT_ENDPOINT").replace("/api/projects/deep-research-demo-project", "")
AZURE_OPENAI_API_VERSION = "2024-10-21"
AGENT_MODEL_DEPLOYMENT_NAME = os.getenv("AGENT_MODEL_DEPLOYMENT_NAME")

# Azure AI project configuration
AZURE_SUBSCRIPTION_ID = os.getenv("AZURE_SUBSCRIPTION_ID")
AZURE_RESOURCE_GROUP_NAME = os.getenv("AZURE_RESOURCE_GROUP_NAME")
AZURE_PROJECT_NAME = os.getenv("AZURE_PROJECT_NAME")

print(f"Restaurant Assistant ID: {RESTAURANT_ASSISTANT_ID}")
print(f"Azure OpenAI Endpoint: {AZURE_OPENAI_ENDPOINT}")
print(f"Model Deployment: {AGENT_MODEL_DEPLOYMENT_NAME}")
print(f"Project: {AZURE_PROJECT_NAME}")

## Initialize Agents Client

In [None]:
# Initialize the Agents client
credential = DefaultAzureCredential()
agents_client = AgentsClient(
    endpoint=RESTAURANT_ASSISTANT_PROJECT,
    credential=credential
)

print("Agents client initialized successfully!")

## Define Target Function for Evaluation

This function will be called by the evaluator for each test query.

In [None]:
def query_restaurant_agent(query: str) -> dict:
    """
    Function to query the restaurant agent and return the response.
    This function will be used by the Azure AI evaluation framework.
    """
    try:
        # Create a thread and run in one step
        result = agents_client.create_thread_and_process_run(
            agent_id=RESTAURANT_ASSISTANT_ID,
            thread={
                "messages": [
                    {
                        "role": "user",
                        "content": query
                    }
                ]
            }
        )
        
        if result.status.value == "completed":
            # Get messages from the thread
            messages = agents_client.messages.list(thread_id=result.thread_id)
            
            # Find the agent's response
            for msg in messages:
                if msg.role.value == "agent":
                    return {
                        "response": msg.content[0].text.value,
                        "query": query
                    }
        
        return {
            "response": f"Agent run failed with status: {result.status.value}",
            "query": query
        }
        
    except Exception as e:
        return {
            "response": f"Error querying agent: {str(e)}",
            "query": query
        }

# Test the function with a sample query
test_result = query_restaurant_agent("What are the opening hours for Scheibmeir's?")
print("Test query result:")
print(f"Query: {test_result['query']}")
print(f"Response: {test_result['response'][:200]}...")

## Configure Evaluators

In [None]:
# Configure the model for AI-assisted evaluators
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    azure_deployment=AGENT_MODEL_DEPLOYMENT_NAME,
    api_version=AZURE_OPENAI_API_VERSION,
    credential=credential
)

# Initialize evaluators
groundedness_evaluator = GroundednessEvaluator(model_config=model_config)
relevance_evaluator = RelevanceEvaluator(model_config=model_config)
coherence_evaluator = CoherenceEvaluator(model_config=model_config)
fluency_evaluator = FluencyEvaluator(model_config=model_config)

print("Evaluators configured successfully!")

## Load Test Data

In [None]:
# Load the evaluation queries
queries = []
with open("evaluation_queries.jsonl", "r") as f:
    for line in f:
        queries.append(json.loads(line.strip()))

print(f"Loaded {len(queries)} test queries")
print("Sample queries:")
for i in range(5):
    print(f"  {i+1}. {queries[i]['query']}")

## Run Evaluation

This will evaluate the restaurant agent using all test queries and multiple evaluation metrics.

In [None]:
# Azure AI project configuration for evaluation
azure_ai_project = {
    "subscription_id": AZURE_SUBSCRIPTION_ID,
    "project_name": AZURE_PROJECT_NAME,
    "resource_group_name": AZURE_RESOURCE_GROUP_NAME,
}

# Run the evaluation
print("Starting evaluation... This may take a while.")

evaluation_result = evaluate(
    data="evaluation_queries.jsonl",
    target=query_restaurant_agent,
    evaluators={
        "groundedness": groundedness_evaluator,
        "relevance": relevance_evaluator,
        "coherence": coherence_evaluator,
        "fluency": fluency_evaluator,
    },
    azure_ai_project=azure_ai_project,
)

print("Evaluation completed!")
print(f"Azure AI Foundry Studio URL: {evaluation_result.get('studio_url')}")

## Display Results

In [None]:
# Display evaluation metrics
print("Evaluation Metrics:")
print("=" * 50)

metrics = evaluation_result.get("metrics", {})
for metric_name, metric_value in metrics.items():
    print(f"{metric_name}: {metric_value:.4f}")

print("\nDetailed results are available in Azure AI Foundry Studio.")

## Sample Results Analysis

Let's look at a few specific examples to understand how the agent performs.

In [None]:
# Test a few specific queries to see the responses
sample_queries = [
    "What are the opening hours for Scheibmeir's restaurant?",
    "What steaks does Scheibmeir's serve?",
    "How do I cook pasta?",  # Non-restaurant query
    "What's the weather like today?",  # Non-restaurant query
    "Does Scheibmeir's serve Chinese food?",
]

print("Sample Agent Responses:")
print("=" * 60)

for i, query in enumerate(sample_queries, 1):
    print(f"\n{i}. Query: {query}")
    result = query_restaurant_agent(query)
    print(f"   Response: {result['response'][:300]}{'...' if len(result['response']) > 300 else ''}")
    print("-" * 60)

## Summary

This evaluation notebook:
1. Tests the restaurant agent with 250 diverse queries
2. Evaluates responses using multiple AI-assisted metrics (groundedness, relevance, coherence, fluency)
3. Provides detailed results in Azure AI Foundry Studio
4. Shows sample responses to understand agent behavior

The evaluation helps assess:
- How well the agent answers questions about Scheibmeir's restaurant using the grounded PDF data
- Whether the agent stays on topic and handles non-restaurant queries appropriately
- The quality and coherence of the agent's responses
- Overall performance across different types of queries