# Questionnaire Agent Evaluation

This notebook evaluates the Questionnaire Agent using a comprehensive set of test queries focused on Azure AI topics.
The evaluation includes both Azure AI-specific questions and general questions to test the agent's ability to stay on topic and provide accurate, contextually relevant responses.

This evaluation system uses Azure AI evaluation SDK to assess:
- **Relevance**: How relevant responses are to the queries
- **Coherence**: How logically structured and consistent responses are
- **Fluency**: How well-written and readable responses are

## Environment Setup and Configuration

In [None]:
import os
import sys
import json
import datetime
from pathlib import Path
from typing import Dict, List, Tuple, Optional
from dotenv import load_dotenv

# Azure AI evaluation imports
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator,
    AzureOpenAIModelConfiguration
)

# Azure authentication and client imports
from azure.identity import AzureCliCredential, DefaultAzureCredential
from azure.ai.projects import AIProjectClient

# Add the parent directory to sys.path to import the questionnaire agent
parent_dir = Path(__file__).parent.parent if '__file__' in globals() else Path.cwd().parent
sys.path.insert(0, str(parent_dir))

# Load environment variables from .env file
load_dotenv(override=True)

print("✅ Libraries imported successfully!")
print(f"📁 Working directory: {Path.cwd()}")
print(f"📂 Parent directory added to path: {parent_dir}")

In [None]:
# Load configuration from environment variables
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_MODEL_DEPLOYMENT = os.getenv("AZURE_OPENAI_MODEL_DEPLOYMENT", "gpt-4o-mini")

# Evaluation model configuration (use the same or different model for evaluation)
EVALUATION_MODEL_DEPLOYMENT = os.getenv("EVALUATION_MODEL_DEPLOYMENT")
EVALUATION_MODEL_ENDPOINT = os.getenv("EVALUATION_MODEL_ENDPOINT")
EVALUATION_OPENAI_API_VERSION = os.getenv("EVALUATION_OPENAI_API_VERSION")

# Bing Search configuration
BING_CONNECTION_ID = os.getenv("BING_CONNECTION_ID")

# Application Insights for tracing
APPLICATIONINSIGHTS_CONNECTION_STRING = os.getenv("APPLICATIONINSIGHTS_CONNECTION_STRING")

print("🔧 Configuration loaded:")
print(f"  Azure OpenAI Endpoint: {AZURE_OPENAI_ENDPOINT}")
print(f"  Main Model Deployment: {AZURE_OPENAI_MODEL_DEPLOYMENT}")
print(f"  Evaluation Model: {EVALUATION_MODEL_DEPLOYMENT}")
print(f"  Evaluation Endpoint: {EVALUATION_MODEL_ENDPOINT}")
print(f"  Evaluation API Version: {EVALUATION_OPENAI_API_VERSION}")
print(f"  Bing Connection ID: {BING_CONNECTION_ID}")

# Verify required configurations
missing_configs = []
if not AZURE_OPENAI_ENDPOINT:
    missing_configs.append("AZURE_OPENAI_ENDPOINT")
if not AZURE_OPENAI_MODEL_DEPLOYMENT:
    missing_configs.append("AZURE_OPENAI_MODEL_DEPLOYMENT")
if not BING_CONNECTION_ID:
    missing_configs.append("BING_CONNECTION_ID")

if missing_configs:
    print(f"❌ Missing required environment variables: {', '.join(missing_configs)}")
    print("   Please check your .env file configuration.")
else:
    print("✅ All required environment variables are configured!")

## Initialize Azure AI Project Client

In [None]:
# Initialize Azure AI Project Client for evaluation
try:
    # Use Azure CLI credentials for authentication
    credential = AzureCliCredential()
    
    # Initialize the Azure AI Project Client
    project_client = AIProjectClient(
        endpoint=AZURE_OPENAI_ENDPOINT,
        credential=credential
    )
    
    print("✅ Azure AI Project Client initialized successfully!")
    print(f"🔗 Connected to endpoint: {AZURE_OPENAI_ENDPOINT}")
    
except Exception as e:
    print(f"❌ Failed to initialize Azure AI Project Client: {str(e)}")
    print("   Please ensure you are logged in with 'az login' and have proper permissions.")
    raise

## Create Questionnaire Agent Query Function

This function serves as the target for the evaluation system. It will use the existing questionnaire agent to process queries and return responses in the format expected by the evaluation framework.

In [None]:
# Import the questionnaire agent
try:
    from question_answerer import QuestionnaireAgentUI
    print("✅ Successfully imported QuestionnaireAgentUI")
except ImportError as e:
    print(f"❌ Failed to import QuestionnaireAgentUI: {str(e)}")
    print("   Make sure the question_answerer.py file is in the parent directory")
    raise

# Initialize the questionnaire agent in headless mode for evaluation
try:
    questionnaire_agent = QuestionnaireAgentUI(
        headless_mode=True, 
        max_retries=3,  # Limit retries for faster evaluation
        mock_mode=False  # Use real Azure AI services for evaluation
    )
    print("✅ Questionnaire agent initialized in headless mode")
except Exception as e:
    print(f"❌ Failed to initialize questionnaire agent: {str(e)}")
    raise



In [None]:
def query_questionnaire_agent(query: str) -> Dict[str, str]:
    """
    Target function for evaluation that queries the questionnaire agent.
    
    Args:
        query (str): The question to ask the agent
        
    Returns:
        Dict[str, str]: Dictionary containing query and response for evaluation
    """
    try:
        # Use default context and parameters for evaluation
        context = "Microsoft Azure AI"
        char_limit = 2000
        max_retries = 3
        verbose = False
        
        # Process the query using the questionnaire agent
        success, answer, links = questionnaire_agent.process_single_question_cli(
            question=query,
            context=context,
            char_limit=char_limit,
            verbose=verbose,
            max_retries=max_retries
        )
        
        if success and answer:
            # Return in the format expected by the evaluation framework
            return {
                "query": query,
                "response": answer
            }
        else:
            # Handle case where agent failed to generate a response
            return {
                "query": query,
                "response": "The agent was unable to generate a response for this query."
            }
            
    except Exception as e:
        print(f"❌ Error querying agent for '{query[:50]}...': {str(e)}")
        return {
            "query": query,
            "response": f"Error occurred while processing query: {str(e)}"
        }

# Test the function with a sample query
print("🧪 Testing the query function with a sample question...")
test_result = query_questionnaire_agent("What are the key features of Azure AI?")
print(f"✅ Test query: {test_result['query']}")
print(f"📝 Response preview: {test_result['response'][:200]}...")

## Configure AI Evaluation Models

Set up the Azure OpenAI model configuration and initialize the evaluation metrics that will assess the quality of the agent's responses.

In [None]:
# Configure Azure OpenAI model for evaluation
# Note: We need to use API key authentication for the evaluation models
# since the evaluation SDK requires explicit API keys

# Try to get API key from environment - check multiple possible variable names
EVALUATION_API_KEY = os.getenv("EVALUATION_API_KEY")

if not EVALUATION_API_KEY:
    print("⚠️  Warning: No Azure OpenAI API key found in environment variables.")
    print("   Please set EVALUATION_API_KEY in your .env file.")
    print("   The evaluation SDK requires explicit API key authentication.")
else:
    print(f"✅ Found API key for evaluation (length: {len(EVALUATION_API_KEY)} characters)")

# Configure the evaluation model
try:
    model_config = AzureOpenAIModelConfiguration(
        azure_endpoint=EVALUATION_MODEL_ENDPOINT,
        azure_deployment=EVALUATION_MODEL_DEPLOYMENT,
        api_version=EVALUATION_OPENAI_API_VERSION,
        api_key=EVALUATION_API_KEY
    )
    print(f"✅ Model configuration created for evaluation")
    print(f"   Endpoint: {EVALUATION_MODEL_ENDPOINT}")
    print(f"   Deployment: {EVALUATION_MODEL_DEPLOYMENT}")
    print(f"   API Version: {EVALUATION_OPENAI_API_VERSION}")
    
except Exception as e:
    print(f"❌ Failed to create model configuration: {str(e)}")
    raise

In [None]:
# Initialize all evaluators
try:
    print("🔧 Initializing evaluation metrics...")
    
    # Relevance: Measures how relevant the response is to the query
    relevance_evaluator = RelevanceEvaluator(model_config=model_config)
    print("✅ Relevance evaluator initialized")
    
    # Coherence: Measures the logical flow and consistency of the response
    coherence_evaluator = CoherenceEvaluator(model_config=model_config)
    print("✅ Coherence evaluator initialized")
    
    # Fluency: Measures the readability and linguistic quality of the response
    fluency_evaluator = FluencyEvaluator(model_config=model_config)
    print("✅ Fluency evaluator initialized")
    
    print("\n🎯 All evaluators configured successfully!")
    print("   Each evaluator will assess different aspects of response quality:")
    print("   • Relevance: How relevant responses are to queries")
    print("   • Coherence: How logically structured responses are")
    print("   • Fluency: How well-written and readable responses are")
    
except Exception as e:
    print(f"❌ Error configuring evaluators: {str(e)}")
    print("   This might be due to API key issues or model configuration problems.")
    raise

## Run Comprehensive Agent Evaluation

Execute the evaluation pipeline using all configured evaluators against the test dataset. This process will measure agent performance across multiple dimensions.

**Note**: This evaluation may take a significant amount of time depending on the number of queries and the response time of the agent and evaluation models.

In [None]:
# Prepare for evaluation
evaluation_name = f"questionnaire_agent_evaluation_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}"

print(f"🚀 Starting comprehensive evaluation: {evaluation_name}")
print(f"⏰ Started at: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\n⚠️  This evaluation may take 10-30 minutes depending on query complexity...")
print("   Each query requires multiple API calls to the agent and evaluation models.")

# Run the evaluation
try:
    # Configure the evaluators dictionary with proper naming for Azure AI Foundry
    evaluators_config = {
        "relevance": relevance_evaluator,
        "coherence": coherence_evaluator,
        "fluency": fluency_evaluator,
    }
    
    # Execute the evaluation
    evaluation_result = evaluate(
        data="evaluation_queries.jsonl",
        target=query_questionnaire_agent,
        evaluators=evaluators_config,
        azure_ai_project=AZURE_OPENAI_ENDPOINT,  # Azure AI project endpoint
        evaluation_name=evaluation_name,
    )
    
    print(f"\n✅ Evaluation completed successfully!")
    print(f"⏰ Finished at: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    
    # Display Azure AI Foundry Studio URL if available
    if 'studio_url' in evaluation_result:
        print(f"🔗 Azure AI Foundry Studio URL: {evaluation_result['studio_url']}")
        print("   You can view detailed results and visualizations in the Azure AI Foundry portal.")
    
except Exception as e:
    print(f"❌ Evaluation failed: {str(e)}")
    print("\n🔧 Troubleshooting tips:")
    print("   • Ensure Azure CLI authentication is working: 'az login'")
    print("   • Check that all required environment variables are set")
    print("   • Verify Azure OpenAI API quotas and limits")
    print("   • Check network connectivity to Azure services")
    raise

## Summary and Next Steps

This evaluation notebook provides a comprehensive assessment of the Questionnaire Agent's performance across multiple dimensions:

### What This Evaluation Measures:

1. **Relevance**: How relevant and on-topic the responses are to the input queries
2. **Coherence**: How logically structured and internally consistent the responses are
3. **Fluency**: How well-written, readable, and linguistically sound the responses are

### Key Benefits:

- **Objective Assessment**: Uses AI-assisted evaluation for consistent, scalable assessment
- **Multi-dimensional Analysis**: Evaluates different aspects of response quality
- **Azure Integration**: Results are available in Azure AI Foundry Studio for detailed analysis
- **Reproducible**: Can be run repeatedly to track improvements over time

### Next Steps:

1. **Review Detailed Results**: Use the Azure AI Foundry Studio URL to explore individual query results
2. **Identify Improvement Areas**: Focus on metrics with lower scores
3. **Iterative Development**: Re-run evaluation after making improvements to track progress
4. **Expand Evaluation**: Consider adding more specific test cases or custom evaluators

### Files Generated:

- Detailed JSON results for programmatic analysis
- Human-readable summary reports
- Links to Azure AI Foundry Studio for interactive exploration