# Questionnaire Agent Evaluation

This notebook evaluates the Questionnaire Agent using a comprehensive set of test queries focused on Azure AI topics.
The evaluation includes both Azure AI-specific questions and general questions to test the agent's ability to stay on topic and provide accurate, contextually relevant responses.

This evaluation system uses Azure AI evaluation SDK to assess:
- **Groundedness**: How well responses are based on source context
- **Relevance**: How relevant responses are to the queries
- **Coherence**: How logically structured and consistent responses are
- **Fluency**: How well-written and readable responses are

## Environment Setup and Configuration

In [None]:
import os
import sys
import json
import datetime
from pathlib import Path
from typing import Dict, List, Tuple, Optional
from dotenv import load_dotenv

# Azure AI evaluation imports
from azure.ai.evaluation import evaluate
from azure.ai.evaluation import (
    GroundednessEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator,
    AzureOpenAIModelConfiguration
)

# Azure authentication and client imports
from azure.identity import AzureCliCredential, DefaultAzureCredential
from azure.ai.projects import AIProjectClient

# Add the parent directory to sys.path to import the questionnaire agent
parent_dir = Path(__file__).parent.parent if '__file__' in globals() else Path.cwd().parent
sys.path.insert(0, str(parent_dir))

# Load environment variables from .env file
load_dotenv(override=True)

print("✅ Libraries imported successfully!")
print(f"📁 Working directory: {Path.cwd()}")
print(f"📂 Parent directory added to path: {parent_dir}")

In [None]:
# Load configuration from environment variables
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_MODEL_DEPLOYMENT = os.getenv("AZURE_OPENAI_MODEL_DEPLOYMENT", "gpt-4o-mini")
AZURE_OPENAI_API_VERSION = os.getenv("AZURE_OPENAI_API_VERSION", "2025-01-01-preview")

# Evaluation model configuration (use the same or different model for evaluation)
EVALUATION_MODEL = os.getenv("EVALUATION_MODEL", AZURE_OPENAI_MODEL_DEPLOYMENT)
EVALUATION_MODEL_ENDPOINT = os.getenv("EVALUATION_MODEL_ENDPOINT", AZURE_OPENAI_ENDPOINT)
EVALUATION_OPENAI_API_VERSION = os.getenv("EVALUATION_OPENAI_API_VERSION", AZURE_OPENAI_API_VERSION)

# Bing Search configuration
BING_CONNECTION_ID = os.getenv("BING_CONNECTION_ID")

# Application Insights for tracing
APPLICATIONINSIGHTS_CONNECTION_STRING = os.getenv("APPLICATIONINSIGHTS_CONNECTION_STRING")

print("🔧 Configuration loaded:")
print(f"  Azure OpenAI Endpoint: {AZURE_OPENAI_ENDPOINT}")
print(f"  Main Model Deployment: {AZURE_OPENAI_MODEL_DEPLOYMENT}")
print(f"  Evaluation Model: {EVALUATION_MODEL}")
print(f"  Evaluation Endpoint: {EVALUATION_MODEL_ENDPOINT}")
print(f"  API Version: {AZURE_OPENAI_API_VERSION}")
print(f"  Bing Connection ID: {BING_CONNECTION_ID}")

# Verify required configurations
missing_configs = []
if not AZURE_OPENAI_ENDPOINT:
    missing_configs.append("AZURE_OPENAI_ENDPOINT")
if not AZURE_OPENAI_MODEL_DEPLOYMENT:
    missing_configs.append("AZURE_OPENAI_MODEL_DEPLOYMENT")
if not BING_CONNECTION_ID:
    missing_configs.append("BING_CONNECTION_ID")

if missing_configs:
    print(f"❌ Missing required environment variables: {', '.join(missing_configs)}")
    print("   Please check your .env file configuration.")
else:
    print("✅ All required environment variables are configured!")

## Initialize Azure AI Project Client

In [None]:
# Initialize Azure AI Project Client for evaluation
try:
    # Use Azure CLI credentials for authentication
    credential = AzureCliCredential()
    
    # Initialize the Azure AI Project Client
    project_client = AIProjectClient(
        endpoint=AZURE_OPENAI_ENDPOINT,
        credential=credential
    )
    
    print("✅ Azure AI Project Client initialized successfully!")
    print(f"🔗 Connected to endpoint: {AZURE_OPENAI_ENDPOINT}")
    
except Exception as e:
    print(f"❌ Failed to initialize Azure AI Project Client: {str(e)}")
    print("   Please ensure you are logged in with 'az login' and have proper permissions.")
    raise

## Create Questionnaire Agent Query Function

This function serves as the target for the evaluation system. It will use the existing questionnaire agent to process queries and return responses in the format expected by the evaluation framework.

In [None]:
# Import the questionnaire agent
try:
    from question_answerer import QuestionnaireAgentUI
    print("✅ Successfully imported QuestionnaireAgentUI")
except ImportError as e:
    print(f"❌ Failed to import QuestionnaireAgentUI: {str(e)}")
    print("   Make sure the question_answerer.py file is in the parent directory")
    raise

# Initialize the questionnaire agent in headless mode for evaluation
try:
    questionnaire_agent = QuestionnaireAgentUI(
        headless_mode=True, 
        max_retries=3,  # Limit retries for faster evaluation
        mock_mode=False  # Use real Azure AI services for evaluation
    )
    print("✅ Questionnaire agent initialized in headless mode")
except Exception as e:
    print(f"❌ Failed to initialize questionnaire agent: {str(e)}")
    raise

In [None]:
def query_questionnaire_agent(query: str) -> Dict[str, str]:
    """
    Target function for evaluation that queries the questionnaire agent.
    
    Args:
        query (str): The question to ask the agent
        
    Returns:
        Dict[str, str]: Dictionary containing query, response, and context for evaluation
    """
    try:
        # Use default context and parameters for evaluation
        context = "Microsoft Azure AI"
        char_limit = 2000
        max_retries = 3
        verbose = False
        
        # Process the query using the questionnaire agent
        success, answer, links = questionnaire_agent.process_single_question_cli(
            question=query,
            context=context,
            char_limit=char_limit,
            verbose=verbose,
            max_retries=max_retries
        )
        
        if success and answer:
            # Return in the format expected by the evaluation framework
            return {
                "query": query,
                "response": answer,
                "context": f"Question processed in context: {context}. Links found: {', '.join(links) if links else 'None'}"
            }
        else:
            # Handle case where agent failed to generate a response
            return {
                "query": query,
                "response": "The agent was unable to generate a response for this query.",
                "context": f"Failed to process question in context: {context}"
            }
            
    except Exception as e:
        print(f"❌ Error querying agent for '{query[:50]}...': {str(e)}")
        return {
            "query": query,
            "response": f"Error occurred while processing query: {str(e)}",
            "context": "Error context - agent encountered an exception"
        }

# Test the function with a sample query
print("🧪 Testing the query function with a sample question...")
test_result = query_questionnaire_agent("What are the key features of Azure AI?")
print(f"✅ Test query: {test_result['query']}")
print(f"📝 Response preview: {test_result['response'][:200]}...")
print(f"📋 Context: {test_result['context']}")

## Configure AI Evaluation Models

Set up the Azure OpenAI model configuration and initialize the evaluation metrics that will assess the quality of the agent's responses.

In [None]:
# Configure Azure OpenAI model for evaluation
# Note: We need to use API key authentication for the evaluation models
# since the evaluation SDK requires explicit API keys

# Try to get API key from environment
EVALUATION_API_KEY = os.getenv("AZURE_OPENAI_KEY") or os.getenv("AZURE_OPENAI_API_KEY")

if not EVALUATION_API_KEY:
    print("⚠️  Warning: No Azure OpenAI API key found in environment variables.")
    print("   The evaluation will attempt to use Azure CLI credentials, but may need API key.")
    print("   Consider setting AZURE_OPENAI_KEY in your .env file.")

# Configure the evaluation model
try:
    model_config = AzureOpenAIModelConfiguration(
        azure_endpoint=EVALUATION_MODEL_ENDPOINT,
        azure_deployment=EVALUATION_MODEL,
        api_version=EVALUATION_OPENAI_API_VERSION,
        api_key=EVALUATION_API_KEY
    )
    print(f"✅ Model configuration created for evaluation")
    print(f"   Endpoint: {EVALUATION_MODEL_ENDPOINT}")
    print(f"   Deployment: {EVALUATION_MODEL}")
    print(f"   API Version: {EVALUATION_OPENAI_API_VERSION}")
    
except Exception as e:
    print(f"❌ Failed to create model configuration: {str(e)}")
    raise

In [None]:
# Initialize all evaluators
try:
    print("🔧 Initializing evaluation metrics...")
    
    # Groundedness: Measures how well the response is grounded in the provided context
    groundedness_evaluator = GroundednessEvaluator(model_config=model_config)
    print("✅ Groundedness evaluator initialized")
    
    # Relevance: Measures how relevant the response is to the query
    relevance_evaluator = RelevanceEvaluator(model_config=model_config)
    print("✅ Relevance evaluator initialized")
    
    # Coherence: Measures the logical flow and consistency of the response
    coherence_evaluator = CoherenceEvaluator(model_config=model_config)
    print("✅ Coherence evaluator initialized")
    
    # Fluency: Measures the readability and linguistic quality of the response
    fluency_evaluator = FluencyEvaluator(model_config=model_config)
    print("✅ Fluency evaluator initialized")
    
    print("\n🎯 All evaluators configured successfully!")
    print("   Each evaluator will assess different aspects of response quality:")
    print("   • Groundedness: How well responses are based on context")
    print("   • Relevance: How relevant responses are to queries")
    print("   • Coherence: How logically structured responses are")
    print("   • Fluency: How well-written and readable responses are")
    
except Exception as e:
    print(f"❌ Error configuring evaluators: {str(e)}")
    print("   This might be due to API key issues or model configuration problems.")
    raise

## Load Test Queries from JSONL File

Load the evaluation queries that include both Azure AI-specific questions and general questions to comprehensively test the agent's capabilities and topic adherence.

In [None]:
# Load evaluation queries from JSONL file
evaluation_queries_file = "evaluation_queries.jsonl"

try:
    # Check if the file exists
    if not Path(evaluation_queries_file).exists():
        print(f"❌ Evaluation queries file not found: {evaluation_queries_file}")
        print(f"   Current working directory: {Path.cwd()}")
        print(f"   Looking for file at: {Path(evaluation_queries_file).absolute()}")
        raise FileNotFoundError(f"Could not find {evaluation_queries_file}")
    
    # Read the JSONL file
    queries = []
    with open(evaluation_queries_file, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if line:  # Skip empty lines
                try:
                    query_data = json.loads(line)
                    if 'query' in query_data:
                        queries.append(query_data['query'])
                    else:
                        print(f"⚠️  Warning: Line {line_num} missing 'query' field: {line}")
                except json.JSONDecodeError as e:
                    print(f"⚠️  Warning: Invalid JSON on line {line_num}: {line}")
                    print(f"   Error: {str(e)}")
    
    print(f"✅ Successfully loaded {len(queries)} evaluation queries")
    
    # Analyze the types of queries
    azure_ai_queries = [q for q in queries if any(keyword in q.lower() for keyword in ['azure', 'ai', 'openai', 'cognitive', 'machine learning'])]
    general_queries = [q for q in queries if q not in azure_ai_queries]
    
    print(f"📊 Query breakdown:")
    print(f"   • Azure AI related queries: {len(azure_ai_queries)}")
    print(f"   • General/off-topic queries: {len(general_queries)}")
    print(f"   • Total queries: {len(queries)}")
    
    # Show some sample queries
    print(f"\n📝 Sample Azure AI queries:")
    for i, query in enumerate(azure_ai_queries[:3], 1):
        print(f"   {i}. {query}")
    
    print(f"\n📝 Sample general queries:")
    for i, query in enumerate(general_queries[:3], 1):
        print(f"   {i}. {query}")
        
except Exception as e:
    print(f"❌ Error loading evaluation queries: {str(e)}")
    raise

## Run Comprehensive Agent Evaluation

Execute the evaluation pipeline using all configured evaluators against the test dataset. This process will measure agent performance across multiple dimensions.

**Note**: This evaluation may take a significant amount of time depending on the number of queries and the response time of the agent and evaluation models.

In [None]:
# Prepare for evaluation
evaluation_name = f"questionnaire_agent_evaluation_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}"

print(f"🚀 Starting comprehensive evaluation: {evaluation_name}")
print(f"📊 Total queries to evaluate: {len(queries)}")
print(f"⏰ Started at: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\n⚠️  This evaluation may take 10-30 minutes depending on query complexity...")
print("   Each query requires multiple API calls to the agent and evaluation models.")

# Run the evaluation
try:
    # Configure the evaluators dictionary with proper naming for Azure AI Foundry
    evaluators_config = {
        "groundedness": groundedness_evaluator,
        "relevance": relevance_evaluator,
        "coherence": coherence_evaluator,
        "fluency": fluency_evaluator,
    }
    
    # Execute the evaluation
    evaluation_result = evaluate(
        data=evaluation_queries_file,
        target=query_questionnaire_agent,
        evaluators=evaluators_config,
        azure_ai_project=AZURE_OPENAI_ENDPOINT,  # Azure AI project endpoint
        evaluation_name=evaluation_name,
    )
    
    print(f"\n✅ Evaluation completed successfully!")
    print(f"⏰ Finished at: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    
    # Display Azure AI Foundry Studio URL if available
    if 'studio_url' in evaluation_result:
        print(f"🔗 Azure AI Foundry Studio URL: {evaluation_result['studio_url']}")
        print("   You can view detailed results and visualizations in the Azure AI Foundry portal.")
    
except Exception as e:
    print(f"❌ Evaluation failed: {str(e)}")
    print("\n🔧 Troubleshooting tips:")
    print("   • Ensure Azure CLI authentication is working: 'az login'")
    print("   • Check that all required environment variables are set")
    print("   • Verify Azure OpenAI API quotas and limits")
    print("   • Check network connectivity to Azure services")
    raise

## Process and Display Evaluation Results

Extract and format evaluation metrics, display summary statistics, and provide detailed analysis of agent performance across different query types.

In [None]:
# Extract and display evaluation metrics
print("📊 EVALUATION RESULTS SUMMARY")
print("=" * 50)

try:
    # Get the overall metrics
    metrics = evaluation_result.get("metrics", {})
    
    if not metrics:
        print("⚠️  No metrics found in evaluation results")
        print("   This might indicate an issue with the evaluation process")
    else:
        print("\n🎯 Overall Performance Metrics:")
        for metric_name, metric_value in metrics.items():
            # Format the metric value based on its type
            if isinstance(metric_value, (int, float)):
                print(f"   • {metric_name.title()}: {metric_value:.4f}")
            else:
                print(f"   • {metric_name.title()}: {metric_value}")
        
        # Calculate and display average score
        numeric_metrics = {k: v for k, v in metrics.items() if isinstance(v, (int, float))}
        if numeric_metrics:
            avg_score = sum(numeric_metrics.values()) / len(numeric_metrics)
            print(f"\n📈 Average Score Across All Metrics: {avg_score:.4f}")
            
            # Provide interpretation
            print(f"\n📝 Performance Interpretation:")
            if avg_score >= 4.0:
                print("   🟢 Excellent: The agent demonstrates high-quality responses")
            elif avg_score >= 3.0:
                print("   🟡 Good: The agent performs well with room for improvement")
            elif avg_score >= 2.0:
                print("   🟠 Fair: The agent shows moderate performance, needs improvement")
            else:
                print("   🔴 Poor: The agent requires significant improvements")
        
        # Analyze individual metrics
        print(f"\n🔍 Detailed Metric Analysis:")
        for metric_name, metric_value in numeric_metrics.items():
            if isinstance(metric_value, (int, float)):
                if metric_value >= 4.0:
                    status = "🟢 Excellent"
                elif metric_value >= 3.0:
                    status = "🟡 Good"
                elif metric_value >= 2.0:
                    status = "🟠 Fair"
                else:
                    status = "🔴 Poor"
                print(f"   • {metric_name.title()}: {metric_value:.4f} - {status}")
    
    # Display additional result information
    if 'outputs' in evaluation_result:
        outputs = evaluation_result['outputs']
        print(f"\n📋 Detailed Results Available:")
        print(f"   • Number of evaluated samples: {len(outputs) if hasattr(outputs, '__len__') else 'N/A'}")
    
    # Display data info if available
    if 'data' in evaluation_result:
        print(f"\n📁 Evaluation Data Info:")
        print(f"   • Data source: {evaluation_queries_file}")
        print(f"   • Total queries processed: {len(queries)}")
        print(f"   • Azure AI queries: {len(azure_ai_queries)}")
        print(f"   • General queries: {len(general_queries)}")

except Exception as e:
    print(f"❌ Error processing evaluation results: {str(e)}")
    print("   Raw evaluation result keys:", list(evaluation_result.keys()) if evaluation_result else "None")

## Export Results for Analysis

Save evaluation results to files, generate reports, and provide information for further analysis and visualization.

In [None]:
# Export evaluation results to files
results_dir = Path("evaluation_results")
results_dir.mkdir(exist_ok=True)

timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
results_file = results_dir / f"evaluation_results_{timestamp}.json"
summary_file = results_dir / f"evaluation_summary_{timestamp}.txt"

try:
    # Save detailed results as JSON
    print(f"💾 Saving detailed results to: {results_file}")
    with open(results_file, 'w', encoding='utf-8') as f:
        json.dump({
            "evaluation_name": evaluation_name,
            "timestamp": timestamp,
            "configuration": {
                "model_deployment": AZURE_OPENAI_MODEL_DEPLOYMENT,
                "evaluation_model": EVALUATION_MODEL,
                "total_queries": len(queries),
                "azure_ai_queries": len(azure_ai_queries),
                "general_queries": len(general_queries)
            },
            "metrics": metrics,
            "evaluation_result": evaluation_result
        }, f, indent=2, default=str)
    
    # Generate summary report
    print(f"📄 Generating summary report: {summary_file}")
    with open(summary_file, 'w', encoding='utf-8') as f:
        f.write("QUESTIONNAIRE AGENT EVALUATION SUMMARY\n")
        f.write("=" * 50 + "\n\n")
        f.write(f"Evaluation Name: {evaluation_name}\n")
        f.write(f"Timestamp: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
        f.write(f"Total Queries Evaluated: {len(queries)}\n")
        f.write(f"Azure AI Related Queries: {len(azure_ai_queries)}\n")
        f.write(f"General/Off-topic Queries: {len(general_queries)}\n\n")
        
        f.write("PERFORMANCE METRICS:\n")
        f.write("-" * 20 + "\n")
        for metric_name, metric_value in metrics.items():
            if isinstance(metric_value, (int, float)):
                f.write(f"{metric_name.title()}: {metric_value:.4f}\n")
            else:
                f.write(f"{metric_name.title()}: {metric_value}\n")
        
        if numeric_metrics:
            avg_score = sum(numeric_metrics.values()) / len(numeric_metrics)
            f.write(f"\nAverage Score: {avg_score:.4f}\n")
            
            f.write(f"\nPerformance Level: ")
            if avg_score >= 4.0:
                f.write("Excellent\n")
            elif avg_score >= 3.0:
                f.write("Good\n")
            elif avg_score >= 2.0:
                f.write("Fair\n")
            else:
                f.write("Poor\n")
        
        if 'studio_url' in evaluation_result:
            f.write(f"\nAzure AI Foundry Studio URL:\n{evaluation_result['studio_url']}\n")
    
    print(f"✅ Results exported successfully!")
    print(f"📁 Results directory: {results_dir.absolute()}")
    print(f"   • Detailed JSON: {results_file.name}")
    print(f"   • Summary report: {summary_file.name}")
    
except Exception as e:
    print(f"❌ Error exporting results: {str(e)}")
    print("   Results are still available in the evaluation_result variable")

## Summary and Next Steps

This evaluation notebook provides a comprehensive assessment of the Questionnaire Agent's performance across multiple dimensions:

### What This Evaluation Measures:

1. **Groundedness**: How well the agent's responses are based on reliable sources and context
2. **Relevance**: How relevant and on-topic the responses are to the input queries
3. **Coherence**: How logically structured and internally consistent the responses are
4. **Fluency**: How well-written, readable, and linguistically sound the responses are

### Evaluation Dataset:

- **Azure AI Queries**: Tests the agent's knowledge and accuracy on its primary domain
- **General Queries**: Tests the agent's ability to stay on topic and handle off-domain questions appropriately

### Key Benefits:

- **Objective Assessment**: Uses AI-assisted evaluation for consistent, scalable assessment
- **Multi-dimensional Analysis**: Evaluates different aspects of response quality
- **Azure Integration**: Results are available in Azure AI Foundry Studio for detailed analysis
- **Reproducible**: Can be run repeatedly to track improvements over time

### Next Steps:

1. **Review Detailed Results**: Use the Azure AI Foundry Studio URL to explore individual query results
2. **Identify Improvement Areas**: Focus on metrics with lower scores
3. **Iterative Development**: Re-run evaluation after making improvements to track progress
4. **Expand Evaluation**: Consider adding more specific test cases or custom evaluators

### Files Generated:

- Detailed JSON results for programmatic analysis
- Human-readable summary reports
- Links to Azure AI Foundry Studio for interactive exploration