# üçè Health Assistant Evaluation Demo üçé

This notebook demonstrates how to use Azure AI Foundry's evaluation capabilities to assess the quality and safety of AI-generated health and fitness responses.

## üîê Authentication Setup

Before running the next cell, make sure you're authenticated with Azure CLI. Run this command in your terminal:

```bash
az login --use-device-code
```

This will provide you with a device code and URL to authenticate in your browser, which is useful for:
- Remote development environments
- Systems without a default browser
- Corporate environments with strict security policies

After successful authentication, you can proceed with the notebook cells below.

## üìä Available Evaluators in Azure AI Foundry

Azure AI Foundry provides a comprehensive set of built-in evaluators for different aspects of AI model quality:

### **AI Quality (AI Assisted)**
- **Groundedness** - Measures how well responses are grounded in provided context
- **Relevance** - Evaluates how relevant responses are to the input query  
- **Coherence** - Assesses logical flow and consistency in responses
- **Fluency** - Measures language quality and readability
- **GPT Similarity** - Compares responses to reference answers

### **AI Quality (NLP Metrics)**
- **F1 Score** - Measures precision and recall balance
- **ROUGE Score** - Evaluates text summarization quality
- **BLEU Score** - Measures translation and generation quality
- **GLEU Score** - Google's BLEU variant for better correlation
- **METEOR Score** - Considers synonyms and stemming

### **Risk and Safety**
- **Violence** - Detects violent content
- **Sexual** - Identifies sexual content
- **Self-harm** - Detects self-harm related content
- **Hate/Unfairness** - Identifies hateful or unfair content
- **Protected Material** - Detects copyrighted content
- **Indirect Attack** - Identifies indirect prompt injection attempts

üìö **For complete details on all available evaluators, their parameters, and usage examples, visit:**  
**[Azure AI Foundry Evaluators Documentation](https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/observability)**

---

# üèãÔ∏è‚Äç‚ôÄÔ∏è Azure AI Foundry Evaluations üèãÔ∏è‚Äç‚ôÇÔ∏è

This notebook demonstrates how to evaluate AI models using Azure AI Foundry with both **local** and **cloud** evaluations.

## What This Notebook Does:
1. **Setup & Data Creation** - Creates synthetic health & fitness Q&A data
2. **Local Evaluation** - Runs F1Score and Relevance evaluators locally  
3. **Cloud Evaluation** - Uploads results to Azure AI Foundry project

## Key Features:
‚úÖ **Local Evaluations** - F1Score and AI-assisted Relevance evaluators
‚úÖ **Cloud Integration** - Upload results to Azure AI Foundry
‚úÖ **Browser Authentication** - Uses InteractiveBrowserCredential  
‚úÖ **Error Handling** - Robust fallbacks and clear status reporting

In [1]:
# Setup and Data Creation
import json
import os
import time
from pathlib import Path
from dotenv import load_dotenv

# Load environment variables from root directory
root_env_path = os.environ.get("ROOT_ENV_PATH", '../../../.env')
load_dotenv(root_env_path)
print(f"‚úÖ Environment variables loaded from: {root_env_path}")

# Check required environment variables for Azure AI Foundry
AI_FOUNDRY_PROJECT_ENDPOINT = os.environ.get("AI_FOUNDRY_PROJECT_ENDPOINT")
TENANT_ID = os.environ.get("TENANT_ID")

print("üîç Environment Variables Status:")
print(
    f"   AI_FOUNDRY_PROJECT_ENDPOINT: {'‚úÖ Set' if AI_FOUNDRY_PROJECT_ENDPOINT else '‚ùå Missing'}"
)
print(f"   TENANT_ID: {'‚úÖ Set' if TENANT_ID else '‚ùå Missing'}")

if not AI_FOUNDRY_PROJECT_ENDPOINT:
    print("\n‚ö†Ô∏è Required environment variables missing!")
    print("Please add these to your .env file:")
    print("AI_FOUNDRY_PROJECT_ENDPOINT=<your-azure-ai-project-endpoint>")
    print("TENANT_ID=<your-azure-tenant-id>")
else:
    print(f"\n‚úÖ All environment variables configured correctly!")
    print(f"üîß Loaded values:")
    print(f"   AI_FOUNDRY_PROJECT_ENDPOINT: {AI_FOUNDRY_PROJECT_ENDPOINT}")
    print(f"   TENANT_ID: {TENANT_ID}")

# Create synthetic health & fitness evaluation data
synthetic_eval_data = [
    {
        "query": "How can I start a beginner workout routine at home?",
        "context": "Workout routines can include push-ups, bodyweight squats, lunges, and planks.",
        "response": "You can just go for 10 push-ups total.",
        "ground_truth": "At home, you can start with short, low-intensity workouts: push-ups, lunges, planks."
    },
    {
        "query": "Are diet sodas healthy for daily consumption?",
        "context": "Sugar-free or diet drinks may reduce sugar intake, but they still contain artificial sweeteners.",
        "response": "Yes, diet sodas are 100% healthy.",
        "ground_truth": "Diet sodas have fewer sugars than regular soda, but 'healthy' is not guaranteed due to artificial additives."
    },
    {
        "query": "What's the capital of France?",
        "context": "France is in Europe. Paris is the capital.",
        "response": "London.",
        "ground_truth": "Paris."
    }
]

# Write data to JSONL file
eval_data_filename = os.environ.get("EVAL_DATA_FILENAME", "health_fitness_eval_data.jsonl")
eval_data_path = Path(f"./{eval_data_filename}")
with eval_data_path.open("w", encoding="utf-8") as f:
    for row in synthetic_eval_data:
        f.write(json.dumps(row) + "\n")

print(f"‚úÖ Evaluation data created: {eval_data_path.resolve()}")
print(f"üìä Total samples: {len(synthetic_eval_data)}")

‚úÖ Environment variables loaded from: ../../../.env
üîç Environment Variables Status:
   AI_FOUNDRY_PROJECT_ENDPOINT: ‚úÖ Set
   TENANT_ID: ‚úÖ Set

‚úÖ All environment variables configured correctly!
üîß Loaded values:
   AI_FOUNDRY_PROJECT_ENDPOINT: https://demopocaifoundry.services.ai.azure.com/api/projects/demoproject
   TENANT_ID: 16b3c013-d300-468d-ac64-7eda0820b6d3
‚úÖ Evaluation data created: C:\src\ai-foundry-e2e-lab\observability-and-evaluations\health_fitness_eval_data.jsonl
üìä Total samples: 3


## üîç Local Evaluation

Run evaluations locally using F1Score (basic text similarity) and Relevance (AI-assisted) evaluators.

In [None]:
# Local Evaluation with Azure AI Foundry
from azure.ai.evaluation import evaluate, F1ScoreEvaluator, RelevanceEvaluator
import logging

# Reduce logging noise
logging.getLogger('promptflow').setLevel(logging.ERROR)
logging.getLogger('azure.ai.evaluation').setLevel(logging.WARNING)

print("üîç Running Local Evaluation...")

# Configure evaluators
evaluators = {
    "f1_score": F1ScoreEvaluator()
}

evaluator_config = {
    "f1_score": {
        "column_mapping": {
            "response": "${data.response}",
            "ground_truth": "${data.ground_truth}"
        }
    }
}

# Add AI-assisted evaluator if Azure OpenAI is configured
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT", ""),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY", ""),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT", os.environ.get("MODEL_DEPLOYMENT_NAME", "gpt-4")),
    "api_version": os.environ.get("AOAI_API_VERSION", os.environ.get("API_VERSION", "2024-02-15-preview")),
}

if model_config["azure_endpoint"] and model_config["api_key"]:
    print("ü§ñ Adding AI-assisted Relevance evaluator...")
    evaluators["relevance"] = RelevanceEvaluator(model_config=model_config)
    evaluator_config["relevance"] = {
        "column_mapping": {
            "query": "${data.query}",
            "response": "${data.response}"
        }
    }
else:
    print("‚ö†Ô∏è Azure OpenAI not configured - using F1Score only")

# Run local evaluation
try:
    local_result = evaluate(
        data=str(eval_data_path),
        evaluators=evaluators,
        evaluator_config=evaluator_config
    )
    
    print("‚úÖ Local evaluation completed!")
    
    # Display results
    metrics = local_result['metrics']
    for metric_name, value in metrics.items():
        print(f"üìä {metric_name}: {value:.4f}")
        
        # Save results locally
        local_results_filename = os.environ.get("LOCAL_RESULTS_FILENAME", "local_evaluation_results.json")
        with open(local_results_filename, "w") as f:
            json.dump(local_result, f, indent=2)

        print(f"üíæ Results saved to: {local_results_filename}")
except Exception as e:
    print(f"‚ùå Local evaluation failed: {e}")
    local_result = None

üîç Running Local Evaluation...
ü§ñ Adding AI-assisted Relevance evaluator...
2026-01-02 10:32:48 -0600   22648 execution.bulk     INFO     Finished 3 / 3 lines.
2026-01-02 10:32:48 -0600   22648 execution.bulk     INFO     Average execution time for completed lines: 0.0 seconds. Estimated time for incomplete lines: 0.0 seconds.


Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "f1_score_20260102_163248_643561"
Run status: "Completed"
Start time: "2026-01-02 16:32:48.643561+00:00"
Duration: "0:00:01.013010"

2026-01-02 10:32:54 -0600   45700 execution.bulk     INFO     Finished 1 / 3 lines.
2026-01-02 10:32:54 -0600   45700 execution.bulk     INFO     Average execution time for completed lines: 5.46 seconds. Estimated time for incomplete lines: 10.92 seconds.
2026-01-02 10:32:54 -0600   45700 execution.bulk     INFO     Finished 2 / 3 lines.
2026-01-02 10:32:54 -0600   45700 execution.bulk     INFO     Average execution time for completed lines: 2.77 seconds. Estimated time for incomplete lines: 2.77 seconds.
2026-01-02 10:32:54 -0600   45700 execution.bulk     INFO     Finished 3 / 3 lines.
2026-01-02 10:32:54 -0600   45700 execution.bulk     INFO     Average execution time for completed lines: 1.92 seconds. Estimated time for incomplete lines: 0.0 seconds.


Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "relevance_20260102_163248_636419"
Run status: "Completed"
Start time: "2026-01-02 16:32:48.636419+00:00"
Duration: "0:00:05.822927"


{
    "f1_score": {
        "status": "Completed",
        "duration": "0:00:01.013010",
        "completed_lines": 3,
        "failed_lines": 0,
        "log_path": null
    },
    "relevance": {
        "status": "Completed",
        "duration": "0:00:05.822927",
        "completed_lines": 3,
        "failed_lines": 0,
        "log_path": null
    }
}


‚úÖ Local evaluation completed!
üìä f1_score.f1_score: 0.1870
üíæ Results saved to: local_evaluation_results.json
üìä f1_score.f1_threshold: 0.5000
üíæ Results saved to: local_evaluation_results.json
üìä relevance.relevance: 1.6667
üíæ Results saved to: local_evaluation_results.json
üìä relevance.gpt_relevance: 1.6667
üíæ Results saved to: local_evaluation_results.json
üìä relevance.relevance_threshold: 3.0000
üíæ Results saved to: local_evaluation_results.json
üìä f1_score.binary

‚òÅÔ∏è Setting up Cloud Evaluation with Azure AI Foundry...
üè¢ Foundry Project Endpoint: https://demopocaifoundry.services.ai.azure.com/api/projects/demoproject
üîë Tenant ID: 16b3c013-d300-468d-ac64-7eda0820b6d3
üîê Setting up authentication...
‚úÖ AIProjectClient created successfully!
üì§ Uploading evaluation data to Azure AI Foundry...
‚úÖ Data uploaded successfully! Dataset ID: azureai://accounts/demopocaifoundry/projects/demoproject/data/health-fitness-dataset-1767374513/versions/1.0
‚öôÔ∏è Configuring evaluators for cloud evaluation...
üöÄ Creating and submitting cloud evaluation...
üéâ CLOUD EVALUATION SUBMITTED!
   üìã Name: b203928d-ee9e-43c1-bd56-8fdf35a39b6f
   üìã Status: NotStarted
   üìã Response Type: <class 'azure.ai.projects.models._models.Evaluation'>
   üìã ID: b203928d-ee9e-43c1-bd56-8fdf35a39b6f

üîó View detailed results at: https://ai.azure.com/
   Navigate to your project ‚Üí Evaluation ‚Üí View evaluation runs
üíæ Results saved to: cloud_evaluation

## ‚òÅÔ∏è Cloud Evaluation

Upload evaluation results to Azure AI Foundry project for tracking and collaboration.

In [3]:
# Cloud Evaluation - Following Official Microsoft Documentation
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
    EvaluatorConfiguration,
    EvaluatorIds,
    Evaluation,
    InputDataset
)
import os
import json
import time

print("‚òÅÔ∏è Setting up Cloud Evaluation with Azure AI Foundry...")

# Configuration from environment variables
AI_FOUNDRY_PROJECT_ENDPOINT = os.environ.get("AI_FOUNDRY_PROJECT_ENDPOINT")
TENANT_ID = os.environ.get("TENANT_ID")

print(f"üè¢ Foundry Project Endpoint: {AI_FOUNDRY_PROJECT_ENDPOINT}")
print(f"üîë Tenant ID: {TENANT_ID}")

if not AI_FOUNDRY_PROJECT_ENDPOINT:
    print("‚ö†Ô∏è Missing AI_FOUNDRY_PROJECT_ENDPOINT in .env file")
    cloud_result = None
else:
    try:
        # Step 1: Create project client using DefaultAzureCredential (as per Microsoft docs)
        print("üîê Setting up authentication...")
        project_client = AIProjectClient(
            endpoint=AI_FOUNDRY_PROJECT_ENDPOINT,
            credential=DefaultAzureCredential(),
        )
        print("‚úÖ AIProjectClient created successfully!")

        # Step 2: Upload evaluation data to Azure AI Foundry (required for cloud evaluation)
        print("üì§ Uploading evaluation data to Azure AI Foundry...")
        dataset_name = os.environ.get("DATASET_NAME", f"health-fitness-dataset-{int(time.time())}")
        dataset_version = os.environ.get("DATASET_VERSION", "1.0")
        try:
            data_upload = project_client.datasets.upload_file(
                name=dataset_name,
                version=dataset_version,
                file_path=str(eval_data_path),
            )
            data_id = data_upload.id
            print(f"‚úÖ Data uploaded successfully! Dataset ID: {data_id}")
        except Exception as upload_error:
            print(f"‚ùå Data upload failed: {upload_error}")
            raise upload_error

        # Step 3: Configure evaluators using Azure AI Foundry built-in evaluators
        print("‚öôÔ∏è Configuring evaluators for cloud evaluation...")

        evaluators = {
            "bleu_score": EvaluatorConfiguration(
                id=EvaluatorIds.BLEU_SCORE.value,
                data_mapping={
                    "response": "${data.response}",
                    "ground_truth": "${data.ground_truth}",
                },
            ),
        }

        # Step 4: Create and submit evaluation
        print("üöÄ Creating and submitting cloud evaluation...")
        evaluation_name = os.environ.get("EVALUATION_NAME", f"health-fitness-eval-{int(time.time())}")
        evaluation = Evaluation(
            display_name=evaluation_name,
            description="Health and fitness AI response evaluation",
            data=InputDataset(id=data_id),
            evaluators=evaluators,
        )

        # Submit the evaluation
        evaluation_response = project_client.evaluations.create(evaluation)

        print("üéâ CLOUD EVALUATION SUBMITTED!")
        print(f"   üìã Name: {evaluation_response.name}")
        print(f"   üìã Status: {evaluation_response.status}")
        print(f"   üìã Response Type: {type(evaluation_response)}")

        # Get evaluation ID - handle different possible attribute names
        evaluation_id = None
        if hasattr(evaluation_response, 'id'):
            evaluation_id = evaluation_response.id
        elif hasattr(evaluation_response, 'name'):
            evaluation_id = evaluation_response.name  # Use name as ID if no separate ID exists

        if evaluation_id:
            print(f"   üìã ID: {evaluation_id}")

        print(f"\nüîó View detailed results at: https://ai.azure.com/")
        print("   Navigate to your project ‚Üí Evaluation ‚Üí View evaluation runs")

        # Save results
        cloud_result = {
            "evaluation_name": evaluation_response.name,
            "status": evaluation_response.status,
            "project_endpoint": AI_FOUNDRY_PROJECT_ENDPOINT,
            "dataset_id": data_id,
            "timestamp": int(time.time()),
        }

        # Add evaluation ID if available
        if evaluation_id:
            cloud_result["evaluation_id"] = evaluation_id

        with open(os.environ.get("CLOUD_RESULTS_FILENAME", "cloud_evaluation_results.json"), "w") as f:
            json.dump(cloud_result, f, indent=2, default=str)
        print(f"üíæ Results saved to: {os.environ.get('CLOUD_RESULTS_FILENAME', 'cloud_evaluation_results.json')}")

        print("\n‚úÖ SUCCESS: Cloud evaluation submitted to Azure AI Foundry!")
        print("   The evaluation will run in the cloud and results will be available in the Azure AI Foundry portal.")

    except Exception as e:
        print(f"‚ùå Cloud evaluation failed: {e}")
        print(f"üìã Error type: {type(e).__name__}")

        # Enhanced error handling
        error_str = str(e).lower()
        if "401" in error_str or "unauthorized" in error_str:
            print("\nüîê AUTHENTICATION ISSUE:")
            print("   - Make sure you're logged in with: az login")
            print("   - Ensure you have access to the Azure AI Foundry project")
        elif "403" in error_str or "forbidden" in error_str:
            print("\nüö´ PERMISSION ISSUE:")
            print("   - Verify you have 'AI Developer' or 'Contributor' role")
            print("   - Check Azure AI Foundry project permissions")
        elif "404" in error_str or "not found" in error_str:
            print("\nüîç RESOURCE NOT FOUND:")
            print("   - Verify AI_FOUNDRY_PROJECT_ENDPOINT is correct")
            print("   - Check if project exists in Azure AI Foundry")
        elif "storage" in error_str or "blob" in error_str:
            print("\nüíæ STORAGE ISSUE:")
            print("   - Ensure your Azure AI Foundry project has a connected storage account")
            print("   - Check storage account permissions for the project")
        else:
            print(f"\nüí° TROUBLESHOOTING:")
            print(f"   - Full error: {str(e)[:300]}...")
            print("   - Try running local evaluation first")
            print("   - Check Azure AI Foundry project configuration")

        cloud_result = None

## üßπ Cleanup Evaluation Files

Remove all generated evaluation data and results files to keep the workspace clean.

In [4]:
# Cleanup evaluation files
import os
from pathlib import Path

print("üßπ Cleaning up evaluation files...")

files_to_cleanup = [
    os.environ.get("EVAL_DATA_FILENAME", "health_fitness_eval_data.jsonl"),
    os.environ.get("LOCAL_RESULTS_FILENAME", "local_evaluation_results.json"),
    os.environ.get("CLOUD_RESULTS_FILENAME", "cloud_evaluation_results.json")
]

deleted_count = 0
for file_path in files_to_cleanup:
    try:
        if os.path.exists(file_path):
            os.remove(file_path)
            print(f"‚úÖ Deleted: {file_path}")
            deleted_count += 1
        else:
            print(f"‚ö†Ô∏è  File not found (already deleted?): {file_path}")
    except Exception as e:
        print(f"‚ùå Error deleting {file_path}: {e}")

if deleted_count > 0:
    print(f"\nüéâ Cleanup completed! Deleted {deleted_count} file(s).")
else:
    print("\n‚úÖ No files to cleanup (all already deleted).")

## ‚òÅÔ∏è Cleanup Cloud Resources (Azure AI Foundry)

Remove uploaded datasets and evaluation results from Azure AI Foundry to keep your cloud workspace clean.

In [None]:
# Cleanup cloud resources from Azure AI Foundry
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient
import json

print("‚òÅÔ∏è Cleaning up cloud resources from Azure AI Foundry...")

try:
    # Load saved cloud results to get dataset ID
    cloud_results_file = os.environ.get("CLOUD_RESULTS_FILENAME", "cloud_evaluation_results.json")
    
    if os.path.exists(cloud_results_file):
        with open(cloud_results_file, "r") as f:
            cloud_result = json.load(f)
            dataset_id = cloud_result.get("dataset_id")
            dataset_name = cloud_result.get("evaluation_name", "").replace("eval", "dataset")
            print(f"üìã Found dataset ID from previous run: {dataset_id}")
    else:
        print("‚ö†Ô∏è  No cloud results file found. You can manually specify the dataset name:")
        dataset_name = input("Enter dataset name (or press Enter to skip): ").strip()
        dataset_id = None
    
    # Connect to Azure AI Foundry
    AI_FOUNDRY_PROJECT_ENDPOINT = os.environ.get("AI_FOUNDRY_PROJECT_ENDPOINT")
    
    if AI_FOUNDRY_PROJECT_ENDPOINT:
        project_client = AIProjectClient(
            endpoint=AI_FOUNDRY_PROJECT_ENDPOINT,
            credential=DefaultAzureCredential(),
        )
        print("‚úÖ Connected to Azure AI Foundry")
        
        # List and delete datasets
        print("\nüìä Listing datasets...")
        try:
            datasets = project_client.datasets.list()
            dataset_list = list(datasets)
            
            if dataset_list:
                print(f"Found {len(dataset_list)} dataset(s):")
                for idx, ds in enumerate(dataset_list, 1):
                    print(f"  {idx}. Name: {ds.name}, ID: {ds.id}, Version: {ds.version if hasattr(ds, 'version') else 'N/A'}")
                
                # Option to delete specific dataset or all
                delete_choice = input("\nEnter dataset number to delete (or 'all' to delete all, or 'skip' to skip): ").strip().lower()
                
                if delete_choice == "all":
                    for ds in dataset_list:
                        try:
                            project_client.datasets.delete(name=ds.name, version=ds.version if hasattr(ds, 'version') else None)
                            print(f"‚úÖ Deleted dataset: {ds.name}")
                        except Exception as e:
                            print(f"‚ùå Failed to delete {ds.name}: {e}")
                elif delete_choice.isdigit() and 1 <= int(delete_choice) <= len(dataset_list):
                    ds = dataset_list[int(delete_choice) - 1]
                    try:
                        project_client.datasets.delete(name=ds.name, version=ds.version if hasattr(ds, 'version') else None)
                        print(f"‚úÖ Deleted dataset: {ds.name}")
                    except Exception as e:
                        print(f"‚ùå Failed to delete dataset: {e}")
                else:
                    print("‚è≠Ô∏è  Skipped dataset deletion")
            else:
                print("‚úÖ No datasets found in Azure AI Foundry")
                
        except Exception as e:
            print(f"‚ùå Error listing/deleting datasets: {e}")
            print("üí° You may need to delete datasets manually in Azure AI Foundry portal:")
            print(f"   https://ai.azure.com/ ‚Üí Your Project ‚Üí Data ‚Üí Datasets")
        
        print("\nüéâ Cloud cleanup completed!")
        
    else:
        print("‚ùå AI_FOUNDRY_PROJECT_ENDPOINT not set - cannot connect to Azure AI Foundry")
        
except Exception as e:
    print(f"‚ùå Cloud cleanup failed: {e}")
    print("\nüí° Manual cleanup options:")
    print("1. Go to: https://ai.azure.com/")
    print("2. Navigate to your project ‚Üí Data ‚Üí Datasets")
    print("3. Select and delete the evaluation datasets")
    print("4. Navigate to Evaluation ‚Üí View runs to delete evaluation results")

: 