# AI Evaluation with Azure AI Foundry

This notebook demonstrates how to use Azure AI Evaluation tools to assess the quality and safety of AI responses. We'll explore three types of evaluators:

1. **NLP-based evaluators** (e.g., BLEU score) for measuring text similarity
2. **AI-assisted quality evaluators** (e.g., Relevance) that use language models to assess quality
3. **AI-assisted safety evaluators** (e.g., Violence detection) for content safety

Let's start by importing the necessary libraries and evaluators.

In [None]:
import os

from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator, BleuScoreEvaluator

## 1. NLP-based Evaluation: BLEU Score

The BLEU (Bilingual Evaluation Understudy) score is a traditional NLP metric that measures how similar a generated response is to a reference text. It's based on n-gram matching and doesn't require a language model.

**Use case**: Measuring text similarity, especially useful for translation tasks or when you have ground truth reference answers.

In [None]:
# NLP bleu score evaluator
bleu_score_evaluator = BleuScoreEvaluator()
result = bleu_score_evaluator(
    response="Tokyo is the capital of Japan.",
    ground_truth="The capital of Japan is Tokyo."
)

print(f"BLEU Score Result: {result}")
print(f"Score: {result['bleu_score']}")

## 2. AI-Assisted Quality Evaluation: Relevance

AI-assisted evaluators use language models to assess the quality of responses. The Relevance evaluator determines how well the response answers the given query.

**Requirements**: 
- Azure OpenAI endpoint and API key
- Deployment name for your model

First, let's set up the model configuration using environment variables.

In [None]:
# AI assisted quality evaluator
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}

print("Model Configuration:")
print(f"Endpoint: {model_config['azure_endpoint']}")
print(f"Deployment: {model_config['azure_deployment']}")
print(f"API Key configured: {'Yes' if model_config['api_key'] else 'No'}")

Now let's create the Relevance evaluator and test it with a simple query-response pair:

In [None]:
relevance_evaluator = RelevanceEvaluator(model_config)
result = relevance_evaluator(
    query="What is the capital of Japan?",
    response="The capital of Japan is Tokyo."
)

print(f"Relevance Evaluation Result: {result}")
print(f"Relevance Score: {result['relevance']}")
print(f"Reasoning: {result.get('gpt_relevance', 'Not provided')}")

## 3. AI-Assisted Safety Evaluation: Violence Detection

Safety evaluators help detect potentially harmful content in AI responses. The Violence evaluator checks for violent content in the response.

**Configuration Options**: There are two ways to configure the Azure AI Project for safety evaluators:
1. Using project details (subscription ID, resource group, project name)
2. Using the project URL directly

Let's explore both approaches.

### Option 1: Using Azure AI Project Details

This approach requires you to specify the subscription ID, resource group name, and project name separately:

In [None]:
# Option #1 : Using Azure AI Project Details from environment variables
azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP"),
    "project_name": os.environ.get("AZURE_AI_PROJECT_NAME"),
}
if not all(azure_ai_project.values()):
    print("One or more Azure AI Project details are missing from environment variables.")
else:
    print("All Azure AI Project details are set:")


In [None]:
print(f"Azure AI Project configuration:")
for key, value in azure_ai_project.items():
    print(f"  {key}: {value if value else 'NOT SET'}")

In [None]:
violence_evaluator = ViolenceEvaluator(azure_ai_project)
result = violence_evaluator(
    query="What is the capital of France?",
    response="Paris."
)

print(f"Violence Evaluation Result (Option 1): {result}")
print(f"Violence Score: {result.get('violence', 'Not provided')}")
print(f"Violence Reason: {result.get('violence_reason', 'Not provided')}")

### Option 2: Using Azure AI Project URL

This approach uses a direct URL to your Azure AI project, which is more concise:

In [None]:
# Option # 2 : Using Azure AI Project URL 
azure_ai_project = "https://{resource_name}.services.ai.azure.com/api/projects/{project_name}"

# Note: Replace {resource_name} and {project_name} with your actual values
print(f"Azure AI Project URL (Option 2): {azure_ai_project}")

In [None]:
violence_evaluator = ViolenceEvaluator(azure_ai_project)
result = violence_evaluator(
    query="What is the capital of France?",
    response="Paris."
)

print(f"Violence Evaluation Result (Option 2): {result}")
print(f"Violence Score: {result.get('violence', 'Not provided')}")
print(f"Violence Reason: {result.get('violence_reason', 'Not provided')}")

## Summary

This notebook demonstrated three types of evaluation approaches in Azure AI Foundry:

1. **NLP-based Evaluators (BLEU Score)**: Fast, deterministic metrics that don't require API calls. Good for measuring text similarity against ground truth.

2. **AI-assisted Quality Evaluators (Relevance)**: Use language models to assess quality aspects like relevance, coherence, fluency, etc. More nuanced but require API calls.

3. **AI-assisted Safety Evaluators (Violence)**: Specifically designed to detect harmful content. Essential for production AI systems.

### Key Takeaways:
- Choose evaluators based on your specific use case and requirements
- NLP metrics are fast but limited in scope
- AI-assisted evaluators provide more nuanced assessment but require Azure OpenAI or AI Foundry resources
- Safety evaluators are crucial for responsible AI deployment
- You can configure Azure AI projects using either detailed parameters or direct URLs

### Next Steps:
- Configure your actual Azure credentials and project details
- Experiment with different evaluator types and parameters
- Consider combining multiple evaluators for comprehensive assessment
- Integrate evaluations into your AI development workflow