# Azure AI Evaluation SDK Demo

This notebook demonstrates how to use the Azure AI Evaluation SDK to evaluate generative AI applications. The SDK provides tools for assessing the quality, safety, and performance of AI-generated content.

## Overview

The Azure AI Evaluation SDK allows you to:
- Evaluate a single response or an entire dataset
- Use built-in evaluators for quality, safety, and performance metrics
- Create custom evaluators for specific needs
- Track evaluation results in Azure AI projects

Let's explore how to use this SDK for various evaluation scenarios.

## Installation

First, let's install the Azure AI Evaluation SDK:

In [None]:
# Install the Azure AI Evaluation SDK
!pip install azure-ai-evaluation

## Setup and Configuration

Let's set up our environment variables and authentication for Azure services:

In [None]:
import os
from azure.identity import DefaultAzureCredential

# Set up authentication
credential = DefaultAzureCredential()

# Initialize Azure AI project and Azure OpenAI connection
# Replace these with your actual values or set as environment variables
azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP"),
    "project_name": os.environ.get("AZURE_PROJECT_NAME"),
}

# Model configuration for evaluators that use LLMs
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "api_version": os.environ.get("AZURE_OPENAI_API_VERSION"),
    "deployment_name": os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME")
}

## Performance and Quality Evaluators

The SDK provides several evaluators for assessing the quality of AI-generated content. Let's explore some of these evaluators:

### AI-Assisted Quality Evaluators

- **GroundednessEvaluator**: Evaluates if the response is grounded in the provided context
- **RelevanceEvaluator**: Evaluates if the response is relevant to the query
- **CoherenceEvaluator**: Evaluates the logical flow and consistency of the response
- **FluencyEvaluator**: Evaluates the grammatical correctness and readability
- **SimilarityEvaluator**: Evaluates similarity between two pieces of text

In [None]:
# Import AI-Assisted quality evaluators
from azure.ai.evaluation import (
    GroundednessEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator,
    SimilarityEvaluator
)

# Initialize evaluators
groundedness_eval = GroundednessEvaluator(model_config)
relevance_eval = RelevanceEvaluator(model_config)
coherence_eval = CoherenceEvaluator(model_config)
fluency_eval = FluencyEvaluator(model_config)
similarity_eval = SimilarityEvaluator(model_config)

# Example: Evaluate groundedness of a response based on context
context = "The exact speed of light in a vacuum is 299,792,458 meters per second, a constant used in physics to represent 'c'."
query = "What is the speed of light?"
response = "The speed of light is approximately 299,792,458 meters per second."

groundedness_score = groundedness_eval(context=context, query=query, response=response)
print("Groundedness Evaluation:")
print(groundedness_score)

# Example: Evaluate relevance of a response to a query
relevance_score = relevance_eval(query=query, response=response)
print("\nRelevance Evaluation:")
print(relevance_score)

# Example: Evaluate coherence of a response
coherence_score = coherence_eval(response=response)
print("\nCoherence Evaluation:")
print(coherence_score)

### NLP-Based Metrics

These evaluators use traditional NLP metrics to assess the quality of generated text:

- **F1ScoreEvaluator**: Calculates F1 score between reference and generated text
- **RougeScoreEvaluator**: Calculates ROUGE metrics
- **BleuScoreEvaluator**: Calculates BLEU score
- **MeteorScoreEvaluator**: Calculates METEOR score

In [None]:
# Import NLP metric evaluators
from azure.ai.evaluation import (
    F1ScoreEvaluator,
    RougeScoreEvaluator,
    BleuScoreEvaluator,
    MeteorScoreEvaluator
)

# Initialize evaluators
f1_eval = F1ScoreEvaluator()
rouge_eval = RougeScoreEvaluator()
bleu_eval = BleuScoreEvaluator()
meteor_eval = MeteorScoreEvaluator()

# Example: Calculate metrics between reference and generated text
reference = "The speed of light in a vacuum is exactly 299,792,458 meters per second."
generated = "The speed of light is approximately 299,792,458 meters per second."

f1_score = f1_eval(reference=reference, response=generated)
print("F1 Score:")
print(f1_score)

rouge_score = rouge_eval(reference=reference, response=generated)
print("\nROUGE Score:")
print(rouge_score)

bleu_score = bleu_eval(reference=reference, response=generated)
print("\nBLEU Score:")
print(bleu_score)

## Risk and Safety Evaluators

The SDK provides evaluators to detect potentially harmful content:

- **ViolenceEvaluator**: Detects violent content
- **SexualEvaluator**: Detects sexual content
- **SelfHarmEvaluator**: Detects content promoting self-harm
- **HateUnfairnessEvaluator**: Detects hate speech or unfair bias
- **CodeVulnerabilityEvaluator**: Detects vulnerabilities in code

In [None]:
# Import safety evaluators
from azure.ai.evaluation import (
    ViolenceEvaluator,
    HateUnfairnessEvaluator,
    CodeVulnerabilityEvaluator
)

# Initialize evaluators
violence_eval = ViolenceEvaluator(model_config)
hate_eval = HateUnfairnessEvaluator(model_config)
code_vuln_eval = CodeVulnerabilityEvaluator(model_config)

# Example: Check a response for potential issues
safe_response = "To solve this problem, we need to analyze the data carefully and draw conclusions based on evidence."
code_response = "Here's a password verification function:\ndef verify(input_pwd, stored_pwd):\n    return input_pwd == stored_pwd"

violence_score = violence_eval(response=safe_response)
print("Violence Evaluation:")
print(violence_score)

code_vuln_score = code_vuln_eval(response=code_response)
print("\nCode Vulnerability Evaluation:")
print(code_vuln_score)

## Custom Evaluators

You can create custom evaluators for specific evaluation needs. Here's an example of a simple code-based evaluator:

In [None]:
# Define a simple custom evaluator to measure answer length
class AnswerLengthEvaluator:
    def __init__(self):
        pass
    
    # A class is made callable by implementing the special method __call__
    def __call__(self, *, response: str, **kwargs):
        return {"value": len(response)}

# Initialize and use the custom evaluator
answer_length_eval = AnswerLengthEvaluator()
length_result = answer_length_eval(response="The speed of light is approximately 299,792,458 meters per second.")
print("Answer Length Evaluation:")
print(length_result)

## Evaluating Datasets

The SDK provides the `evaluate()` function to evaluate an entire dataset with multiple evaluators:

In [None]:
# Import the evaluate function
from azure.ai.evaluation import evaluate

# Create a sample dataset (in real scenarios, you would load this from a file)
import json

# Create a sample dataset as a list of dictionaries
sample_data = [
    {
        "query": "What is the speed of light?",
        "context": "The exact speed of light in a vacuum is 299,792,458 meters per second, a constant used in physics to represent 'c'.",
        "response": "The speed of light is approximately 299,792,458 meters per second."
    },
    {
        "query": "What is the capital of France?",
        "context": "Paris is the capital and most populous city of France. It is located on the Seine River.",
        "response": "Paris is the capital of France."
    }
]

# Save the sample data to a JSONL file
with open("sample_data.jsonl", "w") as f:
    for item in sample_data:
        f.write(json.dumps(item) + "\n")

# Evaluate the dataset using multiple evaluators
result = evaluate(
    data="sample_data.jsonl",  # path to the data file
    evaluators={
        "groundedness": groundedness_eval,
        "answer_length": answer_length_eval
    },
    # Column mapping to tell the evaluator which columns to use
    evaluator_config={
        "groundedness": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.context}",
                "response": "${data.response}"
            }
        },
        "answer_length": {
            "column_mapping": {
                "response": "${data.response}"
            }
        }
    },
    # Optionally provide Azure AI project information to track results
    # azure_ai_project=azure_ai_project,
    # Output path to save results
    output_path="./evaluation_results.json"
)

# Print evaluation results
print("Aggregate Metrics:")
print(result["metrics"])
print("\nRow-level Results (first row):")
print(result["rows"][0])

## Multi-Modal Evaluation

The SDK also supports evaluation of multi-modal content, including images:

In [None]:
# Note: This example demonstrates the structure for multi-modal evaluation
# Actual execution requires appropriate models and images

from azure.ai.evaluation import ViolenceEvaluator
import base64

# Initialize a safety evaluator for multi-modal content
safety_evaluator = ViolenceEvaluator(model_config)

# Example of how to structure a conversation with an image
# In a real scenario, you would load and encode an actual image
sample_conversation = {
    "messages": [
        {
            "content": "What's in this image?",
            "role": "user",
        },
        {
            # In a real scenario, replace with actual base64 encoded image
            "content": [{"type": "image_url", "image_url": {"url": "data:image/jpg;base64,<ENCODED_IMAGE>"}}],
            "role": "assistant",
        },
    ]
}

# Example showing how you would evaluate the conversation (commented out as it requires actual images)
# safety_score = safety_evaluator(conversation=sample_conversation)
# print("Safety Score for Multimodal Content:")
# print(safety_score)

## Conclusion

The Azure AI Evaluation SDK provides a comprehensive toolkit for evaluating generative AI applications. It offers:

- **Built-in evaluators** for quality, safety, and performance metrics
- **Custom evaluator** capabilities for specific evaluation needs
- **Dataset evaluation** for assessing large amounts of data
- **Multi-modal evaluation** for content including images
- **Integration with Azure AI projects** for tracking evaluation results

By using these evaluation tools, you can gain insights into your AI application's performance, identify areas for improvement, and ensure the quality and safety of generated content.