1. Introduction
 
The Azure AI Evaluation SDK allows you to quantitatively and qualitatively evaluate Generative AI applications both locally and at scale. It includes a variety of built-in evaluators you can use with your test data, and supports evaluation for both single-turn and multi-turn conversations, as well as multi-modal data (e.g., images).

2. Environment Setup
 
Make sure you have access to the necessary Azure OpenAI resources. Set the following environment variables in your system (or in your notebook for demonstration):

In [None]:
import os  
  
os.environ["AZURE_ENDPOINT"] = "endpoint"  
os.environ["AZURE_API_KEY"] = "your keys"  
os.environ["AZURE_DEPLOYMENT_NAME"] = "model deployment name"  
os.environ["AZURE_API_VERSION"] = "api version"   # e.g., "2024-02-15-preview"  

3. SDK Installation
 
Install the Azure AI Evaluation SDK:

In [None]:
!pip install azure-ai-evaluation  

In [None]:
pip install azure-ai-projects

4. Model Configuration
 
Required for AI-assisted evaluators (except some safety evaluators):
You need to specify which GPT model will be used as the judge.

In [2]:
from azure.ai.evaluation import AzureOpenAIModelConfiguration  
  
model_config = AzureOpenAIModelConfiguration(  
    azure_endpoint=os.environ["AZURE_ENDPOINT"],  
    api_key=os.environ["AZURE_API_KEY"],  
    azure_deployment=os.environ["AZURE_DEPLOYMENT_NAME"],  
    api_version=os.environ["AZURE_API_VERSION"],  
)  

5. Running Built-in Evaluators (Single Row)
 
Let's run an evaluator on a simple query-response pair using the RelevanceEvaluator:

In [None]:
from azure.ai.evaluation import RelevanceEvaluator  
  
query = "What is the capital of France?"  
response = "Paris."  
  
relevance_eval = RelevanceEvaluator(model_config)  
result = relevance_eval(query=query, response=response)  
print(result)  

Supported Built-in Evaluators

General purpose: CoherenceEvaluator, FluencyEvaluator, QAEvaluator, etc.
Similarity: SimilarityEvaluator, F1ScoreEvaluator, BleuScoreEvaluator,...
RAG: GroundednessEvaluator, RetrievalEvaluator, etc.
Safety: ViolenceEvaluator, ContentSafetyEvaluator, ...
See full list in Azure Docs

6. Batch Evaluation with .jsonl Dataset
 
Prepare your dataset as a .jsonl file (JSON Lines):

Example: data.jsonl

{"query": "What is the capital of France?", "response": "Paris."}  
{"query": "What atoms compose water?", "response": "Hydrogen and oxygen."}  
{"query": "What color is my shirt?", "response": "Blue."}  

You can now run evaluators over this dataset:

In [None]:
from azure.ai.evaluation import evaluate, GroundednessEvaluator  
  
groundedness_eval = GroundednessEvaluator(model_config)  
  
result = evaluate(  
    data="data.jsonl",  
    evaluators={"groundedness": groundedness_eval},  
    output_path="./eval_results.json"    # Output is optional  
)  
import json  
print(json.dumps(result['metrics'], indent=2))  

Data Requirements
 

Each line in .jsonl must be a valid JSON object.
Key names should match the evaluator's expected input (query, response, context, etc).

7. Evaluating Conversations
 
Conversation Example:

In [None]:
from azure.ai.evaluation import GroundednessEvaluator  
  
conversation = {  
    "messages": [  
        {"content": "Which tent is the most waterproof?", "role": "user"},  
        {  
            "content": "The Alpine Explorer Tent is the most waterproof",  
            "role": "assistant",  
            "context": "From our product list the Alpine Explorer Tent is the most waterproof.",  
        },  
        {"content": "How much does it cost?", "role": "user"},  
        {  
            "content": "The Alpine Explorer Tent is $120.",  
            "role": "assistant",  
            "context": None,  
        },  
    ]  
}  
  
groundedness_eval = GroundednessEvaluator(model_config)  
score = groundedness_eval(conversation=conversation)  
  
import json  
print(json.dumps(score, indent=2))  

JSONL Format for Conversations:

In [None]:
{"conversation": { "messages": [...] }}  

8. Using Composite Evaluators
 
Composite evaluators group several metrics under one evaluator:

QA Evaluator Example (works on query-response pairs):

In [7]:
evaluation_data = [
    {
        "query": "Who invented the lightbulb?",
        "response": "Thomas Edison invented the first commercially successful incandescent light bulb.",
        "context": "In 1879, Thomas Edison created the first commercially successful incandescent light bulb."
    },
    # Add more entries as needed
]


In [8]:
import json

with open("evaluation_data.jsonl", "w") as f:
    for entry in evaluation_data:
        f.write(json.dumps(entry) + "\n")


In [None]:
from azure.ai.evaluation import evaluate, QAEvaluator

# Initialize your evaluator
qa_evaluator = QAEvaluator(model_config)

# Run the evaluation
result = evaluate(
    data="evaluation_data.jsonl",
    evaluators={"qa": qa_evaluator},
    evaluation_name="RAG Evaluation Demo"
)

print(result["metrics"])


In [None]:
!az login


QA Evaluator:

In [None]:
from azure.ai.evaluation import evaluate, QAEvaluator

# Define your Azure AI project details
azure_ai_project = {
    "subscription_id": "your sub id",
     "project_name": "your project name",
     "resource_group_name": "your resource group name"
}

# Initialize your evaluator
qa_evaluator = QAEvaluator(model_config)

# Run the evaluation
result = evaluate(
    data="evaluation_data.jsonl",
    evaluators={"qa": qa_evaluator},
    evaluation_name="RAG Evaluation Demo",
    azure_ai_project=azure_ai_project
)

# Output the evaluation metrics and the link to Azure AI Foundry
print(result["metrics"])
print(f"View results in Azure AI Foundry: {result.get('studio_url')}")


Groundedness Evaluator:

In [None]:
from azure.ai.evaluation import evaluate, GroundednessEvaluator

# Define your Azure AI project details
azure_ai_project = {
    "subscription_id": "your sub id",
     "project_name": "your project name",
     "resource_group_name": "your resource group name"
}

# Initialize the evaluator
groundedness_evaluator = GroundednessEvaluator(model_config)

# Run the evaluation
result = evaluate(
    data="evaluation_data.jsonl",
    evaluators={"groundedness": groundedness_evaluator},
    evaluation_name="RAG Groundedness Evaluation",
    azure_ai_project=azure_ai_project
)

# Output the evaluation metrics and the link to Azure AI Foundry
print(result["metrics"])
print(f"View results in Azure AI Foundry: {result.get('studio_url')}")


9. Tracking Evaluations in Azure AI Project
 
You can log evaluation runs to your Azure AI project for easier tracking:

In [19]:
#example

Another Example:

In [None]:
pip install azure-ai-evaluation azure-identity


In [None]:
from azure.ai.evaluation import evaluate, GroundednessEvaluator, RetrievalEvaluator, ViolenceEvaluator, BleuScoreEvaluator
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
# Define your Azure AI project details
azure_ai_project = {
    "subscription_id": "your sub id",
     "project_name": "your project name",
     "resource_group_name": "your resource group name"
    
}

# Initialize evaluators
evaluators = {
    "groundedness": GroundednessEvaluator(model_config),
    "retrieval": RetrievalEvaluator(model_config),
    "violence": ViolenceEvaluator(credential=credential, azure_ai_project=azure_ai_project),
    "bleu": BleuScoreEvaluator(threshold=0.5)
}

# Run the evaluation
result = evaluate(
    data="evaluation_data_new.jsonl",
    evaluators=evaluators,
    evaluation_name="RAG Comprehensive Evaluation",
    azure_ai_project=azure_ai_project
)

# Output the evaluation metrics and the link to Azure AI Foundry
print(result["metrics"])
print(f"View results in Azure AI Foundry: {result.get('studio_url')}")


10. Advanced: Local Evaluation on a Target (Optional)
 
If you want to run evaluation against a live application (e.g., an API or callable class), provide it as the target parameter. The Evaluator will send queries to the target and evaluate the returned answers.

Example (assuming you have a callable askwiki class):

In [None]:
from askwiki import askwiki  # Your implementation  
  
result = evaluate(  
    data="data.jsonl",  
    target=askwiki,  
    evaluators={"groundedness": groundedness_eval},  
    evaluator_config={  
        "default": {  
            "column_mapping": {  
                "query": "${data.query}",  
                "context": "${outputs.context}",  
                "response": "${outputs.response}"  
            }  
        }  
    },  
    output_path="target_eval.json"  
)  