# Evaluate with quantitative NLP evaluators

## Objective
This notebook demonstrates how to use NLP-based evaluators to assess the quality of generated text by comparing it to reference text. By the end of this tutorial, you'll be able to:
 - Understand different NLP evaluators such as `BleuScoreEvaluator`, `GleuScoreEvaluator`, `MeteorScoreEvaluator`, and `RougeScoreEvaluator`.
 - Evaluate dataset using these evaluators.

## Time
You should expect to spend about 10 minutes running this notebook.

## Before you begin

### Installation
Install the following packages required to execute this notebook.

In [1]:
# Install the packages
%pip install azure-ai-evaluation

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
from pprint import pprint
from dotenv import load_dotenv
load_dotenv("../.credentials.env")

True

## NLP Evaluators

### BleuScoreEvaluator

BLEU (Bilingual Evaluation Understudy) score is commonly used in natural language processing (NLP) and machine
translation. It is widely used in text summarization and text generation use cases. It evaluates how closely the
generated text matches the reference text. The BLEU score ranges from 0 to 1, with higher scores indicating
better quality.

In [3]:
from azure.ai.evaluation import BleuScoreEvaluator

bleu = BleuScoreEvaluator()

In [4]:
result = bleu(response="London is the capital of England.", ground_truth="The capital of England is London.")

print(result)

{'bleu_score': 0.22961813530951883, 'bleu_result': 'fail', 'bleu_threshold': 0.5}


### GleuScoreEvaluator

The GLEU (Google-BLEU) score evaluator measures the similarity between generated and reference texts by
evaluating n-gram overlap, considering both precision and recall. This balanced evaluation, designed for
sentence-level assessment, makes it ideal for detailed analysis of translation quality. GLEU is well-suited for
use cases such as machine translation, text summarization, and text generation.

In [5]:
from azure.ai.evaluation import GleuScoreEvaluator

gleu = GleuScoreEvaluator()

In [6]:
result = gleu(response="London is the capital of England.", ground_truth="The capital of England is London.")

print(result)

{'gleu_score': 0.4090909090909091, 'gleu_result': 'fail', 'gleu_threshold': 0.5}


### MeteorScoreEvaluator

The METEOR (Metric for Evaluation of Translation with Explicit Ordering) score grader evaluates generated text by
comparing it to reference texts, focusing on precision, recall, and content alignment. It addresses limitations of
other metrics like BLEU by considering synonyms, stemming, and paraphrasing. METEOR score considers synonyms and
word stems to more accurately capture meaning and language variations. In addition to machine translation and
text summarization, paraphrase detection is an optimal use case for the METEOR score.

In [7]:
from azure.ai.evaluation import MeteorScoreEvaluator

meteor = MeteorScoreEvaluator(alpha=0.9, beta=3.0, gamma=0.5)

In [8]:
result = meteor(response="London is the capital of England.", ground_truth="The capital of England is London.")

print(result)

{'meteor_score': 0.9067055393586005, 'meteor_result': 'pass', 'meteor_threshold': 0.5}


### RougeScoreEvaluator

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used to evaluate automatic
summarization and machine translation. It measures the overlap between generated text and reference summaries.
ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. Text
summarization and document comparison are among optimal use cases for ROUGE, particularly in scenarios where text
coherence and relevance are critical.


In [9]:
from azure.ai.evaluation import RougeScoreEvaluator, RougeType

rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1)

In [10]:
result = rouge(response="London is the capital of England.", ground_truth="The capital of England is London.")

print(result)

{'rouge_precision': 1.0, 'rouge_recall': 1.0, 'rouge_f1_score': 1.0, 'rouge_precision_result': 'pass', 'rouge_recall_result': 'pass', 'rouge_f1_score_result': 'pass', 'rouge_precision_threshold': 0.5, 'rouge_recall_threshold': 0.5, 'rouge_f1_score_threshold': 0.5}


## Evaluate a Dataset using NLP Evaluators

The code below evaluates the dataset using BLEU, GLEU, METEOR, and ROUGE evaluators. Results are computed locally and then uploaded to Azure ML using MLflow with Azure AD authentication.

In [20]:
from azure.ai.evaluation import evaluate
from azure.identity import DefaultAzureCredential
import random

# Use the same credential
credential = DefaultAzureCredential()

randomNum = random.randint(1111, 9999)
result = evaluate(
    data="nlp_data.jsonl",
    evaluation_name="NLP-demo-" + str(randomNum),
    evaluators={
        "bleu": bleu,
        "gleu": gleu,
        "meteor": meteor,
        "rouge": rouge,
    },
    # Optionally provide your AI Studio project information to track your evaluation results in your Azure AI Studio project
    # azure_ai_project = {
    #     "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    #     "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP"),
    #     "project_name": os.environ.get("AZURE_PROJECT_NAME"),
    # },
    # Add credential for storage access
    credential=credential
)

2025-10-20 16:05:34 +0000 139989675398848 execution.bulk     INFO     Finished 3 / 3 lines.
2025-10-20 16:05:34 +0000 139989675398848 execution.bulk     INFO     Average execution time for completed lines: 0.0 seconds. Estimated time for incomplete lines: 0.0 seconds.
2025-10-20 16:05:34 +0000 139989675398848 execution.bulk     INFO     Average execution time for completed lines: 0.0 seconds. Estimated time for incomplete lines: 0.0 seconds.
2025-10-20 16:05:34 +0000 139989692184256 execution.bulk     INFO     Finished 3 / 3 lines.
2025-10-20 16:05:34 +0000 139989692184256 execution.bulk     INFO     Finished 3 / 3 lines.
2025-10-20 16:05:34 +0000 139989692184256 execution.bulk     INFO     Average execution time for completed lines: 0.04 seconds. Estimated time for incomplete lines: 0.0 seconds.
2025-10-20 16:05:34 +0000 139990109497024 execution.bulk     INFO     Finished 3 / 3 lines.
2025-10-20 16:05:34 +0000 139990109497024 execution.bulk     INFO     Average execution time for com

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics


2025-10-20 16:05:37 +0000 139989683791552 execution.bulk     INFO     Finished 3 / 3 lines.
2025-10-20 16:05:37 +0000 139989683791552 execution.bulk     INFO     Average execution time for completed lines: 1.1 seconds. Estimated time for incomplete lines: 0.0 seconds.
2025-10-20 16:05:37 +0000 139989683791552 execution.bulk     INFO     Average execution time for completed lines: 1.1 seconds. Estimated time for incomplete lines: 0.0 seconds.


Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "meteor_20251020_160534_049455"
Run status: "Completed"
Start time: "2025-10-20 16:05:34.049455+00:00"
Duration: "0:00:03.304629"


{
    "bleu": {
        "status": "Completed",
        "duration": "0:00:01.005341",
        "completed_lines": 3,
        "failed_lines": 0,
        "log_path": null
    },
    "gleu": {
        "status": "Completed",
        "duration": "0:00:01.009556",
        "completed_lines": 3,
        "failed_lines": 0,
        "log_path": null
    },
    "meteor": {
        "status": "Completed",
        "duration": "0:00:03.304629",
        "completed_lines": 3,
        "failed_lines": 0,
        "log_path": null
    },
    "rouge": {
        "status": "Completed",
        "duration": "0:00:01.005115",
        "completed_lines": 3,
        "failed_lines": 0,
        "log_path": null
    }
}




### View Evaluation Results

The evaluation ran successfully in local mode. Let's examine the results:

In [21]:
import pandas as pd

# Display the aggregated metrics
print("=== Aggregated Metrics ===")
for metric_name, metric_value in result["metrics"].items():
    print(f"{metric_name}: {metric_value:.4f}")

# Display a sample of the detailed row-level results
print("\n=== Sample Row-Level Results ===")
results_df = pd.DataFrame(result["rows"])
print(results_df.head())

=== Aggregated Metrics ===
bleu.bleu_score: 0.2762
bleu.bleu_threshold: 0.5000
gleu.gleu_score: 0.3484
gleu.gleu_threshold: 0.5000
meteor.meteor_score: 0.7350
meteor.meteor_threshold: 0.5000
rouge.rouge_precision: 0.6667
rouge.rouge_recall: 0.5321
rouge.rouge_f1_score: 0.5914
rouge.rouge_precision_threshold: 0.5000
rouge.rouge_recall_threshold: 0.5000
rouge.rouge_f1_score_threshold: 0.5000
bleu.binary_aggregate: 0.0000
gleu.binary_aggregate: 0.0000
meteor.binary_aggregate: 1.0000
rouge.binary_aggregate: 0.6700

=== Sample Row-Level Results ===
                inputs.response              inputs.ground_truth  \
0      The cat sits on the mat.     A cat is sitting on the mat.   
1     She enjoys reading books.         She loves to read books.   
2  He quickly ran to the store.  He ran to the store in a hurry.   

   outputs.bleu.bleu_score outputs.bleu.bleu_result  \
0                 0.376850                     fail   
1                 0.109826                     fail   
2           

## Upload Results to Azure ML with MLflow

This section configures MLflow to use Azure AD authentication and uploads the evaluation results to Azure ML Studio for tracking and visualization.

In [None]:
# Install MLflow packages for Azure ML integration
%pip install -q mlflow azureml-mlflow

Collecting azure-storage-blob<=12.19.0,>=12.5.0 (from azureml-mlflow)
  Using cached azure_storage_blob-12.19.0-py3-none-any.whl.metadata (26 kB)
Collecting azure-storage-blob<=12.19.0,>=12.5.0 (from azureml-mlflow)
  Using cached azure_storage_blob-12.19.0-py3-none-any.whl.metadata (26 kB)
Using cached azure_storage_blob-12.19.0-py3-none-any.whl (394 kB)
Using cached azure_storage_blob-12.19.0-py3-none-any.whl (394 kB)
Installing collected packages: azure-storage-blob
  Attempting uninstall: azure-storage-blob
    Found existing installation: azure-storage-blob 12.27.0
    Uninstalling azure-storage-blob-12.27.0:
      Successfully uninstalled azure-storage-blob-12.27.0
Installing collected packages: azure-storage-blob
  Attempting uninstall: azure-storage-blob
    Found existing installation: azure-storage-blob 12.27.0
    Uninstalling azure-storage-blob-12.27.0:
      Successfully uninstalled azure-storage-blob-12.27.0
[31mERROR: pip's dependency resolver does not currently take in

In [24]:
import mlflow
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
import os
from dotenv import load_dotenv

# Load environment variables from the parent directory's .credentials.env
load_dotenv("../.credentials.env")

# Initialize credential
credential = DefaultAzureCredential()

# Get configuration from environment
subscription_id = os.environ.get("AZURE_SUBSCRIPTION_ID")
resource_group = os.environ.get("AZURE_RESOURCE_GROUP")
workspace_name = os.environ.get("AZURE_PROJECT_NAME")

print(f"Subscription: {subscription_id}")
print(f"Resource Group: {resource_group}")
print(f"Workspace/Project: {workspace_name}")

# Create MLClient to get the proper tracking URI
ml_client = MLClient(
    credential=credential,
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    workspace_name=workspace_name
)

# Get the workspace and its MLflow tracking URI
workspace = ml_client.workspaces.get(workspace_name)
tracking_uri = workspace.mlflow_tracking_uri
print(f"\nMLflow Tracking URI: {tracking_uri}")

# Set the tracking URI for MLflow
mlflow.set_tracking_uri(tracking_uri)

# azureml-mlflow will use DefaultAzureCredential automatically
# when the tracking URI is from Azure ML
print("\n✓ MLflow configured with Azure AD authentication")
print("✓ Using DefaultAzureCredential for all operations")

Subscription: d2056f9b-aeff-4e5b-9010-d9f5fce1e8a9
Resource Group: Eval-Hack
Workspace/Project: Eval-Hack-Project


Overriding of current TracerProvider is not allowed
Overriding of current MeterProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented



MLflow Tracking URI: azureml://swedencentral.api.azureml.ms/mlflow/v2.0/subscriptions/d2056f9b-aeff-4e5b-9010-d9f5fce1e8a9/resourceGroups/Eval-Hack/providers/Microsoft.MachineLearningServices/workspaces/eval-hack-project

✓ MLflow configured with Azure AD authentication
✓ Using DefaultAzureCredential for all operations


In [None]:
import random

# Create an evaluation name for this MLflow run
evaluation_name = f"NLP-Eval-{random.randint(1111, 9999)}"

print(f"Uploading evaluation results to Azure ML: {evaluation_name}\n")

# Log the results to Azure ML using MLflow with AAD auth
with mlflow.start_run(run_name=evaluation_name) as run:
    # Log all metrics
    print("Metrics:")
    for metric_name, metric_value in result["metrics"].items():
        mlflow.log_metric(metric_name, metric_value)
        print(f"  • {metric_name}: {metric_value:.4f}")
    
    # Log parameters for context
    mlflow.log_param("num_samples", len(result["rows"]))
    mlflow.log_param("evaluators", "bleu, gleu, meteor, rouge")
    mlflow.log_param("data_file", "nlp_data.jsonl")
    
    # Get the run details
    run_id = run.info.run_id
    experiment_id = run.info.experiment_id
    experiment_name = mlflow.get_experiment(experiment_id).name
    
print(f"\n{'='*70}")
print(f"✅ Results uploaded to Azure ML!")
print(f"{'='*70}")
print(f"\nRun Details:")
print(f"  • Run ID: {run_id}")
print(f"  • Experiment: {experiment_name}")
print(f"  • Samples evaluated: {len(result['rows'])}")
print(f"\n🔗 View in Azure ML Studio:")
workspace_url = tracking_uri.split('/mlflow')[0]
studio_url = f"{workspace_url}?wsid=/subscriptions/{subscription_id}/resourcegroups/{resource_group}/workspaces/{workspace_name}&tid={os.environ.get('AZURE_TENANT_ID')}"
print(f"   {studio_url}")

Logging evaluation results to Azure ML as: NLP-Eval-MLflow-5173
Results from variable 'result' will be logged

Logging metrics:
Logging metrics:
  ✓ bleu.bleu_score = 0.2762
  ✓ bleu.bleu_score = 0.2762
  ✓ bleu.bleu_threshold = 0.5000
  ✓ bleu.bleu_threshold = 0.5000
  ✓ gleu.gleu_score = 0.3484
  ✓ gleu.gleu_score = 0.3484
  ✓ gleu.gleu_threshold = 0.5000
  ✓ gleu.gleu_threshold = 0.5000
  ✓ meteor.meteor_score = 0.7350
  ✓ meteor.meteor_score = 0.7350
  ✓ meteor.meteor_threshold = 0.5000
  ✓ meteor.meteor_threshold = 0.5000
  ✓ rouge.rouge_precision = 0.6667
  ✓ rouge.rouge_precision = 0.6667
  ✓ rouge.rouge_recall = 0.5321
  ✓ rouge.rouge_recall = 0.5321
  ✓ rouge.rouge_f1_score = 0.5914
  ✓ rouge.rouge_f1_score = 0.5914
  ✓ rouge.rouge_precision_threshold = 0.5000
  ✓ rouge.rouge_precision_threshold = 0.5000
  ✓ rouge.rouge_recall_threshold = 0.5000
  ✓ rouge.rouge_recall_threshold = 0.5000
  ✓ rouge.rouge_f1_score_threshold = 0.5000
  ✓ rouge.rouge_f1_score_threshold = 0.5000
  ✓

## Summary

**Workflow:**
1. ✅ Evaluations run locally using NLP evaluators (BLEU, GLEU, METEOR, ROUGE)
2. ✅ Results displayed in the notebook with detailed metrics
3. ✅ Metrics automatically uploaded to Azure ML using MLflow with Azure AD authentication
4. ✅ Results trackable in Azure ML Studio with full experiment history

**Why this approach?**
When storage accounts have key-based authentication disabled (a security best practice), the standard `azure_ai_project` parameter fails. This solution uses MLflow with Azure AD credentials to securely upload results while respecting your organization's security policies.

## View Results in Azure ML Studio

1. Navigate to [Azure ML Studio](https://ml.azure.com)
2. Select your workspace: **Eval-Hack-Project**
3. Click on **Experiments** in the left navigation
4. Find your run (name starting with "NLP-Eval-")
5. View metrics, compare runs, and create visualizations