# Evaluate with quantitative NLP evaluators

## Objective
This notebook demonstrates how to use NLP-based evaluators to assess the quality of generated text by comparing it to reference text. By the end of this tutorial, you'll be able to:
 - Understand different NLP evaluators such as `BleuScoreEvaluator`, `GleuScoreEvaluator`, `MeteorScoreEvaluator`, and `RougeScoreEvaluator`.
 - Evaluate dataset using these evaluators.

## Time
You should expect to spend about 10 minutes running this notebook.

## Before you begin

### Installation
Install the following packages required to execute this notebook.

In [1]:
# Install the packages
%pip install azure-ai-evaluation

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
from pprint import pprint
from dotenv import load_dotenv
load_dotenv("../.credentials.env")

True

## NLP Evaluators

In [3]:
# Initialize Azure AI project and Azure OpenAI conncetion with your environment variables
azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP"),
    "project_name": os.environ.get("AZURE_PROJECT_NAME"),
}

## Set up env vars for model endpoints and keys

In [4]:
env_var = { 
    "gpt-35-turbo": {
        "endpoint": os.environ.get("AZURE_OPENAI_GPT35_ENDPOINT"),
        "key": os.environ.get("AZURE_OPENAI_GPT35_API_KEY"),
    },
    "gpt-4": {
        "endpoint": os.environ.get("AZURE_OPENAI_GPT4_ENDPOINT"),
        "key": os.environ.get("AZURE_OPENAI_GPT4_API_KEY"),
    },
    "gpt-4o": {
        "endpoint": os.environ.get("AZURE_OPENAI_GPT4o_ENDPOINT"),
        "key": os.environ.get("AZURE_OPENAI_GPT4o_API_KEY"),
    },
   "gpt-4o-mini" : { 
        "endpoint" : os.environ.get("AZURE_OPENAI_GPT4o-mini_ENDPOINT"), 
        "key" : os.environ.get("AZURE_OPENAI_GPT4o-mini_API_KEY"), 
    },    
}

In [5]:
with open("target_nlp_api/target_nlp_api.py") as fin:
    print(fin.read())

import requests
from typing_extensions import Self
from typing import TypedDict
from promptflow.tracing import trace


class ModelEndpoints:
    def __init__(self: Self, env: dict, model_type: str) -> str:
        self.env = env
        self.model_type = model_type

    class Response(TypedDict):
        query: str
        response: str

    @trace
    def __call__(self: Self, query: str) -> Response:
        if self.model_type == "gpt-4":
            output = self.call_gpt4_endpoint(query)
        elif self.model_type == "gpt-35-turbo":
            output = self.call_gpt35_turbo_endpoint(query)
        elif self.model_type == "gpt-4o":
            output = self.call_gpt4o_endpoint(query)
        elif self.model_type == "gpt-4o-mini":
            output = self.call_gpt4o_mini_endpoint(query)
        else:
            output = self.call_default_endpoint(query)

        return output

    def query(self: Self, endpoint: str, headers: str, payload: str) -> str:
        response = requests

In [6]:
from target_nlp_api.target_nlp_api import ModelEndpoints

In [None]:
from azure.ai.evaluation import evaluate, BleuScoreEvaluator, GleuScoreEvaluator, MeteorScoreEvaluator, RougeScoreEvaluator, RougeType
import random
import pathlib
import sys

from target_nlp_api.target_nlp_api import ModelEndpoints

# Re-initialize env_var with FULL endpoint URLs including deployment names
env_var = { 
    "gpt-35-turbo": {
        "endpoint": os.environ.get("AZURE_OPENAI_GPT35_ENDPOINT") + "openai/deployments/" + os.environ.get("AZURE_OPENAI_GPT35_DEPLOYMENT") + "/chat/completions?api-version=2024-02-15-preview",
        "key": os.environ.get("AZURE_OPENAI_GPT35_API_KEY"),
    },
    "gpt-4": {
        "endpoint": os.environ.get("AZURE_OPENAI_GPT4_ENDPOINT") + "openai/deployments/" + os.environ.get("AZURE_OPENAI_GPT4_DEPLOYMENT") + "/chat/completions?api-version=2024-02-15-preview",
        "key": os.environ.get("AZURE_OPENAI_GPT4_API_KEY"),
    },
    "gpt-4o": {
        "endpoint": os.environ.get("AZURE_OPENAI_GPT4o_ENDPOINT") + "openai/deployments/" + os.environ.get("AZURE_OPENAI_GPT4o_DEPLOYMENT") + "/chat/completions?api-version=2024-02-15-preview",
        "key": os.environ.get("AZURE_OPENAI_GPT4o_API_KEY"),
    },
    "gpt-4o-mini" : { 
        "endpoint" : os.environ.get("AZURE_OPENAI_GPT4o_ENDPOINT") + "openai/deployments/" + os.environ.get("AZURE_OPENAI_GPT4o_DEPLOYMENT") + "/chat/completions?api-version=2024-02-15-preview",
        "key" : os.environ.get("AZURE_OPENAI_GPT4o_API_KEY"), 
    },    
}

models = ["gpt-35-turbo","gpt-4","gpt-4o","gpt-4o-mini"]

for model in models:
    print(" Evaluating NLP metrics - ", model)
    print("-----------------------------------")
    randomNum = random.randint(1111, 9999)
    result = evaluate(
        #azure_ai_project=azure_ai_project, 
        data="ai_data.jsonl",
        evaluation_name = "NLP-" + model.title() + "_Run-" + str(randomNum),
        target = ModelEndpoints(env_var, model),

        evaluators={
            "bleu": BleuScoreEvaluator(),
            "gleu": GleuScoreEvaluator(),
            "meteor": MeteorScoreEvaluator(alpha=0.9, beta=3.0, gamma=0.5),
            "rouge": RougeScoreEvaluator(rouge_type=RougeType.ROUGE_1),
        },
        evaluator_config={
            "bleu": {
                "column_mapping": {
                    "ground_truth": "${data.ground_truth}",
                    "response": "${target.response}"}
            },
        }
    )

 Evaluating NLP metrics -  gpt-35-turbo
-----------------------------------
2025-10-21 08:42:39 +0000 140016922646208 execution.bulk     INFO     Finished 1 / 4 lines.
2025-10-21 08:42:39 +0000 140016922646208 execution.bulk     INFO     Average execution time for completed lines: 0.9 seconds. Estimated time for incomplete lines: 2.7 seconds.
2025-10-21 08:42:39 +0000 140016922646208 execution.bulk     INFO     Finished 1 / 4 lines.
2025-10-21 08:42:39 +0000 140016922646208 execution.bulk     INFO     Average execution time for completed lines: 0.9 seconds. Estimated time for incomplete lines: 2.7 seconds.
2025-10-21 08:42:39 +0000 140016922646208 execution.bulk     INFO     Finished 2 / 4 lines.
2025-10-21 08:42:39 +0000 140016922646208 execution.bulk     INFO     Average execution time for completed lines: 0.51 seconds. Estimated time for incomplete lines: 1.02 seconds.
2025-10-21 08:42:39 +0000 140016922646208 execution.bulk     INFO     Finished 2 / 4 lines.
2025-10-21 08:42:39 +00

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "gleu_20251021_084241_942989"
Run status: "Completed"
Start time: "2025-10-21 08:42:41.942989+00:00"
Duration: "0:00:01.008575"


Run name: "bleu_20251021_084241_946498"
Run status: "Completed"
Start time: "2025-10-21 08:42:41.946498+00:00"
Duration: "0:00:01.011501"


Run name: "rouge_20251021_084241_956636"
Run status: "Completed"
Start time: "2025-10-21 08:42:41.956636+00:00"
Duration: "0:00:01.029718"

2025-10-21 08:42:45 +0000 140016896419520 execution.bulk     INFO     Finished 4 / 4 lines.
2025-10-21 08:42:45 +0000 140016896419520 execution.bulk     INFO     Average execution time for completed lines: 0.85 seconds. Estimated time for incomplete lines: 0.0 seconds.
2025-10-21 08:42:45 +0000 140016896419520 execution.bulk     INFO     Finished 4 / 4 lines.
2025-10-21 08:42:45 +0000 140016896419520 execution.bulk     INFO     Average execution time for completed lines: 0.85 seconds. Estimated time for incomplete lines: 0.0 seconds.


Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics
Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "meteor_20251021_084241_965495"
Run status: "Completed"
Start time: "2025-10-21 08:42:41.965495+00:00"
Duration: "0:00:03.413234"


{
    "bleu": {
        "status": "Completed",
        "duration": "0:00:01.011501",
        "completed_lines": 4,
        "failed_lines": 0,
        "log_path": null
    },
    "gleu": {
        "status": "Completed",
        "duration": "0:00:01.008575",
        "completed_lines": 4,
        "failed_lines": 0,
        "log_path": null
    },
    "meteor": {
        "status": "Completed",
        "duration": "0:00:03.413234",
        "completed_lines": 4,
        "failed_lines": 0,
        "log_path": null
    },
    "rouge": {
        "status": "Completed",
        "duration": "0:00:01.029718",
        "completed_lines": 4,
        "failed_lines": 0,
        "log_path": null
    }
}




EvaluationException: (UserError) Failed to upload evaluation run to the cloud due to insufficient permission to access the storage. Please ensure that the necessary access rights are granted.
Visit https://aka.ms/azsdk/python/evaluation/remotetracking/troubleshoot to troubleshoot this issue.

View the results, Alternatively you can view the results in AI Foundry

In [None]:
from pprint import pprint

pprint(result)