# Cloud evaluation: Evaluating AI app data remotely in the cloud 

## Objective

This tutorial provides a step-by-step guide on how to evaluate data generated by AI applications or LLMs remotely in the cloud. 

This tutorial uses the following Azure AI services:

- [Azure AI Safety Evaluation](https://aka.ms/azureaistudiosafetyeval)
- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)

## Time

You should expect to spend 20 minutes running this sample. 

## About this example

This example demonstrates the cloud evaluation of query and response pairs that were generated by an AI app or a LLM. It is important to have access to AzureOpenAI credentials and an AzureAI project. **To create data to use in your own evaluation, learn more [here](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/simulator-interaction-data)** . This example demonstrates: 

- Single-instance, triggered cloud evaluation on a test dataset (to be used for pre-deployment evaluation of an AI application).

## Before you begin
### Prerequesite
- Have an Azure OpenAI Deployment with GPT model supporting `chat completion`, for example `gpt-4`.
- Make sure you're first logged into your Azure subscription by running `az login`.
- You have some test data you want to evaluate, which includes the user queries and responses (and perhaps context, or ground truth) from your AI applications. See [data requirements for our built-in evaluators](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk#data-requirements-for-built-in-evaluators). Alternatively, if you want to simulate data against your application endpoints using Azure AI Evaluation SDK, see our samples on simulation. 

### Installation

Install the following packages required to execute this notebook. 

```
pip install -r requirements.txt
```

In [1]:
!az login --tenant 16b3c013-d300-468d-ac64-7eda0820b6d3

[
  {
    "cloudName": "AzureCloud",
    "homeTenantId": "16b3c013-d300-468d-ac64-7eda0820b6d3",
    "id": "6025ba02-1dfd-407f-b358-88f811c7c7aa",
    "isDefault": true,
    "managedByTenants": [
      {
        "tenantId": "2f4a9838-26b7-47ee-be60-ccc1fdec5953"
      },
      {
        "tenantId": "72f988bf-86f1-41af-91ab-2d7cd011db47"
      }
    ],
    "name": "MCAPS-Hybrid-REQ-40165-2022-JakeWang",
    "state": "Enabled",
    "tenantId": "16b3c013-d300-468d-ac64-7eda0820b6d3",
    "user": {
      "name": "jacwang@microsoft.com",
      "type": "user"
    }
  },
  {
    "cloudName": "AzureCloud",
    "homeTenantId": "16b3c013-d300-468d-ac64-7eda0820b6d3",
    "id": "997499f7-6523-407d-ac0c-d9ee154f1df1",
    "isDefault": false,
    "managedByTenants": [
      {
        "tenantId": "72f988bf-86f1-41af-91ab-2d7cd011db47"
      }
    ],
    "name": "MCAPS-Hybrid-REQ-41592-2022-KamalAbburi",
    "state": "Enabled",
    "tenantId": "16b3c013-d300-468d-ac64-7eda0820b6d3",
    "user": {
   



In [2]:
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from azure.ai.projects.models import (
    EvaluatorConfiguration,
    EvaluatorIds,
)
from dotenv import load_dotenv
import os

import os
from azure.identity import DefaultAzureCredential
from azure.ai.projects import AIProjectClient

load_dotenv(override=True)

True

### Configuration

Set the following variables for use in this notebook:

In [3]:
azure_ai_connection_string = os.environ.get('AZURE_AI_PROJECT_URL')  # At the moment, it should be in the format "<Region>.api.azureml.ms;<AzureSubscriptionId>;<ResourceGroup>;<HubName>" Ex: eastus2.api.azureml.ms;xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxxxx;rg-sample;sample-project-eastus2
azure_openai_deployment = os.environ.get('AZURE_OPENAI_DEPLOYMENT')  # Your AOAI resource, you must use an AOAI GPT model
azure_openai_api_version = os.environ.get('AZURE_OPENAI_API_VERSION')

In [4]:

# Optional – reuse an existing dataset
dataset_name    = os.environ.get("DATASET_NAME",    "dataset-test")
dataset_version = os.environ.get("DATASET_VERSION", "1.0")

### Connect to your Azure Open AI deployment
To evaluate your LLM-generated data remotely in the cloud, we must connect to your Azure Open AI deployment. This deployment must be a GPT model which supports `chat completion`, such as `gpt-4`. To see the proper value for `conn_str`, navigate to the connection string at the "Project Overview" page for your Azure AI project. 

In [5]:


# Create the project client (Foundry project and credentials)
project_client = AIProjectClient(
    endpoint=azure_ai_connection_string,
    credential=DefaultAzureCredential(),
)

### Data
The following code demonstrates how to upload the data for evaluation to your Azure AI project. Below we use `evaluate_test_data.jsonl` which exemplifies LLM-generated data in the query-response format expected by the Azure AI Evaluation SDK. For your use case, you should upload data in the same format, which can be generated using the [`Simulator`](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/simulator-interaction-data) from Azure AI Evaluation SDK. 

Alternatively, if you already have an existing dataset for evaluation, you can use that by finding the link to your dataset in your [registry](https://ml.azure.com/registries) or find the dataset ID.

In [6]:
# List each dataset with name, version and id
for ds in project_client.datasets.list():
    print(f"- {ds.name}  (version: {ds.version}, id: {ds.id})")

- nw_dataset  (version: 1.0, id: azureai://accounts/ai-jacwang-1965/projects/evalproj/data/nw_dataset/versions/1.0)
- eval-data  (version: 1.0, id: azureai://accounts/ai-jacwang-1965/projects/evalproj/data/eval-data/versions/1.0)


In [7]:
# Upload a local jsonl file (skip if you already have a Dataset registered)
try:
    data_id = project_client.datasets.upload_file(
        name='eval-data',
        version=dataset_version,
        file_path="../../data/evaluate_test_data.jsonl",
    ).id
    print(f"Successfully uploaded dataset with ID: {data_id}")
except Exception as e:
    if "409" in str(e) or "Conflict" in str(e):
        print(f"Dataset 'eval-data' with version '{dataset_version}' already exists.")
        # Retrieve the existing dataset
        datasets = project_client.datasets.list()
        for ds in datasets:
            if ds.name == 'eval-data' and ds.version == dataset_version:
                data_id = ds.id
                print(f"Using existing dataset with ID: {data_id}")
                break
    else:
        print(f"Error uploading dataset: {e}")
        raise

Dataset 'eval-data' with version '1.0' already exists.
Using existing dataset with ID: azureai://accounts/ai-jacwang-1965/projects/evalproj/data/eval-data/versions/1.0


### Configure Evaluators to Run
The code below demonstrates how to configure the evaluators you want to run in AI Foundry.  Note, the EvaluatorConfiguration must be configured correctly for the entire cloud evaluation job to run successfully.

In [8]:

model_config = {
    "azure_endpoint": os.environ.get("AZURE_AI_PROJECT_URL"), # https://<account_name>.services.ai.azure.com
    "api_key": os.environ.get("AZURE_OPENAI_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}
        

In [9]:
from azure.ai.projects.models import EvaluatorConfiguration, EvaluatorIds
from azure.ai.evaluation import RougeType

# Built-in evaluator configurations
evaluators = {
    "relevance": EvaluatorConfiguration(
        id=EvaluatorIds.RELEVANCE.value,
        init_params={"model_config": model_config},
    ),
    "coherence": EvaluatorConfiguration(
        id=EvaluatorIds.COHERENCE.value,
        init_params={"model_config": model_config},
    ),
    "response_completeness": EvaluatorConfiguration(
        id=EvaluatorIds.RESPONSE_COMPLETENESS.value,
        init_params={"model_config": model_config},
    ),
    "intent_resolution": EvaluatorConfiguration(
        id=EvaluatorIds.INTENT_RESOLUTION.value,
        init_params={"model_config": model_config},
    ),
    "hate_unfairness": EvaluatorConfiguration(
        id=EvaluatorIds.HATE_UNFAIRNESS.value,
        init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    "violence": EvaluatorConfiguration(
        id=EvaluatorIds.VIOLENCE.value,
        init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    "content_safety": EvaluatorConfiguration(
        id=EvaluatorIds.CONTENT_SAFETY.value,
        init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    "self_harm": EvaluatorConfiguration(
        id=EvaluatorIds.SELF_HARM.value,
        init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    "sexual": EvaluatorConfiguration(
        id=EvaluatorIds.SEXUAL.value,
        init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    "protected_material": EvaluatorConfiguration(
        id=EvaluatorIds.PROTECTED_MATERIAL.value,
        init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    "bleu_score": EvaluatorConfiguration(
        id=EvaluatorIds.BLEU_SCORE.value
    ),
    "f1_score": EvaluatorConfiguration(
        id=EvaluatorIds.F1_SCORE.value
    ),
    "gleu_score": EvaluatorConfiguration(
        id=EvaluatorIds.GLEU_SCORE.value
    ),
    "meteor_score": EvaluatorConfiguration(
        id=EvaluatorIds.METEOR_SCORE.value,
        init_params={"alpha": 0.8},
    ),
    "rouge_score": EvaluatorConfiguration(
        id=EvaluatorIds.ROUGE_SCORE.value,
        init_params={"rouge_type": RougeType.ROUGE_4},
    ),
    "groundedness": EvaluatorConfiguration(
        id=EvaluatorIds.GROUNDEDNESS.value,
        init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    "groundedness_pro": EvaluatorConfiguration(
        id=EvaluatorIds.GROUNDEDNESS_PRO.value,
        init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    "code_vulnerability": EvaluatorConfiguration(
        id=EvaluatorIds.CODE_VULNERABILITY.value,
        init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    "fluency": EvaluatorConfiguration(
        id=EvaluatorIds.FLUENCY.value
    ),
    "indirect_attack": EvaluatorConfiguration(
        id=EvaluatorIds.INDIRECT_ATTACK.value
    ),
    "retrieval": EvaluatorConfiguration(
        id=EvaluatorIds.RETRIEVAL.value,
        init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    "similarity": EvaluatorConfiguration(
        id=EvaluatorIds.SIMILARITY.value
    ),
    "document_retrieval": EvaluatorConfiguration(
        id=EvaluatorIds.DOCUMENT_RETRIEVAL.value,
        init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    "tool_call_accuracy": EvaluatorConfiguration(
        id=EvaluatorIds.TOOL_CALL_ACCURACY.value
    ),
}

In [None]:
#Working evals, use to test in support region like East US2
'''from azure.ai.projects.models import (
    EvaluatorConfiguration,
    EvaluatorIds,
)
from azure.ai.evaluation import RougeScoreEvaluator, RougeType


# Built-in evaluator configurations
evaluators = {
    # "relevance": EvaluatorConfiguration(
    #     id=EvaluatorIds.RELEVANCE.value,
    #     init_params={"model_config": model_config},
    #     #data_mapping={"query": "${data.Input}", "response": "${data.Output}"},
    # ),
    "hate_unfairness": EvaluatorConfiguration(
        id=EvaluatorIds.HATE_UNFAIRNESS.value,
        init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    "violence": EvaluatorConfiguration(
        id=EvaluatorIds.VIOLENCE.value,
        init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    # "groundedness": EvaluatorConfiguration(
    #     id=EvaluatorIds.GROUNDEDNESS.value,
    #     init_params={"azure_ai_project": azure_ai_connection_string},
    # ),
    # "groundedness_pro": EvaluatorConfiguration(
    #     id=EvaluatorIds.GROUNDEDNESS_PRO.value,
    #     init_params={"azure_ai_project": azure_ai_connection_string},
    # ),
    "bleu_score": EvaluatorConfiguration(
        id=EvaluatorIds.BLEU_SCORE.value,
        # init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    # "code_vulnerability": EvaluatorConfiguration(
    #     id=EvaluatorIds.CODE_VULNERABILITY.value,
    #     init_params={"azure_ai_project": azure_ai_connection_string},
    # ),
    # "coherence": EvaluatorConfiguration(
    #     id=EvaluatorIds.COHERENCE.value,
    #     init_params={"azure_ai_project": azure_ai_connection_string},
    # ),
    # "content_safety": EvaluatorConfiguration(
    #     id=EvaluatorIds.CONTENT_SAFETY.value,
    #     init_params={"azure_ai_project": azure_ai_connection_string},
    # ),
    "f1_score": EvaluatorConfiguration(
        id=EvaluatorIds.F1_SCORE.value,
        # init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    # "fluency": EvaluatorConfiguration(
    #     id=EvaluatorIds.FLUENCY.value,
    #     init_params={"azure_ai_project": azure_ai_connection_string},
    # ),
    "gleu_score": EvaluatorConfiguration(
        id=EvaluatorIds.GLEU_SCORE.value,
        # init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    # "indirect_attack": EvaluatorConfiguration(
    #     id=EvaluatorIds.INDIRECT_ATTACK.value,
    #     init_params={"azure_ai_project": azure_ai_connection_string},
    # ),
    # "intent_resolution": EvaluatorConfiguration(
    #     id=EvaluatorIds.INTENT_RESOLUTION.value,
    #     init_params={"azure_ai_project": azure_ai_connection_string},
    # ),
    "meteor_score": EvaluatorConfiguration(
        id=EvaluatorIds.METEOR_SCORE.value,
        init_params={"alpha": 0.8},
    ),
    # "protected_material": EvaluatorConfiguration(
    #     id=EvaluatorIds.PROTECTED_MATERIAL.value,
    #     init_params={"azure_ai_project": azure_ai_connection_string},
    # ),
    # "retrieval": EvaluatorConfiguration(
    #     id=EvaluatorIds.RETRIEVAL.value,
    #     init_params={"azure_ai_project": azure_ai_connection_string},
    # ),
    "rouge_score": EvaluatorConfiguration(
        id=EvaluatorIds.ROUGE_SCORE.value,
        init_params={"rouge_type"=RougeType.ROUGE_4},
    ),
    "self_harm": EvaluatorConfiguration(
        id=EvaluatorIds.SELF_HARM.value,
        init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    "sexual": EvaluatorConfiguration(
        id=EvaluatorIds.SEXUAL.value,
        init_params={"azure_ai_project": azure_ai_connection_string},
    ),
    # "similarity": EvaluatorConfiguration(
    #     id=EvaluatorIds.SIMILARITY.value,
    #     init_params={"azure_ai_project": azure_ai_connection_string},
    # ),
}
'''

### Create cloud evaluation
Below we demonstrate how to trigger a single-instance Cloud Evaluation remotely on a test dataset. This can be used for pre-deployment testing of your AI application. 
 
Here we pass in the `data_id` we would like to use for the evaluation. 

In [10]:
from azure.ai.projects.models import (
    Evaluation,
    InputDataset
)

# Create an evaluation with the dataset and evaluators specified
evaluation = Evaluation(
    display_name="Cloud evaluation",
    description="Evaluation of dataset",
    data=InputDataset(id=data_id),
    evaluators=evaluators,
)

# Run the evaluation 
evaluation_response = project_client.evaluations.create(
    evaluation,
    headers={
        "model-endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
        "api-key": os.environ.get("AZURE_OPENAI_KEY"),
    },
)

print("Created evaluation:", evaluation_response.name)
print("Status:", evaluation_response.status)

HttpResponseError: (UserError) Invalid connectionId format: 335nuZ4bUIaJgL6iyOjJN4Ggxk5jxX5v2GXGlYWHrMt11OZQWr04JQQJ99BEACHYHv6XJ3w3AAAAACOGYS8f. Expected 'connections/{name}' or 'connections/{name}/...' is invalid
Code: UserError
Message: Invalid connectionId format: 335nuZ4bUIaJgL6iyOjJN4Ggxk5jxX5v2GXGlYWHrMt11OZQWr04JQQJ99BEACHYHv6XJ3w3AAAAACOGYS8f. Expected 'connections/{name}' or 'connections/{name}/...' is invalid

In [15]:
evaluation_response

{'data': {'id': 'azureai://accounts/ai-jacwang-1965/projects/evalproj/data/eval-data/versions/1.0', 'type': 'Dataset'}, 'target': None, 'description': 'Evaluation of dataset', 'evaluators': {'hate_unfairness': {'id': 'azureai://built-in/evaluators/hate_unfairness', 'initParams': {'azure_ai_project': 'https://ai-jacwang-1965.services.ai.azure.com/api/projects/evalproj'}, 'dataMapping': {}}, 'violence': {'id': 'azureai://built-in/evaluators/violence', 'initParams': {'azure_ai_project': 'https://ai-jacwang-1965.services.ai.azure.com/api/projects/evalproj'}, 'dataMapping': {}}, 'bleu_score': {'id': 'azureai://built-in/evaluators/bleu_score', 'initParams': {}, 'dataMapping': {}}, 'f1_score': {'id': 'azureai://built-in/evaluators/f1_score', 'initParams': {}, 'dataMapping': {}}, 'gleu_score': {'id': 'azureai://built-in/evaluators/gleu_score', 'initParams': {}, 'dataMapping': {}}, 'meteor_score': {'id': 'azureai://built-in/evaluators/meteor_score', 'initParams': {'alpha': 0.8}, 'dataMapping': 

In [12]:
# Get evaluation
get_evaluation_response = evaluation_response

print("----------------------------------------------------------------")
#print("Created evaluation, evaluation ID: ", get_evaluation_response.data.id)
print("Evaluation status: ", get_evaluation_response.status)
print("AI Foundry Portal URI: ", get_evaluation_response.properties["AiStudioEvaluationUri"])
print("----------------------------------------------------------------")

----------------------------------------------------------------
Evaluation status:  NotStarted
AI Foundry Portal URI:  https://ai.azure.com/resource/build/evaluation/8615bd26-9b45-4abc-8973-0475773ded38?wsid=/subscriptions/6025ba02-1dfd-407f-b358-88f811c7c7aa/resourceGroups/aigent_eus/providers/Microsoft.CognitiveServices/accounts/ai-jacwang-1965/projects/evalproj&tid=16b3c013-d300-468d-ac64-7eda0820b6d3
----------------------------------------------------------------


## Retrieve and download evaluation results

In [None]:
import time
from azure.ai.projects.models import Evaluation

# Poll for evaluation completion
eval_name = evaluation_response.name
while True:
    ev = project_client.evaluations.get(eval_name)
    status = ev.status
    print(f"Evaluation status: {status}")
    if status in ['Completed', 'Failed', 'Canceled']:
        break
    time.sleep(10)

# List evaluation result versions for this evaluation
print('Evaluation result versions:')
for res in project_client.evaluation_results.list_versions(eval_name):
    print(f"- version: {res.version}, blob URI: {res.blob_uri}")
    result = res
    break

# Download the instance results JSONL
import requests
blob_url = result.blob_uri
r = requests.get(blob_url)
with open('instance_results.jsonl', 'wb') as f:
    f.write(r.content)
print('Downloaded instance_results.jsonl')

# Preview the first few rows
with open('instance_results.jsonl', 'r', encoding='utf-8') as f:
    for _ in range(5):
        print(f.readline().strip())