# 🏋️‍♀️ Health & Fitness Evaluations with Azure AI Foundry 🏋️‍♂️

This notebook demuestra cómo **evaluar** un modelo de IA generativa (o aplicación) utilizando el ecosistema de **Azure AI Foundry**. Destacaremos tres SDK de Python clave:
1. **azure-ai-projects** (AIProjectClient): gestionar y orquestar evaluaciones en la nube.
2. **azure-ai-inference**: realizar inferencia de modelos (opcional pero útil para generar datos para evaluación).
3. **azure-ai-evaluation**: ejecutar métricas automatizadas para la calidad y seguridad de la salida de modelos de lenguaje.

Crearemos o utilizaremos algunos datos sintéticos de preguntas y respuestas en salud y fitness, y luego mediremos qué tan bien responde tu modelo. Realizaremos evaluaciones tanto **locales** como **en la nube** (en un proyecto de Azure AI Foundry).

> **Descargo de responsabilidad**: Esto aborda un escenario hipotético de salud y fitness. **No se proporciona consejo médico real**. Consulta siempre a profesionales.

## Contenidos del cuaderno
1. [Configuración e Importaciones](#1-Setup-and-Imports)
2. [Ejemplos de Evaluación Local](#3-Local-Evaluation)
3. [Evaluación en la Nube con AIProjectClient](#4-Cloud-Evaluation)


Instalaremos las bibliotecas necesarias, las importaremos y definiremos algunos datos sintéticos. 

### Dependencias
- `azure-ai-projects` para orquestar evaluaciones en tu Proyecto Azure AI Foundry.
- `azure-ai-evaluation` para métricas integradas o personalizadas (como Relevancia, Fundamentación, F1Score, etc.).
- `azure-ai-inference`(opcional) si deseas generar completions para producir datos a evaluar.
- `azure-identity` (para la autenticación con Azure mediante DefaultAzureCredential).

### Datos Sintéticos
Crearemos un pequeño archivo JSONL con pares de Preguntas y Respuestas de salud y fitness, que incluyen query, response, context y ground_truth. Esto simula un escenario en el que tenemos preguntas de usuarios, las respuestas del modelo y además una referencia de la verdad.

Puedes adaptar este enfoque a cualquier dominio: por ejemplo, finanzas, comercio electrónico, etc.

<img src="./seq-diagrams/2-evals.png" alt="Flujo de Evaluación" width="30%"/>


In [16]:
%%capture
# If you need to install these, uncomment:
# !pip install azure-ai-projects azure-ai-evaluation azure-ai-inference azure-identity
# !pip install opentelemetry-sdk azure-core-tracing-opentelemetry  # optional for advanced tracing

import json
import os
import uuid
from pathlib import Path
from typing import Dict, Any

from azure.identity import DefaultAzureCredential

# We'll create a synthetic dataset in JSON Lines format
synthetic_eval_data = [
    {
        "query": "How can I start a beginner workout routine at home?",
        "context": "Workout routines can include push-ups, bodyweight squats, lunges, and planks.",
        "response": "You can just go for 10 push-ups total.",
        "ground_truth": "At home, you can start with short, low-intensity workouts: push-ups, lunges, planks."
    },
    {
        "query": "Are diet sodas healthy for daily consumption?",
        "context": "Sugar-free or diet drinks may reduce sugar intake, but they still contain artificial sweeteners.",
        "response": "Yes, diet sodas are 100% healthy.",
        "ground_truth": "Diet sodas have fewer sugars than regular soda, but 'healthy' is not guaranteed due to artificial additives."
    },
    {
        "query": "What's the capital of France?",
        "context": "France is in Europe. Paris is the capital.",
        "response": "London.",
        "ground_truth": "Paris."
    }
]

# Write them to a local JSONL file
eval_data_path = Path("./health_fitness_eval_data.jsonl")
with eval_data_path.open("w", encoding="utf-8") as f:
    for row in synthetic_eval_data:
        f.write(json.dumps(row) + "\n")

print(f"Sample evaluation data written to {eval_data_path.resolve()}")

# 3. Local Evaluation Examples

Mostraremos cómo ejecutar una evaluación local basada en código en un conjunto de datos JSONL. Vamos a:
1. **Cargar** los datos.
2. **Definir** uno o más evaluadores. (por ejemplo, F1ScoreEvaluator, RelevanceEvaluator, GroundednessEvaluator o personalizados).
3. **Ejecutar** evaluate(...) para producir un diccionario de métricas.

> También podemos trabajar con datos de conversaciones en múltiples turnos o agregar columnas adicionales, como ground_truth, para métricas avanzadas.

## Ejemplo 1: Combinando F1Score, Relevance y Groundedness
Combinaremos:
- F1ScoreEvaluator (basado en NLP, que compara response con ground_truth)
- RelevanceEvaluator (asistido por IA, que utiliza GPT para juzgar qué tan bien response responde a query)
- GroundednessEvaluator (que verifica qué tan bien la respuesta está fundamentada en el context proporcionado)
- Un evaluador personalizado basado en código que registra la longitud de la respuesta.


In [None]:
import os
from azure.ai.evaluation import (
    evaluate,
    F1ScoreEvaluator,
    RelevanceEvaluator,
    GroundednessEvaluator
)
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from azure.ai.projects.models import ConnectionType

# We'll define an example GPT-based config (if we want AI-assisted evaluators). 
# This is needed for AI-assisted evaluators. Fill with your Azure OpenAI config.
# If you skip some evaluators, you can omit.
connection_string = os.environ.get("PROJECT_CONNECTION_STRING")

project_client = AIProjectClient.from_connection_string(
        credential=DefaultAzureCredential(),
        conn_str=connection_string
    )

default_connection = project_client.connections.get_default(
    connection_type=ConnectionType.AZURE_OPEN_AI
)

model_config = default_connection.to_evaluator_model_config(
    deployment_name='gpt-4o', 
    api_version='2024-12-01-preview'
)
model_config['type'] = 'azure_openai'

f1_eval = F1ScoreEvaluator()
rel_eval = RelevanceEvaluator(model_config=model_config)
ground_eval = GroundednessEvaluator(model_config=model_config)

# We'll run evaluate(...) with these evaluators.
results = evaluate(
    data=str(eval_data_path),
    evaluators={
        "f1_score": f1_eval,
        "relevance": rel_eval,
        "groundedness": ground_eval,
    },
    evaluator_config={
        "f1_score": {
            "column_mapping": {
                "response": "${data.response}",
                "ground_truth": "${data.ground_truth}"
            }
        },
        "relevance": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}"
            }
        },
        "groundedness": {
            "column_mapping": {
                "context": "${data.context}",
                "response": "${data.response}"
            }
        }
    }
)

print("Local evaluation result =>")
print(results)

[2025-04-05 14:15:30 -0300][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_nvdjs1d0_20250405_141530_853137, log path: C:\Users\pablocastao\.promptflow\.runs\azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_nvdjs1d0_20250405_141530_853137\logs.txt
[2025-04-05 14:15:30 -0300][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_dsmsk78q_20250405_141530_853137, log path: C:\Users\pablocastao\.promptflow\.runs\azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_dsmsk78q_20250405_141530_853137\logs.txt
[2025-04-05 14:15:30 -0300][promptflow._sdk._orchestrator.run_submitter][INFO] - Submitting run azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_wf2tl5t6_20250405_141530_853137, log path: C:\Users\pablocastao\.promptflow\.runs\azure_ai_evaluation_evaluators_common_base_eval_a

2025-04-05 14:15:31 -0300   10296 execution.bulk     INFO     Current thread is not main thread, skip signal handler registration in BatchEngine.
2025-04-05 14:15:31 -0300   10296 execution.bulk     INFO     Finished 3 / 3 lines.
2025-04-05 14:15:31 -0300   10296 execution.bulk     INFO     Average execution time for completed lines: 0.01 seconds. Estimated time for incomplete lines: 0.0 seconds.

Run name: "azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_wf2tl5t6_20250405_141530_853137"
Run status: "Completed"
Start time: "2025-04-05 14:15:30.848696-03:00"
Duration: "0:00:01.211030"
Output path: "C:\Users\pablocastao\.promptflow\.runs\azure_ai_evaluation_evaluators_common_base_eval_asyncevaluatorbase_wf2tl5t6_20250405_141530_853137"

2025-04-05 14:15:37 -0300   10296 execution.bulk     INFO     Finished 1 / 3 lines.
2025-04-05 14:15:37 -0300   10296 execution.bulk     INFO     Average execution time for completed lines: 7.01 seconds. Estimated time for incomplete li

✅ Created AIProjectClient.
✅ Created AIProjectClient.




✅ Created AIProjectClient.


Uploading health_fitness_eval_data.jsonl (< 1 MB): 0.00B [00:00, ?B/s] (< 1 MB): 100%|##########| 814/814 [00:00<00:00, 6.22kB/s] (< 1 MB): 100%|##########| 814/814 [00:00<00:00, 6.22kB/s]




✅ Uploaded JSONL to project. Data asset ID: /subscriptions/06d043e2-5a2e-46bf-bf48-fffee525f377/resourceGroups/lab-ai-foundry/providers/Microsoft.MachineLearningServices/workspaces/project-demo-yais/data/72b4045f-ec82-49a4-8498-e066619252d6/versions/1
✅ Created AIProjectClient.




✅ Uploaded JSONL to project. Data asset ID: /subscriptions/06d043e2-5a2e-46bf-bf48-fffee525f377/resourceGroups/lab-ai-foundry/providers/Microsoft.MachineLearningServices/workspaces/project-demo-yais/data/82330ffe-75b4-4137-9019-3f3412ff35eb/versions/1
✅ Created evaluation job. ID: d0ff0577-bcc8-4f8d-92e0-e43d7b28a3c6
Current status: Starting
View details in Foundry: https://ai.azure.com/build/evaluation/d0ff0577-bcc8-4f8d-92e0-e43d7b28a3c6?wsid=/subscriptions/06d043e2-5a2e-46bf-bf48-fffee525f377/resourceGroups/lab-ai-foundry/providers/Microsoft.MachineLearningServices/workspaces/project-demo-yais
✅ Created AIProjectClient.




✅ Uploaded JSONL to project. Data asset ID: /subscriptions/06d043e2-5a2e-46bf-bf48-fffee525f377/resourceGroups/lab-ai-foundry/providers/Microsoft.MachineLearningServices/workspaces/project-demo-yais/data/b30ac5b6-020e-4f69-a626-5e9ba72b569f/versions/1
✅ Created evaluation job. ID: e57ca178-8518-45ac-9453-526ff1f5546c
Current status: Starting
View details in Foundry: https://ai.azure.com/build/evaluation/e57ca178-8518-45ac-9453-526ff1f5546c?wsid=/subscriptions/06d043e2-5a2e-46bf-bf48-fffee525f377/resourceGroups/lab-ai-foundry/providers/Microsoft.MachineLearningServices/workspaces/project-demo-yais
✅ Created evaluation job. ID: e57ca178-8518-45ac-9453-526ff1f5546c
Current status: Starting


**Inspecting Local Results**

The `evaluate(...)` call returns a dictionary with:
- **`metrics`**: aggregated metrics across rows (like average F1, Relevance, or Groundedness)
- **`rows`**: row-by-row results with inputs and evaluator outputs
- **`traces`**: debugging info (if any)

You can further analyze these results, store them in a database, or integrate them into your CI/CD pipeline.

# 4. Cloud Evaluation with `AIProjectClient`

Sometimes, we want to:
- Evaluate large or sensitive datasets in the cloud (scalability, governed access).
- Keep track of evaluation results in an Azure AI Foundry project.
- Optionally schedule recurring evaluations.

We'll do that by:
1. **Upload** the local JSONL to your Azure AI Foundry project.
2. **Create** an `Evaluation` referencing built-in or custom evaluator definitions.
3. **Poll** until the job is done (with retry logic for resilience).
4. **Review** the results in the portal or via `project_client.evaluations.get(...)`.

### Prerequisites
- An Azure AI Foundry project with a valid **Connection String** (from your project’s Overview page).
- An Azure OpenAI deployment (if using AI-assisted evaluators).


In [17]:
import os
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
    Evaluation, Dataset, EvaluatorConfiguration, ConnectionType
)
from azure.ai.evaluation import F1ScoreEvaluator, RelevanceEvaluator, ViolenceEvaluator
from azure.identity import DefaultAzureCredential
from azure.core.exceptions import ServiceResponseError
import time

# 1) Connect to Azure AI Foundry project
project_conn_str = os.environ.get("PROJECT_CONNECTION_STRING")
credential = DefaultAzureCredential()

project_client = AIProjectClient.from_connection_string(
    credential=credential,
    conn_str=project_conn_str
)
print("✅ Created AIProjectClient.")

# 2) Upload data for evaluation
uploaded_data_id, _ = project_client.upload_file(str(eval_data_path))
print("✅ Uploaded JSONL to project. Data asset ID:", uploaded_data_id)

# 3) Prepare an Azure OpenAI connection for AI-assisted evaluators
default_connection = project_client.connections.get_default(
    connection_type=ConnectionType.AZURE_OPEN_AI
)

model_config = default_connection.to_evaluator_model_config(
    deployment_name='gpt-4o', 
    api_version='2024-12-01-preview'
)
model_config['type'] = 'azure_openai'

evaluation = Evaluation(
    display_name="Health Fitness Remote Evaluation",
    description="Evaluating dataset for correctness.",
    data=Dataset(id=uploaded_data_id),
    evaluators={
        "f1_score": EvaluatorConfiguration(id=F1ScoreEvaluator.id),
        "relevance": EvaluatorConfiguration(
            id=RelevanceEvaluator.id,
            init_params={"model_config": model_config}
        ),
        "violence": EvaluatorConfiguration(
            id=ViolenceEvaluator.id,
            init_params={"azure_ai_project": project_client.scope}
        )
    }
)

# Helper: Create evaluation with retry logic
def create_evaluation_with_retry(project_client, evaluation, max_retries=3, retry_delay=5):
    for attempt in range(max_retries):
        try:
            result = project_client.evaluations.create(evaluation=evaluation)
            return result
        except ServiceResponseError as e:
            if attempt == max_retries - 1:
                raise
            print(f"⚠️ Attempt {attempt+1} failed: {str(e)}. Retrying in {retry_delay} seconds...")
            time.sleep(retry_delay)

# 5) Create & track the evaluation using retry logic
cloud_eval = create_evaluation_with_retry(project_client, evaluation)
print("✅ Created evaluation job. ID:", cloud_eval.id)

# 6) Poll or fetch final status
fetched_eval = project_client.evaluations.get(cloud_eval.id)
print("Current status:", fetched_eval.status)
if hasattr(fetched_eval, 'properties'):
    link = fetched_eval.properties.get("AiStudioEvaluationUri", "")
    if link:
        print("View details in Foundry:", link)
else:
    print("No link found.")

### Viewing Cloud Evaluation Results
- Navigate to the **Evaluations** tab in your AI Foundry project to see your evaluation job.
- Open the evaluation to view aggregated metrics and row-level details.
- For AI-assisted or risk & safety evaluators, you'll see both average scores and detailed per-row results.