# 🏋️‍♀️ Health & Fitness Evaluations with Azure AI Foundry 🏋️‍♂️

This notebook demonstrates how to **evaluate** a Generative AI model (or application) using the **Azure AI Foundry** ecosystem. We'll highlight three key Python SDKs:
1. **`azure-ai-projects`** (`AIProjectClient`): manage & orchestrate evaluations in the cloud.
2. **`azure-ai-inference`**: perform model inference (optional but helpful if generating data for evaluation).
3. **`azure-ai-evaluation`**: run automated metrics for LLM output quality & safety.

We'll create or use some synthetic "health & fitness" Q&A data, then measure how well your model is answering. We'll do both **local** evaluation and **cloud** evaluation (on an Azure AI Foundry project).

> **Disclaimer**: This covers a hypothetical health & fitness scenario. **No real medical advice** is provided. Always consult professionals.

## Notebook Contents
1. [Setup & Imports](#1-Setup-and-Imports)
2. [Local Evaluation Examples](#3-Local-Evaluation)
3. [Cloud Evaluation with `AIProjectClient`](#4-Cloud-Evaluation)
4. [Extra Topics](#5-Extra-Topics)
   - [Risk & Safety Evaluators](#5.1-Risk-and-Safety)
   - [More Quality Evaluators](#5.2-Quality)
   - [Custom Evaluators](#5.3-Custom)
   - [Simulators & Adversarial Data](#5.4-Simulators)
5. [Conclusion](#6-Conclusion)


## 1. Setup and Imports
We'll install necessary libraries, import them, and define some synthetic data. 

### Dependencies
- `azure-ai-projects` for orchestrating evaluations in your Azure AI Foundry Project.
- `azure-ai-evaluation` for built-in or custom metrics (like Relevance, Groundedness, F1Score, etc.).
- `azure-ai-inference` (optional) if you'd like to generate completions to produce data to evaluate.
- `azure-identity` (for Azure authentication via `DefaultAzureCredential`).

### Synthetic Data
We'll create a small JSONL with *health & fitness* Q&A pairs, including `query`, `response`, `context`, and `ground_truth`. This simulates a scenario where we have user questions, the model's answers, plus a reference ground truth.

You can adapt this approach to any domain: e.g., finance, e-commerce, etc.

<img src="./seq-diagrams/2-evals.png" alt="Evaluation Flow" width="30%"/>


In [None]:
%%capture
# If you need to install these, uncomment:
# !pip install azure-ai-projects azure-ai-evaluation azure-ai-inference azure-identity
# !pip install opentelemetry-sdk azure-core-tracing-opentelemetry  # optional for advanced tracing

import json
import os
import uuid
from pathlib import Path
from typing import Dict, Any

from azure.identity import DefaultAzureCredential

# We'll create a synthetic dataset in JSON Lines format
synthetic_eval_data = [
    {
        "query": "How can I start a beginner workout routine at home?",
        "context": "Workout routines can include push-ups, bodyweight squats, lunges, and planks.",
        "response": "You can just go for 10 push-ups total.",
        "ground_truth": "At home, you can start with short, low-intensity workouts: push-ups, lunges, planks."
    },
    {
        "query": "Are diet sodas healthy for daily consumption?",
        "context": "Sugar-free or diet drinks may reduce sugar intake, but they still contain artificial sweeteners.",
        "response": "Yes, diet sodas are 100% healthy.",
        "ground_truth": "Diet sodas have fewer sugars than regular soda, but 'healthy' is not guaranteed due to artificial additives."
    },
    {
        "query": "What's the capital of France?",
        "context": "France is in Europe. Paris is the capital.",
        "response": "London.",
        "ground_truth": "Paris."
    }
]

# Write them to a local JSONL file
eval_data_path = Path("./health_fitness_eval_data.jsonl")
with eval_data_path.open("w", encoding="utf-8") as f:
    for row in synthetic_eval_data:
        f.write(json.dumps(row) + "\n")

print(f"Sample evaluation data written to {eval_data_path.resolve()}")

# 3. Local Evaluation Examples

We'll show how to run local, code-based evaluation on a JSONL dataset. We'll:
1. **Load** the data.
2. **Define** one or more evaluators. (e.g. `F1ScoreEvaluator`, `RelevanceEvaluator`, `GroundednessEvaluator`, or custom.)
3. **Run** `evaluate(...)` to produce a dictionary of metrics.

> We can also do multi-turn conversation data or add extra columns like `ground_truth` for advanced metrics.

## Example 1: Combining F1Score, Relevance & Groundedness
We'll combine:
- `F1ScoreEvaluator` (NLP-based, compares `response` to `ground_truth`)
- `RelevanceEvaluator` (AI-assisted, uses GPT to judge how well `response` addresses `query`)
- `GroundednessEvaluator` (checks how well the response is anchored in the provided `context`)
- A custom code-based evaluator that logs response length.


In [None]:
import os
from azure.ai.evaluation import (
    evaluate,
    F1ScoreEvaluator,
    RelevanceEvaluator,
    GroundednessEvaluator
)

# Our custom evaluator to measure response length.
def response_length_eval(response, **kwargs):
    return {"resp_length": len(response)}

# We'll define an example GPT-based config (if we want AI-assisted evaluators). 
# This is needed for AI-assisted evaluators. Fill with your Azure OpenAI config.
# If you skip some evaluators, you can omit.
model_config = {
    "azure_endpoint": os.environ.get("AOAI_ENDPOINT", "https://dummy-endpoint.azure.com"),
    "api_key": os.environ.get("AOAI_API_KEY", "fake-key"),
    "azure_deployment": os.environ.get("AOAI_DEPLOYMENT", "gpt-4"),
    "api_version": os.environ.get("AOAI_API_VERSION", "2023-07-01-preview"),
}

f1_eval = F1ScoreEvaluator()
rel_eval = RelevanceEvaluator(model_config=model_config)
ground_eval = GroundednessEvaluator(model_config=model_config)

# We'll run evaluate(...) with these evaluators.
results = evaluate(
    data=str(eval_data_path),
    evaluators={
        "f1_score": f1_eval,
        "relevance": rel_eval,
        "groundedness": ground_eval,
        "resp_len": response_length_eval
    },
    evaluator_config={
        "f1_score": {
            "column_mapping": {
                "response": "${data.response}",
                "ground_truth": "${data.ground_truth}"
            }
        },
        "relevance": {
            "column_mapping": {
                "query": "${data.query}",
                "response": "${data.response}"
            }
        },
        "groundedness": {
            "column_mapping": {
                "context": "${data.context}",
                "response": "${data.response}"
            }
        },
        "resp_len": {
            "column_mapping": {
                "response": "${data.response}"
            }
        }
    }
)

print("Local evaluation result =>")
print(results)

**Inspecting Local Results**

The `evaluate(...)` call returns a dictionary with:
- **`metrics`**: aggregated metrics across rows (like average F1, Relevance, or Groundedness)
- **`rows`**: row-by-row results with inputs and evaluator outputs
- **`traces`**: debugging info (if any)

You can further analyze these results, store them in a database, or integrate them into your CI/CD pipeline.

# 4. Cloud Evaluation with `AIProjectClient`

Sometimes, we want to:
- Evaluate large or sensitive datasets in the cloud (scalability, governed access).
- Keep track of evaluation results in an Azure AI Foundry project.
- Optionally schedule recurring evaluations.

We'll do that by:
1. **Upload** the local JSONL to your Azure AI Foundry project.
2. **Create** an `Evaluation` referencing built-in or custom evaluator definitions.
3. **Poll** until the job is done (with retry logic for resilience).
4. **Review** the results in the portal or via `project_client.evaluations.get(...)`.

### Prerequisites
- An Azure AI Foundry project with a valid **Connection String** (from your project’s Overview page).
- An Azure OpenAI deployment (if using AI-assisted evaluators).


In [None]:
import os
from azure.ai.projects import AIProjectClient
from azure.ai.projects.models import (
    Evaluation, Dataset, EvaluatorConfiguration, ConnectionType
)
from azure.ai.evaluation import F1ScoreEvaluator, RelevanceEvaluator, ViolenceEvaluator
from azure.identity import DefaultAzureCredential
from azure.core.exceptions import ServiceResponseError
import time

# 1) Connect to Azure AI Foundry project
project_conn_str = os.environ.get("PROJECT_CONNECTION_STRING")
credential = DefaultAzureCredential()

project_client = AIProjectClient.from_connection_string(
    credential=credential,
    conn_str=project_conn_str
)
print("✅ Created AIProjectClient.")

# 2) Upload data for evaluation
uploaded_data_id, _ = project_client.upload_file(str(eval_data_path))
print("✅ Uploaded JSONL to project. Data asset ID:", uploaded_data_id)

# 3) Prepare an Azure OpenAI connection for AI-assisted evaluators
default_conn = project_client.connections.get_default(ConnectionType.AZURE_OPEN_AI)

deployment_name = os.environ.get("MODEL_DEPLOYMENT", "gpt-4o")
api_version = os.environ.get("AOAI_API_VERSION", "2024-12-01-preview")

# 4) Construct the evaluation object
model_config = default_conn.to_evaluator_model_config(
    deployment_name=deployment_name,
    api_version=api_version
)

evaluation = Evaluation(
    display_name="Health Fitness Remote Evaluation",
    description="Evaluating dataset for correctness.",
    data=Dataset(id=uploaded_data_id),
    evaluators={
        "f1_score": EvaluatorConfiguration(id=F1ScoreEvaluator.id),
        "relevance": EvaluatorConfiguration(
            id=RelevanceEvaluator.id,
            init_params={"model_config": model_config}
        ),
        "violence": EvaluatorConfiguration(
            id=ViolenceEvaluator.id,
            init_params={"azure_ai_project": project_client.scope}
        )
    }
)

# Helper: Create evaluation with retry logic
def create_evaluation_with_retry(project_client, evaluation, max_retries=3, retry_delay=5):
    for attempt in range(max_retries):
        try:
            result = project_client.evaluations.create(evaluation=evaluation)
            return result
        except ServiceResponseError as e:
            if attempt == max_retries - 1:
                raise
            print(f"⚠️ Attempt {attempt+1} failed: {str(e)}. Retrying in {retry_delay} seconds...")
            time.sleep(retry_delay)

# 5) Create & track the evaluation using retry logic
cloud_eval = create_evaluation_with_retry(project_client, evaluation)
print("✅ Created evaluation job. ID:", cloud_eval.id)

# 6) Poll or fetch final status
fetched_eval = project_client.evaluations.get(cloud_eval.id)
print("Current status:", fetched_eval.status)
if hasattr(fetched_eval, 'properties'):
    link = fetched_eval.properties.get("AiStudioEvaluationUri", "")
    if link:
        print("View details in Foundry:", link)
else:
    print("No link found.")

### Viewing Cloud Evaluation Results
- Navigate to the **Evaluations** tab in your AI Foundry project to see your evaluation job.
- Open the evaluation to view aggregated metrics and row-level details.
- For AI-assisted or risk & safety evaluators, you'll see both average scores and detailed per-row results.