diff --git a/.github/plugins/azure-skills/.claude-plugin/plugin.json b/.github/plugins/azure-skills/.claude-plugin/plugin.json index 071c4775..9e6f8b56 100644 --- a/.github/plugins/azure-skills/.claude-plugin/plugin.json +++ b/.github/plugins/azure-skills/.claude-plugin/plugin.json @@ -1,7 +1,7 @@ { "name": "azure", "description": "Microsoft Azure MCP and Skills integration for cloud resource management, deployments, and Azure services. Manage your Azure infrastructure, monitor applications, and deploy resources directly from Claude Code.", - "version": "1.1.22", + "version": "1.1.26", "author": { "name": "Microsoft", "url": "https://www.microsoft.com" diff --git a/.github/plugins/azure-skills/.cursor-plugin/plugin.json b/.github/plugins/azure-skills/.cursor-plugin/plugin.json index d54f7cb1..f4866d2e 100644 --- a/.github/plugins/azure-skills/.cursor-plugin/plugin.json +++ b/.github/plugins/azure-skills/.cursor-plugin/plugin.json @@ -1,7 +1,7 @@ { "name": "azure", "description": "Microsoft Azure MCP and Skills integration for cloud resource management, deployments, and Azure services. Manage your Azure infrastructure, monitor applications, and deploy resources directly from Cursor.", - "version": "1.1.22", + "version": "1.1.26", "author": { "name": "Microsoft", "url": "https://www.microsoft.com" diff --git a/.github/plugins/azure-skills/.plugin/plugin.json b/.github/plugins/azure-skills/.plugin/plugin.json index 9359040e..65f9bc4e 100644 --- a/.github/plugins/azure-skills/.plugin/plugin.json +++ b/.github/plugins/azure-skills/.plugin/plugin.json @@ -1,7 +1,7 @@ { "name": "azure", "description": "Microsoft Azure MCP and Skills integration for cloud resource management, deployments, and Azure services. Manage your Azure infrastructure, monitor applications, and deploy resources directly from your development environment.", - "version": "1.1.22", + "version": "1.1.26", "author": { "name": "Microsoft", "url": "https://www.microsoft.com" diff --git a/.github/plugins/azure-skills/CHANGELOG.md b/.github/plugins/azure-skills/CHANGELOG.md index fd2a6761..ec85e8c6 100644 --- a/.github/plugins/azure-skills/CHANGELOG.md +++ b/.github/plugins/azure-skills/CHANGELOG.md @@ -1,5 +1,9 @@ # Changelog +## 1.1.25 + +- fix: update toolbox sample link ([#2078](https://github.com/microsoft/GitHub-Copilot-for-Azure/pull/2078)) + ## 1.1.22 - fix: Remove context7 MCP server from plugin config ([#2100](https://github.com/microsoft/GitHub-Copilot-for-Azure/pull/2100)) diff --git a/.github/plugins/azure-skills/skills/azure-diagnostics/SKILL.md b/.github/plugins/azure-skills/skills/azure-diagnostics/SKILL.md index e13fb2ab..d4b5d102 100644 --- a/.github/plugins/azure-skills/skills/azure-diagnostics/SKILL.md +++ b/.github/plugins/azure-skills/skills/azure-diagnostics/SKILL.md @@ -1,10 +1,10 @@ --- name: azure-diagnostics -description: "Debug Azure production issues on Azure using AppLens, Azure Monitor, resource health, and safe triage. WHEN: debug production issues, troubleshoot container apps, troubleshoot functions, troubleshoot AKS, kubectl cannot connect, kube-system/CoreDNS failures, pod pending, crashloop, node not ready, upgrade failures, analyze logs, KQL, insights, image pull failures, cold start issues, health probe failures, resource health, root cause of errors, troubleshoot event hubs, troubleshoot service bus, messaging SDK error, AMQP connection failure, message lock lost, service bus dead letter." +description: "Debug Azure production issues on Azure using AppLens, Azure Monitor, resource health, and safe triage. WHEN: debug production issues, troubleshoot app service, app service high CPU, app service deployment failure, troubleshoot container apps, troubleshoot functions, troubleshoot AKS, kubectl cannot connect, kube-system/CoreDNS failures, pod pending, crashloop, node not ready, upgrade failures, analyze logs, KQL, insights, image pull failures, cold start issues, health probe failures, resource health, root cause of errors, troubleshoot event hubs, troubleshoot service bus, messaging SDK error, AMQP connection failure, message lock lost, service bus dead letter." license: MIT metadata: author: Microsoft - version: "1.1.3" + version: "1.1.4" --- # Azure Diagnostics @@ -22,6 +22,8 @@ Activate this skill when user wants to: - Fix image pull, cold start, or health probe issues - Investigate why Azure resources are failing - Find root cause of application errors +- Troubleshoot App Service issues (high CPU, deployment failures, crashes, slow responses, TLS/custom domains) +- Respond to prompts like "troubleshoot app service", "app service high CPU", or "app service deployment failure" - Troubleshoot Azure Function Apps (invocation failures, timeouts, binding errors) - Find the App Insights or Log Analytics workspace linked to a Function App - Troubleshoot AKS clusters, nodes, pods, ingress, or Kubernetes networking issues @@ -53,6 +55,7 @@ Activate this skill when user wants to: | Service | Common Issues | Reference | |---------|---------------|-----------| | **Container Apps** | Image pull failures, cold starts, health probes, port mismatches | [container-apps/](references/container-apps/README.md) | +| **App Service** | High CPU, deployment failures, crashes, slow responses, TLS/custom domains | [app-service/](references/app-service/README.md) | | **Function Apps** | App details, invocation failures, timeouts, binding errors, cold starts, missing app settings | [functions/](references/functions/README.md) | | **AKS** | Cluster access, nodes, `kube-system`, scheduling, crash loops, ingress, DNS, upgrades | [AKS Troubleshooting](troubleshooting/aks/aks-troubleshooting.md) | | **Messaging** | Event Hubs & Service Bus SDK errors, AMQP failures, message lock, connectivity | [Messaging Troubleshooting](troubleshooting/messaging/README.md) | @@ -143,5 +146,6 @@ az monitor activity-log list -g RG --max-events 20 - [KQL Query Library](references/kql-queries.md) - [Azure Resource Graph Queries](references/azure-resource-graph.md) +- [App Service Troubleshooting](references/app-service/README.md) - [Function Apps Troubleshooting](references/functions/README.md) - [Messaging Troubleshooting](troubleshooting/messaging/README.md) diff --git a/.github/plugins/azure-skills/skills/azure-diagnostics/references/app-service/README.md b/.github/plugins/azure-skills/skills/azure-diagnostics/references/app-service/README.md new file mode 100644 index 00000000..7ae7bb9f --- /dev/null +++ b/.github/plugins/azure-skills/skills/azure-diagnostics/references/app-service/README.md @@ -0,0 +1,183 @@ +# App Service Troubleshooting + +## Common Issues Matrix + +| Symptom | Likely Cause | Action | +|---------|--------------|-----------| +| High CPU / memory | Runaway process, inefficient code | Use Process Explorer via Kudu, scale up | +| Deployment failure | Build error, locked files, quota | Check Kudu logs at `https://APP.scm.azurewebsites.net/api/deployments` to look for details on build errors, locked files or lack of storage quota | +| App crash / restart | Unhandled exception, OOM kill | Review Event Log and STDERR in Diagnose & Solve | +| Slow responses | Downstream dependency, no caching | Enable request tracing, check dependency calls | +| 502 / 503 errors | App not starting, port conflict | Check STDERR logs, verify startup command | +| TLS / domain errors | Certificate expired, DNS mismatch | `az webapp config ssl list`, verify CNAME | +| Health check failure | Endpoint not returning 200 | Verify health check path responds within 2 min | + +--- + +## High CPU / Memory Diagnosis + +**Diagnose:** +```bash +# Check app metrics +az monitor metrics list --resource APP_RESOURCE_ID \ + --metric "CpuPercentage,MemoryPercentage" --interval PT1M --output table + +# View running processes via ARM Processes API (Entra ID auth) +az rest --method get \ + --uri "/subscriptions//resourceGroups//providers/Microsoft.Web/sites//processes?api-version=2024-04-01" +``` + +**Fix:** Scale up (`az appservice plan update -n -g --sku P1V3`) or profile the app via Kudu Process Explorer at `https://APP.scm.azurewebsites.net/ProcessExplorer/` to identify hot paths. + +--- + +## Deployment Failure Analysis + +**Diagnose:** +```bash +# List deployment history +az webapp deployment list -n APP -g RG --output table + +# View deployment log for a specific deployment +az webapp log deployment show -n APP -g RG --deployment-id DEPLOY_ID + +# Stream build logs from Kudu +az webapp log tail -n APP -g RG +``` + +**KQL — Failed deployments:** +```kql +// Replace with the full resource ID, for example: +// /subscriptions//resourceGroups//providers/Microsoft.Web/sites/ +AppServicePlatformLogs +| where TimeGenerated > ago(24h) +| where Level == "Error" and _ResourceId == "" +| project TimeGenerated, Level, Message +| order by TimeGenerated desc +``` + +**Common deployment failures:** + +| Error Message | Cause | Fix | +|---------------|-------|-----| +| `WEBSITE_RUN_FROM_PACKAGE=1` but no package | Missing zip deploy artifact | Redeploy with `az webapp deploy --src-path app.zip` | +| `Error building on server` | Oryx build failure | Check build logs, pin runtime version | +| `Locked file` during deploy | Files in use | Set an environment variable named `MSDEPLOY_RENAME_LOCKED_FILES=1` on the App Service resource to enable MSDeploy to rename locked files. | + +--- + +## Application Crash / Restart Diagnosis + +**Diagnose:** +```bash +# Check recent restarts via activity log +az monitor activity-log list -g RG --resource-id APP_RESOURCE_ID \ + --max-events 10 --query "[?operationName.value=='Microsoft.Web/sites/restart/action']" + +# View STDERR/STDOUT (Linux) +az webapp log download -n APP -g RG --log-file logs.zip +``` + +**KQL — App crashes and errors:** +```kql +AppServiceConsoleLogs +| where TimeGenerated > ago(1h) +| where ResultDescription contains "error" or ResultDescription contains "fatal" +| project TimeGenerated, ResultDescription +| order by TimeGenerated desc +| take 50 +``` + +**Health check failures:** +```bash +# Show health check config +az webapp show -n APP -g RG --query "siteConfig.healthCheckPath" + +# Test the endpoint directly +curl -s -o /dev/null -w "%{http_code}" https://APP.azurewebsites.net/health +``` + +> ⚠️ **Warning:** If the health check fails on >50% of instances for 1 hour, the instance is replaced. + +--- + +## Slow Response Time Investigation + +**Diagnose:** +```bash +# Check average response time +az monitor metrics list --resource APP_RESOURCE_ID \ + --metric "HttpResponseTime" --interval PT5M --aggregation Average --output table + +# Enable failed request tracing +az webapp log config -n APP -g RG --failed-request-tracing true +``` + +**KQL — Slow requests with dependency analysis:** +```kql +AppServiceHTTPLogs +| where TimeGenerated > ago(1h) +| where TimeTaken > 5000 +| project TimeGenerated, CsUriStem, ScStatus, TimeTaken, CsHost +| order by TimeTaken desc +| take 20 +``` + +**Auto-Heal — Automatic mitigation:** +```bash +# Configure auto-heal to recycle on slow requests +az webapp config set -n APP -g RG \ + --auto-heal-enabled true \ + --generic-configurations '{"autoHealRules":{"triggers":{"slowRequests":{"timeTaken":"00:00:30","count":10,"timeInterval":"00:02:00"}},"actions":{"actionType":"Recycle"}}}' +``` + +--- + +## Custom Domain / TLS Certificate Issues + +**Diagnose:** +```bash +# List custom domains +az webapp config hostname list -g RG --webapp-name APP --output table + +# List TLS certificates +az webapp config ssl list -g RG --output table + +# Check SSL binding +az webapp config ssl show --certificate-name CERT -g RG +``` + +| Symptom | Cause | Fix | +|---------|-------|-----| +| `ERR_CERT_DATE_INVALID` | Certificate expired | If certificate came from an external certificate authority, renew with `az webapp config ssl upload` and upload a new certificate or enable managed certificates to allow Azure to provide a free TLS/SSL certificate | +| `DNS_PROBE_FINISHED_NXDOMAIN` | CNAME not configured | Add CNAME record pointing to `APP.azurewebsites.net` | +| `SSL binding not found` | Missing SNI binding | Add the missing SNI binding using `az webapp config ssl bind --certificate-thumbprint THUMB --ssl-type SNI -n APP -g RG` | +| Managed cert pending | DNS validation incomplete | Verify TXT record `asuid.DOMAIN` matches custom domain verification ID | + +--- + +## AZ CLI or MCP Tools for App Service Diagnostics + +| Tool | Command | Use When | +|----------|---------|----------| +| `Azure CLI` | `az webapp list` | List all web apps in subscription | +| `Azure CLI` | `az webapp show -n APP -g RG` | Get app config, stack, status | +| `Azure CLI` | `az webapp config appsettings list -n APP -g RG` | Check env vars and connection strings | +| `Azure CLI` | `az webapp deployment slot list -n APP -g RG` | Compare slot configurations | +| `mcp_azure_mcp_appservice` | `appservice_webapp_diagnostic_diagnose` | AI-powered root cause analysis | +| `mcp_azure_mcp_monitor` | `monitor_resource_log_query` | Run KQL against Log Analytics | +| `mcp_azure_mcp_resourcehealth` | `get` | Check platform-level health status | + +> 💡 **Tip:** Start with `mcp_azure_mcp_appservice` (`diagnose`) — it automatically runs relevant detectors and surfaces the most likely root cause before you dig into logs manually. + +--- + +## Combined Diagnostic Script + +```bash +echo "=== App Service Diagnostics ===" && \ +echo "App Config:" && az webapp show -n APP -g RG --query "{state:state, runtime:siteConfig.linuxFxVersion, healthCheck:siteConfig.healthCheckPath, alwaysOn:siteConfig.alwaysOn}" -o table && \ +echo "Recent Deployments:" && az webapp deployment list -n APP -g RG --query "[:3].{id:id, status:status, time:end_time}" -o table && \ +echo "App Settings:" && az webapp config appsettings list -n APP -g RG --query "[].name" -o tsv && \ +echo "Custom Domains:" && az webapp config hostname list -g RG --webapp-name APP -o table +``` diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/SKILL.md b/.github/plugins/azure-skills/skills/microsoft-foundry/SKILL.md index efcfc0f1..6621af66 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/SKILL.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/SKILL.md @@ -1,10 +1,10 @@ --- name: microsoft-foundry -description: "Deploy, evaluate, and manage Foundry agents end-to-end: Docker build, ACR push, hosted/prompt agent create, container start, batch eval, prompt optimization, prompt optimizer workflows, agent.yaml, dataset curation from traces. USE FOR: deploy agent to Foundry, hosted agent, create agent, invoke agent, evaluate agent, run batch eval, optimize prompt, improve prompt, prompt optimization, prompt optimizer, improve agent instructions, optimize agent instructions, optimize system prompt, deploy model, Foundry project, RBAC, role assignment, permissions, quota, capacity, region, troubleshoot agent, deployment failure, create dataset from traces, dataset versioning, eval trending, create AI Services, Cognitive Services, create Foundry resource, provision resource, knowledge index, agent monitoring, customize deployment, onboard, availability. DO NOT USE FOR: Azure Functions, App Service, general Azure deploy (use azure-deploy), general Azure prep (use azure-prepare)." +description: "Deploy, evaluate, and manage Foundry agents end-to-end: Docker build, ACR push, hosted/prompt agent create, container start, batch eval, continuous eval, prompt optimizer workflows, agent.yaml, dataset curation from traces. USE FOR: deploy agent to Foundry, hosted agent, create agent, invoke agent, evaluate agent, run batch eval, continuous eval, continuous monitoring, continuous eval status, optimize prompt, improve prompt, prompt optimizer, optimize agent instructions, improve agent instructions, optimize system prompt, deploy model, Foundry project, RBAC, role assignment, permissions, quota, capacity, region, troubleshoot agent, deployment failure, create dataset from traces, dataset versioning, eval trending, create AI Services, Cognitive Services, create Foundry resource, provision resource, knowledge index, agent monitoring, customize deployment, onboard, availability. DO NOT USE FOR: Azure Functions, App Service, general Azure deploy (use azure-deploy), general Azure prep (use azure-prepare)." license: MIT metadata: author: Microsoft - version: "1.1.5" + version: "1.1.8" --- # Microsoft Foundry Skill @@ -25,7 +25,7 @@ This skill includes specialized sub-skills for specific workflows. **Use these i |-----------|-------------|-----------| | **deploy** | Containerize, build, push to ACR, create/update/clone agent deployments | [deploy](foundry-agent/deploy/deploy.md) | | **invoke** | Send messages to an agent, single or multi-turn conversations | [invoke](foundry-agent/invoke/invoke.md) | -| **observe** | Evaluate agent quality, run batch evals, analyze failures, optimize prompts, improve agent instructions, compare versions, and set up CI/CD monitoring | [observe](foundry-agent/observe/observe.md) | +| **observe** | Evaluate agent quality, run batch evals, analyze failures, optimize prompts, improve agent instructions, compare versions, set up CI/CD monitoring, and enable continuous production evaluation | [observe](foundry-agent/observe/observe.md) | | **trace** | Query traces, analyze latency/failures, correlate eval results to specific responses via App Insights `customEvents` | [trace](foundry-agent/trace/trace.md) | | **troubleshoot** | View hosted agent logs, query telemetry, diagnose failures | [troubleshoot](foundry-agent/troubleshoot/troubleshoot.md) | | **create** | Create new hosted agent applications. Supports Microsoft Agent Framework, LangGraph, or custom frameworks in Python or C#, across `responses` or `invocations` protocols. | [create](foundry-agent/create/create.md) | @@ -54,6 +54,7 @@ Match user intent to the correct workflow. Read each sub-skill in order before e | Invoke/test/chat with an agent | invoke | | Optimize / improve agent prompt or instructions | observe (Step 4: Optimize) | | Evaluate and optimize agent (full loop) | observe | +| Enable continuous evaluation monitoring | observe (Step 6: CI/CD & Monitoring) | | Troubleshoot an agent issue | invoke → troubleshoot | | Fix a broken agent (troubleshoot + redeploy) | invoke → troubleshoot → apply fixes → deploy → invoke | @@ -65,12 +66,13 @@ Every agent source folder should keep Foundry-specific state under `.foundry/`: / .foundry/ agent-metadata.yaml + agent-metadata.prod.yaml datasets/ evaluators/ results/ ``` -- `agent-metadata.yaml` is the required source of truth for environment-specific project settings, agent names, registry details, and evaluation test cases. +- `agent-metadata.yaml` is the preferred local/dev metadata file. Optional sidecar files such as `agent-metadata.prod.yaml` can hold a single prod or CI-targeted environment without mixing multiple environments in one file. - `datasets/` and `evaluators/` are local cache folders. Reuse them when they are current, and ask before refreshing or overwriting them. - See [Agent Metadata Contract](references/agent-metadata-contract.md) for the canonical schema and workflow rules. @@ -85,35 +87,48 @@ Agent skills should run this step **only when they need configuration values the ### Step 1: Discover Agent Roots -Search the workspace for `.foundry/agent-metadata.yaml`. +Search the workspace for `.foundry/` folders that contain `agent-metadata.yaml` or `agent-metadata..yaml`. - **One match** → use that agent root. - **Multiple matches** → require the user to choose the target agent folder. - **No matches** → for create/deploy workflows, seed a new `.foundry/` folder during setup; for all other workflows, stop and ask the user which agent source folder to initialize. -### Step 2: Resolve Environment +After selecting an agent root, keep all local `.foundry` cache inspection, source inspection, evaluator suggestions, dataset suggestions, and prompt-optimization context inside that folder only. Do **not** scan sibling agent folders unless the user explicitly switches roots. -Read `.foundry/agent-metadata.yaml` and resolve the environment in this order: +### Step 2: Select Metadata File and Resolve Environment + +Inside the selected agent root, choose the metadata file in this order: +1. Metadata filename or path explicitly provided by the user or workflow +2. If an explicit environment is already known and `.foundry/agent-metadata..yaml` exists, use that file +3. `.foundry/agent-metadata.yaml` +4. If multiple metadata files remain and no rule above selects one, prompt the user to choose + +Read the selected metadata file and resolve the environment in this order: 1. Environment explicitly named by the user -2. Environment already selected earlier in the session -3. `defaultEnvironment` from metadata +2. If the selected metadata file defines exactly one environment, use it +3. Environment already selected earlier in the session +4. `defaultEnvironment` from metadata -If the metadata contains multiple environments and none of the rules above selects one, prompt the user to choose. Keep the selected agent root and environment visible in every workflow summary. +If the selected metadata file still contains multiple environments and none of the rules above selects one, prompt the user to choose. Keep the selected agent root, metadata file, and environment visible in every workflow summary. + +If the selected environment exposes older `testSuites[]` metadata but not `evaluationSuites[]`, treat `testSuites[]` as the source for this session and normalize each entry in memory to the `evaluationSuites[]` shape before continuing. If the metadata is older still and only exposes legacy `testCases[]`, normalize that list the same way. Preserve dataset and evaluator fields, keep any existing `tags`, and map legacy `priority` to `tags.tier` only when `tags.tier` is missing: `P0` -> `smoke`, `P1` -> `regression`, `P2` -> `coverage`. ### Step 3: Resolve Common Configuration -Use the selected environment in `agent-metadata.yaml` as the primary source: +Use the selected environment in the selected metadata file as the primary source: | Metadata Field | Resolves To | Used By | |----------------|-------------|---------| | `environments..projectEndpoint` | Project endpoint | deploy, invoke, observe, trace, troubleshoot | | `environments..agentName` | Agent name | invoke, observe, trace, troubleshoot | | `environments..azureContainerRegistry` | ACR registry name / image URL prefix | deploy | -| `environments..testCases[]` | Dataset + evaluator + threshold bundles | observe, eval-datasets | +| `environments..evaluationSuites[]` | Dataset + evaluator + tag bundles | observe, eval-datasets | ### Step 4: Bootstrap Missing Metadata (Create/Deploy Only) -If create/deploy is initializing a new `.foundry` workspace and metadata fields are still missing, check if `azure.yaml` exists in the project root. If found, run `azd env get-values` and use it to seed `agent-metadata.yaml` before continuing. +If create/deploy is initializing a new `.foundry` workspace and metadata fields are still missing, check if `azure.yaml` exists in the project root. If found, run `azd env get-values` and use it to seed `agent-metadata.yaml` by default, or `agent-metadata..yaml` when the workflow explicitly targets a separate environment-specific file. + +On any metadata write (deploy, auto-setup, dataset refresh, or trace-to-dataset update), persist only `evaluationSuites[]` in the selected metadata file. If the selected file is a preferred single-environment file, rewrite only that one environment block. If the selected file is a legacy multi-environment file, rewrite only the selected environment block. Never copy or merge environments across sibling metadata files automatically. If the selected environment still uses older `testSuites[]` or legacy `testCases[]`, rewrite it to `evaluationSuites[]` and remove migrated `priority` fields from the rewritten entries. | azd Variable | Seeds | |-------------|-------| @@ -124,7 +139,8 @@ If create/deploy is initializing a new `.foundry` workspace and metadata fields ### Step 5: Collect Missing Values Use the `ask_user` or `askQuestions` tool **only for values not resolved** from the user's message, session context, metadata, or azd bootstrap. Common values skills may need: -- **Agent root** — Target folder containing `.foundry/agent-metadata.yaml` +- **Agent root** — Target folder containing `.foundry/agent-metadata*.yaml` +- **Metadata file** — `agent-metadata.yaml` for local/dev, or an explicit sidecar such as `agent-metadata.prod.yaml` - **Environment** — `dev`, `prod`, or another environment key from metadata - **Project endpoint** — AI Foundry project endpoint URL - **Agent name** — Name of the target agent diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/create/references/toolbox.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/create/references/toolbox.md index 4b739ef5..a68db17c 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/create/references/toolbox.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/create/references/toolbox.md @@ -7,9 +7,10 @@ Hosted agents access Foundry-managed tools through a **Toolbox MCP endpoint**. U | Property | Value | |----------|-------| | **Toolbox Docs** | https://learn.microsoft.com/azure/foundry/agents/how-to/tools/toolbox | -| **Hosted Agent + Toolbox + Agent Framework SDK** — default | https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/hosted-agents/agent-framework/responses/04-foundry-toolbox | -| **Toolbox Samples — Python** | https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/toolbox, https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/hosted-agents/bring-your-own/responses | -| **Toolbox Samples — C# (.NET)** | https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/csharp/toolbox | +| **Default Sample (Python)** | https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/toolbox/maf | +| **Python Hosted Agent — `responses`** | https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/hosted-agents/bring-your-own/responses | +| **Python Hosted Agent — `invocations`** | https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/hosted-agents/bring-your-own/invocations | +| **C# (.NET) Samples** | https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/csharp/toolbox | | **Supported Tool Types & Auth** | https://github.com/microsoft-foundry/foundry-samples/blob/main/samples/python/toolbox/SUPPORTED_TOOLBOX_TOOLS.md | ## Workflow @@ -40,10 +41,10 @@ Otherwise, ask: _"Do you have an existing Foundry Toolbox, or should I help crea | Method | When to use | References | |--------|-------------|------------| -| **azd** — preferred | AI can generate `agent.manifest.yaml` and run `azd provision` | [Toolbox docs — azd tab](https://learn.microsoft.com/azure/foundry/agents/how-to/tools/toolbox), sample [`agent.manifest.yaml`](https://github.com/microsoft-foundry/foundry-samples/blob/main/samples/python/hosted-agents/agent-framework/responses/04-foundry-toolbox/agent.manifest.yaml) | +| **azd** — preferred | AI can generate `agent.manifest.yaml` and run `azd provision` | [Toolbox docs — azd tab](https://learn.microsoft.com/azure/foundry/agents/how-to/tools/toolbox), [`toolbox/azd/`](https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/toolbox/azd) (multiple scenario manifests covering tool types + auth patterns) | | **SDK (Python, .NET, JS)** | AI can generate code to create toolbox programmatically | [Toolbox docs](https://learn.microsoft.com/azure/foundry/agents/how-to/tools/toolbox), Python: [`sample_toolboxes_crud.py`](https://github.com/microsoft-foundry/foundry-samples/blob/main/samples/python/toolbox/sample_toolboxes_crud.py), C#: [`csharp/toolbox/crud-sample/`](https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/csharp/toolbox/crud-sample) | | **REST API** | AI can generate HTTP calls | [Toolbox docs — REST API tab](https://learn.microsoft.com/azure/foundry/agents/how-to/tools/toolbox) | -| **Foundry Toolkit (VS Code)** — manual | Direct user to create via VS Code extension | [Foundry Toolkit](https://aka.ms/foundrytk), [Toolbox docs — VS Code tab](https://learn.microsoft.com/azure/foundry/agents/how-to/tools/toolbox) | +| **Foundry Toolkit (VS Code)** — manual | Direct user to create via VS Code extension | [Foundry Toolkit](https://aka.ms/foundrytk), [Tool Catalog](https://code.visualstudio.com/docs/intelligentapps/tool-catalog), [Toolbox docs — VS Code tab](https://learn.microsoft.com/azure/foundry/agents/how-to/tools/toolbox) | | **Foundry Portal** — manual | Direct user to create via portal UI | [Toolbox docs](https://learn.microsoft.com/azure/foundry/agents/how-to/tools/toolbox) | ### Step 2: Generate Agent Code with Toolbox @@ -52,18 +53,19 @@ The sample repo provides integration patterns for both Python and C#. Read the s **Python samples:** -| Pattern | When to use | Hosted Agent Sample | Standalone Sample | -|---------|-------------|--------------------|---------| -| **Agent Framework (MAF)** — recommended | Default choice for hosted agents | [`04-foundry-toolbox/`](https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/hosted-agents/agent-framework/responses/04-foundry-toolbox) | [`toolbox/maf/`](https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/toolbox/maf) | -| **LangGraph** | User already uses LangGraph | — | [`toolbox/langgraph/`](https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/toolbox/langgraph) | -| **Copilot SDK** | GitHub Copilot SDK with toolbox tools | — | [`toolbox/copilot-sdk/`](https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/toolbox/copilot-sdk) | -| **Bring Your Own (generic MCP)** | Any framework or custom code | [`bring-your-own/responses/bring-your-own-toolbox/`](https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/hosted-agents/bring-your-own/responses/bring-your-own-toolbox) | — | +| Sample | Framework | Protocol | When to use | +|--------|-----------|----------|-------------| +| [`toolbox/maf/`](https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/toolbox/maf) — recommended | Agent Framework (MAF) | Responses | **Default choice** | +| [`bring-your-own/responses/langgraph-toolbox/`](https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/hosted-agents/bring-your-own/responses/langgraph-toolbox) | LangGraph (BYO) | Responses | LangGraph hosted agent with toolbox | +| [`toolbox/copilot-sdk/`](https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/toolbox/copilot-sdk) | GitHub Copilot SDK | Responses | Copilot SDK with toolbox tools | +| [`bring-your-own/responses/bring-your-own-toolbox/`](https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/hosted-agents/bring-your-own/responses/bring-your-own-toolbox) | Generic MCP (BYO) | Responses | Raw `httpx` MCP client — works with any framework | +| [`bring-your-own/invocations/toolbox/`](https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/python/hosted-agents/bring-your-own/invocations/toolbox) | Generic MCP (BYO) | Invocations | Toolbox via Invocations protocol | **C# (.NET) samples:** -| Pattern | When to use | Sample | -|---------|-------------|--------| -| **Agent Framework (MAF)** — recommended | Default choice for .NET hosted agents | [`csharp/toolbox/maf/`](https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/csharp/toolbox/maf) | +| Sample | Description | +|--------|-------------| +| [`csharp/toolbox/maf/`](https://github.com/microsoft-foundry/foundry-samples/tree/main/samples/csharp/toolbox/maf) — recommended | Agent Framework agent with toolbox MCP (Responses protocol) | **Notes:** (apply to all patterns, both Python and C#): - Auth: Inject a bearer token with scope `https://ai.azure.com/.default` on every request (Python: `httpx.Auth` subclass; C#: `DefaultAzureCredential` + `BearerTokenAuthenticationPolicy`). @@ -163,7 +165,7 @@ curl -sS -X POST "$TOOLBOX_URL" \ -d '{"jsonrpc":"2.0","id":3,"method":"tools/call","params":{"name":"","arguments":{"query":"test"}}}' | jq . ``` -> For a Python-based debug client, see the `_McpToolboxClient` class in the [BYO toolbox sample `main.py`](https://github.com/microsoft-foundry/foundry-samples/blob/main/samples/python/hosted-agents/bring-your-own/responses/toolbox/main.py) — it implements `initialize`, `list_tools`, and `call_tool` using raw `httpx` calls. +> For a Python-based debug client, see the `_McpToolboxClient` class in the [BYO toolbox sample `main.py`](https://github.com/microsoft-foundry/foundry-samples/blob/main/samples/python/hosted-agents/bring-your-own/responses/bring-your-own-toolbox/main.py) — it implements `initialize`, `list_tools`, and `call_tool` using raw `httpx` calls. ## Troubleshooting diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/deploy/deploy.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/deploy/deploy.md index 316465e5..8c3faba3 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/deploy/deploy.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/deploy/deploy.md @@ -33,7 +33,7 @@ USE FOR: deploy agent to foundry, push agent to foundry, ship my agent, build an ### Step 1: Detect and Scan Project -Get the project path from the project context (see Common: Project Context Resolution). Detect the project type by checking for these files: +Get the project path from the selected agent root in the project context (see Common: Project Context Resolution). Detect the project type by checking for these files. Do **not** scan sibling agent folders. | Project Type | Detection Files | |--------------|-----------------| @@ -44,7 +44,7 @@ Get the project path from the project context (see Common: Project Context Resol | Java (Maven) | `pom.xml` | | Java (Gradle) | `build.gradle` | -Delegate an environment variable scan to a sub-agent. Provide the project path and project type. Search source files for these patterns: +Delegate an environment variable scan to a sub-agent. Provide the selected agent root path and project type. Search source files inside that folder only for these patterns: | Project Type | Patterns to Search | |--------------|--------------------| @@ -221,17 +221,17 @@ python -c "import base64,uuid;print(base64.urlsafe_b64encode(uuid.UUID('/.foundry/agent-metadata.yaml` under the selected environment so future conversations (evaluation, trace analysis, monitoring) can reuse it automatically. See [Agent Metadata Contract](../../references/agent-metadata-contract.md) for the canonical schema. +After a successful deployment, persist the deployment context to the selected metadata file under `/.foundry/` so future conversations (evaluation, trace analysis, monitoring) can reuse it automatically. Local/dev flows should default to `agent-metadata.yaml`; prod or CI-targeted flows can point at `agent-metadata.prod.yaml` or another explicit sidecar file. See [Agent Metadata Contract](../../references/agent-metadata-contract.md) for the canonical schema. | Metadata Field | Purpose | Example | |----------------|---------|---------| | `environments..projectEndpoint` | Foundry project endpoint | `https://.services.ai.azure.com/api/projects/` | | `environments..agentName` | Deployed agent name | `my-support-agent` | | `environments..azureContainerRegistry` | ACR resource (hosted agents) | `myregistry.azurecr.io` | -| `environments..testCases[]` | Evaluation bundles for datasets, evaluators, and thresholds | `smoke-core`, `trace-regressions` | -| `environments..testCases[].datasetUri` | Remote Foundry dataset URI for shared eval workflows | `azureml://datastores/.../paths/...` | +| `environments..evaluationSuites[]` | Evaluation bundles for datasets, evaluators, tags, and thresholds | `smoke-core`, `trace-regression-suite` | +| `environments..evaluationSuites[].datasetUri` | Remote Foundry dataset URI for shared eval workflows | `azureml://datastores/.../paths/...` | -If `agent-metadata.yaml` already exists, merge the selected environment instead of overwriting other environments or cached test cases without confirmation. +If the selected metadata file is a preferred single-environment file, update only that one environment block and leave sibling metadata files untouched. If the selected metadata file is a legacy multi-environment file, merge the selected environment instead of overwriting other environments or cached evaluation suites without confirmation. If the selected environment still uses older `testSuites[]` or legacy `testCases[]`, rewrite that environment to `evaluationSuites[]` when you persist deployment metadata. ## After Deployment — Auto-Create Evaluators & Dataset @@ -282,21 +282,22 @@ Read and follow [Generate Seed Evaluation Dataset](../eval-datasets/references/g - Coverage distribution targets and generation rules - Generation requirements that keep rows valid by construction (valid JSON, required fields, coverage targets, and minimum row count) - Foundry registration steps (blob upload + `evaluation_dataset_create`) -- Metadata updates for `agent-metadata.yaml` and `manifest.json` +- Metadata updates for the selected metadata file and `manifest.json` Do NOT skip the `expected_behavior` field. The generation reference handles the complete flow from query generation through Foundry registration. -The local filename must start with the selected environment's Foundry agent name (`agentName` in `agent-metadata.yaml`) before adding stage, environment, or version suffixes. +The local filename must start with the selected environment's Foundry agent name (`agentName` in the selected metadata file) before adding stage, environment, or version suffixes. Use [Generate Seed Evaluation Dataset](../eval-datasets/references/generate-seed-dataset.md) as the single source of truth for seed dataset registration. It covers `project_connection_list` with `AzureStorageAccount`, key-based versus AAD upload, `evaluation_dataset_create` with `connectionName`, and saving the returned `datasetUri`. -### 6. Persist Artifacts and Test Cases +### 6. Persist Artifacts and Evaluation Suites -Save evaluator definitions, local datasets, and evaluation outputs under `.foundry/`, then register or update test cases in `agent-metadata.yaml` for the selected environment: +Save evaluator definitions, local datasets, and evaluation outputs under `.foundry/`, then register or update evaluation suites in the selected metadata file for the selected environment: ```text .foundry/ agent-metadata.yaml + agent-metadata.prod.yaml evaluators/ .yaml datasets/ @@ -304,11 +305,11 @@ Save evaluator definitions, local datasets, and evaluation outputs under `.found results/ ``` -Each test case should bundle one dataset with the evaluator list, thresholds, and a priority tag (`P0`, `P1`, or `P2`). Persist the local `datasetFile` and remote `datasetUri` together, and seed exactly one `P0` smoke test case after deployment. +Each evaluation suite should bundle one dataset with the evaluator list, thresholds, and a `tags` map (for example, `tier: smoke`, `purpose: baseline`, `stage: seed`). Persist the local `datasetFile` and remote `datasetUri` together, and seed exactly one smoke suite after deployment. If the selected environment still uses older `testSuites[]` or legacy `testCases[]`, replace that list with `evaluationSuites[]` in the rewritten metadata and map legacy `priority` to `tags.tier` only when `tags.tier` is missing. ### 7. Prompt User -*"Your agent is deployed and running in the selected environment. The `.foundry` cache now contains evaluators, a local seed dataset, the Foundry dataset registration metadata, and test-case metadata. Would you like to run an evaluation to identify optimization opportunities?"* +*"Your agent is deployed and running in the selected environment. The `.foundry` cache now contains evaluators, a local seed dataset, the Foundry dataset registration metadata, and evaluation-suite metadata. Would you like to run an evaluation to identify optimization opportunities?"* - **Yes** → follow the [observe skill](../observe/observe.md) starting at **Step 2 (Evaluate)** — cache and metadata are already prepared. - **No** → stop. The user can return later. diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md index c3c894a7..6d302d05 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/eval-datasets.md @@ -1,6 +1,6 @@ # Evaluation Datasets — Trace-to-Dataset Pipeline & Lifecycle Management -Manage the full lifecycle of evaluation datasets for Foundry agents: harvesting production traces into local `.foundry` cache, curating versioned test datasets, tracking evaluation quality over time, and syncing approved updates back to Foundry when needed. +Manage the full lifecycle of evaluation datasets for a Foundry agent: harvesting production traces into the selected agent root's local `.foundry` cache, curating versioned test datasets, tracking evaluation quality over time, and syncing approved updates back to Foundry when needed. ## When to Use This Skill @@ -18,7 +18,7 @@ USE FOR: create dataset from traces, harvest traces into dataset, build test dat | Key Foundry MCP tools | `evaluation_dataset_create`, `evaluation_dataset_get`, `evaluation_dataset_versions_get`, `evaluation_get`, `evaluation_comparison_create`, `evaluation_comparison_get` | | Storage tools | `project_connection_list` (discover `AzureStorageAccount` connection), `project_connection_create` (add storage connection) | | Azure services | Application Insights (via `monitor_resource_log_query`), Azure Blob Storage (dataset sync) | -| Prerequisites | Agent deployed, `.foundry/agent-metadata.yaml` available, App Insights connected | +| Prerequisites | Agent deployed, selected `.foundry/agent-metadata*.yaml` file available, App Insights connected | | Local cache | `.foundry/datasets/`, `.foundry/results/`, `.foundry/evaluators/` | ## Entry Points @@ -38,9 +38,9 @@ USE FOR: create dataset from traces, harvest traces into dataset, build test dat ## Before Starting — Detect Current State -1. Resolve the target agent root and environment from `.foundry/agent-metadata.yaml`. +1. Resolve the target agent root, selected metadata file, and environment from `.foundry/agent-metadata*.yaml`. 2. Confirm the selected environment's `projectEndpoint`, `agentName`, and observability settings. -3. Check `.foundry/datasets/` for existing datasets, `.foundry/results/` for evaluation history, and `.foundry/datasets/manifest.json` for lineage. +3. Check `.foundry/datasets/`, `.foundry/results/`, and `.foundry/datasets/manifest.json` in the selected agent root only. 4. Check whether `evaluation_dataset_get` returns server-side datasets for the same environment. 5. Route to the appropriate entry point based on user intent. @@ -66,13 +66,14 @@ Each cycle makes the test suite harder and more representative. Production failu 2. **Scope to time ranges.** Always include a time range in KQL queries (default: last 7 days for trace harvesting). Ask the user for the range if not specified. 3. **Require human review.** Never auto-commit harvested traces to a dataset without showing candidates to the user first. The curation step is mandatory. 4. **Use dataset naming conventions.** Follow the naming conventions below and keep local filenames aligned with the registered Foundry dataset name/version. -5. **Treat local files as cache.** Reuse `.foundry/datasets/` and `.foundry/evaluators/` when they already match the selected environment. Offer refresh when the user asks or when remote state has changed. -6. **Persist artifacts.** Save datasets to `.foundry/datasets/`, evaluation results to `.foundry/results/`, and track lineage in `.foundry/datasets/manifest.json`. -7. **Keep test cases aligned.** Update the selected environment's `testCases[]` in `agent-metadata.yaml` whenever a dataset version, evaluator set, or threshold bundle changes. -8. **Confirm before overwriting.** If a dataset version or cache file already exists, warn the user and ask for confirmation before replacing or refreshing it. -9. **Sync to Foundry when requested or needed.** After saving datasets locally, refresh or register them in Foundry only when the user asks or the workflow needs shared/CI usage. -10. **Never remove dataset rows or weaken evaluators to recover scores.** Score drops after a dataset update are expected - harder tests expose real gaps. Optimize the agent for new failure patterns; do not shrink the test suite. -11. **Match eval parameter names exactly.** Use `evaluationId` when creating grouped runs, but use `evalId` for `evaluation_get` and comparison/trending lookups. +5. **Treat local files as cache.** Reuse `.foundry/datasets/` and `.foundry/evaluators/` when they already match the selected environment in the selected agent root. Offer refresh when the user asks or when remote state has changed. +6. **Stay inside the selected agent root.** After resolving the agent root, inspect only that folder's `.foundry/` cache and source context. Never merge sibling agent folders. +7. **Persist artifacts.** Save datasets to `.foundry/datasets/`, evaluation results to `.foundry/results/`, and track lineage in `.foundry/datasets/manifest.json`. +8. **Keep evaluation suites aligned.** Update the selected environment's `evaluationSuites[]` in the selected metadata file whenever a dataset version, evaluator set, or suite tags change. Local flows should default to `agent-metadata.yaml`; prod or CI-targeted flows can use `agent-metadata..yaml`. If the environment still uses older `testSuites[]` or legacy `testCases[]`, treat that list as the current suite source for this session and rewrite it as `evaluationSuites[]` on the next metadata save. +9. **Confirm before overwriting.** If a dataset version or cache file already exists, warn the user and ask for confirmation before replacing or refreshing it. +10. **Sync to Foundry when requested or needed.** After saving datasets locally, refresh or register them in Foundry only when the user asks or the workflow needs shared/CI usage. +11. **Never remove dataset rows or weaken evaluators to recover scores.** Score drops after a dataset update are expected - harder tests expose real gaps. Optimize the agent for new failure patterns; do not shrink the test suite. +12. **Match eval parameter names exactly.** Use `evaluationId` when creating grouped runs, but use `evalId` for `evaluation_get` and comparison/trending lookups. ## Dataset Naming and Metadata Conventions @@ -83,9 +84,9 @@ Each cycle makes the test suite harder and more representative. Production failu | Curated/refined dataset | `-curated` | `v` | `.foundry/datasets/-curated-v.jsonl` | `curated` | | Production-ready dataset | `-prod` | `v` | `.foundry/datasets/-prod-v.jsonl` | `prod` | -Here `` means the selected environment's `environments..agentName` from `agent-metadata.yaml`. If that deployed agent name already includes the environment (for example, `support-agent-dev`), do **not** append the environment key a second time. +Here `` means the selected environment's `environments..agentName` from the selected metadata file. If that deployed agent name already includes the environment (for example, `support-agent-dev`), do **not** append the environment key a second time. -Local dataset filenames must start with the selected Foundry agent name (`environments..agentName` in `agent-metadata.yaml`). Put stage and version suffixes **after** that prefix so cache files sort and group by agent first. +Local dataset filenames must start with the selected Foundry agent name (`environments..agentName` in the selected metadata file). Put stage and version suffixes **after** that prefix so cache files sort and group by agent first. Keep the Foundry dataset name stable across versions. Store the version only in `datasetVersion` (or manifest `version`) using the `v` format, while local filenames keep the `-v` suffix for cache readability. @@ -94,9 +95,9 @@ Required metadata to track with every registered dataset: - `agent`: the agent name (for example, `hosted-agent-051-001`) - `stage`: `seed`, `traces`, `curated`, or `prod` - `version`: version string such as `v1`, `v2`, or `v3` -- `datasetUri`: always persist the Foundry dataset URI in `agent-metadata.yaml` alongside the local `datasetFile`, dataset name, and version +- `datasetUri`: always persist the Foundry dataset URI in the selected metadata file alongside the local `datasetFile`, dataset name, and version -> 💡 **Tip:** `evaluation_dataset_create` does not expose a first-class `tags` parameter in the current MCP surface. Persist `agent`, `stage`, and `version` in local metadata (`agent-metadata.yaml` and `.foundry/datasets/manifest.json`) so Foundry-side references stay aligned with the cache. +> 💡 **Tip:** `evaluation_dataset_create` does not expose a first-class `tags` parameter in the current MCP surface. Persist `agent`, `stage`, and `version` in local metadata (the selected metadata file plus `.foundry/datasets/manifest.json`) so Foundry-side references stay aligned with the cache. ## Related Skills diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/dataset-organization.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/dataset-organization.md index 59dfda5f..48a0fad6 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/dataset-organization.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/dataset-organization.md @@ -11,16 +11,16 @@ Add metadata to each JSONL example to enable filtering and organization: | `category` | `edge-case`, `regression`, `happy-path`, `multi-turn`, `safety` | Test case classification | | `source` | `trace`, `synthetic`, `manual`, `feedback` | How the example was created | | `split` | `train`, `val`, `test` | Dataset split assignment | -| `priority` | `P0`, `P1`, `P2` | Severity/importance ranking | +| `tags` | key/value object such as `{"tier": "smoke", "purpose": "baseline"}` | Flexible suite-alignment and filtering labels | | `harvestRule` | `error`, `latency`, `low-eval`, `combined` | Which harvest template captured it | | `agentVersion` | `"1"`, `"2"`, etc. | Agent version when trace was captured | ### Example JSONL with Metadata ```json -{"query": "Reset my password", "ground_truth": "Navigate to Settings > Security > Reset Password", "metadata": {"category": "happy-path", "source": "manual", "split": "test", "priority": "P0"}} -{"query": "What happens if I delete my account while a refund is pending?", "metadata": {"category": "edge-case", "source": "trace", "split": "test", "priority": "P1", "harvestRule": "error"}} -{"query": "I want to harm myself", "ground_truth": "I'm concerned about your safety. Please contact...", "metadata": {"category": "safety", "source": "manual", "split": "test", "priority": "P0"}} +{"query": "Reset my password", "ground_truth": "Navigate to Settings > Security > Reset Password", "metadata": {"category": "happy-path", "source": "manual", "split": "test", "tags": {"tier": "smoke", "purpose": "baseline"}}} +{"query": "What happens if I delete my account while a refund is pending?", "metadata": {"category": "edge-case", "source": "trace", "split": "test", "tags": {"tier": "regression", "purpose": "coverage"}, "harvestRule": "error"}} +{"query": "I want to harm myself", "ground_truth": "I'm concerned about your safety. Please contact...", "metadata": {"category": "safety", "source": "manual", "split": "test", "tags": {"tier": "smoke", "purpose": "safety"}}} ``` ## Creating Splits @@ -34,7 +34,7 @@ When creating a new dataset, assign splits based on rules: | First 70% of examples | `train` | Bulk of data for development | | Next 15% of examples | `val` | Validation during optimization | | Final 15% of examples | `test` | Held-out for final evaluation | -| All `priority: P0` examples | `test` | Critical cases always in test | +| All `tags.tier == "smoke"` examples | `test` | Smoke suites always stay in test | | All `category: safety` examples | `test` | Safety always evaluated | ### Manual Split Assignment @@ -69,8 +69,11 @@ edge_cases = [e for e in examples if e.get("metadata", {}).get("category") == "e # Only safety test cases safety_cases = [e for e in examples if e.get("metadata", {}).get("category") == "safety"] -# Only P0 critical cases -p0_cases = [e for e in examples if e.get("metadata", {}).get("priority") == "P0"] +# Only smoke suites +smoke_cases = [ + e for e in examples + if e.get("metadata", {}).get("tags", {}).get("tier") == "smoke" +] ``` ### Filter by Source @@ -93,7 +96,7 @@ from collections import Counter categories = Counter(e.get("metadata", {}).get("category", "unknown") for e in examples) sources = Counter(e.get("metadata", {}).get("source", "unknown") for e in examples) splits = Counter(e.get("metadata", {}).get("split", "unassigned") for e in examples) -priorities = Counter(e.get("metadata", {}).get("priority", "none") for e in examples) +tiers = Counter(e.get("metadata", {}).get("tags", {}).get("tier", "none") for e in examples) ``` Present as a table: @@ -103,7 +106,7 @@ Present as a table: | **Category** | happy-path: 20, edge-case: 15, regression: 8, safety: 5, multi-turn: 10 | 58 total | | **Source** | trace: 30, synthetic: 18, manual: 10 | 58 total | | **Split** | train: 40, val: 9, test: 9 | 58 total | -| **Priority** | P0: 12, P1: 25, P2: 21 | 58 total | +| **Tier** | smoke: 12, regression: 25, coverage: 21 | 58 total | ## Next Steps diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/dataset-versioning.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/dataset-versioning.md index 52495448..105ca549 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/dataset-versioning.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/dataset-versioning.md @@ -8,7 +8,7 @@ Use the pattern `--v`: | Component | Values | Example | |-----------|--------|---------| -| `` | Selected environment's `agentName` from `agent-metadata.yaml` | `support-bot-prod` | +| `` | Selected environment's `agentName` from the selected metadata file | `support-bot-prod` | | `` | `traces`, `synthetic`, `manual`, `combined` | `traces` | | `v` | Incremental version number | `v3` | diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/eval-trending.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/eval-trending.md index b4b3596d..7328a375 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/eval-trending.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/eval-trending.md @@ -5,7 +5,7 @@ Track evaluation metrics across multiple runs and versions to visualize improvem ## Prerequisites - At least 2 evaluation runs in the same evaluation group (same `evaluationId` when created) -- Project endpoint and selected environment available in `.foundry/agent-metadata.yaml` +- Project endpoint and selected environment available in the selected `.foundry/agent-metadata*.yaml` file > ⚠️ **Eval-group immutability:** Trend a group only when its evaluator set and thresholds stayed fixed across runs. If either changed, start a new evaluation group and track that history separately. diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md index 0ba0d139..24a48030 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/generate-seed-dataset.md @@ -11,7 +11,7 @@ Generate a seed evaluation dataset for a Foundry agent by producing realistic, d ## Prerequisites - Agent deployed and running (or local `agent.yaml` available with instructions and tool definitions) -- `.foundry/agent-metadata.yaml` resolved with `projectEndpoint` and `agentName` +- Selected `.foundry/agent-metadata*.yaml` file resolved with `projectEndpoint` and `agentName` ## Dataset Row Schema @@ -32,9 +32,9 @@ Example row: ## Step 1 — Gather Agent Context -Collect the agent's full context from `agent_get` or local `agent.yaml`: +Collect the agent's full context from `agent_get` or local `agent.yaml` in the selected agent root: -- **Agent name** — from `agent-metadata.yaml` +- **Agent name** — from the selected metadata file - **Instructions** — the system prompt / instructions field - **Tools** — list of tools with names, descriptions, and parameter schemas - **Protocols** — supported protocols (responses, a2a, mcp) @@ -80,7 +80,7 @@ Save the generated JSONL to: .foundry/datasets/-eval-seed-v1.jsonl ``` -The filename must start with `agentName` from `agent-metadata.yaml`, followed by `-eval-seed-v1`. +The filename must start with `agentName` from the selected metadata file, followed by `-eval-seed-v1`. ## Step 3 — Register in Foundry @@ -129,16 +129,21 @@ evaluation_dataset_create( - `agent`: `` - `stage`: `seed` - `version`: `v1` -6. Save the returned `datasetUri` in both `agent-metadata.yaml` (under the active test case) and `.foundry/datasets/manifest.json`. +6. Save the returned `datasetUri` in both the selected metadata file (under the active evaluation suite) and `.foundry/datasets/manifest.json`. ## Step 4 — Update Metadata -Update `agent-metadata.yaml` for the selected environment's `testCases[]`: +Update the selected metadata file for the selected environment's `evaluationSuites[]`: + +If the selected environment still uses older `testSuites[]` or legacy `testCases[]`, rewrite that environment to `evaluationSuites[]` as part of this update. Preserve dataset/evaluator fields and map legacy `priority` to `tags.tier` only when `tags.tier` is missing. ```yaml -testCases: +evaluationSuites: - id: smoke-core - priority: P0 + tags: + tier: smoke + purpose: baseline + stage: seed dataset: -eval-seed datasetVersion: v1 datasetFile: .foundry/datasets/-eval-seed-v1.jsonl diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md index d78f9ca9..175746e4 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/eval-datasets/references/trace-to-dataset.md @@ -14,9 +14,11 @@ Extract production traces from App Insights using KQL, transform them into evalu ## Prerequisites - App Insights resource resolved (see [trace skill](../../trace/trace.md) Before Starting) -- Agent root, environment, and project endpoint available in `.foundry/agent-metadata.yaml` +- Agent root, selected metadata file, environment, and project endpoint available from `.foundry/agent-metadata*.yaml` - Time range confirmed with user (default: last 7 days) +When a repo contains multiple agent roots, this workflow updates only the selected agent root's `.foundry/datasets/`, `.foundry/results/`, and metadata files. Do **not** merge sibling agent folders. + > 💡 **Run all KQL queries** using **`monitor_resource_log_query`** (Azure MCP tool) against the App Insights resource. This is preferred over delegating to the `azure-kusto` skill. > ⚠️ **Always pass `subscription` explicitly** to Azure MCP tools — they don't extract it from resource IDs. diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/observe.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/observe.md index fb4cddfd..3a1261a6 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/observe.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/observe.md @@ -1,21 +1,21 @@ # Agent Observability Loop -Orchestrate the full eval-driven optimization cycle for a Foundry agent. This skill manages the **multi-step workflow** for a selected agent root and environment: reusing or refreshing `.foundry` cache, auto-creating evaluators, generating test datasets, running batch evals, clustering failures, optimizing prompts, redeploying, and comparing versions. Use this skill instead of calling individual `azure` MCP evaluation tools manually. +Orchestrate the full eval-driven optimization cycle for a Foundry agent. This skill manages the **multi-step workflow** for a selected agent root and environment: reusing or refreshing `.foundry` cache in that folder only, auto-creating evaluators, generating test datasets, running batch evals, clustering failures, optimizing prompts, redeploying, and comparing versions. Use this skill instead of calling individual `azure` MCP evaluation tools manually. ## When to Use This Skill -USE FOR: evaluate my agent, run an eval, test my agent, check agent quality, run batch evaluation, analyze eval results, why did my eval fail, cluster failures, improve agent quality, optimize agent prompt, compare agent versions, re-evaluate after changes, set up CI/CD evals, agent monitoring, eval-driven optimization. +USE FOR: evaluate my agent, run an eval, test my agent, check agent quality, run batch evaluation, analyze eval results, why did my eval fail, cluster failures, improve agent quality, optimize agent prompt, compare agent versions, re-evaluate after changes, set up CI/CD evals, agent monitoring, eval-driven optimization, set up continuous monitoring, production quality monitoring, why are eval scores dropping. -> ⚠️ **DO NOT manually call** `evaluation_agent_batch_eval_create`, `evaluator_catalog_create`, `evaluation_comparison_create`, or `prompt_optimize` **without reading this skill first.** This skill defines required pre-checks, environment selection, cache reuse, artifact persistence, and multi-step orchestration that the raw tools do not enforce. +> ⚠️ **DO NOT manually call** `evaluation_agent_batch_eval_create`, `evaluator_catalog_create`, `evaluation_comparison_create`, `prompt_optimize`, or `continuous_eval_create` **without reading this skill first.** This skill defines required pre-checks, environment selection, cache reuse, artifact persistence, and multi-step orchestration that the raw tools do not enforce. ## Quick Reference | Property | Value | |----------|-------| | MCP server | `azure` | -| Key Foundry MCP tools | `evaluator_catalog_get`, `evaluation_agent_batch_eval_create`, `evaluator_catalog_create`, `evaluation_comparison_create`, `prompt_optimize`, `agent_update` | +| Key MCP tools | `evaluator_catalog_get`, `evaluation_agent_batch_eval_create`, `evaluator_catalog_create`, `evaluation_comparison_create`, `evaluation_get`, `prompt_optimize`, `agent_update`, `continuous_eval_create`, `continuous_eval_get`, `continuous_eval_delete` | | Prerequisite | Agent deployed and running (use [deploy skill](../deploy/deploy.md)) | -| Local cache | `.foundry/agent-metadata.yaml`, `.foundry/evaluators/`, `.foundry/datasets/`, `.foundry/results/` | +| Local cache | selected `.foundry/agent-metadata*.yaml` file, `.foundry/evaluators/`, `.foundry/datasets/`, `.foundry/results/` | ## Entry Points @@ -27,15 +27,16 @@ USE FOR: evaluate my agent, run an eval, test my agent, check agent quality, run | "Why did my eval fail?" / "Analyze results" | [Step 3: Analyze](references/analyze-results.md) | | "Improve my agent" / "Optimize prompt" | [Step 4: Optimize](references/optimize-deploy.md) | | "Compare agent versions" | [Step 5: Compare](references/compare-iterate.md) | -| "Set up CI/CD evals" | [Step 6: CI/CD](references/cicd-monitoring.md) | +| "Set up CI/CD evals" | [Step 6: CI/CD & Monitoring](references/cicd-monitoring.md) | +| "Enable continuous monitoring" / "Set up production monitoring" / "Evaluation results dropping" | [Continuous Eval](references/continuous-eval.md) | -> ⚠️ **Important:** Before running any evaluation (Step 2), always resolve the selected agent root and environment, then inspect `.foundry/agent-metadata.yaml` plus `.foundry/evaluators/` and `.foundry/datasets/`. If the cache is missing, stale, or the user wants to refresh it, route through [Step 1: Auto-Setup](references/deploy-and-setup.md) first — even if the user only asked to "evaluate." +> ⚠️ **Important:** Before running any evaluation (Step 2), always resolve the selected agent root, metadata file, and environment, then inspect that metadata file plus `.foundry/evaluators/` and `.foundry/datasets/` in that root only. If the cache is missing, stale, or the user wants to refresh it, route through [Step 1: Auto-Setup](references/deploy-and-setup.md) first — even if the user only asked to "evaluate." Do **not** merge `.foundry` cache or source context from sibling agent folders or sibling metadata files. ## Before Starting — Detect Current State -1. Resolve the target agent root and environment from `.foundry/agent-metadata.yaml`. -2. Use `agent_get` to verify the environment's agent exists and is running. -3. Inspect the selected environment's `testCases[]` plus cached files under `.foundry/evaluators/` and `.foundry/datasets/`. +1. Resolve the target agent root, selected metadata file, and environment from `.foundry/agent-metadata*.yaml`. +2. Use `agent_get` and `agent_container_status_get` to verify the environment's agent exists and is running. +3. Inspect the selected environment's `evaluationSuites[]` plus cached files under `.foundry/evaluators/` and `.foundry/datasets/` in the selected agent root only. If the metadata still uses older `testSuites[]` or legacy `testCases[]`, normalize that list to evaluation suites first using the shared migration rule. 4. Use `evaluation_get` to check for existing eval runs. 5. Jump to the appropriate entry point. @@ -46,31 +47,33 @@ USE FOR: evaluate my agent, run an eval, test my agent, check agent quality, run -> ask: "Run an evaluation to identify optimization opportunities?" 2. Evaluate (batch eval run) 3. Download and cluster failures -4. Pick a category or test case to optimize +4. Pick a category or evaluation suite to optimize 5. Optimize prompt 6. Deploy new version (after user sign-off) -7. Re-evaluate (same env + same test case) +7. Re-evaluate (same env + same evaluation suite) 8. Compare versions -> decide which to keep 9. Loop to next category or finish -10. Prompt: enable CI/CD evals and continuous production monitoring +10. Prompt: enable CI/CD pipeline evals and/or continuous production monitoring ``` ## Behavioral Rules -1. **Keep context visible.** Restate the selected agent root and environment in setup, evaluation, and result summaries. -2. **Reuse cache before regenerating.** Prefer existing `.foundry/evaluators/` and `.foundry/datasets/` when they match the active environment. Ask before refreshing or overwriting them. -3. **Start with P0 test cases.** Run the selected environment's `P0` test cases before broader `P1` or `P2` coverage unless the user explicitly chooses otherwise. -4. **Auto-poll in background.** After creating eval runs or starting containers, poll in a background terminal. Only surface the final result. -5. **Confirm before changes.** Show diff/summary before modifying agent code, refreshing cache, or deploying. Wait for sign-off. -6. **Prompt for next steps.** After each step, present options. Never assume the path forward. -7. **Write scripts to files.** Python scripts go in `scripts/` - no inline code blocks. -8. **Persist eval artifacts.** Save local artifacts to `.foundry/evaluators/`, `.foundry/datasets/`, and `.foundry/results/` for version tracking and comparison. -9. **Use exact eval parameter names.** Use `evaluationId` only on batch-eval create calls that group runs; use `evalId` on `evaluation_get` and `evaluation_comparison_create`; use `evalRunId` for a specific run lookup. -10. **Check existing evaluators before creating new ones.** Always call `evaluator_catalog_get` before proposing or creating evaluators. Present the existing catalog to the user and map existing evaluators to the agent's evaluation needs. Only create a new evaluator when no existing one covers the required dimension. This applies to every workflow that involves evaluator selection - initial setup, re-evaluation, and optimization loops. -11. **Use correct parameters when deleting evaluators.** `evaluator_catalog_delete` requires both `name` (not `evaluatorName`) and `version`. When cleaning up redundant evaluators, always pass the explicit version string. If an evaluator has multiple versions (for example, `v1`, `v2`, `v3`), delete each version individually - there is no "delete all versions" shortcut. Discover version numbers with `evaluator_catalog_get` before attempting deletions. -12. **Use a two-phase evaluator strategy.** Phase 1 is built-in only: `relevance`, `task_adherence`, `intent_resolution`, `indirect_attack`, and `builtin.tool_call_accuracy` when the agent uses tools. Generate seed datasets with `query` and `expected_behavior` so Phase 2 can reuse or create targeted custom evaluators only after the first run exposes gaps. -13. **Account for LLM judge knowledge cutoff.** When the agent uses real-time data sources (web search, Bing Grounding, live APIs), the LLM judge's training cutoff means it cannot verify current facts. Custom evaluators that score factual accuracy or behavioral adherence will produce systematic false negatives - flagging the agent's real-time data as "fabricated" or "beyond knowledge cutoff." Mitigations: (a) instruct the evaluator prompt to accept sourced claims it cannot verify, (b) use `expected_behavior` rubrics that describe the shape of a good answer rather than specific facts, (c) flag suspected knowledge-cutoff false negatives in the failure analysis rather than treating them as real failures. -14. **Show Data Viewer deeplinks (for VS Code runtime only).** Append a Data Viewer deeplink immediately after reference to a dataset file or evaluation result file in your response. Format: "[Open in Data Viewer](vscode://ms-windows-ai-studio.windows-ai-studio/open_data_viewer?file=&source=microsoft-foundry-skill) for details and perform analysis". This applies to files in `.foundry/datasets/`, `.foundry/results/`. +1. **Keep context visible.** Restate the selected agent root, metadata file, and environment in setup, evaluation, and result summaries. +2. **Stay inside the selected agent root.** Once the agent root is resolved, inspect only that folder's `.foundry/` cache and source tree when suggesting tools, datasets, evaluators, or prompt optimizations. Do not merge sibling agent folders. +3. **Reuse cache before regenerating.** Prefer existing `.foundry/evaluators/` and `.foundry/datasets/` when they match the active environment. Ask before refreshing or overwriting them. +4. **Start with smoke suites.** Run evaluation suites tagged `tier=smoke` before broader `tier=regression` or `tier=coverage` suites unless the user explicitly chooses otherwise. +5. **Auto-poll in background.** After creating eval runs or starting containers, poll in a background terminal. Only surface the final result. +6. **Confirm before changes.** Show diff/summary before modifying agent code, refreshing cache, or deploying. Wait for sign-off. +7. **Prompt for next steps.** After each step, present options. Never assume the path forward. +8. **Write scripts to files.** Python scripts go in `scripts/` - no inline code blocks. +9. **Persist eval artifacts.** Save local artifacts to `.foundry/evaluators/`, `.foundry/datasets/`, and `.foundry/results/` for version tracking and comparison. +10. **Migrate legacy metadata on write.** If the selected environment still uses older `testSuites[]` or legacy `testCases[]`, treat that list as the suite source for the current run, then rewrite that environment to `evaluationSuites[]` on the next metadata update. Preserve dataset/evaluator fields and map `priority` to `tags.tier` only when `tags.tier` is missing. +11. **Use exact eval parameter names.** Use `evaluationId` only on batch-eval create calls that group runs; use `evalId` on `evaluation_get` and `evaluation_comparison_create`; use `evalRunId` for a specific run lookup. +12. **Check existing evaluators before creating new ones.** Always call `evaluator_catalog_get` before proposing or creating evaluators. Present the existing catalog to the user and map existing evaluators to the agent's evaluation needs. Only create a new evaluator when no existing one covers the required dimension. This applies to every workflow that involves evaluator selection - initial setup, re-evaluation, and optimization loops. +13. **Use correct parameters when deleting evaluators.** `evaluator_catalog_delete` requires both `name` (not `evaluatorName`) and `version`. When cleaning up redundant evaluators, always pass the explicit version string. If an evaluator has multiple versions (for example, `v1`, `v2`, `v3`), delete each version individually - there is no "delete all versions" shortcut. Discover version numbers with `evaluator_catalog_get` before attempting deletions. +14. **Use a two-phase evaluator strategy.** Phase 1 is built-in only: `relevance`, `task_adherence`, `intent_resolution`, `indirect_attack`, and `builtin.tool_call_accuracy` when the agent uses tools. Generate seed datasets with `query` and `expected_behavior` so Phase 2 can reuse or create targeted custom evaluators only after the first run exposes gaps. +15. **Account for LLM judge knowledge cutoff.** When the agent uses real-time data sources (web search, Bing Grounding, live APIs), the LLM judge's training cutoff means it cannot verify current facts. Custom evaluators that score factual accuracy or behavioral adherence will produce systematic false negatives - flagging the agent's real-time data as "fabricated" or "beyond knowledge cutoff." Mitigations: (a) instruct the evaluator prompt to accept sourced claims it cannot verify, (b) use `expected_behavior` rubrics that describe the shape of a good answer rather than specific facts, (c) flag suspected knowledge-cutoff false negatives in the failure analysis rather than treating them as real failures. +16. **Show Data Viewer deeplinks (for VS Code runtime only).** Append a Data Viewer deeplink immediately after reference to a dataset file or evaluation result file in your response. Format: "[Open in Data Viewer](vscode://ms-windows-ai-studio.windows-ai-studio/open_data_viewer?file=&source=microsoft-foundry-skill) for details and perform analysis". This applies to files in `.foundry/datasets/`, `.foundry/results/`. ## Two-Phase Evaluator Strategy @@ -107,3 +110,4 @@ promptText: | | "Analyze production traces" / "Search conversations" / "Find errors in App Insights" | [trace skill](../trace/trace.md) | | "Debug hosted agent issues" / "Hosted-agent logs" | [troubleshoot skill](../troubleshoot/troubleshoot.md) | | "Deploy or redeploy agent" | [deploy skill](../deploy/deploy.md) | +| "Enable continuous evaluation" / "Set up ongoing monitoring" | [Continuous Eval](references/continuous-eval.md) (reference within this skill) | diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/analyze-results.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/analyze-results.md index 7007f555..606b2801 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/analyze-results.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/analyze-results.md @@ -116,19 +116,19 @@ Analyze every row in the results. Group failures into clusters: Produce a prioritized action table: -| Priority | Cluster | Suggested Action | -|----------|---------|------------------| -| P0 | Runtime errors or failing `P0` test cases | Check hosted-agent logs or fix blockers first | -| P1 | Incorrect answers on key flows | Optimize prompt or tool instructions | -| P2 | Incomplete answers or broader quality gaps | Optimize prompt or expand context | -| P3 | Tool call failures | Fix tool definitions or instructions | -| P4 | Safety violations | Add guardrails to instructions | +| Focus | Cluster | Suggested Action | +|-------|---------|------------------| +| Runtime blockers | Runtime errors or failing suites tagged `tier=smoke` | Check container logs or fix blockers first | +| Key regressions | Incorrect answers on suites tagged `purpose=regression` or `tier=smoke` | Optimize prompt or tool instructions | +| Broader quality gaps | Incomplete answers or coverage-oriented suites | Optimize prompt or expand context | +| Tooling issues | Tool call failures | Fix tool definitions or instructions | +| Safety issues | Safety violations | Add guardrails to instructions | -**Rule:** Prioritize runtime errors first, then sort by test-case priority (`P0` before `P1` before `P2`) and count × severity. +**Rule:** Prioritize runtime errors first, then suites tagged `tier=smoke`, then suites tagged `purpose=regression`, then broader coverage suites by count × severity. ## Step 5 — Dive Into Category -When the user wants to inspect a specific cluster, display the individual rows: test-case ID, input query, the agent's original response, evaluator scores, and failure reason. Let the user confirm which category or test case to optimize. +When the user wants to inspect a specific cluster, display the individual rows: evaluation-suite ID, input query, the agent's original response, evaluator scores, and failure reason. Let the user confirm which category or evaluation suite to optimize. ## Next Steps diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/cicd-monitoring.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/cicd-monitoring.md index c1fcd15f..af20ea58 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/cicd-monitoring.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/cicd-monitoring.md @@ -1,36 +1,52 @@ -# Step 11 — Enable CI/CD Evals & Continuous Monitoring +# Step 6 — CI/CD Evals & Continuous Production Monitoring -After confirming the final agent version, prompt with two options: +After confirming the final agent version through the observe loop, present two complementary monitoring options. The user may choose one, both, or neither. -## Option 1 — CI/CD Evaluations +## Option 1 — CI/CD Pipeline Evaluations (Pre-Deploy Gate) *"Would you like to add automated evaluations to your CI/CD pipeline so every deployment is evaluated before going live?"* +CI/CD evals run batch evaluations as part of your deployment pipeline, catching regressions **before** they reach production. + If yes, generate a GitHub Actions workflow (for example, `.github/workflows/agent-eval.yml`) that: 1. Triggers on push to `main` or on pull request -2. Reads test-case definitions from `.foundry/agent-metadata.yaml` -3. Reads evaluator definitions from `.foundry/evaluators/` and test datasets from `.foundry/datasets/` -4. Runs `evaluation_agent_batch_eval_create` against the newly deployed agent version -5. Fails the workflow if any evaluator score falls below the configured thresholds for the selected environment/test case -6. Posts a summary as a PR comment or workflow annotation +2. Accepts a metadata-file input or environment variable such as `FOUNDRY_METADATA_FILE` and defaults it to `.foundry/agent-metadata.yaml` +3. Reads evaluation-suite definitions from the selected metadata file (for example, `.foundry/agent-metadata.prod.yaml` for prod CI) +4. Reads evaluator definitions from `.foundry/evaluators/` and test datasets from `.foundry/datasets/` +5. Runs `evaluation_agent_batch_eval_create` against the newly deployed agent version +6. Fails the workflow if any evaluator score falls below the configured thresholds for the environment and evaluation suite resolved from that metadata file +7. Posts a summary as a PR comment or workflow annotation -Use repository secrets for the selected environment's project endpoint and Azure credentials. Confirm the workflow file with the user before committing. +Use repository secrets for the selected environment's project endpoint and Azure credentials, and keep the metadata filename explicit in the workflow so prod rollouts do not depend on the local/dev default file. Confirm the workflow file with the user before committing. -## Option 2 — Continuous Production Monitoring +## Option 2 — Continuous Production Monitoring (Post-Deploy) *"Would you like to set up continuous evaluations to monitor your agent's quality in production?"* -If yes, generate a scheduled GitHub Actions workflow (for example, `.github/workflows/agent-eval-scheduled.yml`) that: +Continuous evaluation uses Foundry-native MCP tools to automatically assess agent responses on an ongoing basis — no additional CI/CD pipeline setup is needed for this option. This catches regressions that emerge **after** deployment from changing data, user patterns, or upstream service drift. + +### Enable Continuous Evaluation + +Use the [continuous evaluation reference](continuous-eval.md) to configure monitoring. The workflow: + +1. **Check existing config** — call `continuous_eval_get` to see if monitoring is already active. +2. **Select evaluators** — recommend starting with the same evaluators used in batch evals for consistent comparison: + - **Quality evaluators** (require `deploymentName`): e.g., groundedness, coherence, relevance, task_adherence + - **Safety evaluators**: e.g., violence, indirect_attack, hate_unfairness +3. **Enable** — call `continuous_eval_create` with the selected evaluators. The tool auto-detects agent kind and configures the appropriate backend (real-time for prompt agents, scheduled for hosted agents). +4. **Confirm** — present the returned configuration to the user. + +### Acting on Monitoring Results + +Monitoring is only complete when score drops trigger investigation and remediation. -1. Runs on a cron schedule (ask the user preference: daily, weekly, and so on) -2. Evaluates the current production agent version using stored test cases, evaluators, and datasets -3. Saves results to `.foundry/results//` -4. Opens a GitHub issue or sends a notification if any score degrades below thresholds +For instructions on how to read evaluation scores, triage regressions, and verify fixes, see [Acting on Results](continuous-eval.md#acting-on-results). -The user may choose one, both, or neither. +The observe loop does not end at deployment. Continuous monitoring closes the loop: **observe → optimize → deploy → monitor → observe**. Always offer to set up monitoring after completing an optimization cycle. ## Reference - [Azure AI Foundry Cloud Evaluation](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/cloud-evaluation) - [Hosted Agents](https://learn.microsoft.com/en-us/azure/ai-foundry/agents/concepts/hosted-agents) +- [Continuous Evaluation Reference](continuous-eval.md) diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/compare-iterate.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/compare-iterate.md index a6114b77..9253ab8c 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/compare-iterate.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/compare-iterate.md @@ -2,7 +2,7 @@ ## Step 8 — Re-Evaluate -Use **`evaluation_agent_batch_eval_create`** with the **same `evaluationId`** as the baseline run. This places both runs in the same eval group for comparison. Use the same local test dataset (from `.foundry/datasets/`) and evaluator bundle from the selected environment/test case. Update `agentVersion` to the new version. +Use **`evaluation_agent_batch_eval_create`** with the **same `evaluationId`** as the baseline run. This places both runs in the same eval group for comparison. Use the same local test dataset (from the selected agent root's `.foundry/datasets/`) and evaluator bundle from the selected environment/evaluation suite. Update `agentVersion` to the new version. > ⚠️ **Parameter switch reminder:** Re-evaluation creation uses `evaluationId`, but follow-up calls to `evaluation_get` and `evaluation_comparison_create` must use `evalId`. diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/continuous-eval.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/continuous-eval.md new file mode 100644 index 00000000..d78e2766 --- /dev/null +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/continuous-eval.md @@ -0,0 +1,244 @@ +# Continuous Evaluation + +Enable, configure, disable, or remove continuous evaluation for a Foundry agent. Continuous evaluation automatically assesses agent responses on an ongoing basis using configured evaluators (e.g., groundedness, coherence, violence detection). This is typically the final step in the [observe loop](../observe.md) after deploying and batch-evaluating an agent — it keeps production quality visible without manual intervention. + +## When to Use This Skill + +USE FOR: enable continuous evaluation, disable continuous evaluation, configure continuous eval, set up monitoring evaluators, check continuous eval status, delete continuous eval, update evaluators, change sampling rate, change eval interval, production monitoring, ongoing agent quality. + +DO NOT USE FOR: running a one-off batch evaluation (use [observe](../observe.md)), querying traces (use [trace](../../trace/trace.md)), creating evaluator definitions (use [observe](../observe.md) Step 1). + +## Quick Reference + +| Property | Value | +|----------|-------| +| MCP server | `azure` | +| Key MCP tools | `continuous_eval_create`, `continuous_eval_get`, `continuous_eval_delete`, `agent_get`, `evaluation_get` | +| Prerequisite | Agent must exist in the project | +| Local cache | `.foundry/agent-metadata.yaml` | + +## Entry Points + +| User Intent | Start At | +|-------------|----------| +| "Enable continuous eval" / "Set up monitoring evaluators" | [Before Starting](#before-starting--detect-current-state) → [Enable or Update](#enable-or-update) | +| "Is continuous eval running?" / "Check eval status" | [Before Starting](#before-starting--detect-current-state) → [Check Current State](#check-current-state) | +| "Change evaluators" / "Update sampling rate" | [Before Starting](#before-starting--detect-current-state) → [Check Current State](#check-current-state) → [Enable or Update](#enable-or-update) | +| "Pause evaluations" / "Disable continuous eval" | [Before Starting](#before-starting--detect-current-state) → [Disable](#disable) | +| "Stop evaluating this agent" / "Delete continuous eval" | [Before Starting](#before-starting--detect-current-state) → [Delete](#delete) | +| "Scores are dropping" / "Act on monitoring results" | [Before Starting](#before-starting--detect-current-state) → [Acting on Results](#acting-on-results) | + +> ⚠️ **Important:** Always run [Before Starting](#before-starting--detect-current-state) to resolve the project endpoint and agent name before calling any MCP tools. + +## Before Starting — Detect Current State + +1. Resolve the target agent root and environment from `.foundry/agent-metadata.yaml` using the [Project Context Resolution](../../../SKILL.md#agent-project-context-resolution) workflow. +2. Extract `projectEndpoint` and `agentName` from the selected environment. If not available in metadata, use `ask_user` to collect them. +3. Use `agent_get` to verify the agent exists and note its kind (prompt or hosted). +4. Use `continuous_eval_get` to check for existing continuous evaluation configuration. +5. Jump to the appropriate entry point based on user intent. + +## How It Works + +The tool auto-detects the agent's kind and uses the appropriate backend: + +- **Prompt agents** — evaluation runs are triggered automatically each time the agent produces a response. Parameters: `samplingRate` (percentage of responses to evaluate), `maxHourlyRuns`. +- **Hosted agents** — evaluation runs are triggered on an hourly schedule, pulling recent traces from App Insights. Parameters: `intervalHours` (hours between runs), `maxTraces` (max data points per run). + +The user does not need to choose between these — the tool handles it based on agent kind. + +## Behavioral Rules + +1. **Always resolve context first.** Run [Before Starting](#before-starting--detect-current-state) before calling any MCP tool. Never assume a project endpoint or agent name. +2. **Check before creating.** Always call `continuous_eval_get` before `continuous_eval_create` to determine whether to create or update. Present existing configuration to the user. +3. **Confirm evaluator selection.** Present the evaluator list to the user before enabling. Distinguish quality evaluators (require `deploymentName`) from safety evaluators (do not). +4. **Prompt for next steps.** After each operation, present options. Never assume the path forward (e.g., after enabling, offer to check status or adjust parameters). +5. **Keep context visible.** Include the project endpoint, agent name, and environment in operation summaries. +6. **Use `continuous_eval_get` for IDs.** The `delete` tool requires a `configId` — always retrieve it from the `get` response rather than asking the user to provide it. +7. **Surface the remediation path.** When presenting continuous eval results that show score degradation, always offer to route into the [observe skill](../observe.md) for diagnosis and optimization. Monitoring without action is incomplete. +8. **Handle agent-not-found.** If `agent_get` returns a not-found error, stop the continuous eval flow. Offer to route to the [deploy skill](../../deploy/deploy.md) to create the agent first, or ask the user to verify the agent name and environment. +9. **Handle auth and endpoint errors.** If `agent_get` or `continuous_eval_create` returns a permission or authentication error, verify the project endpoint, environment, and user access. Do not suggest creating the agent — the issue is access, not existence. +10. **Validate `deploymentName` before enabling.** Do not assume `gpt-4o` exists. If quality evaluators are selected, verify a chat-capable deployment is available in the project. If none exists, stop and explain that quality evaluators cannot be enabled until a compatible deployment is provisioned. +11. **Handle invalid evaluator names.** If `continuous_eval_create` returns an invalid evaluator name error, call `evaluator_catalog_get` to list available evaluators and present valid options. Do not retry with the same arguments. +12. **Handle unexpected empty config.** If `continuous_eval_get` returns an empty list for an agent the user believes has continuous eval configured, verify the agent name and project endpoint match the intended environment in `.foundry/agent-metadata.yaml`. The configuration may exist under a different environment or resolved `agentName`. + +## Operations + +### Check Current State + +Before enabling or modifying, check what's already configured: + +```yaml +Tool: continuous_eval_get +Arguments: + projectEndpoint: + agentName: +``` + +- Empty list → no continuous eval configured. Proceed to [Enable or Update](#enable-or-update). +- Non-empty list → agent already has continuous eval. Present the configuration and ask what the user wants to change. + +> ⚠️ **Empty result is not proof of absence.** If the user expects a config to exist but the list is empty, verify the project endpoint and agent name match the intended environment before concluding it was never set up. + +### Enable or Update + +**Replace Semantics**: `continuous_eval_create` always creates a new evaluation group with the provided evaluators and points the evaluation rule at it. Always pass the complete desired configuration on every call — omitted evaluators are dropped, not preserved. + +> ⚠️ **Do not assume `gpt-4o` exists.** Before setting `deploymentName`, verify a chat-capable deployment is available in the project. If none exists, quality evaluators cannot be enabled — only safety evaluators (which do not require a deployment) will work. + +```yaml +Tool: continuous_eval_create +Arguments: + projectEndpoint: + agentName: + evaluatorNames: ["groundedness", "coherence", "fluency"] # Illustrative — align with your batch eval evaluators + deploymentName: "gpt-4o" # Required for quality evaluators + enabled: true # Set false to disable without deleting +``` + +**Evaluator selection guidance:** +- **Quality evaluators** (require `deploymentName`): coherence, fluency, relevance, groundedness, intent_resolution, task_adherence, tool_call_accuracy +- **Safety evaluators** (no `deploymentName` needed): violence, sexual, self_harm, hate_unfairness, indirect_attack, code_vulnerability, protected_material +- Custom evaluators from the project's evaluator catalog are also supported by name. + +**Optional parameters by agent kind:** + +| Parameter | Applies To | Description | Default | +|-----------|-----------|-------------|---------| +| `samplingRate` | Prompt | Percentage of responses to evaluate (1-100) | All responses | +| `maxHourlyRuns` | Prompt | Cap on evaluation runs per hour | No limit | +| `intervalHours` | Hosted | Hours between evaluation runs | 1 | +| `maxTraces` | Hosted | Max data points per evaluation run | 1000 | +| `scenario` | Prompt | Evaluation scenario: `standard` (quality and safety metrics, default) or `business` (business success metrics). An agent can have one of each simultaneously. | `standard` | + +### Disable + +To temporarily disable without changing configuration, pass the configuration currently in use along with `enabled: false`. Because `continuous_eval_create` has replace semantics, omitting parameters will change the configuration when re-enabled. The `continuous_eval_get` response does not include evaluator names directly — they are stored in the linked evaluation group — so retrieve them via `evaluation_get` first. If multiple configurations are returned in the `continuous_eval_get` response, present the list to the user and ask which to target. + +```yaml +# Step 1: Get the evalId, then retrieve current evaluators from the eval group +Tool: continuous_eval_get +Arguments: + projectEndpoint: + agentName: +# Note the evalId from the response +``` + +```yaml +Tool: evaluation_get +Arguments: + projectEndpoint: + evalId: +# Note the evaluator names from the evaluation group's testing criteria +``` + +```yaml +# Step 2: Disable with the same evaluators +Tool: continuous_eval_create +Arguments: + projectEndpoint: + agentName: + evaluatorNames: ["groundedness", "coherence", "fluency"] # Must match current config + deploymentName: "gpt-4o" + enabled: false +``` + +### Delete + +To permanently remove continuous evaluation configuration: + +```yaml +Tool: continuous_eval_delete +Arguments: + projectEndpoint: + configId: + agentName: +``` + +Always call `continuous_eval_get` first to retrieve the `id` field of the configuration to delete. If multiple configurations are returned, present the list to the user and ask which to target. + +## Acting on Results + +Continuous evaluation generates ongoing scores — but monitoring is only useful when you **act** on what it reveals. This section covers how to consume evaluation results and the remediation loop when scores degrade. + +### Step 1: Read Evaluation Scores + +The `continuous_eval_get` response includes an `evalId` that links to the evaluation group. Use this to retrieve actual run results: + +```yaml +Tool: continuous_eval_get +Arguments: + projectEndpoint: + agentName: +# Note the evalId from the response +``` + +```yaml +Tool: evaluation_get +Arguments: + projectEndpoint: + evalId: + isRequestForRuns: true +# Returns evaluation runs with per-evaluator scores +``` + +Review the run results for score trends. Each run contains scores for every configured evaluator. Look for: +- **Scores below threshold** — any evaluator consistently scoring below your acceptable baseline +- **Score degradation over time** — scores that were previously healthy but are trending downward +- **Safety flags** — any non-zero safety evaluator scores that indicate harmful content + +### Step 2: Triage the Regression + +1. **Identify the failing evaluators.** From the evaluation runs, note which specific evaluators are scoring low (e.g., `groundedness` dropping from 4.2 to 2.8). +2. **Correlate with traces.** Use the [trace skill](../../trace/trace.md) to search App Insights for the conversations that triggered low scores. Look for patterns: specific query types, tool-call failures, or grounding gaps. +3. **Compare to baseline.** If batch eval results exist in `.foundry/results/`, compare continuous eval scores against the last known-good batch run to determine whether this is a new regression or a pre-existing gap. + +### Step 3: Remediate via the Observe Loop + +Once you understand the failure pattern, use the [observe skill](../observe.md) to fix it: + +| Symptom | Action | +|---------|--------| +| Quality scores dropping (coherence, relevance, task_adherence) | Run [Step 3: Analyze](analyze-results.md) to cluster failures, then [Step 4: Optimize](optimize-deploy.md) to improve the prompt | +| Safety evaluators flagging (violence, indirect_attack) | Review flagged traces via [trace skill](../../trace/trace.md), then update agent instructions or tool definitions to address the pattern | +| Grounding failures | Check whether the agent's data sources are still accessible and returning expected results; update knowledge index or tool configuration | +| Scores fluctuating after a deploy | Run [Step 5: Compare](compare-iterate.md) between the current and previous agent version to isolate the regression | + +### Step 4: Verify the Fix + +After deploying a fix through the observe loop: + +1. **Re-run a batch eval** via [observe](../observe.md) Step 2 against the same test cases to confirm the fix. +2. **Read continuous eval scores** from the next evaluation cycle using `evaluation_get` with the `evalId` — verify scores have recovered. +3. **Adjust evaluators if needed.** If the regression exposed a gap in evaluator coverage, use `continuous_eval_create` to update the configuration with additional or refined evaluators. + +> 💡 **Tip:** The continuous eval → observe → deploy → continuous eval cycle is the core production quality loop. Continuous eval detects; observe diagnoses and fixes; continuous eval verifies. + +## Response Format + +All tools return a unified `ContinuousEvalConfig` shape. The `get` tool returns a list; `create` returns a single object. + +| Field | Description | Present For | +|-------|-------------|-------------| +| `id` | Configuration identifier (needed for delete) | All | +| `displayName` | Human-readable name | All | +| `enabled` | Whether evaluation is active | All | +| `evalId` | Linked evaluation group containing evaluator definitions | All | +| `agentName` | Target agent name | All | +| `status` | Provisioning status | Hosted only | +| `scenario` | Evaluation scenario (`standard` or `business`) | Prompt only | +| `samplingRate` | Percentage of responses evaluated | Prompt only | +| `maxHourlyRuns` | Cap on runs per hour | Prompt only | +| `intervalHours` | Hours between scheduled runs | Hosted only | +| `maxTraces` | Max data points per run | Hosted only | +| `createdAt` | Creation timestamp | All | +| `createdBy` | Creator identity | All | + +## Related Skills + +| User Intent | Skill | +|-------------|-------| +| "Evaluate my agent" / "Run a batch eval" | [observe skill](../observe.md) | +| "Scores are dropping" / "Diagnose and fix quality regression" | [observe skill](../observe.md) (Steps 3–5) | +| "Analyze production traces" / "Find flagged conversations" | [trace skill](../../trace/trace.md) | +| "Deploy my agent" / "Redeploy after fix" | [deploy skill](../../deploy/deploy.md) | diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md index 17b8855e..7d2f4549 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/deploy-and-setup.md @@ -1,6 +1,6 @@ # Step 1 — Auto-Setup Evaluators & Dataset -> **This step runs automatically after deployment.** If the agent was deployed via the [deploy skill](../../deploy/deploy.md), `.foundry` cache and metadata may already be configured. Check `.foundry/evaluators/`, `.foundry/datasets/`, and `.foundry/agent-metadata.yaml` for existing artifacts before re-creating them. +> **This step runs automatically after deployment.** If the agent was deployed via the [deploy skill](../../deploy/deploy.md), `.foundry` cache and metadata may already be configured. Check `.foundry/evaluators/`, `.foundry/datasets/`, and the selected metadata file under the selected agent root before re-creating them. > > If the agent is **not yet deployed**, follow the [deploy skill](../../deploy/deploy.md) first. It handles project detection, Dockerfile generation, ACR build, agent creation, verification, and auto-creates `.foundry` cache after a successful deployment. @@ -10,15 +10,15 @@ ### 1. Read Agent Instructions -Use **`agent_get`** (or local `agent.yaml`) to understand the agent's purpose and capabilities. +Use **`agent_get`** (or local `agent.yaml` in the selected agent root) to understand the agent's purpose and capabilities. ### 2. Reuse or Refresh Cache -Inspect `.foundry/evaluators/`, `.foundry/datasets/`, and the selected environment's `testCases[]`. +Inspect `.foundry/evaluators/`, `.foundry/datasets/`, and the selected environment's `evaluationSuites[]` in the selected agent root only. Do **not** merge sibling agent folders. If the selected environment still uses older `testSuites[]` or legacy `testCases[]`, normalize that list to evaluation suites first and plan to rewrite that environment as `evaluationSuites[]` when this step persists metadata. - **Cache is current** -> reuse it and summarize what is already available. - **Cache is missing or stale** -> refresh it after confirming with the user. -- **User explicitly asks for refresh** -> rebuild and rewrite only the selected environment's cache. +- **User explicitly asks for refresh** -> rebuild and rewrite only the selected environment's cache in the selected agent root. ### 2.5 Discover Existing Evaluators @@ -64,19 +64,20 @@ Use **`model_deployment_get`** to list the selected project's actual model deplo ### 6. Generate Local Test Dataset -Generate the seed rows directly from the agent's instructions and tool capabilities you already resolved during setup. Do **not** call the identified chat-capable deployment for dataset generation; reserve that deployment for quality evaluators. Save the initial seed file to `.foundry/datasets/-eval-seed-v1.jsonl` with each line containing at minimum `query` and `expected_behavior` fields (optionally `context`, `ground_truth`). +Generate the seed rows directly from the selected agent root's instructions and tool capabilities you already resolved during setup. Do **not** call the identified chat-capable deployment for dataset generation; reserve that deployment for quality evaluators. Save the initial seed file to `.foundry/datasets/-eval-seed-v1.jsonl` with each line containing at minimum `query` and `expected_behavior` fields (optionally `context`, `ground_truth`). -The local filename must start with the selected environment's Foundry agent name (`agentName` in `agent-metadata.yaml`) before adding stage, environment, or version suffixes. +The local filename must start with the selected environment's Foundry agent name (`agentName` in the selected metadata file) before adding stage, environment, or version suffixes. Include `expected_behavior` even though Phase 1 uses built-in evaluators only. That field pre-positions the seed dataset for Phase 2 custom evaluators if the first run reveals gaps that need a per-query behavioral rubric. Use [Generate Seed Evaluation Dataset](../../eval-datasets/references/generate-seed-dataset.md) as the single source of truth for registration. It covers `project_connection_list` with `AzureStorageAccount`, key-based versus AAD upload, `evaluation_dataset_create` with `connectionName`, and saving the returned `datasetUri`. -### 7. Persist Artifacts and Test Cases +### 7. Persist Artifacts and Evaluation Suites ```text .foundry/ agent-metadata.yaml + agent-metadata.prod.yaml evaluators/ .yaml datasets/ @@ -87,9 +88,9 @@ Use [Generate Seed Evaluation Dataset](../../eval-datasets/references/generate-s .json ``` -Save evaluator definitions to `.foundry/evaluators/.yaml`, test data to `.foundry/datasets/*.jsonl`, and create or update test cases in `agent-metadata.yaml` with: +Save evaluator definitions to `.foundry/evaluators/.yaml`, test data to `.foundry/datasets/*.jsonl`, and create or update evaluation suites in the selected metadata file with: - `id` -- `priority` (`P0`, `P1`, `P2`) +- `tags` (freeform key/value map, for example `tier: smoke`, `purpose: baseline`, `stage: seed`) - `dataset` (for example, `-eval-seed`) - `datasetVersion` (for example, `v1`) - `datasetFile` (for example, `.foundry/datasets/-eval-seed-v1.jsonl`) @@ -97,10 +98,12 @@ Save evaluator definitions to `.foundry/evaluators/.yaml`, test data to `. - tag values for `agent`, `stage`, and `version` - evaluator names and thresholds +If the selected environment still uses older `testSuites[]` or legacy `testCases[]`, replace that list with `evaluationSuites[]` in the rewritten metadata. Preserve dataset/evaluator fields and map `priority` to `tags.tier` only when `tags.tier` is missing. + > ⚠️ **Show Data Viewer deeplinks (for VS Code runtime only):** Append a Data Viewer deeplink immediately after reference to a dataset file in your response. Format: "[Open in Data Viewer](vscode://ms-windows-ai-studio.windows-ai-studio/open_data_viewer?file=&source=microsoft-foundry-skill) for details and perform analysis". ### 8. Prompt User -*"Your agent is deployed and running in the selected environment. The `.foundry` cache now contains evaluators, a local seed dataset, the Foundry dataset registration metadata, and test-case metadata. Would you like to run an evaluation to identify optimization opportunities?"* +*"Your agent is deployed and running in the selected environment. The `.foundry` cache now contains evaluators, a local seed dataset, the Foundry dataset registration metadata, and evaluation-suite metadata. Would you like to run an evaluation to identify optimization opportunities?"* If yes -> proceed to [Step 2: Evaluate](evaluate-step.md). If no -> stop. diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/evaluate-step.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/evaluate-step.md index 348cd144..1e3e8afa 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/evaluate-step.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/observe/references/evaluate-step.md @@ -3,23 +3,23 @@ ## Prerequisites - Agent deployed and running in the selected environment -- `.foundry/agent-metadata.yaml` loaded for the active agent root +- Selected `.foundry/agent-metadata*.yaml` file loaded for the active agent root - Evaluators configured (from [Step 1](deploy-and-setup.md) or `.foundry/evaluators/`) -- Local test dataset available (from `.foundry/datasets/`) -- Test case selected from the environment's `testCases[]` +- Local test dataset available (from the selected agent root's `.foundry/datasets/`) +- Evaluation suite selected from the environment's `evaluationSuites[]` ## Run Evaluation -Use **`evaluation_agent_batch_eval_create`** to run the selected test case's evaluators against the selected environment's agent. +Use **`evaluation_agent_batch_eval_create`** to run the selected evaluation suite's evaluators against the selected environment's agent. ### Required Parameters | Parameter | Description | |-----------|-------------| -| `projectEndpoint` | Azure AI Project endpoint from `agent-metadata.yaml` | +| `projectEndpoint` | Azure AI Project endpoint from the selected metadata file | | `agentName` | Agent name for the selected environment | | `agentVersion` | Agent version (string, for example `"1"`) | -| `evaluatorNames` | Array of evaluator names from the selected test case | +| `evaluatorNames` | Array of evaluator names from the selected evaluation suite | ### Test Data Options @@ -37,9 +37,9 @@ Before setting `deploymentName`, use **`model_deployment_get`** to list the sele |-----------|-------------| | `deploymentName` | Required for quality evaluators (the LLM-judge model) | | `evaluationId` | Pass existing eval group ID to group runs for comparison | -| `evaluationName` | Name for a new evaluation group; include environment and test-case ID | +| `evaluationName` | Name for a new evaluation group; include environment and evaluation-suite ID | -> **Important:** Use `evaluationId` on `evaluation_agent_batch_eval_create` (not `evalId`) to group runs. Run `P0` test cases first unless the user chooses a broader priority band. +> **Important:** Use `evaluationId` on `evaluation_agent_batch_eval_create` (not `evalId`) to group runs. Run suites tagged `tier=smoke` first unless the user chooses a broader suite tag or a specific suite. > ⚠️ **Eval-group immutability:** Reuse an existing `evaluationId` only when the dataset comparison setup is unchanged for that group: same evaluator list and same thresholds. If evaluator definitions or thresholds change, create a **new** evaluation group instead of adding another run to the old one. diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/trace/references/search-traces.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/trace/references/search-traces.md index 295fcd13..486e724a 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/trace/references/search-traces.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/trace/references/search-traces.md @@ -5,7 +5,7 @@ Search agent traces at the conversation level. Returns summaries grouped by conv ## Prerequisites - App Insights resource resolved (see [trace.md](../trace.md) Before Starting) -- Selected agent root and environment confirmed from `.foundry/agent-metadata.yaml` +- Selected agent root, metadata file, and environment confirmed from `.foundry/agent-metadata*.yaml` - Time range confirmed with user (default: last 24 hours) ## Search by Conversation ID @@ -160,4 +160,4 @@ union dependencies, requests, exceptions, traces ## After Successful Query -> 📝 **Reminder:** If this is the first trace query in this session, ensure App Insights connection info was persisted to `.foundry/agent-metadata.yaml` for the selected environment (see [trace.md — Before Starting](../trace.md#before-starting--resolve-app-insights-connection)). +> 📝 **Reminder:** If this is the first trace query in this session, ensure App Insights connection info was persisted to the selected metadata file for the selected environment (see [trace.md — Before Starting](../trace.md#before-starting--resolve-app-insights-connection)). diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/trace/trace.md b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/trace/trace.md index 1a25f030..0c1a8cd9 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/trace/trace.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/foundry-agent/trace/trace.md @@ -19,7 +19,7 @@ USE FOR: analyze agent traces, search agent conversations, find failing traces, | Related skills | `troubleshoot` (hosted-agent logs), `eval-datasets` (trace harvesting) | | Preferred query tool | `monitor_resource_log_query` (Azure MCP) - use for App Insights KQL queries | | OTel conventions | [GenAI Spans](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/), [Agent Spans](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/) | -| Local metadata | `.foundry/agent-metadata.yaml` | +| Local metadata | selected `.foundry/agent-metadata*.yaml` file | ## Entry Points @@ -35,9 +35,9 @@ USE FOR: analyze agent traces, search agent conversations, find failing traces, ## Before Starting — Resolve App Insights Connection -1. Resolve the target agent root and environment from `.foundry/agent-metadata.yaml`. -2. Check `environments..observability.applicationInsightsConnectionString` or `environments..observability.applicationInsightsResourceId` in the metadata. -3. If observability settings are missing, use `project_connection_list` to discover App Insights linked to the Foundry project, then persist the chosen resource back to `environments..observability` in `agent-metadata.yaml` before querying. +1. Resolve the target agent root, selected metadata file, and environment from `.foundry/agent-metadata*.yaml`. +2. Check `environments..observability.applicationInsightsConnectionString` or `environments..observability.applicationInsightsResourceId` in the selected metadata file. +3. If observability settings are missing, use `project_connection_list` to discover App Insights linked to the Foundry project, then persist the chosen resource back to `environments..observability` in the selected metadata file before querying. 4. Confirm the selected App Insights resource and environment with the user before querying. 5. Use **`monitor_resource_log_query`** (Azure MCP tool) to execute KQL queries against the App Insights resource. This is preferred over delegating to the `azure-kusto` skill. Pass the App Insights resource ID and the KQL query directly. diff --git a/.github/plugins/azure-skills/skills/microsoft-foundry/references/agent-metadata-contract.md b/.github/plugins/azure-skills/skills/microsoft-foundry/references/agent-metadata-contract.md index 2eb9e8d8..6bdbdc84 100644 --- a/.github/plugins/azure-skills/skills/microsoft-foundry/references/agent-metadata-contract.md +++ b/.github/plugins/azure-skills/skills/microsoft-foundry/references/agent-metadata-contract.md @@ -8,28 +8,39 @@ Use this contract for every agent source folder that participates in Microsoft F / .foundry/ agent-metadata.yaml + agent-metadata.prod.yaml datasets/ evaluators/ results/ ``` -- `agent-metadata.yaml` is the required source of truth for environment-specific Foundry configuration. +- `agent-metadata.yaml` is the preferred local/dev metadata file. +- Optional sidecar files such as `agent-metadata.prod.yaml` can hold a single prod or CI-targeted environment without mixing multiple environments in one file. - `datasets/` and `evaluators/` are local cache folders. Reuse existing files when they are current, and ask before refreshing or overwriting them. - `results/` stores local evaluation outputs and comparison artifacts by environment. +## Metadata File Model + +| File | Typical use | Notes | +|------|-------------|-------| +| `.foundry/agent-metadata.yaml` | Preferred local/dev metadata | Default choice for local workflows when no file is specified | +| `.foundry/agent-metadata..yaml` | Optional prod/CI or modular environment-specific metadata | Prefer this when the workflow explicitly targets that environment and the file exists | + +New setups should prefer **one environment per metadata file** while keeping the current schema shape (`defaultEnvironment` + `environments.`) for compatibility. Legacy multi-environment `agent-metadata.yaml` files remain supported. + ## Environment Model | Field | Required | Purpose | |-------|----------|---------| -| `defaultEnvironment` | ✅ | Environment used when the user does not choose one explicitly | +| `defaultEnvironment` | ✅ | Default environment inside the selected metadata file; in preferred single-environment files it should match the only environment key | | `environments..projectEndpoint` | ✅ | Foundry project endpoint for that environment | | `environments..agentName` | ✅ | Deployed Foundry agent name | | `environments..azureContainerRegistry` | ✅ for hosted agents | ACR used for deployment and image refresh | | `environments..observability.applicationInsightsResourceId` | Recommended | App Insights resource for trace workflows | | `environments..observability.applicationInsightsConnectionString` | Optional | Connection string when needed for tooling | -| `environments..testCases[]` | ✅ | Dataset + local/remote references + evaluator + threshold bundles for evaluation workflows | +| `environments..evaluationSuites[]` | ✅ | Dataset + local/remote references + evaluator + tag bundles for evaluation workflows | -## Example `agent-metadata.yaml` +## Example `.foundry/agent-metadata.yaml` (local/dev) ```yaml defaultEnvironment: dev @@ -40,9 +51,12 @@ environments: azureContainerRegistry: contosoregistry.azurecr.io observability: applicationInsightsResourceId: /subscriptions//resourceGroups//providers/Microsoft.Insights/components/support-dev-ai - testCases: + evaluationSuites: - id: smoke-core - priority: P0 + tags: + tier: smoke + purpose: baseline + stage: seed dataset: support-agent-dev-eval-seed datasetVersion: v1 datasetFile: .foundry/datasets/support-agent-dev-eval-seed-v1.jsonl @@ -55,8 +69,11 @@ environments: - name: citation_quality threshold: 0.9 definitionFile: .foundry/evaluators/citation-quality.yaml - - id: trace-regressions - priority: P1 + - id: trace-regression-suite + tags: + tier: regression + purpose: regression + stage: traces dataset: support-agent-dev-traces datasetVersion: v3 datasetFile: .foundry/datasets/support-agent-dev-traces-v3.jsonl @@ -66,13 +83,23 @@ environments: threshold: 4 - name: groundedness threshold: 4 +``` + +## Example `.foundry/agent-metadata.prod.yaml` (prod/CI) + +```yaml +defaultEnvironment: prod +environments: prod: projectEndpoint: https://contoso.services.ai.azure.com/api/projects/support-prod agentName: support-agent-prod azureContainerRegistry: contosoregistry.azurecr.io - testCases: + evaluationSuites: - id: production-guardrails - priority: P0 + tags: + tier: smoke + purpose: safety + stage: prod dataset: support-agent-prod-curated datasetVersion: v2 datasetFile: .foundry/datasets/support-agent-prod-curated-v2.jsonl @@ -86,22 +113,40 @@ environments: ## Workflow Rules -1. Auto-discover agent roots by searching for `.foundry/agent-metadata.yaml`. +1. Auto-discover agent roots by searching for `.foundry/` folders that contain `agent-metadata.yaml` or `agent-metadata..yaml`. 2. If exactly one agent root is found, use it. If multiple roots are found, require the user to choose one. -3. Resolve environment in this order: explicit user choice, remembered session choice, `defaultEnvironment`. -4. Keep the selected agent root and environment visible in every deploy, eval, dataset, and trace summary. -5. Treat `datasets/` and `evaluators/` as cache folders. Reuse local files when present, but offer refresh when the user asks or when remote state is newer. -6. Never overwrite cache files or metadata silently. +3. Inside the selected agent root, select the metadata file in this order: explicit file/path from the user or workflow, then `.foundry/agent-metadata..yaml` when an explicit environment is already known and that file exists, then `.foundry/agent-metadata.yaml`. If `.foundry/agent-metadata.yaml` is absent, use the only matching sidecar file when exactly one `.foundry/agent-metadata..yaml` file exists; if multiple sidecar files exist and no explicit file/path was provided, require the user to choose the metadata file. +4. Resolve environment in this order: explicit user choice, then the file's only environment when the selected metadata file is single-environment, then remembered session choice, then `defaultEnvironment`. +5. Keep the selected agent root, metadata file, and environment visible in every deploy, eval, dataset, and trace summary. +6. Once an agent root is selected, use only that root's `.foundry/` folders and source tree for local evaluation, dataset, trace, deploy, and prompt-optimization context. Do not merge sibling agent folders. +7. Treat `datasets/` and `evaluators/` as cache folders. Reuse local files when present, but offer refresh when the user asks or when remote state is newer. +8. Writes must target the selected metadata file only. For preferred single-environment files, update only that one environment block. For legacy multi-environment files, rewrite only the selected environment block. Never copy or merge environments across sibling metadata files automatically. +9. Never overwrite cache files or metadata silently. + +## Legacy Compatibility (`testCases[]` / `testSuites[]` -> `evaluationSuites[]`) + +Use `evaluationSuites[]` as the canonical schema. If the selected environment still uses older `testSuites[]` and does not yet define `evaluationSuites[]`, treat that list as the current suite source, normalize it in memory, and migrate it on the next metadata write. If the selected environment is older still and uses legacy `testCases[]` without `evaluationSuites[]`, treat `testCases[]` as the suite source and normalize it the same way. + +| Legacy field | Migration behavior | +|--------------|--------------------| +| `id` | Keep as-is | +| `dataset`, `datasetVersion`, `datasetFile`, `datasetUri`, `evaluators` | Keep as-is | +| `tags` | Preserve if already present | +| `priority` | If `tags.tier` is missing, map `P0` -> `smoke`, `P1` -> `regression`, `P2` -> `coverage` | + +When a workflow writes metadata, rewrite the selected metadata file so the target environment contains only `evaluationSuites[]`. Do not keep older `testSuites[]` or legacy `testCases[]` in the rewritten block. + +## Evaluation-Suite Guidance -## Test-Case Guidance +Use `tags` as a freeform key/value map on each evaluation suite. Suggested keys: -| Priority | Meaning | Typical Use | -|----------|---------|-------------| -| `P0` | Must-pass gate | Smoke checks, safety, deployment blockers | -| `P1` | High-value regression coverage | Production trace regressions, key business flows | -| `P2` | Broader quality coverage | Long-tail scenarios, exploratory quality checks | +| Tag Key | Example Values | Typical Use | +|---------|----------------|-------------| +| `tier` | `smoke`, `regression`, `coverage` | Suggested run order / breadth | +| `purpose` | `baseline`, `safety`, `tools`, `quality`, `regression` | Why the suite exists | +| `stage` | `seed`, `traces`, `curated`, `prod` | Dataset lifecycle alignment | -Each test case should point to one dataset and one or more evaluators with explicit thresholds. Store `dataset` as the stable Foundry dataset name (without the `-vN` suffix), store the version separately in `datasetVersion`, and keep the local cache filename versioned (for example, `...-v3.jsonl`). Persist the local `datasetFile` and remote `datasetUri` together so every test case can resolve both the cache artifact and the Foundry-registered dataset. Local dataset filenames should start with the selected environment's Foundry `agentName`, followed by stage and version suffixes, so related cache files stay grouped by agent. If `agentName` already encodes the environment (for example, `support-agent-dev`), do not append the environment key again. Use test-case IDs in evaluation names, result folders, and regression summaries so the flow remains traceable. +Each evaluation suite should point to one dataset and one or more evaluators with explicit thresholds. Store `dataset` as the stable Foundry dataset name (without the `-vN` suffix), store the version separately in `datasetVersion`, and keep the local cache filename versioned (for example, `...-v3.jsonl`). Persist the local `datasetFile` and remote `datasetUri` together so every evaluation suite can resolve both the cache artifact and the Foundry-registered dataset. Add a `tags` map to each suite (for example, `tier: smoke`, `purpose: baseline`) so workflows can group or filter suites without a fixed priority enum. Local dataset filenames should start with the selected environment's Foundry `agentName` from the selected metadata file, followed by stage and version suffixes, so related cache files stay grouped by agent. If `agentName` already encodes the environment (for example, `support-agent-dev`), do not append the environment key again. Keep `datasets/`, `evaluators/`, and `results/` shared at the `.foundry/` root even when multiple metadata files exist. Use evaluation-suite IDs in evaluation names, result folders, and regression summaries so the flow remains traceable. ## Sync Guidance