docs: update demo guide and add CI/CD integration guide by spboyer · Pull Request #194 · microsoft/waza

spboyer · 2026-04-21T14:51:12Z

Fixes #112 and #89

Changes

Issue #112 - DEMO-SCRIPT.md Audit

Removed references to non-existent 'waza generate' command
Updated workflow to use 'waza new skill' and 'waza new eval'
Removed outdated flags: --log, --suggestions, --trials, --fail-threshold
Replaced with current CLI flags: --session-log, --session-dir, --task, --parallel
Updated Parts 5+ to reflect current CLI behavior
Verified all commands work with current CLI

Issue #89 - CI/CD Integration Guide

Created new docs/CI-CD-GUIDE.md with comprehensive coverage
GitHub Actions examples (basic, multi-model, baseline comparison)
Azure DevOps pipeline examples
Secrets management for both platforms
7 best practices: caching, quality gates, parallel execution, filtering, logging, PR comments, test separation
Troubleshooting guide for common issues
Advanced workflows: approval gates, trend tracking

Testing

Documentation-only changes; no build/test issues.

- Fix DEMO-SCRIPT.md to match current CLI commands - Remove references to 'waza generate' command (doesn't exist) - Replace with 'waza new skill' and 'waza new eval' - Remove outdated flags: --log, --suggestions, --trials, --fail-threshold - Replace with current flags: --session-log, --session-dir, --task, --parallel - Update Part 5+ sections to reflect current CLI behavior - Add comprehensive CI/CD integration guide (docs/CI-CD-GUIDE.md) - GitHub Actions examples (basic, multi-model, baseline comparison) - Azure DevOps pipeline examples - Secrets management for both platforms - Best practices: caching, quality gates, parallel execution, logging - Troubleshooting guide with common issues - Advanced workflows: approval gates, trend tracking Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds a CI/CD-focused documentation guide intended to help users run waza evaluations in automated pipelines (GitHub Actions / Azure DevOps) and use results for gating and reporting.

Changes:

Added a new docs/CI-CD-GUIDE.md with copy/pasteable CI examples and operational guidance.
Included sections for workflow patterns (multi-model, baseline), secrets management, best practices, troubleshooting, and advanced workflows.

Show a summary per file

File	Description
docs/CI-CD-GUIDE.md	Introduces a CI/CD integration guide with pipeline examples, gating, artifacts, and troubleshooting.

Copilot's findings

Comments suppressed due to low confidence (2)

docs/CI-CD-GUIDE.md:407

The PR comment script assumes results.summary.tasks_passed, tasks_total, and pass_rate, but the CLI’s summary uses fields like total_tests, succeeded, and success_rate. This snippet won’t render correct numbers until it’s aligned with the actual JSON schema.

      const fs = require('fs');
      const results = JSON.parse(fs.readFileSync('results.json', 'utf8'));
      const comment = `
## Evaluation Results
- **Pass Rate:** ${(results.summary.pass_rate * 100).toFixed(1)}%
- **Tasks Passed:** ${results.summary.tasks_passed}/${results.summary.tasks_total}
- **Duration:** ${results.summary.duration_ms}ms

docs/CI-CD-GUIDE.md:485

The “Result JSON Structure” example doesn’t match the actual waza run --output schema (e.g., summary.pass_rate vs summary.success_rate, tasks[*].id/name vs tasks[*].test_id/display_name, and summary.tasks_passed/tasks_total aren’t present). Since earlier workflow snippets parse these fields, please update this JSON example to reflect the real internal/models.EvaluationOutcome JSON shape.

### Result JSON Structure

```json
{
  "summary": {
    "eval_name": "my-skill-eval",
    "status": "passed",
    "pass_rate": 0.95,
    "tasks_passed": 19,
    "tasks_total": 20,
    "duration_ms": 5000,
    "timestamp": "2025-01-20T10:30:00Z"
  },
  "tasks": [
    {
      "id": "task-001",
      "name": "Example Task",
      "status": "passed",
      "score": 1.0,
      "duration_ms": 250,
      "trials": 1
    }
  ]
}

Files reviewed: 1/1 changed files
Comments generated: 10

Copilot · 2026-04-21T15:01:43Z

+        run: |
+          PASS_RATE=$(jq '.summary.pass_rate' results.json)
+          echo "Pass Rate: $PASS_RATE"
+          if (( $(echo "$PASS_RATE < 0.8" | bc -l) )); then
+            echo "❌ Pass rate below threshold"
+            exit 1


The examples read jq '.summary.pass_rate' / results.summary.pass_rate, but the CLI output JSON uses summary.success_rate (see internal/models/outcome.go -> OutcomeDigest.SuccessRate tagged as json:"success_rate"). As written, the threshold checks and PR comment snippet will either return null or fail. Please update the jq queries and the “Result JSON Structure” example to match the actual output schema.

This issue also appears in the following locations of the same file:

line 401

line 462

Copilot · 2026-04-21T15:01:44Z

+      - name: Run evaluation
+        run: waza run evals/my-skill/eval.yaml \
+          --context-dir evals/my-skill/fixtures \
+          --output results.json
+        env:
+          COPILOT_SDK_TOKEN: ${{ secrets.COPILOT_SDK_TOKEN }}
+


The workflow examples set COPILOT_SDK_TOKEN, but this repo’s docs indicate copilot-sdk uses GITHUB_TOKEN (and CI can typically use the mock executor with no secrets). As written, users will add a secret/env var that waza doesn’t read and still fail auth. Please align the env var + surrounding text with README’s “Environment Requirements” section and update the other COPILOT_SDK_TOKEN references in this guide accordingly.

Copilot · 2026-04-21T15:01:44Z

+# CI/CD Integration Guide
+
+This guide explains how to integrate waza evaluations into your CI/CD pipelines using GitHub Actions, Azure DevOps, and other platforms.


PR metadata says this “Fixes #89”, but issue #89’s acceptance criteria calls for a docs-site page under site/src/content/docs/guides/ (and sidebar/nav update). This PR adds docs/CI-CD-GUIDE.md, but I don’t see a corresponding site guide; consider either porting this content into the site docs structure (and adding it to navigation) or adjusting the “Fixes #89” claim/scope.

Copilot · 2026-04-21T15:01:44Z

+### Issue: Tests timeout in CI
+
+**Solution:** Increase timeout or use `--timeout` flag:
+```yaml
+- name: Run evaluation
+  timeout-minutes: 10
+  run: waza run evals/my-skill/eval.yaml \
+    --context-dir evals/my-skill/fixtures \
+    --output results.json


The troubleshooting section suggests using a --timeout flag with waza run, but waza run doesn’t define a --timeout CLI flag (timeouts are configured via config.timeout_seconds in the eval spec / defaults). This example as written will fail with an unknown flag error; please update the guidance to the supported configuration mechanism.

Copilot · 2026-04-21T15:01:45Z

+### Threshold Configuration
+
+Set pass thresholds in `eval.yaml`:
+
+```yaml
+config:
+  model: claude-sonnet-4-20250514
+  pass_threshold: 0.8
+  metrics:
+    - name: task_completion
+      threshold: 0.85
+    - name: trigger_accuracy
+      threshold: 0.90
+```


The eval.yaml snippet includes config.pass_threshold, but the spec schema in this repo doesn’t define that field (YAML parsing uses KnownFields(true), so this would fail with an unknown field error). Please remove pass_threshold and document the supported thresholding mechanism (per-metric metrics[*].threshold, and/or whatever determines overall pass/fail in the current runner).

Copilot · 2026-04-21T15:01:45Z

+To create a Copilot SDK token, authenticate with the CLI:
+
+```bash
+copilot auth login


This guide instructs copilot auth login, but the codebase error messages direct users to run copilot login (no auth subcommand). Update the command so readers can successfully authenticate.

Suggested change

copilot auth login

copilot login

Copilot · 2026-04-21T15:01:45Z

+          git stash
+          git checkout main
+          waza run evals/my-skill/eval.yaml \
+            --context-dir evals/my-skill/fixtures \
+            --output results-baseline.json
+          git checkout -
+          git stash pop


The “Baseline Comparison” example is likely to fail: git stash won’t create a stash entry when there are no tracked changes, but git stash pop will then error with “No stash entries found” (and results-current.json is typically untracked). Consider using git stash -u + git stash pop || true, or (more robust) use git worktree add to run the baseline eval in a separate checkout without stashing/checking out back and forth.

Suggested change

git stash

git checkout main

waza run evals/my-skill/eval.yaml \

--context-dir evals/my-skill/fixtures \

--output results-baseline.json

git checkout -

git stash pop

git stash -u

git checkout main

waza run evals/my-skill/eval.yaml \

--context-dir evals/my-skill/fixtures \

--output results-baseline.json

git checkout -

git stash pop || true

Copilot · 2026-04-21T15:01:46Z

+
+      - name: Upload results
+        if: always()
+        uses: actions/upload-artifact@v3


This guide uses actions/upload-artifact@v3, but the repo’s own workflows have already standardized on actions/upload-artifact@v4. Updating to v4 avoids deprecation warnings and keeps the docs consistent with in-repo examples.

Suggested change

uses: actions/upload-artifact@v3

uses: actions/upload-artifact@v4

Copilot · 2026-04-21T15:01:46Z

+      - uses: actions/checkout@v4
+
+      - name: Set up Go
+        uses: actions/setup-go@v4


This guide uses actions/setup-go@v4, but the repo’s workflows use actions/setup-go@v5 (and Go 1.26). For copy/paste parity with existing workflows, update to v5 here too.

Suggested change

uses: actions/setup-go@v4

uses: actions/setup-go@v5

Copilot · 2026-04-21T15:01:46Z

+  with:
+    path: |
+      .waza-cache
+      ~/.cache/waza


The cache example includes ~/.cache/waza, but the CLI’s configurable cache directory defaults to .waza-cache (and there’s no reference in the codebase to a ~/.cache/waza location). Including an unused path may confuse users and reduce cache hit rates. Suggest removing ~/.cache/waza or replacing it with the actual directory/directories waza writes to.

Suggested change

~/.cache/waza

Copilot AI review requested due to automatic review settings April 21, 2026 14:51

spboyer requested review from chlowell, richardpark-msft and wbreza as code owners April 21, 2026 14:51

github-actions Bot enabled auto-merge (squash) April 21, 2026 14:51

Copilot started reviewing on behalf of spboyer April 21, 2026 14:51 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

spboyer merged commit cd914c6 into main Apr 21, 2026
10 checks passed

spboyer deleted the squad/docs-update-112-89 branch April 21, 2026 15:04

This was referenced Apr 21, 2026

demo guide needs to be updated - it references commands and switches that don't exist anymore #112

Closed

docs: Add CI/CD integration guide (GitHub Actions, Azure DevOps) #89

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: update demo guide and add CI/CD integration guide#194

docs: update demo guide and add CI/CD integration guide#194
spboyer merged 1 commit into
mainfrom
squad/docs-update-112-89

spboyer commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# CI/CD Integration Guide

		This guide explains how to integrate waza evaluations into your CI/CD pipelines using GitHub Actions, Azure DevOps, and other platforms.

	uses: actions/upload-artifact@v3
	uses: actions/upload-artifact@v4

Conversation

spboyer commented Apr 21, 2026

Changes

Issue #112 - DEMO-SCRIPT.md Audit

Issue #89 - CI/CD Integration Guide

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants