docs: update demo guide and add CI/CD integration guide#194
Conversation
- Fix DEMO-SCRIPT.md to match current CLI commands - Remove references to 'waza generate' command (doesn't exist) - Replace with 'waza new skill' and 'waza new eval' - Remove outdated flags: --log, --suggestions, --trials, --fail-threshold - Replace with current flags: --session-log, --session-dir, --task, --parallel - Update Part 5+ sections to reflect current CLI behavior - Add comprehensive CI/CD integration guide (docs/CI-CD-GUIDE.md) - GitHub Actions examples (basic, multi-model, baseline comparison) - Azure DevOps pipeline examples - Secrets management for both platforms - Best practices: caching, quality gates, parallel execution, logging - Troubleshooting guide with common issues - Advanced workflows: approval gates, trend tracking Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a CI/CD-focused documentation guide intended to help users run waza evaluations in automated pipelines (GitHub Actions / Azure DevOps) and use results for gating and reporting.
Changes:
- Added a new
docs/CI-CD-GUIDE.mdwith copy/pasteable CI examples and operational guidance. - Included sections for workflow patterns (multi-model, baseline), secrets management, best practices, troubleshooting, and advanced workflows.
Show a summary per file
| File | Description |
|---|---|
| docs/CI-CD-GUIDE.md | Introduces a CI/CD integration guide with pipeline examples, gating, artifacts, and troubleshooting. |
Copilot's findings
Comments suppressed due to low confidence (2)
docs/CI-CD-GUIDE.md:407
- The PR comment script assumes
results.summary.tasks_passed,tasks_total, andpass_rate, but the CLI’ssummaryuses fields liketotal_tests,succeeded, andsuccess_rate. This snippet won’t render correct numbers until it’s aligned with the actual JSON schema.
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('results.json', 'utf8'));
const comment = `
## Evaluation Results
- **Pass Rate:** ${(results.summary.pass_rate * 100).toFixed(1)}%
- **Tasks Passed:** ${results.summary.tasks_passed}/${results.summary.tasks_total}
- **Duration:** ${results.summary.duration_ms}ms
docs/CI-CD-GUIDE.md:485
- The “Result JSON Structure” example doesn’t match the actual
waza run --outputschema (e.g.,summary.pass_ratevssummary.success_rate,tasks[*].id/namevstasks[*].test_id/display_name, andsummary.tasks_passed/tasks_totalaren’t present). Since earlier workflow snippets parse these fields, please update this JSON example to reflect the realinternal/models.EvaluationOutcomeJSON shape.
### Result JSON Structure
```json
{
"summary": {
"eval_name": "my-skill-eval",
"status": "passed",
"pass_rate": 0.95,
"tasks_passed": 19,
"tasks_total": 20,
"duration_ms": 5000,
"timestamp": "2025-01-20T10:30:00Z"
},
"tasks": [
{
"id": "task-001",
"name": "Example Task",
"status": "passed",
"score": 1.0,
"duration_ms": 250,
"trials": 1
}
]
}
- Files reviewed: 1/1 changed files
- Comments generated: 10
| run: | | ||
| PASS_RATE=$(jq '.summary.pass_rate' results.json) | ||
| echo "Pass Rate: $PASS_RATE" | ||
| if (( $(echo "$PASS_RATE < 0.8" | bc -l) )); then | ||
| echo "❌ Pass rate below threshold" | ||
| exit 1 |
There was a problem hiding this comment.
The examples read jq '.summary.pass_rate' / results.summary.pass_rate, but the CLI output JSON uses summary.success_rate (see internal/models/outcome.go -> OutcomeDigest.SuccessRate tagged as json:"success_rate"). As written, the threshold checks and PR comment snippet will either return null or fail. Please update the jq queries and the “Result JSON Structure” example to match the actual output schema.
This issue also appears in the following locations of the same file:
- line 401
- line 462
| - name: Run evaluation | ||
| run: waza run evals/my-skill/eval.yaml \ | ||
| --context-dir evals/my-skill/fixtures \ | ||
| --output results.json | ||
| env: | ||
| COPILOT_SDK_TOKEN: ${{ secrets.COPILOT_SDK_TOKEN }} | ||
|
|
There was a problem hiding this comment.
The workflow examples set COPILOT_SDK_TOKEN, but this repo’s docs indicate copilot-sdk uses GITHUB_TOKEN (and CI can typically use the mock executor with no secrets). As written, users will add a secret/env var that waza doesn’t read and still fail auth. Please align the env var + surrounding text with README’s “Environment Requirements” section and update the other COPILOT_SDK_TOKEN references in this guide accordingly.
| # CI/CD Integration Guide | ||
|
|
||
| This guide explains how to integrate waza evaluations into your CI/CD pipelines using GitHub Actions, Azure DevOps, and other platforms. |
There was a problem hiding this comment.
PR metadata says this “Fixes #89”, but issue #89’s acceptance criteria calls for a docs-site page under site/src/content/docs/guides/ (and sidebar/nav update). This PR adds docs/CI-CD-GUIDE.md, but I don’t see a corresponding site guide; consider either porting this content into the site docs structure (and adding it to navigation) or adjusting the “Fixes #89” claim/scope.
| ### Issue: Tests timeout in CI | ||
|
|
||
| **Solution:** Increase timeout or use `--timeout` flag: | ||
| ```yaml | ||
| - name: Run evaluation | ||
| timeout-minutes: 10 | ||
| run: waza run evals/my-skill/eval.yaml \ | ||
| --context-dir evals/my-skill/fixtures \ | ||
| --output results.json |
There was a problem hiding this comment.
The troubleshooting section suggests using a --timeout flag with waza run, but waza run doesn’t define a --timeout CLI flag (timeouts are configured via config.timeout_seconds in the eval spec / defaults). This example as written will fail with an unknown flag error; please update the guidance to the supported configuration mechanism.
| ### Threshold Configuration | ||
|
|
||
| Set pass thresholds in `eval.yaml`: | ||
|
|
||
| ```yaml | ||
| config: | ||
| model: claude-sonnet-4-20250514 | ||
| pass_threshold: 0.8 | ||
| metrics: | ||
| - name: task_completion | ||
| threshold: 0.85 | ||
| - name: trigger_accuracy | ||
| threshold: 0.90 | ||
| ``` |
There was a problem hiding this comment.
The eval.yaml snippet includes config.pass_threshold, but the spec schema in this repo doesn’t define that field (YAML parsing uses KnownFields(true), so this would fail with an unknown field error). Please remove pass_threshold and document the supported thresholding mechanism (per-metric metrics[*].threshold, and/or whatever determines overall pass/fail in the current runner).
| To create a Copilot SDK token, authenticate with the CLI: | ||
|
|
||
| ```bash | ||
| copilot auth login |
There was a problem hiding this comment.
This guide instructs copilot auth login, but the codebase error messages direct users to run copilot login (no auth subcommand). Update the command so readers can successfully authenticate.
| copilot auth login | |
| copilot login |
| git stash | ||
| git checkout main | ||
| waza run evals/my-skill/eval.yaml \ | ||
| --context-dir evals/my-skill/fixtures \ | ||
| --output results-baseline.json | ||
| git checkout - | ||
| git stash pop |
There was a problem hiding this comment.
The “Baseline Comparison” example is likely to fail: git stash won’t create a stash entry when there are no tracked changes, but git stash pop will then error with “No stash entries found” (and results-current.json is typically untracked). Consider using git stash -u + git stash pop || true, or (more robust) use git worktree add to run the baseline eval in a separate checkout without stashing/checking out back and forth.
| git stash | |
| git checkout main | |
| waza run evals/my-skill/eval.yaml \ | |
| --context-dir evals/my-skill/fixtures \ | |
| --output results-baseline.json | |
| git checkout - | |
| git stash pop | |
| git stash -u | |
| git checkout main | |
| waza run evals/my-skill/eval.yaml \ | |
| --context-dir evals/my-skill/fixtures \ | |
| --output results-baseline.json | |
| git checkout - | |
| git stash pop || true |
|
|
||
| - name: Upload results | ||
| if: always() | ||
| uses: actions/upload-artifact@v3 |
There was a problem hiding this comment.
This guide uses actions/upload-artifact@v3, but the repo’s own workflows have already standardized on actions/upload-artifact@v4. Updating to v4 avoids deprecation warnings and keeps the docs consistent with in-repo examples.
| uses: actions/upload-artifact@v3 | |
| uses: actions/upload-artifact@v4 |
| - uses: actions/checkout@v4 | ||
|
|
||
| - name: Set up Go | ||
| uses: actions/setup-go@v4 |
There was a problem hiding this comment.
This guide uses actions/setup-go@v4, but the repo’s workflows use actions/setup-go@v5 (and Go 1.26). For copy/paste parity with existing workflows, update to v5 here too.
| uses: actions/setup-go@v4 | |
| uses: actions/setup-go@v5 |
| with: | ||
| path: | | ||
| .waza-cache | ||
| ~/.cache/waza |
There was a problem hiding this comment.
The cache example includes ~/.cache/waza, but the CLI’s configurable cache directory defaults to .waza-cache (and there’s no reference in the codebase to a ~/.cache/waza location). Including an unused path may confuse users and reduce cache hit rates. Suggest removing ~/.cache/waza or replacing it with the actual directory/directories waza writes to.
| ~/.cache/waza |
Fixes #112 and #89
Changes
Issue #112 - DEMO-SCRIPT.md Audit
Issue #89 - CI/CD Integration Guide
Testing
Documentation-only changes; no build/test issues.