feat(evals): add npm eval scripts and shared environments#1626
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1626 +/- ##
==========================================
+ Coverage 85.50% 86.53% +1.02%
==========================================
Files 82 76 -6
Lines 11805 10746 -1059
==========================================
- Hits 10094 9299 -795
+ Misses 1711 1447 -264
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
- add eval:run, eval:run:skills, eval:run:agents, eval:run:scripts, eval:compare npm scripts - add environments map to .vally.yaml with security, coding-standards, and security-and-coding named environments - refactor eval.yaml files to use named environment references instead of relative paths - add vally version check to copilot-setup-steps verify step 🧪 - Generated by Copilot
61504ac to
a9c177f
Compare
wbreza
left a comment
There was a problem hiding this comment.
LGTM — approving. Small, well-scoped refactor (6 files, +23/-12), all CI green, named environments resolve to the same skill paths as the prior inline lists. Three non-blocking suggestions below; none gate merge.
🟡 Minor (non-blocking)
1. eval:run fail-fast via && (package.json)
Chained with &&, so the first failing suite halts the rest. Likely intentional for fast feedback, but if you ever want all suites to attempt in a single run (and aggregate failures), consider npm-run-all -c eval:run:skills eval:run:agents eval:run:scripts, or leave a comment documenting fail-fast intent.
2. npx vally --version in copilot-setup-steps.yml
If vally isn't already installed by a prior step in this job, npx will silently download it on first invocation — masking a "not installed" regression and adding CI time. If the intent is to verify a previously installed binary, prefer npm exec --no-install vally -- --version so it fails loudly when missing.
3. eval:compare with no args
Worth a one-line note (script comment or README) on what vally compare defaults to (last two runs? interactive picker?) so the script is self-documenting.
🟢 Nits
security-and-codingduplicates entries fromsecurity+coding-standards. Explicit is fine; if vally supports env composition (extends/array merge), a future cleanup could DRY it.- All three referenced skill paths (
security/owasp-top-10,security/owasp-cicd,coding-standards/python-foundational) verified present at head SHA. - No leftover inline
environment:blocks in the three suite eval.yaml files. Refactor is consistent.
Phase 3 (CI workflow) deferral with the security-design note is the right call.
Addresses wbreza review feedback: npx without --no silently downloads on first invocation, masking a not-installed regression. 🔧 - Generated by Copilot Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Thanks for the review @wbreza! Addressing your suggestions: 1. 2. 3. |
bindsi
left a comment
There was a problem hiding this comment.
Looks good
- All three referenced skill paths exist (
security/owasp-top-10,security/owasp-cicd,coding-standards/python-foundational). - Inline
environment:blocks fully removed from the three suite files — refactor is consistent, no drift. npx --nois the right call over barenpx(fails loudly rather than silently downloading).- Fail-fast
&&chaining ineval:runis reasonable and intent confirmed in-thread.
Non-blocking findings from @wbreza already addressed in 4d64b66 and follow-up comments — nothing further from me on the code.
One small request
Could you please redo the PR description using the repo's .github/PULL_REQUEST_TEMPLATE.md? The current body is great content-wise, but the template's sections (linked issue, change type checkboxes, testing/validation, etc.) keep the changelog and downstream tooling consistent across PRs. Thanks!
|
Updated the PR description to use the repo template - thanks for the catch. |
Description
Implements Phases 1-2 of the Vally evaluation framework integration (issue #1599).
Phase 1 - npm scripts: Added suite-specific eval commands to package.json (eval:run, eval:run:skills, eval:run:agents, eval:run:scripts, eval:compare). Fail-fast chaining via && for CI feedback.
Phase 2 - Shared environments: Added environments map to .vally.yaml with three named environments (security, coding-standards, security-and-coding). Refactored all eval.yaml files to use named string references instead of inline relative paths.
Phase 1b - Tooling verification: Added vally version check to copilot-setup-steps verify step using npx --no (fails loudly if not installed).
Related Issue(s)
Closes phase 1-2 of #1599
Type of Change
Select all that apply:
Code & Documentation:
Infrastructure & Configuration:
Testing
Checklist
Required Checks
Required Automated Checks
The following validation commands must pass before merging:
Security Considerations
Additional Notes