feat(evals): add npm eval scripts and shared environments by jongio · Pull Request #1626 · microsoft/hve-core

jongio · 2026-05-20T22:35:09Z

Description

Implements Phases 1-2 of the Vally evaluation framework integration (issue #1599).

Phase 1 - npm scripts: Added suite-specific eval commands to package.json (eval:run, eval:run:skills, eval:run:agents, eval:run:scripts, eval:compare). Fail-fast chaining via && for CI feedback.

Phase 2 - Shared environments: Added environments map to .vally.yaml with three named environments (security, coding-standards, security-and-coding). Refactored all eval.yaml files to use named string references instead of inline relative paths.

Phase 1b - Tooling verification: Added vally version check to copilot-setup-steps verify step using npx --no (fails loudly if not installed).

Related Issue(s)

Closes phase 1-2 of #1599

Type of Change

Select all that apply:

Code & Documentation:

New feature (non-breaking change adding functionality)

Infrastructure & Configuration:

GitHub Actions workflow
Dependency update

Testing

All three eval specs pass vally lint
Environment references resolve correctly per the resolveEnvironment() contract (string ref maps to named lookup in project config)
npx --no vally -- --version fails loudly when vally is not installed, succeeds when present
CI: 56/56 checks green

Checklist

Required Checks

Documentation is updated (if applicable)
Files follow existing naming conventions
Changes are backwards compatible (if applicable)
Tests added for new functionality (if applicable)

Required Automated Checks

The following validation commands must pass before merging:

Markdown linting: npm run lint:md
Spell checking: npm run spell-check
Frontmatter validation: npm run lint:frontmatter
Skill structure validation: npm run validate:skills
Link validation: npm run lint:md-links
PowerShell analysis: npm run lint:ps
Plugin freshness: npm run plugin:generate
Docusaurus tests: npm run docs:test

Security Considerations

This PR does not contain any sensitive or NDA information
Any new dependencies have been reviewed for security issues
Security-related scripts follow the principle of least privilege

Additional Notes

Phase 3 (CI workflow) deferred to follow-up PR pending security design decisions around token strategy and fork PR gating
vally-cli pinned at 0.4.0 as devDependency
Upstream issues filed for preset support and environment inheritance: microsoft/evaluate#204, microsoft/evaluate#158

codecov-commenter · 2026-05-20T22:39:54Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.53%. Comparing base (46c0c03) to head (a8d72a3).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1626      +/-   ##
==========================================
+ Coverage   85.50%   86.53%   +1.02%     
==========================================
  Files          82       76       -6     
  Lines       11805    10746    -1059     
==========================================
- Hits        10094     9299     -795     
+ Misses       1711     1447     -264

Flag	Coverage Δ
pester	`83.65% <ø> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 8 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- add eval:run, eval:run:skills, eval:run:agents, eval:run:scripts, eval:compare npm scripts - add environments map to .vally.yaml with security, coding-standards, and security-and-coding named environments - refactor eval.yaml files to use named environment references instead of relative paths - add vally version check to copilot-setup-steps verify step 🧪 - Generated by Copilot

wbreza

LGTM — approving. Small, well-scoped refactor (6 files, +23/-12), all CI green, named environments resolve to the same skill paths as the prior inline lists. Three non-blocking suggestions below; none gate merge.

🟡 Minor (non-blocking)

1. eval:run fail-fast via && (package.json)
Chained with &&, so the first failing suite halts the rest. Likely intentional for fast feedback, but if you ever want all suites to attempt in a single run (and aggregate failures), consider npm-run-all -c eval:run:skills eval:run:agents eval:run:scripts, or leave a comment documenting fail-fast intent.

2. npx vally --version in copilot-setup-steps.yml
If vally isn't already installed by a prior step in this job, npx will silently download it on first invocation — masking a "not installed" regression and adding CI time. If the intent is to verify a previously installed binary, prefer npm exec --no-install vally -- --version so it fails loudly when missing.

3. eval:compare with no args
Worth a one-line note (script comment or README) on what vally compare defaults to (last two runs? interactive picker?) so the script is self-documenting.

🟢 Nits

security-and-coding duplicates entries from security + coding-standards. Explicit is fine; if vally supports env composition (extends/array merge), a future cleanup could DRY it.
All three referenced skill paths (security/owasp-top-10, security/owasp-cicd, coding-standards/python-foundational) verified present at head SHA.
No leftover inline environment: blocks in the three suite eval.yaml files. Refactor is consistent.

Phase 3 (CI workflow) deferral with the security-design note is the right call.

Addresses wbreza review feedback: npx without --no silently downloads on first invocation, masking a not-installed regression. 🔧 - Generated by Copilot Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jongio · 2026-05-21T01:24:33Z

Thanks for the review @wbreza! Addressing your suggestions:

1. eval:run fail-fast via && — Intentional. We want fast feedback in CI; first failure stops the run. If we later need an "all suites regardless" mode, we'll add eval:run:all with npm-run-all -c.

2. npx vally --version → npx --no — Fixed in 4d64b66. Now fails loudly if the binary isn't already in node_modules/.bin/ from npm ci.

3. eval:compare with no args — vally compare requires explicit --run-a and --run-b paths (it has no default). The script is a convenience alias; users pass paths via -- --run-a <path> --run-b <path>. Will add usage to docs when Phase 3 CI workflow lands.

bindsi

Looks good

All three referenced skill paths exist (security/owasp-top-10, security/owasp-cicd, coding-standards/python-foundational).
Inline environment: blocks fully removed from the three suite files — refactor is consistent, no drift.
npx --no is the right call over bare npx (fails loudly rather than silently downloading).
Fail-fast && chaining in eval:run is reasonable and intent confirmed in-thread.

Non-blocking findings from @wbreza already addressed in 4d64b66 and follow-up comments — nothing further from me on the code.

One small request

Could you please redo the PR description using the repo's .github/PULL_REQUEST_TEMPLATE.md? The current body is great content-wise, but the template's sections (linked issue, change type checkboxes, testing/validation, etc.) keep the changelog and downstream tooling consistent across PRs. Thanks!

jongio · 2026-05-21T13:22:24Z

Updated the PR description to use the repo template - thanks for the catch.

jongio requested a review from a team as a code owner May 20, 2026 22:35

github-advanced-security AI found potential problems May 20, 2026

View reviewed changes

jongio force-pushed the feat/vally-phase1-2 branch from 61504ac to a9c177f Compare May 21, 2026 00:53

wbreza approved these changes May 21, 2026

View reviewed changes

fix(workflows): use npx --no to fail loudly if vally missing

4d64b66

Addresses wbreza review feedback: npx without --no silently downloads on first invocation, masking a not-installed regression. 🔧 - Generated by Copilot Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jongio requested review from WilliamBerryiii, bindsi and rezatnoMsirhC May 21, 2026 01:45

bindsi approved these changes May 21, 2026

View reviewed changes

Merge branch 'main' into feat/vally-phase1-2

a8d72a3

WilliamBerryiii approved these changes May 22, 2026

View reviewed changes

WilliamBerryiii merged commit 1102901 into microsoft:main May 22, 2026
56 checks passed

This was referenced May 22, 2026

chore(main): pre-release 4.1.138 #1476

Open

chore(main): release hve-core 4.0.0 #1184

Open

matysanchez mentioned this pull request May 25, 2026

Fix #1634: docs: sync lint:all chain in build-system.md with package.json #1679

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): add npm eval scripts and shared environments#1626

feat(evals): add npm eval scripts and shared environments#1626
WilliamBerryiii merged 3 commits into
microsoft:mainfrom
jongio:feat/vally-phase1-2

jongio commented May 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 20, 2026 •

edited

Loading

Uh oh!

wbreza left a comment

Uh oh!

jongio commented May 21, 2026

Uh oh!

bindsi left a comment

Uh oh!

jongio commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jongio commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Type of Change

Testing

Checklist

Required Checks

Required Automated Checks

Security Considerations

Additional Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wbreza left a comment

Choose a reason for hiding this comment

🟡 Minor (non-blocking)

🟢 Nits

Uh oh!

jongio commented May 21, 2026

Uh oh!

bindsi left a comment

Choose a reason for hiding this comment

Looks good

One small request

Uh oh!

jongio commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jongio commented May 20, 2026 •

edited

Loading

codecov-commenter commented May 20, 2026 •

edited

Loading