Skip to content

feat(evals): add npm eval scripts and shared environments#1626

Merged
WilliamBerryiii merged 3 commits into
microsoft:mainfrom
jongio:feat/vally-phase1-2
May 22, 2026
Merged

feat(evals): add npm eval scripts and shared environments#1626
WilliamBerryiii merged 3 commits into
microsoft:mainfrom
jongio:feat/vally-phase1-2

Conversation

@jongio
Copy link
Copy Markdown
Contributor

@jongio jongio commented May 20, 2026

Description

Implements Phases 1-2 of the Vally evaluation framework integration (issue #1599).

Phase 1 - npm scripts: Added suite-specific eval commands to package.json (eval:run, eval:run:skills, eval:run:agents, eval:run:scripts, eval:compare). Fail-fast chaining via && for CI feedback.

Phase 2 - Shared environments: Added environments map to .vally.yaml with three named environments (security, coding-standards, security-and-coding). Refactored all eval.yaml files to use named string references instead of inline relative paths.

Phase 1b - Tooling verification: Added vally version check to copilot-setup-steps verify step using npx --no (fails loudly if not installed).

Related Issue(s)

Closes phase 1-2 of #1599

Type of Change

Select all that apply:

Code & Documentation:

  • New feature (non-breaking change adding functionality)

Infrastructure & Configuration:

  • GitHub Actions workflow
  • Dependency update

Testing

  • All three eval specs pass vally lint
  • Environment references resolve correctly per the resolveEnvironment() contract (string ref maps to named lookup in project config)
  • npx --no vally -- --version fails loudly when vally is not installed, succeeds when present
  • CI: 56/56 checks green

Checklist

Required Checks

  • Documentation is updated (if applicable)
  • Files follow existing naming conventions
  • Changes are backwards compatible (if applicable)
  • Tests added for new functionality (if applicable)

Required Automated Checks

The following validation commands must pass before merging:

  • Markdown linting: npm run lint:md
  • Spell checking: npm run spell-check
  • Frontmatter validation: npm run lint:frontmatter
  • Skill structure validation: npm run validate:skills
  • Link validation: npm run lint:md-links
  • PowerShell analysis: npm run lint:ps
  • Plugin freshness: npm run plugin:generate
  • Docusaurus tests: npm run docs:test

Security Considerations

  • This PR does not contain any sensitive or NDA information
  • Any new dependencies have been reviewed for security issues
  • Security-related scripts follow the principle of least privilege

Additional Notes

  • Phase 3 (CI workflow) deferred to follow-up PR pending security design decisions around token strategy and fork PR gating
  • vally-cli pinned at 0.4.0 as devDependency
  • Upstream issues filed for preset support and environment inheritance: microsoft/evaluate#204, microsoft/evaluate#158

@jongio jongio requested a review from a team as a code owner May 20, 2026 22:35
Comment thread demo/remotion-videos/package.json Fixed
Comment thread demo/remotion-videos/package.json Fixed
Comment thread demo/remotion-videos/package.json Fixed
Comment thread demo/remotion-videos/package.json Fixed
Comment thread demo/remotion-videos/package.json Fixed
Comment thread demo/remotion-videos/package.json Fixed
Comment thread demo/remotion-videos/package.json Fixed
Comment thread demo/remotion-videos/package.json Fixed
Comment thread demo/remotion-videos/package.json Fixed
Comment thread demo/remotion-videos/package.json Fixed
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 86.53%. Comparing base (46c0c03) to head (a8d72a3).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1626      +/-   ##
==========================================
+ Coverage   85.50%   86.53%   +1.02%     
==========================================
  Files          82       76       -6     
  Lines       11805    10746    -1059     
==========================================
- Hits        10094     9299     -795     
+ Misses       1711     1447     -264     
Flag Coverage Δ
pester 83.65% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- add eval:run, eval:run:skills, eval:run:agents, eval:run:scripts, eval:compare npm scripts
- add environments map to .vally.yaml with security, coding-standards, and security-and-coding named environments
- refactor eval.yaml files to use named environment references instead of relative paths
- add vally version check to copilot-setup-steps verify step

🧪 - Generated by Copilot
@jongio jongio force-pushed the feat/vally-phase1-2 branch from 61504ac to a9c177f Compare May 21, 2026 00:53
Copy link
Copy Markdown

@wbreza wbreza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — approving. Small, well-scoped refactor (6 files, +23/-12), all CI green, named environments resolve to the same skill paths as the prior inline lists. Three non-blocking suggestions below; none gate merge.

🟡 Minor (non-blocking)

1. eval:run fail-fast via && (package.json)
Chained with &&, so the first failing suite halts the rest. Likely intentional for fast feedback, but if you ever want all suites to attempt in a single run (and aggregate failures), consider npm-run-all -c eval:run:skills eval:run:agents eval:run:scripts, or leave a comment documenting fail-fast intent.

2. npx vally --version in copilot-setup-steps.yml
If vally isn't already installed by a prior step in this job, npx will silently download it on first invocation — masking a "not installed" regression and adding CI time. If the intent is to verify a previously installed binary, prefer npm exec --no-install vally -- --version so it fails loudly when missing.

3. eval:compare with no args
Worth a one-line note (script comment or README) on what vally compare defaults to (last two runs? interactive picker?) so the script is self-documenting.

🟢 Nits

  • security-and-coding duplicates entries from security + coding-standards. Explicit is fine; if vally supports env composition (extends/array merge), a future cleanup could DRY it.
  • All three referenced skill paths (security/owasp-top-10, security/owasp-cicd, coding-standards/python-foundational) verified present at head SHA.
  • No leftover inline environment: blocks in the three suite eval.yaml files. Refactor is consistent.

Phase 3 (CI workflow) deferral with the security-design note is the right call.

Addresses wbreza review feedback: npx without --no silently
downloads on first invocation, masking a not-installed regression.

🔧 - Generated by Copilot

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jongio
Copy link
Copy Markdown
Contributor Author

jongio commented May 21, 2026

Thanks for the review @wbreza! Addressing your suggestions:

1. eval:run fail-fast via && — Intentional. We want fast feedback in CI; first failure stops the run. If we later need an "all suites regardless" mode, we'll add eval:run:all with npm-run-all -c.

2. npx vally --versionnpx --no — Fixed in 4d64b66. Now fails loudly if the binary isn't already in node_modules/.bin/ from npm ci.

3. eval:compare with no argsvally compare requires explicit --run-a and --run-b paths (it has no default). The script is a convenience alias; users pass paths via -- --run-a <path> --run-b <path>. Will add usage to docs when Phase 3 CI workflow lands.

Copy link
Copy Markdown
Member

@bindsi bindsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

  • All three referenced skill paths exist (security/owasp-top-10, security/owasp-cicd, coding-standards/python-foundational).
  • Inline environment: blocks fully removed from the three suite files — refactor is consistent, no drift.
  • npx --no is the right call over bare npx (fails loudly rather than silently downloading).
  • Fail-fast && chaining in eval:run is reasonable and intent confirmed in-thread.

Non-blocking findings from @wbreza already addressed in 4d64b66 and follow-up comments — nothing further from me on the code.

One small request

Could you please redo the PR description using the repo's .github/PULL_REQUEST_TEMPLATE.md? The current body is great content-wise, but the template's sections (linked issue, change type checkboxes, testing/validation, etc.) keep the changelog and downstream tooling consistent across PRs. Thanks!

@jongio
Copy link
Copy Markdown
Contributor Author

jongio commented May 21, 2026

Updated the PR description to use the repo template - thanks for the catch.

@WilliamBerryiii WilliamBerryiii merged commit 1102901 into microsoft:main May 22, 2026
56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants