Add promptfoo eval harness for skills + consolidate MCP endpoint#65
Conversation
ari-launchdarkly
left a comment
There was a problem hiding this comment.
This is a lot. I'm gonna ask Devin to take a pass
Review SummaryThorough and well-structured PR. The eval harness design is solid — running skills through the Claude Agent SDK with an in-process mock MCP server is the right call for realism. The bug fixes are all clearly motivated and the unit test coverage (66 tests, all passing) is strong. A few observations below. Architecture & Design
Bug FixesThe fixes called out in the description all look correct:
Items Worth Considering
Minor Nits
Overall this is a well-thought-out eval harness with solid engineering. The CI workflow design (diff-gated re-runs, artifact-based aggregation, PR score-diff comments) is particularly nice for keeping API costs down while maintaining visibility. |
ari-launchdarkly
left a comment
There was a problem hiding this comment.
I want to see if the CI will allow this to go through with the failures
ari-launchdarkly
left a comment
There was a problem hiding this comment.
Let's block merges when the CI failures exist
Co-authored-by: ari-launchdarkly <asalem@launchdarkly.com>
Skill eval results
Only suites whose source actually changed since their last recorded score were re-run. Soft-failing while we stabilise the baseline. |
Add eval-gate job for required status check and remove continue-on-error from evaluate
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
14b5af1 to
1ee2558
Compare
Summary
promptfoo-based eval harness underevals/that runs each skill end-to-end through the Claude Agent SDK with mocked LaunchDarkly MCP tools, asserting on both tool-call trajectory and response quality.github/workflows/eval-skills.yml) that runs evals on PRs and nightly.mcp.jsonand migration docs to the unifiedmcp/launchdarklyendpointEval harness (
evals/)Provider (
providers/claude-skill-agent-sdk.js): loads the skill the same way a real Claude Code session does (off-disk via.claude/skills/<slug>/), exposes an in-process mock MCP server, and records a full tool-call trajectory for assertions.Suites (one
promptfooconfig.yamlper skill, 3–5 test cases each):aiconfig-create— 4/4 passing (100%)aiconfig-update— 4/5 passing (80%)aiconfig-tools— 3/4 passing (75%)aiconfig-variations— 4/5 passing (80%)launchdarkly-flag-create— 2/3 passing (67%, pre-existing mock conflict with seed data)Scripts:
aggregate.js— runs all suites, writeseval-scores.jsonat repo rootdiff-changed-skills.js— git-diff-gated re-run (only re-evaluates suites whose source changed)render-badges.js— rewrites per-skill README eval score badgesrun-models.js— cross-model matrix runner (haiku / sonnet / opus)Unit tests (
tests/*.test.js, 66 tests, no API calls needed):transform.test.js,output-valid.test.js,assertions.test.js,jsonschema-to-zod.test.js,mock.test.jsnpm testfromevals/CI workflow (
.github/workflows/eval-skills.yml)AGENT_MODELfromclaude-sonnet-4-20250514→claude-sonnet-4-6unit-testjob that runsnpm test(no API key needed) before the LLM eval jobsJob flow:
unit-test— runs 66 unit tests on every trigger, no API callsdiff— computes which suites changed since their last recorded scoreevaluate— runs changed suites in parallel (max 3), gated onunit-testpassingaggregate— merges results intoeval-scores.json, posts PR score-diff comment, commits refreshed scores/badges on scheduleTriggers:
pull_request(onskills/**,evals/**, workflow file),schedule(09:17 UTC daily),workflow_dispatch(withrun_alloverride)Bug fixes (found during review against PR #54 implementation)
run-models.js:.filter(Array.isArray([]) || true)evaluated to.filter(true)— aTypeErrorcrash on any non-empty results array; removed the no-op filterrun-models.js:runSuiteWithModelonly checkedresult.error(OS spawn errors), notresult.status !== 0; promptfoo eval failures were silently treated as successrun-models.js: output path was relative instead of absolute, inconsistent withaggregate.js_mock.js:update-ai-config-variationspread all ofinputinto the variation, polluting state withconfigKey,variationKey, etc.; now only patches the allowed variation fields_mock.js:list-ai-configsdropped agent-created configs whose key collided with a seed template key; state entries now always win_mock.js:create-ai-config-variationfor a nonexistent parent config silently orphaned the variation; now creates a minimal stubpreviewMock: unconditionally appended"..."even for strings under 200 charsaiconfig-createTest 3: deadexploredvariable (computed but never used inpass)flag-createTest 2:pass: false, score: 0.5was internally inconsistent; changed topass: true, score: 0.5tool-responses.json:list-feature-flagsreturned empty whilelist-flagshad seed data; aligned themMCP endpoint consolidation
.mcp.json: replaced separatemcp/fm+mcp/aiconfigsentries with singlemcp/launchdarklyentrymcp-config-templates.md: updated migration section to deprecate bothmcp/fmandmcp/aiconfigs(previously onlymcp/aiconfigswas marked deprecated;mcp/fmwas incorrectly described as a valid mirror)Assorted
evals/.env.example: added links to credential consoles (console.anthropic.com,platform.openai.com/api-keys)EVAL_MODELenv var fallback from the provider (onlyAGENT_MODELis supported)Test plan
cd evals && npm test— 66 unit tests pass, no API calls needednpm run eval:aiconfig-create:single— smoke test (~30s, requiresANTHROPIC_API_KEY)npm run eval:all— full suite run, checkeval-scores.jsonmatches expected scores above