refactor(ci): let resolution gate reuse benchmark artifact#1167
Conversation
The pre-publish-benchmark job's `Run resolution benchmark` step builds codegraphs for ~34 language fixtures, computes precision/recall, and writes resolution-result.json. The `Gate on resolution thresholds` step that follows then ran the same vitest suite which independently copied every fixture and rebuilt the graphs again — doubling the most expensive slice of the publish pipeline. Extend the script's per-language LangResult to include falsePositiveEdges and falseNegativeEdges so the gate test has everything it needs for the existing precision/recall threshold assertions and failure messages. Refactor the gate test to consume that artifact when RESOLUTION_RESULT_JSON is set, falling back to the build-from-fixtures path when unset so devs can still run `npx vitest run tests/benchmarks/resolution/...` standalone. Wire the env var through the workflow's Gate step. Verified locally: gate test in artifact mode passes 170/170 in ~0.5s against an artifact produced by scripts/resolution-benchmark.ts, and the legacy build-from-fixtures path still passes for the javascript fixture. Closes #1052
|
Claude finished @carlos-alm's task —— View job PR Review: Resolution Benchmark DeduplicationTodo List:
Root Cause AnalysisProblem Being Solved: The CI pipeline was running the resolution benchmark twice — once in Why This Approach: The solution extends the script's output format to include edge arrays ( Risk Assessment: ✅ Low risk. The change maintains a fallback path for local development and adds proper validation to ensure the artifact contains required fields. Configuration Change Detection✅ No high-risk configuration changes detected
Test Integrity✅ Test integrity maintained
Code Quality✅ Good implementation quality
Minor observations:
Backlog ComplianceZero-dep: ✅ — No new runtime dependencies added. Uses existing Node.js fs/path modules. Foundation-aligned: ✅ — Fully compliant with all core principles:
Problem-fit: 4/5 — Directly improves CI efficiency which benefits development velocity and AI agent workflows by reducing feedback cycle time. While not directly addressing the "lost AI" problem, faster CI enables more rapid iteration on core features that do. Breaking: No — Purely additive change. Existing test behavior preserved when environment variable is unset. No CLI, API, or DB schema changes. Tier: 1 (zero-dep ✓ + foundation-aligned ✓) Critical ConcernsNone identified. The implementation is well-engineered with appropriate fallbacks and validation. Final Recommendation
Minor suggestions for future iterations:
This is exactly the kind of thoughtful optimization that improves developer experience without compromising test quality or introducing technical debt. |
Greptile SummaryThis PR eliminates the duplicate fixture-rebuild work in the
Confidence Score: 5/5Safe to merge — artifact-mode and fixture-mode paths are well-isolated, all validation gaps from earlier reviews have been addressed, and workflow step ordering guarantees the artifact exists before the gate consumes it. All three issues flagged in prior review rounds (missing numeric-field validation, silent empty-artifact pass-through, and the rounding discrepancy) are now guarded. The new code paths are simple, the fallback fixture mode is untouched, and the only workflow change is a single env-var injection into an already-ordered step sequence. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Run resolution benchmark] -->|writes resolution-result.json| B[resolution-result.json]
B --> C{RESOLUTION_RESULT_JSON set?}
C -->|Yes - CI| D[loadArtifact validate + parse JSON]
D --> E{artifact empty?}
E -->|Yes| F[throw error]
E -->|No| G[metricsFromArtifact per lang]
G -->|missing fields| H[throw error regenerate]
G -->|valid| I[Run threshold assertions]
C -->|No - local| J[discoverFixtures]
J --> K[copyFixture + buildFixtureGraph]
K --> L[computeMetrics]
L --> I
I -->|pass| M[Gate pass]
I -->|fail| N[Gate fail with edge list]
Reviews (4): Last reviewed commit: "Merge branch 'main' into refactor/1052-d..." | Re-trigger Greptile |
| function metricsFromArtifact(lang: string, raw: ArtifactLangResult): BenchmarkMetrics { | ||
| if (!Array.isArray(raw.falsePositiveEdges) || !Array.isArray(raw.falseNegativeEdges)) { | ||
| throw new Error( | ||
| `Resolution artifact for ${lang} is missing falsePositiveEdges/falseNegativeEdges — regenerate with the current resolution-benchmark.ts.`, | ||
| ); | ||
| } |
There was a problem hiding this comment.
metricsFromArtifact validates the two edge-list arrays but leaves every numeric field (precision, recall, truePositives, totalResolved, totalExpected, byMode) unchecked. A stale or hand-edited artifact with null or a missing key for one of these fields will throw a confusing TypeError at the threshold assertions rather than the clear "regenerate" message.
| function metricsFromArtifact(lang: string, raw: ArtifactLangResult): BenchmarkMetrics { | |
| if (!Array.isArray(raw.falsePositiveEdges) || !Array.isArray(raw.falseNegativeEdges)) { | |
| throw new Error( | |
| `Resolution artifact for ${lang} is missing falsePositiveEdges/falseNegativeEdges — regenerate with the current resolution-benchmark.ts.`, | |
| ); | |
| } | |
| function metricsFromArtifact(lang: string, raw: ArtifactLangResult): BenchmarkMetrics { | |
| if ( | |
| typeof raw.precision !== 'number' || | |
| typeof raw.recall !== 'number' || | |
| typeof raw.truePositives !== 'number' || | |
| typeof raw.falsePositives !== 'number' || | |
| typeof raw.falseNegatives !== 'number' || | |
| typeof raw.totalResolved !== 'number' || | |
| typeof raw.totalExpected !== 'number' || | |
| !raw.byMode | |
| ) { | |
| throw new Error( | |
| `Resolution artifact for ${lang} is missing required numeric fields — regenerate with the current resolution-benchmark.ts.`, | |
| ); | |
| } | |
| if (!Array.isArray(raw.falsePositiveEdges) || !Array.isArray(raw.falseNegativeEdges)) { | |
| throw new Error( | |
| `Resolution artifact for ${lang} is missing falsePositiveEdges/falseNegativeEdges — regenerate with the current resolution-benchmark.ts.`, | |
| ); | |
| } |
| const artifact = ARTIFACT_PATH ? loadArtifact(ARTIFACT_PATH) : null; | ||
| // In artifact mode, drive the suite from the keys in the artifact so we never | ||
| // silently skip a language the script reported. In local mode, discover from | ||
| // the filesystem like before. | ||
| const languages = artifact ? Object.keys(artifact).sort() : discoverFixtures(); |
There was a problem hiding this comment.
Empty artifact silently passes the gate
If resolution-result.json is valid JSON but contains {} (e.g., the benchmark script ran with no discoverable fixtures), Object.keys(artifact) returns [], no describe blocks are registered, and vitest exits 0 with "0 tests". The gate would pass without evaluating a single threshold. A guard after loadArtifact throwing on an empty result would make this failure mode explicit.
Codegraph Impact Analysis3 functions changed → 1 callers affected across 1 files
|
…acts (#1167) - scripts/resolution-benchmark.ts: stop rounding precision/recall to 3 decimals before writing the artifact. The rounding let a near-miss like 0.8497 round up to 0.850 and silently clear a 0.85 threshold in CI artifact mode while failing in fixture mode. - tests/benchmarks/resolution/resolution-benchmark.test.ts: validate numeric fields in metricsFromArtifact so a stale or malformed artifact surfaces a clear 'regenerate' error instead of a confusing TypeError at the threshold assertions. - tests/benchmarks/resolution/resolution-benchmark.test.ts: reject an empty artifact in loadArtifact. Without this guard, an empty {} would register zero describe blocks and vitest would exit 0 with '0 tests', silently passing the gate.
|
Addressed Greptile feedback in e7b2f27:
Verified by feeding the test crafted artifacts (empty, malformed, and valid) — each path produces the expected error or passes. |
Summary
pre-publish-benchmarkjob ran the full resolution benchmark twice — once inscripts/resolution-benchmark.ts(which writesresolution-result.json) and again intests/benchmarks/resolution/resolution-benchmark.test.ts(which independently rebuilt every fixture). With ~34 language fixtures this roughly doubled the slowest part of the pre-publish pipeline.LangResultoutput to includefalsePositiveEdges/falseNegativeEdgesarrays, and teaches the gate test to consumeresolution-result.jsonwhenRESOLUTION_RESULT_JSONis set. The legacy build-from-fixtures path stays as a fallback so devs can still run the test standalone withnpx vitest run ….publish.ymlnow passesRESOLUTION_RESULT_JSON: ${{ github.workspace }}/resolution-result.jsoninto the Gate step.Test plan
npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts -t javascript— fixture mode, 5/5 passresolution-result.jsonvianode scripts/resolution-benchmark.ts(34 languages, all keys present including the new edge lists)RESOLUTION_RESULT_JSON=/tmp/resolution-result.json npx vitest run …— artifact mode, 170/170 pass in ~0.5 s (vs. minutes for fixture rebuild)falsePositiveEdges→ clear "regenerate" errorNotes
The issue also asks to consider the same dedup for the tracer validation gate. That duplication exists, but it requires a more invasive change to the script's output format (raw tracer edges + a status field, not just counts), and the toolchain-missing semantics differ between the script and the tracer test. Tracked separately in #1166 to keep this PR focused.
Closes #1052