Skip to content

refactor(ci): let resolution gate reuse benchmark artifact#1167

Merged
carlos-alm merged 3 commits into
mainfrom
refactor/1052-dedupe-resolution-bench
May 20, 2026
Merged

refactor(ci): let resolution gate reuse benchmark artifact#1167
carlos-alm merged 3 commits into
mainfrom
refactor/1052-dedupe-resolution-bench

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • The pre-publish-benchmark job ran the full resolution benchmark twice — once in scripts/resolution-benchmark.ts (which writes resolution-result.json) and again in tests/benchmarks/resolution/resolution-benchmark.test.ts (which independently rebuilt every fixture). With ~34 language fixtures this roughly doubled the slowest part of the pre-publish pipeline.
  • This change extends the script's LangResult output to include falsePositiveEdges / falseNegativeEdges arrays, and teaches the gate test to consume resolution-result.json when RESOLUTION_RESULT_JSON is set. The legacy build-from-fixtures path stays as a fallback so devs can still run the test standalone with npx vitest run ….
  • publish.yml now passes RESOLUTION_RESULT_JSON: ${{ github.workspace }}/resolution-result.json into the Gate step.

Test plan

  • npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts -t javascript — fixture mode, 5/5 pass
  • Generate resolution-result.json via node scripts/resolution-benchmark.ts (34 languages, all keys present including the new edge lists)
  • RESOLUTION_RESULT_JSON=/tmp/resolution-result.json npx vitest run … — artifact mode, 170/170 pass in ~0.5 s (vs. minutes for fixture rebuild)
  • Sanity-check error paths: missing artifact file → clear error; artifact missing falsePositiveEdges → clear "regenerate" error
  • Biome lint on the changed files clean (pre-existing warnings in unrelated test files only)
  • Watch publish workflow on next dispatch to confirm CI savings

Notes

The issue also asks to consider the same dedup for the tracer validation gate. That duplication exists, but it requires a more invasive change to the script's output format (raw tracer edges + a status field, not just counts), and the toolchain-missing semantics differ between the script and the tracer test. Tracked separately in #1166 to keep this PR focused.

Closes #1052

The pre-publish-benchmark job's `Run resolution benchmark` step builds
codegraphs for ~34 language fixtures, computes precision/recall, and
writes resolution-result.json. The `Gate on resolution thresholds` step
that follows then ran the same vitest suite which independently copied
every fixture and rebuilt the graphs again — doubling the most expensive
slice of the publish pipeline.

Extend the script's per-language LangResult to include
falsePositiveEdges and falseNegativeEdges so the gate test has
everything it needs for the existing precision/recall threshold
assertions and failure messages. Refactor the gate test to consume
that artifact when RESOLUTION_RESULT_JSON is set, falling back to the
build-from-fixtures path when unset so devs can still run
`npx vitest run tests/benchmarks/resolution/...` standalone. Wire the
env var through the workflow's Gate step.

Verified locally: gate test in artifact mode passes 170/170 in ~0.5s
against an artifact produced by scripts/resolution-benchmark.ts, and
the legacy build-from-fixtures path still passes for the javascript
fixture.

Closes #1052
@claude
Copy link
Copy Markdown

claude Bot commented May 19, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Resolution Benchmark Deduplication

Todo List:

  • Read and understand changed files
  • Check repository guidelines (CLAUDE.md)
  • Phase 1: Root Cause Analysis
  • Phase 2: Configuration Change Detection
  • Phase 3: Test Integrity
  • Phase 4: Code Quality
  • Phase 5: Backlog Compliance
  • Provide final assessment and recommendation

Root Cause Analysis

Problem Being Solved: The CI pipeline was running the resolution benchmark twice — once in scripts/resolution-benchmark.ts to generate metrics and again in the test suite which rebuilt all ~34 language fixtures from scratch. This roughly doubled the execution time of the slowest part of the pre-publish pipeline.

Why This Approach: The solution extends the script's output format to include edge arrays (falsePositiveEdges/falseNegativeEdges) and teaches the test to consume the pre-computed artifact when RESOLUTION_RESULT_JSON is set. This eliminates duplicate fixture processing while maintaining backward compatibility for local development.

Risk Assessment: ✅ Low risk. The change maintains a fallback path for local development and adds proper validation to ensure the artifact contains required fields.

Configuration Change Detection

No high-risk configuration changes detected

  • Workflow modification adds environment variable but doesn't relax any thresholds
  • No test timeout increases or coverage reductions
  • No linting rules disabled or build configuration modified

Test Integrity

Test integrity maintained

  • Tests maintain identical validation logic and thresholds
  • Added validation ensures artifact format compatibility: metricsFromArtifact:298-302
  • Error handling provides clear guidance when artifact is missing/invalid: loadArtifact:288-295
  • Backward compatibility preserved for local runs without the environment variable

Code Quality

Good implementation quality

  • Clean separation: Artifact loading logic is well-contained in lines tests/benchmarks/resolution/resolution-benchmark.test.ts:274-316
  • Type safety: Proper TypeScript interfaces for artifact format: ArtifactLangResult:275-286
  • Error handling: Clear error messages for missing artifacts and format validation
  • Comments: Excellent documentation explaining the optimization purpose and fallback behavior

Minor observations:

  • Code duplication between script and test for computeMetrics function, but this is acceptable for the performance gain achieved
  • Interface naming could be more consistent (ArtifactLangResult vs LangResult), but functionally equivalent

Backlog Compliance

Zero-dep: ✅ — No new runtime dependencies added. Uses existing Node.js fs/path modules.

Foundation-aligned: ✅ — Fully compliant with all core principles:

  • P1 (always-current): Doesn't add cloud dependencies or external services
  • P4 (zero-cost core): Optimizes CI performance without changing runtime behavior
  • P5 (embeddable-first): No API contract changes
  • P8 (not an agent): Pure optimization, no decision-making or code generation

Problem-fit: 4/5 — Directly improves CI efficiency which benefits development velocity and AI agent workflows by reducing feedback cycle time. While not directly addressing the "lost AI" problem, faster CI enables more rapid iteration on core features that do.

Breaking: No — Purely additive change. Existing test behavior preserved when environment variable is unset. No CLI, API, or DB schema changes.

Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Critical Concerns

None identified. The implementation is well-engineered with appropriate fallbacks and validation.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: Excellent performance optimization that addresses a real CI bottleneck. The implementation is well-architected with proper error handling, backward compatibility, and clear documentation. The ~50% reduction in pre-publish CI time for resolution benchmarks directly benefits development velocity. Code quality is high with appropriate abstractions and type safety.

Minor suggestions for future iterations:

  1. Consider consolidating the duplicate computeMetrics logic into a shared module
  2. The tracer validation mentioned in the PR description could benefit from similar treatment in a follow-up PR

This is exactly the kind of thoughtful optimization that improves developer experience without compromising test quality or introducing technical debt.


@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Greptile Summary

This PR eliminates the duplicate fixture-rebuild work in the pre-publish-benchmark job by extending the resolution benchmark script to emit falsePositiveEdges/falseNegativeEdges in its JSON output, and teaching the gate test to consume that artifact via RESOLUTION_RESULT_JSON instead of re-running every 34-language build from scratch. All three previously flagged P1 concerns (missing numeric-field validation, silent empty-artifact pass-through, and precision/recall rounding discrepancy) have been addressed in this version.

  • scripts/resolution-benchmark.ts: Removes precision/recall rounding and adds falsePositiveEdges/falseNegativeEdges arrays to LangResult.
  • tests/benchmarks/resolution/resolution-benchmark.test.ts: Adds loadArtifact and metricsFromArtifact with full validation, branching beforeAll for artifact vs fixture mode.
  • .github/workflows/publish.yml: Passes RESOLUTION_RESULT_JSON into the gate step so CI uses the pre-computed artifact.

Confidence Score: 5/5

Safe to merge — artifact-mode and fixture-mode paths are well-isolated, all validation gaps from earlier reviews have been addressed, and workflow step ordering guarantees the artifact exists before the gate consumes it.

All three issues flagged in prior review rounds (missing numeric-field validation, silent empty-artifact pass-through, and the rounding discrepancy) are now guarded. The new code paths are simple, the fallback fixture mode is untouched, and the only workflow change is a single env-var injection into an already-ordered step sequence.

No files require special attention.

Important Files Changed

Filename Overview
scripts/resolution-benchmark.ts Adds falsePositiveEdges/falseNegativeEdges to LangResult and removes precision/recall rounding
tests/benchmarks/resolution/resolution-benchmark.test.ts Adds artifact-mode path with loadArtifact/metricsFromArtifact; all previously flagged validation gaps are now guarded
.github/workflows/publish.yml Adds RESOLUTION_RESULT_JSON env var to the gate step; step ordering ensures the artifact exists before the gate consumes it

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Run resolution benchmark] -->|writes resolution-result.json| B[resolution-result.json]
    B --> C{RESOLUTION_RESULT_JSON set?}
    C -->|Yes - CI| D[loadArtifact validate + parse JSON]
    D --> E{artifact empty?}
    E -->|Yes| F[throw error]
    E -->|No| G[metricsFromArtifact per lang]
    G -->|missing fields| H[throw error regenerate]
    G -->|valid| I[Run threshold assertions]
    C -->|No - local| J[discoverFixtures]
    J --> K[copyFixture + buildFixtureGraph]
    K --> L[computeMetrics]
    L --> I
    I -->|pass| M[Gate pass]
    I -->|fail| N[Gate fail with edge list]
Loading

Reviews (4): Last reviewed commit: "Merge branch 'main' into refactor/1052-d..." | Re-trigger Greptile

Comment on lines +297 to +302
function metricsFromArtifact(lang: string, raw: ArtifactLangResult): BenchmarkMetrics {
if (!Array.isArray(raw.falsePositiveEdges) || !Array.isArray(raw.falseNegativeEdges)) {
throw new Error(
`Resolution artifact for ${lang} is missing falsePositiveEdges/falseNegativeEdges — regenerate with the current resolution-benchmark.ts.`,
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 metricsFromArtifact validates the two edge-list arrays but leaves every numeric field (precision, recall, truePositives, totalResolved, totalExpected, byMode) unchecked. A stale or hand-edited artifact with null or a missing key for one of these fields will throw a confusing TypeError at the threshold assertions rather than the clear "regenerate" message.

Suggested change
function metricsFromArtifact(lang: string, raw: ArtifactLangResult): BenchmarkMetrics {
if (!Array.isArray(raw.falsePositiveEdges) || !Array.isArray(raw.falseNegativeEdges)) {
throw new Error(
`Resolution artifact for ${lang} is missing falsePositiveEdges/falseNegativeEdges — regenerate with the current resolution-benchmark.ts.`,
);
}
function metricsFromArtifact(lang: string, raw: ArtifactLangResult): BenchmarkMetrics {
if (
typeof raw.precision !== 'number' ||
typeof raw.recall !== 'number' ||
typeof raw.truePositives !== 'number' ||
typeof raw.falsePositives !== 'number' ||
typeof raw.falseNegatives !== 'number' ||
typeof raw.totalResolved !== 'number' ||
typeof raw.totalExpected !== 'number' ||
!raw.byMode
) {
throw new Error(
`Resolution artifact for ${lang} is missing required numeric fields — regenerate with the current resolution-benchmark.ts.`,
);
}
if (!Array.isArray(raw.falsePositiveEdges) || !Array.isArray(raw.falseNegativeEdges)) {
throw new Error(
`Resolution artifact for ${lang} is missing falsePositiveEdges/falseNegativeEdges — regenerate with the current resolution-benchmark.ts.`,
);
}

Fix in Claude Code

Comment on lines +331 to +335
const artifact = ARTIFACT_PATH ? loadArtifact(ARTIFACT_PATH) : null;
// In artifact mode, drive the suite from the keys in the artifact so we never
// silently skip a language the script reported. In local mode, discover from
// the filesystem like before.
const languages = artifact ? Object.keys(artifact).sort() : discoverFixtures();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Empty artifact silently passes the gate

If resolution-result.json is valid JSON but contains {} (e.g., the benchmark script ran with no discoverable fixtures), Object.keys(artifact) returns [], no describe blocks are registered, and vitest exits 0 with "0 tests". The gate would pass without evaluating a single threshold. A guard after loadArtifact throwing on an empty result would make this failure mode explicit.

Fix in Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

Codegraph Impact Analysis

3 functions changed1 callers affected across 1 files

  • LangResult.falsePositiveEdges in scripts/resolution-benchmark.ts:69 (0 transitive callers)
  • LangResult.falseNegativeEdges in scripts/resolution-benchmark.ts:70 (0 transitive callers)
  • computeMetrics in scripts/resolution-benchmark.ts:102 (1 transitive callers)

…acts (#1167)

- scripts/resolution-benchmark.ts: stop rounding precision/recall to 3
  decimals before writing the artifact. The rounding let a near-miss like
  0.8497 round up to 0.850 and silently clear a 0.85 threshold in CI
  artifact mode while failing in fixture mode.
- tests/benchmarks/resolution/resolution-benchmark.test.ts: validate
  numeric fields in metricsFromArtifact so a stale or malformed artifact
  surfaces a clear 'regenerate' error instead of a confusing TypeError at
  the threshold assertions.
- tests/benchmarks/resolution/resolution-benchmark.test.ts: reject an
  empty artifact in loadArtifact. Without this guard, an empty {} would
  register zero describe blocks and vitest would exit 0 with '0 tests',
  silently passing the gate.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed Greptile feedback in e7b2f27:

  • P1 (precision/recall rounding): Removed the Math.round(... * 1000) / 1000 step in scripts/resolution-benchmark.ts:computeMetrics. The artifact now stores full-precision values so the CI gate compares the exact numbers the fixture-mode gate would, closing the 0.8497 -> 0.850 silent-pass band.
  • P2 (missing numeric field validation): metricsFromArtifact now type-checks precision, recall, truePositives, falsePositives, falseNegatives, totalResolved, totalExpected, and byMode before consumption, surfacing the clear 'regenerate' error instead of a downstream TypeError.
  • P2 (empty artifact silently passes): loadArtifact now rejects {} / non-object payloads with an explicit error, so an empty artifact can no longer register zero describe blocks and exit 0.

Verified by feeding the test crafted artifacts (empty, malformed, and valid) — each path produces the expected error or passes.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 383cbbe into main May 20, 2026
21 checks passed
@carlos-alm carlos-alm deleted the refactor/1052-dedupe-resolution-bench branch May 20, 2026 00:29
@github-actions github-actions Bot locked and limited conversation to collaborators May 20, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

follow-up: deduplicate resolution benchmark work between script and gate test

1 participant