refactor(ci): let resolution gate reuse benchmark artifact by carlos-alm · Pull Request #1167 · optave/ops-codegraph-tool

carlos-alm · 2026-05-19T07:55:15Z

Summary

The pre-publish-benchmark job ran the full resolution benchmark twice — once in scripts/resolution-benchmark.ts (which writes resolution-result.json) and again in tests/benchmarks/resolution/resolution-benchmark.test.ts (which independently rebuilt every fixture). With ~34 language fixtures this roughly doubled the slowest part of the pre-publish pipeline.
This change extends the script's LangResult output to include falsePositiveEdges / falseNegativeEdges arrays, and teaches the gate test to consume resolution-result.json when RESOLUTION_RESULT_JSON is set. The legacy build-from-fixtures path stays as a fallback so devs can still run the test standalone with npx vitest run ….
publish.yml now passes RESOLUTION_RESULT_JSON: ${{ github.workspace }}/resolution-result.json into the Gate step.

Test plan

npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts -t javascript — fixture mode, 5/5 pass
Generate resolution-result.json via node scripts/resolution-benchmark.ts (34 languages, all keys present including the new edge lists)
RESOLUTION_RESULT_JSON=/tmp/resolution-result.json npx vitest run … — artifact mode, 170/170 pass in ~0.5 s (vs. minutes for fixture rebuild)
Sanity-check error paths: missing artifact file → clear error; artifact missing falsePositiveEdges → clear "regenerate" error
Biome lint on the changed files clean (pre-existing warnings in unrelated test files only)
Watch publish workflow on next dispatch to confirm CI savings

Notes

The issue also asks to consider the same dedup for the tracer validation gate. That duplication exists, but it requires a more invasive change to the script's output format (raw tracer edges + a status field, not just counts), and the toolchain-missing semantics differ between the script and the tracer test. Tracked separately in #1166 to keep this PR focused.

Closes #1052

The pre-publish-benchmark job's `Run resolution benchmark` step builds codegraphs for ~34 language fixtures, computes precision/recall, and writes resolution-result.json. The `Gate on resolution thresholds` step that follows then ran the same vitest suite which independently copied every fixture and rebuilt the graphs again — doubling the most expensive slice of the publish pipeline. Extend the script's per-language LangResult to include falsePositiveEdges and falseNegativeEdges so the gate test has everything it needs for the existing precision/recall threshold assertions and failure messages. Refactor the gate test to consume that artifact when RESOLUTION_RESULT_JSON is set, falling back to the build-from-fixtures path when unset so devs can still run `npx vitest run tests/benchmarks/resolution/...` standalone. Wire the env var through the workflow's Gate step. Verified locally: gate test in artifact mode passes 170/170 in ~0.5s against an artifact produced by scripts/resolution-benchmark.ts, and the legacy build-from-fixtures path still passes for the javascript fixture. Closes #1052

claude · 2026-05-19T07:55:31Z

Claude finished @carlos-alm's task —— View job

PR Review: Resolution Benchmark Deduplication

Todo List:

Read and understand changed files
Check repository guidelines (CLAUDE.md)
Phase 1: Root Cause Analysis
Phase 2: Configuration Change Detection
Phase 3: Test Integrity
Phase 4: Code Quality
Phase 5: Backlog Compliance
Provide final assessment and recommendation

Root Cause Analysis

Problem Being Solved: The CI pipeline was running the resolution benchmark twice — once in scripts/resolution-benchmark.ts to generate metrics and again in the test suite which rebuilt all ~34 language fixtures from scratch. This roughly doubled the execution time of the slowest part of the pre-publish pipeline.

Why This Approach: The solution extends the script's output format to include edge arrays (falsePositiveEdges/falseNegativeEdges) and teaches the test to consume the pre-computed artifact when RESOLUTION_RESULT_JSON is set. This eliminates duplicate fixture processing while maintaining backward compatibility for local development.

Risk Assessment: ✅ Low risk. The change maintains a fallback path for local development and adds proper validation to ensure the artifact contains required fields.

Configuration Change Detection

✅ No high-risk configuration changes detected

Workflow modification adds environment variable but doesn't relax any thresholds
No test timeout increases or coverage reductions
No linting rules disabled or build configuration modified

Test Integrity

✅ Test integrity maintained

Tests maintain identical validation logic and thresholds
Added validation ensures artifact format compatibility: metricsFromArtifact:298-302
Error handling provides clear guidance when artifact is missing/invalid: loadArtifact:288-295
Backward compatibility preserved for local runs without the environment variable

Code Quality

✅ Good implementation quality

Clean separation: Artifact loading logic is well-contained in lines tests/benchmarks/resolution/resolution-benchmark.test.ts:274-316
Type safety: Proper TypeScript interfaces for artifact format: ArtifactLangResult:275-286
Error handling: Clear error messages for missing artifacts and format validation
Comments: Excellent documentation explaining the optimization purpose and fallback behavior

Minor observations:

Code duplication between script and test for computeMetrics function, but this is acceptable for the performance gain achieved
Interface naming could be more consistent (ArtifactLangResult vs LangResult), but functionally equivalent

Backlog Compliance

Zero-dep: ✅ — No new runtime dependencies added. Uses existing Node.js fs/path modules.

Foundation-aligned: ✅ — Fully compliant with all core principles:

P1 (always-current): Doesn't add cloud dependencies or external services
P4 (zero-cost core): Optimizes CI performance without changing runtime behavior
P5 (embeddable-first): No API contract changes
P8 (not an agent): Pure optimization, no decision-making or code generation

Problem-fit: 4/5 — Directly improves CI efficiency which benefits development velocity and AI agent workflows by reducing feedback cycle time. While not directly addressing the "lost AI" problem, faster CI enables more rapid iteration on core features that do.

Breaking: No — Purely additive change. Existing test behavior preserved when environment variable is unset. No CLI, API, or DB schema changes.

Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Critical Concerns

None identified. The implementation is well-engineered with appropriate fallbacks and validation.

Final Recommendation

Rating: ⭐⭐⭐⭐☆ (4/5)
Action: APPROVE
Reasoning: Excellent performance optimization that addresses a real CI bottleneck. The implementation is well-architected with proper error handling, backward compatibility, and clear documentation. The ~50% reduction in pre-publish CI time for resolution benchmarks directly benefits development velocity. Code quality is high with appropriate abstractions and type safety.

Minor suggestions for future iterations:

Consider consolidating the duplicate computeMetrics logic into a shared module
The tracer validation mentioned in the PR description could benefit from similar treatment in a follow-up PR

This is exactly the kind of thoughtful optimization that improves developer experience without compromising test quality or introducing technical debt.

greptile-apps · 2026-05-19T07:59:14Z

Greptile Summary

This PR eliminates the duplicate fixture-rebuild work in the pre-publish-benchmark job by extending the resolution benchmark script to emit falsePositiveEdges/falseNegativeEdges in its JSON output, and teaching the gate test to consume that artifact via RESOLUTION_RESULT_JSON instead of re-running every 34-language build from scratch. All three previously flagged P1 concerns (missing numeric-field validation, silent empty-artifact pass-through, and precision/recall rounding discrepancy) have been addressed in this version.

scripts/resolution-benchmark.ts: Removes precision/recall rounding and adds falsePositiveEdges/falseNegativeEdges arrays to LangResult.
tests/benchmarks/resolution/resolution-benchmark.test.ts: Adds loadArtifact and metricsFromArtifact with full validation, branching beforeAll for artifact vs fixture mode.
.github/workflows/publish.yml: Passes RESOLUTION_RESULT_JSON into the gate step so CI uses the pre-computed artifact.

Confidence Score: 5/5

Safe to merge — artifact-mode and fixture-mode paths are well-isolated, all validation gaps from earlier reviews have been addressed, and workflow step ordering guarantees the artifact exists before the gate consumes it.

All three issues flagged in prior review rounds (missing numeric-field validation, silent empty-artifact pass-through, and the rounding discrepancy) are now guarded. The new code paths are simple, the fallback fixture mode is untouched, and the only workflow change is a single env-var injection into an already-ordered step sequence.

No files require special attention.

Important Files Changed

Filename	Overview
scripts/resolution-benchmark.ts	Adds falsePositiveEdges/falseNegativeEdges to LangResult and removes precision/recall rounding
tests/benchmarks/resolution/resolution-benchmark.test.ts	Adds artifact-mode path with loadArtifact/metricsFromArtifact; all previously flagged validation gaps are now guarded
.github/workflows/publish.yml	Adds RESOLUTION_RESULT_JSON env var to the gate step; step ordering ensures the artifact exists before the gate consumes it

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Run resolution benchmark] -->|writes resolution-result.json| B[resolution-result.json]
    B --> C{RESOLUTION_RESULT_JSON set?}
    C -->|Yes - CI| D[loadArtifact validate + parse JSON]
    D --> E{artifact empty?}
    E -->|Yes| F[throw error]
    E -->|No| G[metricsFromArtifact per lang]
    G -->|missing fields| H[throw error regenerate]
    G -->|valid| I[Run threshold assertions]
    C -->|No - local| J[discoverFixtures]
    J --> K[copyFixture + buildFixtureGraph]
    K --> L[computeMetrics]
    L --> I
    I -->|pass| M[Gate pass]
    I -->|fail| N[Gate fail with edge list]

_{Reviews (4): Last reviewed commit: "Merge branch 'main' into refactor/1052-d..." | Re-trigger Greptile}

greptile-apps · 2026-05-19T07:59:21Z

+function metricsFromArtifact(lang: string, raw: ArtifactLangResult): BenchmarkMetrics {
+  if (!Array.isArray(raw.falsePositiveEdges) || !Array.isArray(raw.falseNegativeEdges)) {
+    throw new Error(
+      `Resolution artifact for ${lang} is missing falsePositiveEdges/falseNegativeEdges — regenerate with the current resolution-benchmark.ts.`,
+    );
+  }


metricsFromArtifact validates the two edge-list arrays but leaves every numeric field (precision, recall, truePositives, totalResolved, totalExpected, byMode) unchecked. A stale or hand-edited artifact with null or a missing key for one of these fields will throw a confusing TypeError at the threshold assertions rather than the clear "regenerate" message.

Suggested change

function metricsFromArtifact(lang: string, raw: ArtifactLangResult): BenchmarkMetrics {

if (!Array.isArray(raw.falsePositiveEdges) || !Array.isArray(raw.falseNegativeEdges)) {

throw new Error(

`Resolution artifact for ${lang} is missing falsePositiveEdges/falseNegativeEdges — regenerate with the current resolution-benchmark.ts.`,

);

}

function metricsFromArtifact(lang: string, raw: ArtifactLangResult): BenchmarkMetrics {

if (

typeof raw.precision !== 'number' ||

typeof raw.recall !== 'number' ||

typeof raw.truePositives !== 'number' ||

typeof raw.falsePositives !== 'number' ||

typeof raw.falseNegatives !== 'number' ||

typeof raw.totalResolved !== 'number' ||

typeof raw.totalExpected !== 'number' ||

!raw.byMode

) {

throw new Error(

`Resolution artifact for ${lang} is missing required numeric fields — regenerate with the current resolution-benchmark.ts.`,

);

}

if (!Array.isArray(raw.falsePositiveEdges) || !Array.isArray(raw.falseNegativeEdges)) {

throw new Error(

`Resolution artifact for ${lang} is missing falsePositiveEdges/falseNegativeEdges — regenerate with the current resolution-benchmark.ts.`,

);

}

greptile-apps · 2026-05-19T07:59:24Z

+const artifact = ARTIFACT_PATH ? loadArtifact(ARTIFACT_PATH) : null;
+// In artifact mode, drive the suite from the keys in the artifact so we never
+// silently skip a language the script reported. In local mode, discover from
+// the filesystem like before.
+const languages = artifact ? Object.keys(artifact).sort() : discoverFixtures();


Empty artifact silently passes the gate

If resolution-result.json is valid JSON but contains {} (e.g., the benchmark script ran with no discoverable fixtures), Object.keys(artifact) returns [], no describe blocks are registered, and vitest exits 0 with "0 tests". The gate would pass without evaluating a single threshold. A guard after loadArtifact throwing on an empty result would make this failure mode explicit.

github-actions · 2026-05-19T08:02:39Z

Codegraph Impact Analysis

3 functions changed → 1 callers affected across 1 files

LangResult.falsePositiveEdges in scripts/resolution-benchmark.ts:69 (0 transitive callers)
LangResult.falseNegativeEdges in scripts/resolution-benchmark.ts:70 (0 transitive callers)
computeMetrics in scripts/resolution-benchmark.ts:102 (1 transitive callers)

…acts (#1167) - scripts/resolution-benchmark.ts: stop rounding precision/recall to 3 decimals before writing the artifact. The rounding let a near-miss like 0.8497 round up to 0.850 and silently clear a 0.85 threshold in CI artifact mode while failing in fixture mode. - tests/benchmarks/resolution/resolution-benchmark.test.ts: validate numeric fields in metricsFromArtifact so a stale or malformed artifact surfaces a clear 'regenerate' error instead of a confusing TypeError at the threshold assertions. - tests/benchmarks/resolution/resolution-benchmark.test.ts: reject an empty artifact in loadArtifact. Without this guard, an empty {} would register zero describe blocks and vitest would exit 0 with '0 tests', silently passing the gate.

carlos-alm · 2026-05-19T08:03:23Z

Addressed Greptile feedback in e7b2f27:

P1 (precision/recall rounding): Removed the Math.round(... * 1000) / 1000 step in scripts/resolution-benchmark.ts:computeMetrics. The artifact now stores full-precision values so the CI gate compares the exact numbers the fixture-mode gate would, closing the 0.8497 -> 0.850 silent-pass band.
P2 (missing numeric field validation): metricsFromArtifact now type-checks precision, recall, truePositives, falsePositives, falseNegatives, totalResolved, totalExpected, and byMode before consumption, surfacing the clear 'regenerate' error instead of a downstream TypeError.
P2 (empty artifact silently passes): loadArtifact now rejects {} / non-object payloads with an explicit error, so an empty artifact can no longer register zero describe blocks and exit 0.

Verified by feeding the test crafted artifacts (empty, malformed, and valid) — each path produces the expected error or passes.

carlos-alm · 2026-05-19T08:03:29Z

@greptileai

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

Merge branch 'main' into refactor/1052-dedupe-resolution-bench

b9d431b

carlos-alm mentioned this pull request May 19, 2026

refactor(ci): let tracer-validation gate reuse benchmark artifact #1171

Merged

6 tasks

carlos-alm merged commit 383cbbe into main May 20, 2026
21 checks passed

carlos-alm deleted the refactor/1052-dedupe-resolution-bench branch May 20, 2026 00:29

github-actions Bot locked and limited conversation to collaborators May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(ci): let resolution gate reuse benchmark artifact#1167

refactor(ci): let resolution gate reuse benchmark artifact#1167
carlos-alm merged 3 commits into
mainfrom
refactor/1052-dedupe-resolution-bench

carlos-alm commented May 19, 2026

Uh oh!

claude Bot commented May 19, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 19, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 19, 2026

Uh oh!

greptile-apps Bot May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026 •

edited

Loading

Uh oh!

carlos-alm commented May 19, 2026

Uh oh!

carlos-alm commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented May 19, 2026

Summary

Test plan

Notes

Uh oh!

claude Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Resolution Benchmark Deduplication

Todo List:

Root Cause Analysis

Configuration Change Detection

Test Integrity

Code Quality

Backlog Compliance

Critical Concerns

Final Recommendation

Uh oh!

greptile-apps Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegraph Impact Analysis

Uh oh!

carlos-alm commented May 19, 2026

Uh oh!

carlos-alm commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 19, 2026 •

edited

Loading

greptile-apps Bot commented May 19, 2026 •

edited

Loading

github-actions Bot commented May 19, 2026 •

edited

Loading