Skip to content

refactor(ci): let tracer-validation gate reuse benchmark artifact#1171

Merged
carlos-alm merged 6 commits into
mainfrom
refactor/1166-dedupe-tracer-validation
May 20, 2026
Merged

refactor(ci): let tracer-validation gate reuse benchmark artifact#1171
carlos-alm merged 6 commits into
mainfrom
refactor/1166-dedupe-tracer-validation

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

Changes

  • scripts/resolution-benchmark.ts — Extend the per-language LangResult with a new required tracer: { status: 'ok' | 'skipped', edges } block. runDynamicTracer now mirrors tracer-validation.test.ts's skip semantics (error && edges.length === 0skipped). Existing dynamicEdges / dynamicConfirmed count fields are preserved for the report scripts.
  • tests/benchmarks/resolution/tracer/tracer-validation.test.ts — Add loadArtifact / tracerFromArtifact consuming RESOLUTION_RESULT_JSON when set. In artifact mode the suite drives the language list from artifact keys (so a language the script reported can never be silently skipped). When the env var is unset, fall back to spawning run-tracer.mjs so local npx vitest run … still works standalone.
  • .github/workflows/publish.yml — Wire RESOLUTION_RESULT_JSON into the Run tracer validation step, mirroring the pattern already in place for the resolution gate.

Stacked on

This branch is stacked on #1167 (refactor/1052-dedupe-resolution-bench) — base will need to retarget main after that lands.

Test plan

  • Lint clean (npm run lint)
  • Fallback mode: npx vitest run tests/benchmarks/resolution/tracer/tracer-validation.test.ts -t "javascript" passes (spawns run-tracer.mjs like before)
  • Artifact mode (status=ok): synthetic resolution-result.json with tracer.status="ok" exercises the recall computation and threshold check
  • Artifact mode (status=skipped): tracer.status="skipped" → test reports status: skipped row, no failure
  • Artifact mode (missing tracer block): clear regenerate with the current scripts/resolution-benchmark.ts error
  • CI: Run tracer validation step picks up RESOLUTION_RESULT_JSON from the workspace and reuses the artifact instead of re-spawning the tracer subprocess

Closes #1166

The pre-publish-benchmark job's `Run resolution benchmark` step builds
codegraphs for ~34 language fixtures, computes precision/recall, and
writes resolution-result.json. The `Gate on resolution thresholds` step
that follows then ran the same vitest suite which independently copied
every fixture and rebuilt the graphs again — doubling the most expensive
slice of the publish pipeline.

Extend the script's per-language LangResult to include
falsePositiveEdges and falseNegativeEdges so the gate test has
everything it needs for the existing precision/recall threshold
assertions and failure messages. Refactor the gate test to consume
that artifact when RESOLUTION_RESULT_JSON is set, falling back to the
build-from-fixtures path when unset so devs can still run
`npx vitest run tests/benchmarks/resolution/...` standalone. Wire the
env var through the workflow's Gate step.

Verified locally: gate test in artifact mode passes 170/170 in ~0.5s
against an artifact produced by scripts/resolution-benchmark.ts, and
the legacy build-from-fixtures path still passes for the javascript
fixture.

Closes #1052
…acts (#1167)

- scripts/resolution-benchmark.ts: stop rounding precision/recall to 3
  decimals before writing the artifact. The rounding let a near-miss like
  0.8497 round up to 0.850 and silently clear a 0.85 threshold in CI
  artifact mode while failing in fixture mode.
- tests/benchmarks/resolution/resolution-benchmark.test.ts: validate
  numeric fields in metricsFromArtifact so a stale or malformed artifact
  surfaces a clear 'regenerate' error instead of a confusing TypeError at
  the threshold assertions.
- tests/benchmarks/resolution/resolution-benchmark.test.ts: reject an
  empty artifact in loadArtifact. Without this guard, an empty {} would
  register zero describe blocks and vitest would exit 0 with '0 tests',
  silently passing the gate.
The pre-publish-benchmark job's `Run resolution benchmark` step already
runs the per-language tracer once for every fixture (the script writes
dynamic-edge counts into resolution-result.json for telemetry). The
`Run tracer validation` step that follows then re-runs the same tracer
subprocess once more per language to compute same-file recall — doubling
tracer cost on the publish path.

Spun off from #1052 / #1167, which deduped the resolution gate the
same way. The tracer side wasn't trivial to fold into that PR because
runDynamicTracer and runTracer had different toolchain-missing
semantics (empty array vs null), so the artifact needed a status field
to round-trip the "skipped" signal.

- Extend the script's TracerArtifact to `{ status: 'ok' | 'skipped',
  edges }`, attached to every LangResult as the new `tracer` block.
  `skipped` is set when run-tracer.mjs reports an error and produced
  no edges — mirroring tracer-validation.test.ts's skip semantics so
  it round-trips through the artifact.
- Refactor tracer-validation.test.ts to consume the artifact when
  RESOLUTION_RESULT_JSON is set (driving the language list from the
  artifact keys so we never silently drop a language), falling back
  to spawning run-tracer.mjs when unset so local
  `npx vitest run …` runs still work standalone.
- Wire RESOLUTION_RESULT_JSON into the workflow's tracer validation
  step, same pattern as the resolution gate.

Verified locally: artifact mode correctly distinguishes status=ok
(runs recall computation, hits threshold check) from status=skipped
(graceful skip), and the missing-tracer-block error path surfaces a
clear "regenerate" message. Fallback mode still drives the suite
from the filesystem when the env var is unset.

Closes #1166
@claude
Copy link
Copy Markdown

claude Bot commented May 19, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Tracer Validation Deduplication

Review Progress

  • Read and analyze changed files
  • Phase 1: Root cause analysis
  • Phase 2: Configuration change detection
  • Phase 3: Test integrity review
  • Phase 4: Code quality assessment
  • Phase 5: Backlog compliance evaluation
  • Final assessment and recommendation

Phase 1: Root Cause Analysis ✅

Problem Being Solved: The pre-publish-benchmark job was running per-language tracers twice - once in "Run resolution benchmark" and again in "Run tracer validation". This PR eliminates the duplicate work by having the tracer validation reuse the benchmark artifact.

Why This Approach: The approach is well-justified. The duplication was identified as inefficiency in issue #1166. Rather than re-spawning tracer subprocesses, the validation now reads pre-captured edges from the resolution benchmark artifact, cutting CI time in half for this step.

Risk Assessment: Low risk. This is a pure optimization that preserves all existing semantics. The fallback mode ensures local development workflows remain unchanged.

Phase 2: Configuration Change Detection ✅

No high-risk changes detected:

  • No Biome/linting rules modified
  • No test timeouts changed or coverage thresholds lowered
  • No TypeScript strict flags disabled
  • No module resolution changes
  • Workflow change is purely additive (sets env var for artifact reuse)

Phase 3: Test Integrity Assessment ✅

The changes preserve test integrity well:

Preserved semantics: The tracerFromArtifact function maintains the same null-return contract for skipped languages as the original runTracer function

Robust fallback: Local development unaffected - when RESOLUTION_RESULT_JSON is unset, falls back to spawning run-tracer.mjs

Artifact validation: Strong validation logic rejects malformed artifacts and provides clear error messages

Improved safety: Artifact mode drives test discovery from artifact keys instead of filesystem, preventing silent skips when a language the script reported is missing

Phase 4: Code Quality Assessment ✅

Implementation Quality: Excellent

  • Clean type definitions with clear documentation
  • Proper error handling with descriptive messages
  • Maintains backward compatibility seamlessly
  • Good separation of concerns (artifact loading vs execution logic)

Specific Strengths:

  • Line 184: scripts/resolution-benchmark.ts:184 - Excellent error messaging for missing tracer blocks
  • Line 203: scripts/resolution-benchmark.ts:203 - Proper handling of edge case where error occurs but edges exist
  • Line 188: tests/.../tracer-validation.test.ts:188 - Preserves null-as-skipped contract consistently
  • Line 249: Test suite drives from artifact keys in CI mode, preventing silent language omissions

No significant concerns found

Phase 5: Backlog Compliance ✅

Backlog Compliance

  • Zero-dep: ✓ — No new runtime dependencies added. Builds entirely on existing infrastructure.
  • Foundation-aligned: ✓ — Fully aligned with all core principles. Improves P1 (always-current) by making CI faster. Maintains P4 (zero-cost core) by eliminating duplicate work. No cloud APIs or external services.
  • Problem-fit: 3/5 — This is a CI optimization that doesn't directly address the core "AI agents waste tokens" problem. However, it's valuable infrastructure work that makes the benchmark pipeline more efficient.
  • Breaking: No — This is purely additive. No CLI output changes, API changes, or schema changes. Fallback mode preserves all existing behavior.
  • Tier: 1 — Zero-dep ✓ + Foundation-aligned ✓

Critical Concerns

None identified. This is a well-executed infrastructure optimization with thorough error handling and backward compatibility.

Final Assessment

## Root Cause Analysis
**Problem Being Solved**: Pre-publish CI was running per-language tracers twice (benchmark + validation), causing inefficiency
**Why This Approach**: Clean artifact reuse with robust fallback, eliminates duplicate tracer spawning while preserving all semantics
**Risk Assessment**: Very low risk - pure optimization with comprehensive backward compatibility

## Backlog Compliance  
- Zero-dep: ✓ — No new runtime dependencies
- Foundation-aligned: ✓ — Improves P1 (always-current) by optimizing CI pipeline
- Problem-fit: 3 — Infrastructure optimization, doesn't directly solve core AI agent problem but valuable
- Breaking: No — Purely additive with complete fallback preservation
- Tier: 1

## Critical Concerns
None

## Final Recommendation
- Rating: ⭐⭐⭐⭐ (4/5)
- Action: APPROVE
- Reasoning: Excellent technical implementation of a valuable CI optimization. Clean code, thorough error handling, preserved semantics, zero breaking changes. The artifact reuse pattern could be a model for other CI optimizations.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Greptile Summary

This PR eliminates a redundant per-language tracer subprocess invocation in CI by storing the tracer edges captured during Run resolution benchmark into resolution-result.json and having the Run tracer validation step read them back via RESOLUTION_RESULT_JSON, mirroring the pattern already in place for the resolution gate.

  • scripts/resolution-benchmark.ts: LangResult gains a required tracer: { status, edges } field; runDynamicTracer returns TracerArtifact instead of a raw edge array, with status: 'skipped' when the toolchain is missing and status: 'ok' otherwise.
  • tests/benchmarks/resolution/tracer/tracer-validation.test.ts: New loadArtifact / tracerFromArtifact helpers consume the artifact in CI; the language list is driven from artifact keys (preventing silent skips); local runs fall back to spawning run-tracer.mjs unchanged.
  • .github/workflows/publish.yml: RESOLUTION_RESULT_JSON env var added to the tracer-validation step.

Confidence Score: 5/5

Safe to merge; the deduplication logic is well-guarded and the local fallback is preserved.

All three changed files have narrow, well-scoped changes. The artifact-read path has robust error handling for missing files, malformed JSON, and empty results. The existing workflow step ordering guarantees resolution-result.json is written before it is consumed. The local fallback via runTracer is untouched.

No files require special attention beyond the note about artifact-mode language discovery in tracer-validation.test.ts.

Important Files Changed

Filename Overview
scripts/resolution-benchmark.ts Adds TracerArtifact type and extends LangResult with a required tracer field; changes runDynamicTracer return type from DynamicEdge[] to TracerArtifact.
tests/benchmarks/resolution/tracer/tracer-validation.test.ts Adds loadArtifact / tracerFromArtifact helpers and switches the language list to artifact keys in CI mode; JSON parse errors are now wrapped with a helpful message.
.github/workflows/publish.yml Wires RESOLUTION_RESULT_JSON into the tracer-validation step using the workspace-relative path already used by the resolution gate.

Sequence Diagram

sequenceDiagram
    participant RB as Run resolution benchmark
    participant FS as resolution-result.json
    participant RG as Gate on resolution thresholds
    participant TV as Run tracer validation

    RB->>RB: runDynamicTracer(lang)
    RB->>FS: write JSON with tracer block per lang
    RG->>FS: read via RESOLUTION_RESULT_JSON
    RG->>RG: validate thresholds
    TV->>FS: read via RESOLUTION_RESULT_JSON
    TV->>TV: loadArtifact() returns artifact map
    TV->>TV: tracerFromArtifact per lang
    TV->>TV: "computeSameFileRecall assert >= threshold"
    note over RB,TV: Before this PR TV spawned run-tracer.mjs per language. After TV reuses edges from RB.
Loading

Fix All in Claude Code

Reviews (4): Last reviewed commit: "fix(test): wrap JSON.parse with helpful ..." | Re-trigger Greptile

Comment on lines +158 to +176
function loadArtifact(artifactPath: string): Record<string, ArtifactLangResult> {
if (!fs.existsSync(artifactPath)) {
throw new Error(
`RESOLUTION_RESULT_JSON=${artifactPath} not found — run scripts/resolution-benchmark.ts first.`,
);
}
const parsed = JSON.parse(fs.readFileSync(artifactPath, 'utf-8')) as Record<
string,
ArtifactLangResult
>;
// Refuse to proceed on an empty artifact: with zero languages, vitest would
// register no test cases and exit 0, silently passing the gate.
if (!parsed || typeof parsed !== 'object' || Object.keys(parsed).length === 0) {
throw new Error(
`RESOLUTION_RESULT_JSON=${artifactPath} contains no language results — regenerate with scripts/resolution-benchmark.ts.`,
);
}
return parsed;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 JSON parse error not wrapped with a helpful message

If resolution-result.json exists but contains malformed JSON, JSON.parse will throw a raw SyntaxError with no reference to RESOLUTION_RESULT_JSON or what to do next. The existing helpful error messages ("not found", "contains no language results") cover the empty-file and missing-file cases, but a corrupt-file scenario produces a confusing traceback. Wrapping JSON.parse in a try/catch and re-throwing with the same "regenerate with scripts/resolution-benchmark.ts" context message would keep the DX consistent.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — wrapped JSON.parse in try/catch and re-throw with the 'regenerate with scripts/resolution-benchmark.ts' guidance, matching the helpful-error pattern used for the missing-file and empty-file cases. Verified locally that a malformed resolution-result.json now produces a clear error message instead of a raw SyntaxError traceback.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

Codegraph Impact Analysis

6 functions changed4 callers affected across 2 files

  • TracerArtifact.status in scripts/resolution-benchmark.ts:62 (0 transitive callers)
  • TracerArtifact.edges in scripts/resolution-benchmark.ts:63 (0 transitive callers)
  • LangResult.tracer in scripts/resolution-benchmark.ts:82 (0 transitive callers)
  • computeMetrics in scripts/resolution-benchmark.ts:114 (3 transitive callers)
  • runDynamicTracer in scripts/resolution-benchmark.ts:187 (1 transitive callers)
  • dynamicConfirmed in scripts/resolution-benchmark.ts:314 (0 transitive callers)

Base automatically changed from refactor/1052-dedupe-resolution-bench to main May 20, 2026 00:29
carlos-alm and others added 3 commits May 19, 2026 22:51
…tifact loader (#1171)

When RESOLUTION_RESULT_JSON points to a file with malformed JSON, the
artifact loader previously threw a raw SyntaxError with no context.
Mirror the existing 'regenerate with scripts/resolution-benchmark.ts'
guidance so the DX is consistent with the missing-file and empty-file
cases.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 4d8df7b into main May 20, 2026
25 checks passed
@carlos-alm carlos-alm deleted the refactor/1166-dedupe-tracer-validation branch May 20, 2026 08:09
@github-actions github-actions Bot locked and limited conversation to collaborators May 20, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

follow-up: deduplicate tracer validation work between script and gate test

1 participant