Skip to content

refactor(ci): let tracer-validation reuse benchmark tracer output#1173

Closed
carlos-alm wants to merge 6 commits into
mainfrom
refactor/1166-dedup-tracer-validation
Closed

refactor(ci): let tracer-validation reuse benchmark tracer output#1173
carlos-alm wants to merge 6 commits into
mainfrom
refactor/1166-dedup-tracer-validation

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Extend `scripts/resolution-benchmark.ts` per-language result with a `tracer` artifact (`status: 'ok' | 'skipped'` + raw edges) so the tracer-validation gate test can reuse the subprocess output instead of respawning `run-tracer.mjs` per fixture.
  • Refactor `tests/benchmarks/resolution/tracer/tracer-validation.test.ts` to consume the artifact when `RESOLUTION_RESULT_JSON` is set, preserving skip-on-toolchain-missing semantics. Falls back to running the tracer directly when unset so standalone `vitest` runs still work.
  • Wire `RESOLUTION_RESULT_JSON` through the `Run tracer validation` step in `.github/workflows/publish.yml`.

Mirrors the resolution-gate dedup approach in #1052 (PR d697929).

Why

The pre-publish benchmark job runs each language's tracer subprocess once for telemetry (writing `dynamicEdges` / `dynamicConfirmed` into `resolution-result.json`). The tracer-validation step that follows ran the same subprocess again per fixture, doubling tracer cost in the publish pipeline.

The status field distinguishes "tracer ran, found nothing" (`ok` with empty edges) from "toolchain missing" (`skipped`), mirroring `runTracer`'s null-return semantics so the test still skips gracefully on environments without rustc/ghc/ruby/etc.

Test plan

  • `npx vitest run tests/benchmarks/resolution/tracer/tracer-validation.test.ts` (standalone) — passes (1/1 with the rest skipped on this machine since most language toolchains aren't installed)
  • `RESOLUTION_RESULT_JSON=/tmp/resolution-result.json npx vitest run tests/benchmarks/resolution/tracer/tracer-validation.test.ts` — passes 35/35 in ~160ms against an artifact produced by `scripts/resolution-benchmark.ts`
  • `RESOLUTION_RESULT_JSON=/tmp/resolution-result.json npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts` — passes 170/170 (confirms the new `tracer` field doesn't disturb the existing resolution gate)
  • CI publish workflow's pre-publish-benchmark job

Closes #1166

The pre-publish-benchmark job's `Run resolution benchmark` step already
spawns `run-tracer.mjs` once per language fixture for telemetry (writing
`dynamicEdges` / `dynamicConfirmed` counts into resolution-result.json).
The `Run tracer validation` step that follows then ran the same per-
language tracer subprocess again to compute same-file recall — doubling
the tracer cost in the pre-publish job.

Extend the script's per-language LangResult with a `tracer` artifact
containing status ('ok' | 'skipped') and the raw captured edges. The
status distinction mirrors the gate test's `runTracer` semantics
(null-return on toolchain-missing → 'skipped'). Refactor the gate test
to consume that artifact when RESOLUTION_RESULT_JSON is set, falling
back to running run-tracer.mjs directly when unset so devs can still
execute `npx vitest run tests/benchmarks/resolution/tracer/...`
standalone. Wire the env var through the workflow's tracer-validation
step.

Verified locally: gate test in artifact mode passes 35/35 in ~160ms
against an artifact produced by scripts/resolution-benchmark.ts (vs
~320ms for the standalone path that respawns the JS tracer subprocess).
The resolution-benchmark gate test still passes 170/170 with the new
`tracer` field present in the artifact.

Closes #1166
@claude
Copy link
Copy Markdown

claude Bot commented May 19, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Tracer Validation Deduplication

Final recommendation completed.

Todo List:

  • Read and analyze the 3 changed files
  • Phase 1: Root Cause Analysis
  • Phase 2: Configuration Change Detection
  • Phase 3: Test Integrity Assessment
  • Phase 4: Code Quality Review
  • Phase 5: Backlog Compliance Rating
  • Provide final recommendation and rating

Root Cause Analysis

Problem Being Solved: The pre-publish benchmark job runs each language's tracer subprocess once for telemetry, but the tracer-validation step that follows runs the same subprocess again per fixture, doubling tracer cost in the publish pipeline.

Why This Approach: The solution mirrors the resolution-gate dedup approach from PR #1052. It extends the resolution benchmark script to include a tracer artifact with status and raw edges, allowing the validation test to reuse this output instead of respawning processes.

Risk Assessment: Low risk. This is a CI optimization that preserves existing behavior while reducing duplicate work. Fallback to direct tracer execution ensures standalone tests still work.

Configuration Change Detection

Assessment: ✅ NO HIGH-RISK CHANGES DETECTED

  • No linting rules disabled or relaxed
  • No test timeouts increased or coverage thresholds lowered
  • No build configuration or TypeScript strict mode changes
  • Only adds environment variable to CI workflow
  • Changes are purely additive optimization

Test Integrity Assessment

Assessment: ✅ NO RED FLAGS DETECTED

  • Test assertions remain unchanged - same recall thresholds and expectations
  • Preserves skip-on-toolchain-missing semantics (status: 'skipped' mirrors null return)
  • Fallback behavior ensures standalone vitest runs still work
  • No weakened assertions or conditional validation logic
  • Logic correctly distinguishes "tracer ran, found nothing" (ok + empty edges) from "toolchain missing" (skipped)

Code Quality Assessment

Assessment: ✅ GOOD IMPLEMENTATION QUALITY

Strengths:

  • Clean interface design: TracerArtifact with discriminated union status: 'ok' | 'skipped'
  • Proper error handling: tracer crashes → skipped, subprocess errors → skipped
  • Type safety: Well-defined interfaces for TracerEdge and ArtifactTracerEntry
  • Graceful fallbacks: artifact missing/malformed → use direct execution path
  • Performance optimization: ~50% CI time reduction for tracer validation step

Minor observations:

  • Consistent with existing codebase patterns (mirrors resolution-gate approach)
  • Appropriate timeout handling (60s matches existing)
  • Good separation of concerns: benchmark script emits, test consumes

No significant issues identified.

Backlog Compliance

  • Zero-dep: ✅ — No new runtime dependencies added
  • Foundation-aligned: ✅ — Pure CI optimization, no violation of core principles
  • Problem-fit: 3/5 — Useful CI efficiency improvement but doesn't directly address core "lost AI" problem
  • Breaking: No — Purely additive, preserves all existing behavior
  • Tier: 1 (Zero-dep ✓ + Foundation-aligned ✓)

Critical Concerns

None identified. This is a well-implemented CI optimization.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: Clean implementation of a valuable CI optimization that reduces duplicate work without changing behavior. Well-designed interfaces, proper fallbacks, and follows established patterns. The problem-fit is moderate (CI efficiency vs. core AI agent value), but the implementation quality is high and risk is minimal.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Greptile Summary

This PR eliminates duplicate tracer subprocess invocations in the publish CI pipeline by having scripts/resolution-benchmark.ts embed per-language tracer output (edges + a status discriminator) into resolution-result.json, which tracer-validation.test.ts then reads via RESOLUTION_RESULT_JSON instead of re-spawning run-tracer.mjs per fixture.

  • scripts/resolution-benchmark.ts: runDynamicTracer now returns a TracerArtifact ({ status: 'ok' | 'skipped', edges }) with semantics that map cleanly to the test's null-return convention for missing toolchains; the artifact is written to every LangResult entry in the output JSON.
  • tests/benchmarks/resolution/tracer/tracer-validation.test.ts: Adds an IIFE that parses the artifact at module-load time (with a helpful error on missing/malformed JSON from the previous review), then short-circuits runTracer to consume pre-computed edges when the artifact is present — falling back to the existing subprocess path when RESOLUTION_RESULT_JSON is unset.
  • .github/workflows/publish.yml: Wires RESOLUTION_RESULT_JSON into the "Run tracer validation" step, mirroring the same pattern already in place for the "Gate on resolution thresholds" step.

Confidence Score: 5/5

Safe to merge — the refactor correctly deduplicates tracer subprocess calls in CI without altering any observable test semantics or benchmark metrics.

All three changed files make narrowly scoped, well-tested changes. The TracerArtifact type is structurally identical to what the test's TracerEdge interface expects, so no type mismatch occurs at runtime. The ok/skipped status mapping faithfully mirrors the existing null-return convention in runTracer. Error handling for missing or malformed JSON is in place. The workflow path for RESOLUTION_RESULT_JSON is consistent with the file written by the preceding benchmark step. No correctness or data-loss risks were identified.

No files require special attention.

Important Files Changed

Filename Overview
scripts/resolution-benchmark.ts Introduces TracerArtifact type and refactors runDynamicTracer to return it; emits metrics.tracer into every language result. Logic for ok/skipped status is correct and backward-compatible with existing dynamicEdges/dynamicConfirmed metrics.
tests/benchmarks/resolution/tracer/tracer-validation.test.ts Adds artifact-mode path for runTracer guarded by RESOLUTION_RESULT_JSON; JSON parse is now wrapped in try/catch with a clear regeneration message. Skip semantics (null return) are faithfully preserved for status === 'skipped' entries.
.github/workflows/publish.yml Adds RESOLUTION_RESULT_JSON env var to the "Run tracer validation" step, pointing at the file written by the preceding resolution benchmark step. Path is consistent with the existing gate-test step.

Sequence Diagram

sequenceDiagram
    participant WF as GitHub Actions
    participant BM as resolution-benchmark.ts
    participant RJ as resolution-result.json
    participant TV as tracer-validation.test.ts
    participant RT as run-tracer.mjs

    WF->>BM: node scripts/resolution-benchmark.ts
    loop per language fixture
        BM->>RT: execFileSync (run-tracer.mjs)
        RT-->>BM: "{ edges, error? }"
        BM->>BM: "Build TracerArtifact {status, edges}"
        BM->>RJ: write LangResult incl. tracer field
    end
    BM-->>WF: resolution-result.json written

    WF->>TV: vitest run (RESOLUTION_RESULT_JSON set)
    TV->>RJ: readFileSync + JSON.parse (module load)
    loop per language
        TV->>TV: runTracer(lang)
        alt "artifact has status=ok"
            TV->>TV: return entry.edges (no subprocess)
        else "artifact missing or status=skipped"
            TV->>TV: "return null -> skip test"
        end
    end
    TV-->>WF: recall assertions complete
Loading

Reviews (7): Last reviewed commit: "fix: surface helpful error when RESOLUTI..." | Re-trigger Greptile

Comment on lines +68 to +69
return JSON.parse(fs.readFileSync(RESOLUTION_RESULT_JSON, 'utf-8'));
})();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 If RESOLUTION_RESULT_JSON points to a file that exists but contains invalid or truncated JSON (e.g., the benchmark step was killed mid-write and the redirect left a partial file), JSON.parse throws a raw SyntaxError during module load. Every test in the suite then fails with an opaque parse error instead of a useful message explaining which file to regenerate.

Suggested change
return JSON.parse(fs.readFileSync(RESOLUTION_RESULT_JSON, 'utf-8'));
})();
try {
return JSON.parse(fs.readFileSync(RESOLUTION_RESULT_JSON, 'utf-8'));
} catch (err) {
throw new Error(
`RESOLUTION_RESULT_JSON=${RESOLUTION_RESULT_JSON} is not valid JSON — regenerate it with scripts/resolution-benchmark.ts. (${err})`,
);
}
})();

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c2c5889 — wrapped the JSON.parse in a try/catch that throws a helpful error pointing the developer at scripts/resolution-benchmark.ts to regenerate the artifact, instead of letting the raw SyntaxError surface during module load. Verified manually with a malformed JSON file.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 19, 2026

Codegraph Impact Analysis

4 functions changed1 callers affected across 1 files

  • TracerArtifact.status in scripts/resolution-benchmark.ts:68 (0 transitive callers)
  • TracerArtifact.edges in scripts/resolution-benchmark.ts:69 (0 transitive callers)
  • LangResult.tracer in scripts/resolution-benchmark.ts:87 (0 transitive callers)
  • runDynamicTracer in scripts/resolution-benchmark.ts:189 (1 transitive callers)

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm closed this May 20, 2026
@github-actions github-actions Bot locked and limited conversation to collaborators May 20, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

follow-up: deduplicate tracer validation work between script and gate test

1 participant