fix(bench): discard warmup runs in query benchmark median by carlos-alm · Pull Request #1077 · optave/ops-codegraph-tool

carlos-alm · 2026-05-07T07:14:44Z

Summary

Root cause: native fn_deps pays a cold-start cost on the first 2–3 calls per worker process (rusqlite statement-cache warmup, OS page-cache fill, NAPI-side static init from tree-sitter 0.25's transitive crates linked into the .node binary at perf(native): ~2 second flat overhead added to rebuild operations in v3.10.0 #1054). On Linux x86_64 CI that pulled median(5) into cold-start territory and surfaced as fnDeps depth 1: 28.7 → 48.6 (+69%). Steady-state warm per-call latency is unchanged from v3.9.6.
Fix: run two warmup iterations per (depth, fn) before timing in scripts/query-benchmark.ts so all 5 sampled timings reflect warm-call latency, which is what the regression gate is meant to track.

Verification

Linux x86_64 (Docker, qemu) on main HEAD:

metric	before	after
native fnDeps d1	178.2 ms	119.6 ms
native fnDeps d3	124.9 ms	122.0 ms
native fnDeps d5	118.5 ms	113.1 ms

After the fix, d1 ≈ d3 ≈ d5, matching the warm plateau. Wasm path is unaffected (only ever paid cold-start on iteration 1; median already excluded it).

Mac arm64 cross-check: native d1 = 26.1 ms (unchanged — Mac was always sampling warm).

Test plan

Pre-publish benchmark gate passes on the next dev publish
fnDeps d1/d3/d5 native values land within ~5% of the v3.9.6 baseline (28.7 / 29.1 / 33.2)

Closes #1076

The native fn_deps NAPI path pays a cold-start cost on the first 2-3 calls per worker process — rusqlite statement-cache warmup, OS page cache fill for the DB file, and NAPI-side static init from tree-sitter 0.25's transitive crates (indexmap, hashbrown, equivalent) linked into the .node binary at #1054. The benchmark's median(5) sample then lands inside cold-start territory rather than warm steady-state, which is what the regression gate is supposed to track. On Linux x86_64 CI that surfaced as fnDeps depth 1 native: 28.7 → 48.6 (+69%) once #1054 enlarged the binary's init footprint, even though warm per-call latency is unchanged from v3.9.6. Run two warmup iterations per (depth, fn) before timing so all 5 sampled timings are warm. Verified on Linux x86_64 (Docker, qemu): native d1 178.2ms → 119.6ms, now matching the warm d3/d5 plateau (122/113ms). Wasm path is unaffected by the change since wasm only ever paid cold-start on its first iteration. Closes #1076

github-actions · 2026-05-07T07:15:01Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

claude · 2026-05-07T07:15:04Z

Claude finished @carlos-alm's task —— View job

PR Review: Benchmark Warmup Fix

Analyzing this PR that addresses benchmark regression by adding warmup runs. Conducting a thorough review following the established criteria.

Todo List

Read and analyze the changed file (scripts/query-benchmark.ts)
Phase 1: Root Cause Analysis
Phase 2: Configuration Change Detection
Phase 3: Test Integrity Assessment
Phase 4: Code Quality Review
Phase 5: Backlog Compliance Evaluation
Provide Final Assessment and Rating

Root Cause Analysis

Problem Being Solved: Native fn_deps benchmark regression where cold-start costs (rusqlite statement-cache warmup, OS page-cache fill, NAPI static init from tree-sitter 0.25) inflated median(5) timing from 28.7ms to 48.6ms (+69%) on Linux x86_64 CI.

Why This Approach: Adding 2 warmup runs ensures all 5 timing samples reflect steady-state warm performance rather than cold-start overhead. This is the correct approach - the benchmark is meant to measure query performance, not initialization costs. The regression was masking actual performance by conflating one-time startup costs with per-call latency.

Risk Assessment: ✅ LOW RISK. This fixes a measurement methodology flaw rather than hiding a real performance problem. The verification data shows the fix correctly isolates warm-call latency (d1≈d3≈d5 plateau).

Configuration Changes

✅ No problematic configuration relaxation detected. The only change is adding warmup methodology to benchmark measurement - this improves measurement accuracy rather than hiding issues.

Test Integrity

✅ No test weakening detected. This change improves the quality of performance measurement by eliminating cold-start noise from the metric. The benchmark will now correctly track performance regression in the actual query code rather than initialization overhead.

Code Quality Review

Positive aspects:

Clear, comprehensive documentation explaining the root cause (lines 111-118)
Appropriate warmup count (WARMUP_RUNS = 2) based on analysis that cold-start affects "first 2-3 calls"
Minimal, surgical change that only affects the measurement methodology
Consistent application to both fnDeps and fnImpact benchmarks via benchDepths()

Technical correctness:

Warmup runs use identical parameters to timing runs (fn(name, dbPath, { depth, noTests: true }))
No return value used from warmup runs (correctly discarded)
Maintains existing RUNS = 5 for statistical validity

No issues identified - this is a well-implemented fix.

Backlog Compliance

Zero-dep: ✅ No new runtime dependencies added
Foundation-aligned: ✅ Fully compliant - improves measurement accuracy for local tool
Problem-fit: 3/5 - Useful fix but tangential to core "AI agents navigating codebases" problem
Breaking: No - Purely internal benchmark methodology improvement
Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Critical Concerns

None identified. This is a methodologically sound fix that addresses a real measurement issue.

Final Assessment

## Root Cause Analysis
**Problem Being Solved**: Cold-start costs in native engine corrupting benchmark median calculation
**Why This Approach**: Isolates warm-call latency measurement from one-time initialization overhead  
**Risk Assessment**: Low risk - improves measurement accuracy without masking real issues

## Backlog Compliance  
- Zero-dep: ✓ - No new dependencies
- Foundation-aligned: ✓ - Improves local tool quality
- Problem-fit: 3/5 - Useful but tangential to AI navigation problem
- Breaking: No - Internal methodology improvement
- Tier: 1

## Critical Concerns
None - methodologically sound fix with clear verification data.

## Final Recommendation
- Rating: ⭐⭐⭐⭐☆ (4/5)
- Action: APPROVE  
- Reasoning: Well-implemented fix that addresses a real measurement flaw. Clear documentation, appropriate technical approach, and solid verification data. Only loses one star due to moderate problem-fit score.

Recommendation: ✅ APPROVE - This is a solid technical fix that improves benchmark accuracy.

greptile-apps · 2026-05-07T07:16:49Z

Greptile Summary

Adds 3 warmup iterations per depth in benchDepths before the 5 timed samples, so the median reflects warm-call latency rather than cold-start overhead introduced by tree-sitter 0.25's larger NAPI init footprint.

WARMUP_RUNS = 3 is added as a named constant with a detailed comment explaining the rusqlite/NAPI cold-start sources, and 3 is now consistent with the "2–3 calls" worst-case range described in the comment (previously 2).
Warmup only touches benchDepths; benchDiffImpact is called after all fnDeps/fnImpact rounds, so the native module is already warm by the time it runs — no change needed there.

Confidence Score: 5/5

Safe to merge — a small, isolated addition of warmup iterations to a benchmark script with no effect on production code paths.

The change adds a named constant and a three-iteration warmup loop inside a single benchmark helper. It does not touch any production logic, modifies no data, and the constant value (3) is now aligned with the worst-case cold-start window described in the comment. The benchDiffImpact path is unaffected because it runs after all fnDeps/fnImpact rounds have already warmed the native module.

No files require special attention.

Important Files Changed

Filename	Overview
scripts/query-benchmark.ts	Adds WARMUP_RUNS=3 constant and 3 pre-timing warmup iterations inside benchDepths to discard cold-start latency before the 5 measured runs; logic is correct and well-commented.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[benchDepths called\nwith fn, name, depths] --> B[for each depth]
    B --> C[Warmup loop\ni = 0 .. WARMUP_RUNS-1\nfn called, result discarded]
    C --> D[Timing loop\ni = 0 .. RUNS-1\nperformance.now start]
    D --> E[fn called\ntiming pushed]
    E --> F{i < RUNS?}
    F -- yes --> D
    F -- no --> G[median of timings\nstored in result]
    G --> H{more depths?}
    H -- yes --> B
    H -- no --> I[return result]

    style C fill:#f9c74f,stroke:#f3722c
    style D fill:#90be6d,stroke:#43aa8b

_{Reviews (2): Last reviewed commit: "fix(bench): bump WARMUP_RUNS from 2 to 3..." | Re-trigger Greptile}

greptile-apps · 2026-05-07T07:16:53Z

+// tree-sitter 0.25 grew the binary's init footprint (#1076), even though
+// steady-state per-call latency is unchanged. Discard the first WARMUP_RUNS
+// before timing so the metric reflects warm-call latency, not cold-start.
+const WARMUP_RUNS = 2;


The comment on line 111 says cold-start can span "2-3" calls, but WARMUP_RUNS is only 2. If a particular CI worker pays a 3-call cold start (e.g. first-run OS page-cache miss on top of rusqlite and NAPI init), the first timed sample for fnDeps depth1 is still a cold measurement, and the empirically observed flakiness could resurface. The verification data is convincing, but bumping to 3 would align the constant with the worst case stated in the comment itself.

Suggested change

const WARMUP_RUNS = 2;

const WARMUP_RUNS = 3;

Fixed in 0c7d42d — bumped WARMUP_RUNS to 3 to match the worst-case described in the comment. Good catch.

github-actions · 2026-05-07T07:22:11Z

Codegraph Impact Analysis

1 functions changed → 1 callers affected across 1 files

benchDepths in scripts/query-benchmark.ts:181 (1 transitive callers)

…se cold-start (#1077)

carlos-alm · 2026-05-07T07:33:25Z

@greptileai

greptile-apps Bot reviewed May 7, 2026

View reviewed changes

fix(bench): bump WARMUP_RUNS from 2 to 3 to match documented worst-ca…

0c7d42d

…se cold-start (#1077)

carlos-alm merged commit 5d93d25 into main May 7, 2026
25 of 26 checks passed

carlos-alm deleted the fix/1076-fndeps-regression branch May 7, 2026 08:24

github-actions Bot locked and limited conversation to collaborators May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bench): discard warmup runs in query benchmark median#1077

fix(bench): discard warmup runs in query benchmark median#1077
carlos-alm merged 2 commits intomainfrom
fix/1076-fndeps-regression

carlos-alm commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026 •

edited

Loading

Uh oh!

claude Bot commented May 7, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 7, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot May 7, 2026

Uh oh!

carlos-alm May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026 •

edited

Loading

Uh oh!

carlos-alm commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented May 7, 2026

Summary

Verification

Test plan

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Benchmark Warmup Fix

Todo List

Root Cause Analysis

Configuration Changes

Test Integrity

Code Quality Review

Backlog Compliance

Critical Concerns

Final Assessment

Uh oh!

greptile-apps Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm May 7, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegraph Impact Analysis

Uh oh!

carlos-alm commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 7, 2026 •

edited

Loading

claude Bot commented May 7, 2026 •

edited

Loading

greptile-apps Bot commented May 7, 2026 •

edited

Loading

github-actions Bot commented May 7, 2026 •

edited

Loading