Skip to content

fix(bench): discard warmup runs in query benchmark median#1077

Merged
carlos-alm merged 2 commits intomainfrom
fix/1076-fndeps-regression
May 7, 2026
Merged

fix(bench): discard warmup runs in query benchmark median#1077
carlos-alm merged 2 commits intomainfrom
fix/1076-fndeps-regression

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Root cause: native fn_deps pays a cold-start cost on the first 2–3 calls per worker process (rusqlite statement-cache warmup, OS page-cache fill, NAPI-side static init from tree-sitter 0.25's transitive crates linked into the .node binary at perf(native): ~2 second flat overhead added to rebuild operations in v3.10.0 #1054). On Linux x86_64 CI that pulled median(5) into cold-start territory and surfaced as fnDeps depth 1: 28.7 → 48.6 (+69%). Steady-state warm per-call latency is unchanged from v3.9.6.
  • Fix: run two warmup iterations per (depth, fn) before timing in scripts/query-benchmark.ts so all 5 sampled timings reflect warm-call latency, which is what the regression gate is meant to track.

Verification

Linux x86_64 (Docker, qemu) on main HEAD:

metric before after
native fnDeps d1 178.2 ms 119.6 ms
native fnDeps d3 124.9 ms 122.0 ms
native fnDeps d5 118.5 ms 113.1 ms

After the fix, d1 ≈ d3 ≈ d5, matching the warm plateau. Wasm path is unaffected (only ever paid cold-start on iteration 1; median already excluded it).

Mac arm64 cross-check: native d1 = 26.1 ms (unchanged — Mac was always sampling warm).

Test plan

  • Pre-publish benchmark gate passes on the next dev publish
  • fnDeps d1/d3/d5 native values land within ~5% of the v3.9.6 baseline (28.7 / 29.1 / 33.2)

Closes #1076

The native fn_deps NAPI path pays a cold-start cost on the first 2-3
calls per worker process — rusqlite statement-cache warmup, OS page
cache fill for the DB file, and NAPI-side static init from
tree-sitter 0.25's transitive crates (indexmap, hashbrown, equivalent)
linked into the .node binary at #1054.

The benchmark's median(5) sample then lands inside cold-start
territory rather than warm steady-state, which is what the regression
gate is supposed to track. On Linux x86_64 CI that surfaced as
fnDeps depth 1 native: 28.7 → 48.6 (+69%) once #1054 enlarged the
binary's init footprint, even though warm per-call latency is
unchanged from v3.9.6.

Run two warmup iterations per (depth, fn) before timing so all 5
sampled timings are warm. Verified on Linux x86_64 (Docker, qemu):
native d1 178.2ms → 119.6ms, now matching the warm d3/d5 plateau
(122/113ms). Wasm path is unaffected by the change since wasm only
ever paid cold-start on its first iteration.

Closes #1076
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@claude
Copy link
Copy Markdown

claude Bot commented May 7, 2026

Claude finished @carlos-alm's task —— View job


PR Review: Benchmark Warmup Fix

Analyzing this PR that addresses benchmark regression by adding warmup runs. Conducting a thorough review following the established criteria.

Todo List

  • Read and analyze the changed file (scripts/query-benchmark.ts)
  • Phase 1: Root Cause Analysis
  • Phase 2: Configuration Change Detection
  • Phase 3: Test Integrity Assessment
  • Phase 4: Code Quality Review
  • Phase 5: Backlog Compliance Evaluation
  • Provide Final Assessment and Rating

Root Cause Analysis

Problem Being Solved: Native fn_deps benchmark regression where cold-start costs (rusqlite statement-cache warmup, OS page-cache fill, NAPI static init from tree-sitter 0.25) inflated median(5) timing from 28.7ms to 48.6ms (+69%) on Linux x86_64 CI.

Why This Approach: Adding 2 warmup runs ensures all 5 timing samples reflect steady-state warm performance rather than cold-start overhead. This is the correct approach - the benchmark is meant to measure query performance, not initialization costs. The regression was masking actual performance by conflating one-time startup costs with per-call latency.

Risk Assessment: ✅ LOW RISK. This fixes a measurement methodology flaw rather than hiding a real performance problem. The verification data shows the fix correctly isolates warm-call latency (d1≈d3≈d5 plateau).

Configuration Changes

No problematic configuration relaxation detected. The only change is adding warmup methodology to benchmark measurement - this improves measurement accuracy rather than hiding issues.

Test Integrity

No test weakening detected. This change improves the quality of performance measurement by eliminating cold-start noise from the metric. The benchmark will now correctly track performance regression in the actual query code rather than initialization overhead.

Code Quality Review

Positive aspects:

  • Clear, comprehensive documentation explaining the root cause (lines 111-118)
  • Appropriate warmup count (WARMUP_RUNS = 2) based on analysis that cold-start affects "first 2-3 calls"
  • Minimal, surgical change that only affects the measurement methodology
  • Consistent application to both fnDeps and fnImpact benchmarks via benchDepths()

Technical correctness:

  • Warmup runs use identical parameters to timing runs (fn(name, dbPath, { depth, noTests: true }))
  • No return value used from warmup runs (correctly discarded)
  • Maintains existing RUNS = 5 for statistical validity

No issues identified - this is a well-implemented fix.

Backlog Compliance

  • Zero-dep: ✅ No new runtime dependencies added
  • Foundation-aligned: ✅ Fully compliant - improves measurement accuracy for local tool
  • Problem-fit: 3/5 - Useful fix but tangential to core "AI agents navigating codebases" problem
  • Breaking: No - Purely internal benchmark methodology improvement
  • Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Critical Concerns

None identified. This is a methodologically sound fix that addresses a real measurement issue.

Final Assessment

## Root Cause Analysis
**Problem Being Solved**: Cold-start costs in native engine corrupting benchmark median calculation
**Why This Approach**: Isolates warm-call latency measurement from one-time initialization overhead  
**Risk Assessment**: Low risk - improves measurement accuracy without masking real issues

## Backlog Compliance  
- Zero-dep: ✓ - No new dependencies
- Foundation-aligned: ✓ - Improves local tool quality
- Problem-fit: 3/5 - Useful but tangential to AI navigation problem
- Breaking: No - Internal methodology improvement
- Tier: 1

## Critical Concerns
None - methodologically sound fix with clear verification data.

## Final Recommendation
- Rating: ⭐⭐⭐⭐☆ (4/5)
- Action: APPROVE  
- Reasoning: Well-implemented fix that addresses a real measurement flaw. Clear documentation, appropriate technical approach, and solid verification data. Only loses one star due to moderate problem-fit score.

Recommendation:APPROVE - This is a solid technical fix that improves benchmark accuracy.


@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 7, 2026

Greptile Summary

Adds 3 warmup iterations per depth in benchDepths before the 5 timed samples, so the median reflects warm-call latency rather than cold-start overhead introduced by tree-sitter 0.25's larger NAPI init footprint.

  • WARMUP_RUNS = 3 is added as a named constant with a detailed comment explaining the rusqlite/NAPI cold-start sources, and 3 is now consistent with the "2–3 calls" worst-case range described in the comment (previously 2).
  • Warmup only touches benchDepths; benchDiffImpact is called after all fnDeps/fnImpact rounds, so the native module is already warm by the time it runs — no change needed there.

Confidence Score: 5/5

Safe to merge — a small, isolated addition of warmup iterations to a benchmark script with no effect on production code paths.

The change adds a named constant and a three-iteration warmup loop inside a single benchmark helper. It does not touch any production logic, modifies no data, and the constant value (3) is now aligned with the worst-case cold-start window described in the comment. The benchDiffImpact path is unaffected because it runs after all fnDeps/fnImpact rounds have already warmed the native module.

No files require special attention.

Important Files Changed

Filename Overview
scripts/query-benchmark.ts Adds WARMUP_RUNS=3 constant and 3 pre-timing warmup iterations inside benchDepths to discard cold-start latency before the 5 measured runs; logic is correct and well-commented.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[benchDepths called\nwith fn, name, depths] --> B[for each depth]
    B --> C[Warmup loop\ni = 0 .. WARMUP_RUNS-1\nfn called, result discarded]
    C --> D[Timing loop\ni = 0 .. RUNS-1\nperformance.now start]
    D --> E[fn called\ntiming pushed]
    E --> F{i < RUNS?}
    F -- yes --> D
    F -- no --> G[median of timings\nstored in result]
    G --> H{more depths?}
    H -- yes --> B
    H -- no --> I[return result]

    style C fill:#f9c74f,stroke:#f3722c
    style D fill:#90be6d,stroke:#43aa8b
Loading

Reviews (2): Last reviewed commit: "fix(bench): bump WARMUP_RUNS from 2 to 3..." | Re-trigger Greptile

Comment thread scripts/query-benchmark.ts Outdated
// tree-sitter 0.25 grew the binary's init footprint (#1076), even though
// steady-state per-call latency is unchanged. Discard the first WARMUP_RUNS
// before timing so the metric reflects warm-call latency, not cold-start.
const WARMUP_RUNS = 2;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The comment on line 111 says cold-start can span "2-3" calls, but WARMUP_RUNS is only 2. If a particular CI worker pays a 3-call cold start (e.g. first-run OS page-cache miss on top of rusqlite and NAPI init), the first timed sample for fnDeps depth1 is still a cold measurement, and the empirically observed flakiness could resurface. The verification data is convincing, but bumping to 3 would align the constant with the worst case stated in the comment itself.

Suggested change
const WARMUP_RUNS = 2;
const WARMUP_RUNS = 3;

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0c7d42d — bumped WARMUP_RUNS to 3 to match the worst-case described in the comment. Good catch.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

Codegraph Impact Analysis

1 functions changed1 callers affected across 1 files

  • benchDepths in scripts/query-benchmark.ts:181 (1 transitive callers)

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 5d93d25 into main May 7, 2026
25 of 26 checks passed
@carlos-alm carlos-alm deleted the fix/1076-fndeps-regression branch May 7, 2026 08:24
@github-actions github-actions Bot locked and limited conversation to collaborators May 7, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

query benchmark: fnDeps depth 1 regresses ~70% (28.7ms → 48.6ms) on publish gate

1 participant