fix(bench): discard warmup runs in query benchmark median#1077
fix(bench): discard warmup runs in query benchmark median#1077carlos-alm merged 2 commits intomainfrom
Conversation
The native fn_deps NAPI path pays a cold-start cost on the first 2-3 calls per worker process — rusqlite statement-cache warmup, OS page cache fill for the DB file, and NAPI-side static init from tree-sitter 0.25's transitive crates (indexmap, hashbrown, equivalent) linked into the .node binary at #1054. The benchmark's median(5) sample then lands inside cold-start territory rather than warm steady-state, which is what the regression gate is supposed to track. On Linux x86_64 CI that surfaced as fnDeps depth 1 native: 28.7 → 48.6 (+69%) once #1054 enlarged the binary's init footprint, even though warm per-call latency is unchanged from v3.9.6. Run two warmup iterations per (depth, fn) before timing so all 5 sampled timings are warm. Verified on Linux x86_64 (Docker, qemu): native d1 178.2ms → 119.6ms, now matching the warm d3/d5 plateau (122/113ms). Wasm path is unaffected by the change since wasm only ever paid cold-start on its first iteration. Closes #1076
|
All contributors have signed the CLA ✍️ ✅ |
|
Claude finished @carlos-alm's task —— View job PR Review: Benchmark Warmup FixAnalyzing this PR that addresses benchmark regression by adding warmup runs. Conducting a thorough review following the established criteria. Todo List
Root Cause AnalysisProblem Being Solved: Native Why This Approach: Adding 2 warmup runs ensures all 5 timing samples reflect steady-state warm performance rather than cold-start overhead. This is the correct approach - the benchmark is meant to measure query performance, not initialization costs. The regression was masking actual performance by conflating one-time startup costs with per-call latency. Risk Assessment: ✅ LOW RISK. This fixes a measurement methodology flaw rather than hiding a real performance problem. The verification data shows the fix correctly isolates warm-call latency (d1≈d3≈d5 plateau). Configuration Changes✅ No problematic configuration relaxation detected. The only change is adding warmup methodology to benchmark measurement - this improves measurement accuracy rather than hiding issues. Test Integrity✅ No test weakening detected. This change improves the quality of performance measurement by eliminating cold-start noise from the metric. The benchmark will now correctly track performance regression in the actual query code rather than initialization overhead. Code Quality ReviewPositive aspects:
Technical correctness:
No issues identified - this is a well-implemented fix. Backlog Compliance
Critical ConcernsNone identified. This is a methodologically sound fix that addresses a real measurement issue. Final AssessmentRecommendation: ✅ APPROVE - This is a solid technical fix that improves benchmark accuracy. |
Greptile SummaryAdds 3 warmup iterations per depth in
Confidence Score: 5/5Safe to merge — a small, isolated addition of warmup iterations to a benchmark script with no effect on production code paths. The change adds a named constant and a three-iteration warmup loop inside a single benchmark helper. It does not touch any production logic, modifies no data, and the constant value (3) is now aligned with the worst-case cold-start window described in the comment. The benchDiffImpact path is unaffected because it runs after all fnDeps/fnImpact rounds have already warmed the native module. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[benchDepths called\nwith fn, name, depths] --> B[for each depth]
B --> C[Warmup loop\ni = 0 .. WARMUP_RUNS-1\nfn called, result discarded]
C --> D[Timing loop\ni = 0 .. RUNS-1\nperformance.now start]
D --> E[fn called\ntiming pushed]
E --> F{i < RUNS?}
F -- yes --> D
F -- no --> G[median of timings\nstored in result]
G --> H{more depths?}
H -- yes --> B
H -- no --> I[return result]
style C fill:#f9c74f,stroke:#f3722c
style D fill:#90be6d,stroke:#43aa8b
Reviews (2): Last reviewed commit: "fix(bench): bump WARMUP_RUNS from 2 to 3..." | Re-trigger Greptile |
| // tree-sitter 0.25 grew the binary's init footprint (#1076), even though | ||
| // steady-state per-call latency is unchanged. Discard the first WARMUP_RUNS | ||
| // before timing so the metric reflects warm-call latency, not cold-start. | ||
| const WARMUP_RUNS = 2; |
There was a problem hiding this comment.
The comment on line 111 says cold-start can span "2-3" calls, but
WARMUP_RUNS is only 2. If a particular CI worker pays a 3-call cold start (e.g. first-run OS page-cache miss on top of rusqlite and NAPI init), the first timed sample for fnDeps depth1 is still a cold measurement, and the empirically observed flakiness could resurface. The verification data is convincing, but bumping to 3 would align the constant with the worst case stated in the comment itself.
| const WARMUP_RUNS = 2; | |
| const WARMUP_RUNS = 3; |
There was a problem hiding this comment.
Fixed in 0c7d42d — bumped WARMUP_RUNS to 3 to match the worst-case described in the comment. Good catch.
Codegraph Impact Analysis1 functions changed → 1 callers affected across 1 files
|
Summary
fn_depspays a cold-start cost on the first 2–3 calls per worker process (rusqlite statement-cache warmup, OS page-cache fill, NAPI-side static init from tree-sitter 0.25's transitive crates linked into the.nodebinary at perf(native): ~2 second flat overhead added to rebuild operations in v3.10.0 #1054). On Linux x86_64 CI that pulledmedian(5)into cold-start territory and surfaced asfnDeps depth 1: 28.7 → 48.6 (+69%). Steady-state warm per-call latency is unchanged from v3.9.6.(depth, fn)before timing inscripts/query-benchmark.tsso all 5 sampled timings reflect warm-call latency, which is what the regression gate is meant to track.Verification
Linux x86_64 (Docker, qemu) on
mainHEAD:After the fix, d1 ≈ d3 ≈ d5, matching the warm plateau. Wasm path is unaffected (only ever paid cold-start on iteration 1; median already excluded it).
Mac arm64 cross-check: native d1 = 26.1 ms (unchanged — Mac was always sampling warm).
Test plan
fnDeps d1/d3/d5native values land within ~5% of the v3.9.6 baseline (28.7 / 29.1 / 33.2)Closes #1076