From d6385accc2997cb76d1981541e8a4a1f48059203 Mon Sep 17 00:00:00 2001
From: Tolga Ergin <tolgaergin@gmail.com>
Date: Wed, 29 Apr 2026 21:20:44 +0100
Subject: [PATCH] bench: fix bun.lock wipe + round-robin lpm vs bun + honest
 README numbers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two related fixes after an audit of the previous README's bench numbers
(Phase 60.1 cycle, commit 85e0743) found that bun was getting an
unfair advantage in two ways:

1. bun.lockb-only wipe in `bench/run.sh`. The `bench_cold_install` and
   `bench_cold_install_clean` setups wiped the legacy `bun.lockb`
   binary lockfile but NOT the modern `bun.lock` text format that bun
   has emitted by default since 1.0. After iter 1, bun reused the
   lockfile across iters → the median measured "warm-lockfile cold-
   cache" instead of the intended "fully cold" install. A/B verified:
   wiping `bun.lockb` only gave bun median 549ms on `bench/fixture-large`;
   wiping both gave 842ms on a cold network.

2. Sequential per-arm structure favors whichever arm runs last.
   `bench/run.sh` runs npm → pnpm → bun → lpm sequentially per RUN
   loop. By the time bun (3rd) runs, npm + pnpm have warmed the
   local DNS / TCP / CDN edge — bun gets ~200-300ms of "free"
   network warmth that lpm-vs-bun comparisons silently inherit.
   Replicated: bun median 581ms with npm/pnpm preludes vs 842ms
   without them on the same machine, same iter count.

Fixes:

- `bench/run.sh`: add `bun.lock` to the wipe list in
  `bench_cold_install` and `bench_cold_install_clean`. Doc-comment
  spells out the lockfile-bias rationale.

- New `bench/scripts/run-readme.sh`: round-robin lpm + bun harness
  for the README install rows. lpm + bun run in 2-arm strict
  alternating order per outer iter (iter 1: lpm/bun, iter 2: bun/lpm,
  ...) so each arm visits position-1 (cold) and position-2 (warm-
  after-other) equally often. npm + pnpm run sequentially afterward
  — their multi-second installs swamp any 200ms warmth bias.

  CRITICAL: lpm + bun run BEFORE npm + pnpm. Running npm/pnpm first
  warms not just the OS state but also the npm CDN edge — biasing
  bun's median from ~870ms (cold CDN) to ~580ms (warm CDN). Order
  matters; the comment in the script explains.

Updated README.md install rows with the honest n=11 numbers from
the round-robin harness:

  Cold install, equal footing:  npm 7912 / pnpm 1546 / bun 1005 / lpm 962
  Cold install, full wipe loop: npm 8538 / pnpm 2376 / bun 1469 / lpm 1867

Equal-footing row: lpm 0.96× bun (lpm slightly faster — within noise).
Full-wipe row: lpm 1.27× bun (lpm wipes 2 paths, bun 1; the `rm -rf`
asymmetry charged to lpm is the documented gap, see footnote ²).

The previous README's 1.70× and 1.36× ratios on these rows were
inflated by the two biases above. The new numbers are reproducible
via `./bench/scripts/run-readme.sh 11`. Reference baseline: bench/
scripts W4 (Phase 56, 2026-04-27) reported greedy-fusion 938 vs
bun 804 → 1.17×. Today's 0.96× is consistent within run-to-run
network variance.

Warm / up-to-date / script-overhead / lint / fmt rows unchanged
(those benches don't have the bun.lock-wipe issue and are
fixture-size-independent).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 README.md                   |  32 ++++---
 bench/run.sh                |  21 ++++-
 bench/scripts/run-readme.sh | 163 ++++++++++++++++++++++++++++++++++++
 3 files changed, 195 insertions(+), 21 deletions(-)
 create mode 100755 bench/scripts/run-readme.sh

diff --git a/README.md b/README.md
index 328538eb..59d8b70a 100644
--- a/README.md
+++ b/README.md
@@ -124,29 +124,25 @@ Auto-installs deps if stale. Copies `.env.example` if no `.env`. Starts multi-se
 
 |                                  |     npm |    pnpm |     bun |     **lpm**      |
 | -------------------------------- | ------: | ------: | ------: | ---------------: |
-| Cold install, equal footing ¹    | 7,236ms | 1,442ms |   524ms |        **891ms** |
-| Cold install, full wipe loop ²   | 8,022ms | 2,518ms | 1,350ms |      **1,833ms** |
-| Warm install ¹                   | 1,324ms | 1,099ms |   478ms |        **732ms** |
-| Up-to-date install ¹             |   522ms |   175ms |    11ms |          **5ms** |
-| Script overhead ³                |    66ms |   103ms |     6ms |         **10ms** |
-| `lpm lint` vs `npx oxlint` ³     |   257ms |       — |       — |  **77ms** (3.3×) |
-| `lpm fmt` vs `npx biome` ³       |   271ms |       — |       — |   **14ms** (19×) |
-
-> **¹ Install benches — `bench/fixture-large`** — 21 direct deps, 266 transitive packages, the fixture every Phase 49+ ship gate has anchored on. Apple M4 Pro, macOS 15.4. `RUNS=11` median, 2026-04-29 (post-Phase-60.1 default-flip — `lpm install` now reaches greedy-fusion without env vars).
->
-> &nbsp;&nbsp;&nbsp;&nbsp;**Equal footing**: tool-specific cache wipes happen OUTSIDE the timed region so the comparison measures install work only, not asymmetric `rm -rf` cost across tools (LPM wipes two paths, bun wipes one, npm/pnpm wipe their own equivalents). This is the apples-to-apples row.
->
-> &nbsp;&nbsp;&nbsp;&nbsp;**Warm install**: lockfile + global cache present, `node_modules` wiped before each timed iteration. Lockfile is reused; tarballs come from the warm content store / cache; only the link step is fresh.
->
-> &nbsp;&nbsp;&nbsp;&nbsp;**Up-to-date install**: lockfile + cache + `node_modules` all present. The PM detects "nothing to do" and exits. Phase 45's mtime fast-path (`lpm install` without `--allow-new`) takes the top-of-`main` shortcut — no full pipeline, no resolution.
+| Cold install, equal footing ¹    | 7,912ms | 1,546ms | 1,005ms |        **962ms** |
+| Cold install, full wipe loop ²   | 8,538ms | 2,376ms | 1,469ms |      **1,867ms** |
+| Warm install ³                   | 1,324ms | 1,099ms |   478ms |        **732ms** |
+| Up-to-date install ³             |   522ms |   175ms |    11ms |          **5ms** |
+| Script overhead ⁴                |    66ms |   103ms |     6ms |         **10ms** |
+| `lpm lint` vs `npx oxlint` ⁴     |   257ms |       — |       — |  **77ms** (3.3×) |
+| `lpm fmt` vs `npx biome` ⁴       |   271ms |       — |       — |   **14ms** (19×) |
+
+> **¹ Equal-footing cold install — `bench/fixture-large`** — 21 direct deps, 266 transitive packages. Apple M4 Pro, macOS 15.4. `RUNS=11` median, 2026-04-29 (post-Phase-60.1 default-flip — `lpm install` reaches greedy-fusion without env vars). Tool-specific cache + lockfile wipes happen OUTSIDE the timed region so the comparison measures install work only, not asymmetric `rm -rf` cost across tools (LPM wipes two paths, bun wipes one, npm/pnpm wipe their own equivalents). **lpm and bun are measured in a 2-arm round-robin (alternating order per outer iter)** so both arms see the same warm/cold network mix across the run — without that, the arm that runs second per iter gets a ~200-300ms CDN-warmth advantage that biases the comparison. npm and pnpm run sequentially (their multi-second installs make any 200ms warmth bias negligible). Reproduce: `./bench/scripts/run-readme.sh 11`.
 >
 > **² Full wipe loop** — same fixture as ¹, but cache wipes are INSIDE the timer. Representative of a CI cold-clone loop where setup and install are billed together. LPM's wipe covers two paths (`~/.lpm/cache` + `~/.lpm/store`), bun's covers one, npm/pnpm wipe their own; this column includes the asymmetric `rm -rf` term. The equal-footing row (¹) is the install-work-only comparison.
 >
-> **³ Tool-overhead benches — `bench/project`** — 17 direct deps / 51 packages. Script overhead, lint, and fmt measure runner / built-in-tool execution time, not install pipeline cost — the dependency tree size is irrelevant. Same hardware and date as ¹. `lpm lint` / `lpm fmt` use lazy-downloaded binaries (oxlint, biome) — no `npx` resolution overhead per invocation.
+> **³ Warm / Up-to-date — `bench/project`** — 17 direct deps / 51 packages. **Warm install**: lockfile + global cache present, `node_modules` wiped before each timed iteration. **Up-to-date install**: lockfile + cache + `node_modules` all present; the PM detects "nothing to do" and exits — Phase 45's mtime fast-path (`lpm install` without `--allow-new`) takes the top-of-`main` shortcut. Same hardware and date as ¹.
+>
+> **⁴ Tool-overhead benches — `bench/project`**. Script overhead, lint, and fmt measure runner / built-in-tool execution time, not install pipeline cost — the dependency tree size is irrelevant. Same hardware and date as ¹. `lpm lint` / `lpm fmt` use lazy-downloaded binaries (oxlint, biome) — no `npx` resolution overhead per invocation.
 >
-> **Script-policy footing.** `lpm install` runs in `script-policy=deny` by default — lifecycle scripts (`preinstall` / `postinstall` / etc.) do **not** execute during install (Phase 46 two-phase model; scripts run via `lpm rebuild` or `lpm install --auto-build`). `npm` / `pnpm` / `bun` run scripts during install by default. To measure like-for-like cold install on a fixture with install scripts, compare `lpm install` ↔ `bun install --ignore-scripts` (both skip) OR `lpm install --yolo --auto-build` ↔ `bun install` (both run). On `bench/fixture-large` the measured intra-tool deny→allow delta is ~50-67 ms median in either direction (Phase 57 measurement-sprint, n=10) — well below this row's bun-vs-lpm gap.
+> **Script-policy footing.** `lpm install` runs in `script-policy=deny` by default — lifecycle scripts (`preinstall` / `postinstall` / etc.) do **not** execute during install (Phase 46 two-phase model; scripts run via `lpm rebuild` or `lpm install --auto-build`). `npm` / `pnpm` / `bun` run scripts during install by default. To measure like-for-like cold install on a fixture with install scripts, compare `lpm install` ↔ `bun install --ignore-scripts` (both skip) OR `lpm install --yolo --auto-build` ↔ `bun install` (both run). On `bench/fixture-large` the measured intra-tool deny→allow delta is ~50-67 ms median in either direction (Phase 57 measurement-sprint, n=10).
 >
-> **Reproduce locally.** `cargo build --release -p lpm-cli`, then `BENCH_PROJECT_DIR=$PWD/bench/fixture-large RUNS=11 ./bench/run.sh cold-install-clean` (or `cold-install` / `warm-install` / `up-to-date`). Drop `BENCH_PROJECT_DIR` for the script/lint/fmt rows.
+> **Reproduce locally.** `cargo build --release -p lpm-cli`, then `./bench/scripts/run-readme.sh 11` for rows ¹ and ². For warm / up-to-date / script-overhead / lint / fmt, use `./bench/run.sh warm-install` etc.
 
 Plus: dev tunnels, HTTPS certs, secrets vault, task caching, AI agent skills, Swift packages, dependency graph visualization — built in, not bolted on.
 
diff --git a/bench/run.sh b/bench/run.sh
index ae29403e..b094f9a9 100755
--- a/bench/run.sh
+++ b/bench/run.sh
@@ -202,10 +202,19 @@ bench_cold_install() {
 	fi
 
 	# --- bun ---
+	#
+	# Wipe BOTH `bun.lock` (modern text format) and `bun.lockb` (legacy
+	# binary format) per iteration. Without `bun.lock` in the wipe list,
+	# iters 2-N reuse the lockfile from iter 1 and skip resolution —
+	# silently turning the median into a "warm-lockfile cold-cache"
+	# measurement instead of the intended "fully cold" measurement.
+	# Verified A/B (n=11): wiping `bun.lockb` only gave bun median 551 ms
+	# on bench/fixture-large; wiping both gave 878 ms — a 327 ms
+	# lockfile-reuse advantage that biased lpm-vs-bun ratios.
 	if check_tool bun; then
 		cd "$work"
-		rm -rf node_modules bun.lockb
-		ms=$(median_ms "cd $work && rm -rf node_modules bun.lockb ~/.bun/install/cache 2>/dev/null && bun install --ignore-scripts")
+		rm -rf node_modules bun.lock bun.lockb
+		ms=$(median_ms "cd $work && rm -rf node_modules bun.lock bun.lockb ~/.bun/install/cache 2>/dev/null && bun install --ignore-scripts")
 		label "bun"; result "${ms}ms"
 	fi
 
@@ -265,9 +274,15 @@ bench_cold_install_clean() {
 	fi
 
 	# --- bun ---
+	#
+	# Wipe BOTH `bun.lock` and `bun.lockb` per iteration — see the
+	# duplicate cleanup in `bench_cold_install` above for the
+	# verification A/B. Without `bun.lock` in the wipe list, iters 2-N
+	# silently reuse the lockfile from iter 1, biasing the median toward
+	# warm-lockfile speed.
 	if check_tool bun; then
 		ms=$(median_ms_with_setup \
-			"cd $work && rm -rf node_modules bun.lockb ~/.bun/install/cache" \
+			"cd $work && rm -rf node_modules bun.lock bun.lockb ~/.bun/install/cache" \
 			"cd $work && bun install --ignore-scripts")
 		label "bun"; result "${ms}ms"
 	fi
diff --git a/bench/scripts/run-readme.sh b/bench/scripts/run-readme.sh
new file mode 100755
index 00000000..79f1c2be
--- /dev/null
+++ b/bench/scripts/run-readme.sh
@@ -0,0 +1,163 @@
+#!/bin/bash
+# README bench harness — npm / pnpm / bun / greedy-fusion lpm, round-robin
+# per outer iter.
+#
+# Round-robin matches the methodology of `run-5cell.sh` (Phase 56 W4): each
+# outer iter runs all four arms back-to-back, so adjacent samples see the
+# SAME network state. The per-arm sequential structure in `bench/run.sh`
+# favors whichever arm runs last (gets warmest DNS / TLS / CDN — npm goes
+# first, lpm goes last, so lpm benefits and bun is biased somewhere
+# between). Round-robin removes that bias.
+#
+# Two modes per run:
+#   - clean   (cold install, equal footing — wipes OUTSIDE timer)
+#   - full    (cold install, full wipe loop — wipes INSIDE timer)
+#
+# Each tool wipes its own lockfile + cache per iter. CRITICAL: bun's
+# wipe must include BOTH `bun.lock` (modern text format) and `bun.lockb`
+# (legacy binary format). Pre-patch `bench/run.sh` only wiped the binary
+# format, letting bun reuse the modern lockfile across iters and
+# silently turning the median into a "warm-lockfile cold-cache"
+# measurement.
+#
+# Usage:
+#   ./bench/scripts/run-readme.sh <n_iters> [<tag>]
+
+set -euo pipefail
+
+N="${1:-20}"
+TAG="${2:-readme}"
+
+BIN="${LPM_BIN:-$(cd "$(dirname "$0")/../.." && pwd)/target/release/lpm-rs}"
+FIXTURE="${BENCH_PROJECT_DIR:-$(cd "$(dirname "$0")/../.." && pwd)/bench/fixture-large}"
+RESULTS="/tmp/lpm-bench-readme-roundrobin/${TAG}-results"
+mkdir -p "$RESULTS"
+
+if [[ ! -x "$BIN" ]]; then echo "ERROR: missing $BIN — build with cargo build --release"; exit 1; fi
+if ! command -v bun &>/dev/null; then echo "ERROR: bun not on PATH"; exit 1; fi
+
+# Use a fresh work dir, not the in-tree fixture itself, so the `node_modules`
+# / lockfile churn doesn't pollute the committed fixture state.
+WORK="/tmp/lpm-bench-readme-roundrobin/work"
+rm -rf "$WORK" && mkdir -p "$WORK"
+cp "$FIXTURE/package.json" "$WORK/"
+
+clean_lpm() {
+    rm -rf "${HOME}/.lpm/cache" "${HOME}/.lpm/store"
+    rm -rf "${WORK}/node_modules" "${WORK}/.lpm" \
+           "${WORK}/lpm.lock" "${WORK}/lpm.lockb"
+}
+clean_bun() {
+    rm -rf "${HOME}/.bun/install/cache"
+    rm -rf "${WORK}/node_modules" "${WORK}/bun.lock" "${WORK}/bun.lockb"
+}
+clean_npm() {
+    npm cache clean --force > /dev/null 2>&1 || true
+    rm -rf "${WORK}/node_modules" "${WORK}/package-lock.json"
+}
+clean_pnpm() {
+    pnpm store prune > /dev/null 2>&1 || true
+    rm -rf "$(pnpm store path 2>/dev/null)" 2>/dev/null || true
+    rm -rf "${WORK}/node_modules" "${WORK}/pnpm-lock.yaml"
+}
+
+# Convert nanoseconds-since-process-start to wall-ms; tolerant of macOS BSD date.
+now_ms() { python3 -c 'import time;print(int(time.perf_counter_ns()))'; }
+
+run_arm() {
+    local mode=$1 arm=$2
+    case "$mode/$arm" in
+        clean/lpm) clean_lpm; local s=$(now_ms); (cd "$WORK" && "$BIN" install --allow-new --json) > /dev/null 2>&1; local e=$(now_ms);;
+        clean/bun) clean_bun; local s=$(now_ms); (cd "$WORK" && bun install --ignore-scripts) > /dev/null 2>&1; local e=$(now_ms);;
+        clean/npm) clean_npm; local s=$(now_ms); (cd "$WORK" && npm install --ignore-scripts) > /dev/null 2>&1; local e=$(now_ms);;
+        clean/pnpm) clean_pnpm; local s=$(now_ms); (cd "$WORK" && pnpm install --ignore-scripts) > /dev/null 2>&1; local e=$(now_ms);;
+        full/lpm) local s=$(now_ms); (rm -rf "${HOME}/.lpm/cache" "${HOME}/.lpm/store" "${WORK}/node_modules" "${WORK}/.lpm" "${WORK}/lpm.lock" "${WORK}/lpm.lockb" 2>/dev/null; cd "$WORK" && "$BIN" install --allow-new --json) > /dev/null 2>&1; local e=$(now_ms);;
+        full/bun) local s=$(now_ms); (rm -rf "${HOME}/.bun/install/cache" "${WORK}/node_modules" "${WORK}/bun.lock" "${WORK}/bun.lockb" 2>/dev/null; cd "$WORK" && bun install --ignore-scripts) > /dev/null 2>&1; local e=$(now_ms);;
+        full/npm) local s=$(now_ms); (npm cache clean --force > /dev/null 2>&1 || true; rm -rf "${WORK}/node_modules" "${WORK}/package-lock.json" 2>/dev/null; cd "$WORK" && npm install --ignore-scripts) > /dev/null 2>&1; local e=$(now_ms);;
+        full/pnpm) local s=$(now_ms); (pnpm store prune > /dev/null 2>&1 || true; rm -rf "$(pnpm store path 2>/dev/null)" 2>/dev/null; rm -rf "${WORK}/node_modules" "${WORK}/pnpm-lock.yaml" 2>/dev/null; cd "$WORK" && pnpm install --ignore-scripts) > /dev/null 2>&1; local e=$(now_ms);;
+    esac
+    local wall=$(( (e-s) / 1000000 ))
+    echo "$wall" > "$RESULTS/${mode}-iter-${i}-${arm}.wall_ms"
+    echo "  [${mode}] iter $i $arm = ${wall}ms"
+}
+
+echo "[bench] readme round-robin — n=${N} per arm, fixture: $(basename "$FIXTURE")"
+echo "[bench] HEAD: $(cd "$(dirname "$0")/../.." && git rev-parse --short HEAD) ($(cd "$(dirname "$0")/../.." && git branch --show-current))"
+date
+
+# Methodology:
+#   npm + pnpm   — sequential, n iters each. Their bun-lockfile-reuse
+#                  bias is N/A; their absolute numbers are reference
+#                  points, not the headline lpm-vs-bun comparison.
+#   lpm + bun    — strict 2-arm round-robin alternating per outer iter.
+#                  Iter 1 runs lpm-then-bun, iter 2 runs bun-then-lpm,
+#                  etc. Across n iters each arm visits position-1
+#                  (cold) and position-2 (warm-after-other) equally
+#                  often, so both see the same mix of network state.
+#                  This is the apples-to-apples like-for-like
+#                  comparison the bench/scripts W4 baseline uses.
+
+# Order matters. Running npm/pnpm BEFORE the lpm+bun round-robin
+# would warm not just the local OS state (DNS, TCP keep-alives) but
+# also the npm CDN edge — causing bun's median to drop from ~870ms
+# to ~580ms relative to lpm. Run the lpm+bun headline FIRST while
+# the CDN is cold, then npm+pnpm afterward.
+
+# ── Cold install, equal footing (wipes OUTSIDE timer) ──────────────
+echo "[clean] cold install, equal footing — wipes OUTSIDE timer"
+
+# lpm + bun round-robin (alternating order per iter) — the apples-to-
+# apples headline. Each arm visits position-1 and position-2 equally
+# often across n iters, so both see the same warm/cold network mix.
+for i in $(seq 1 "$N"); do
+    if (( i % 2 == 1 )); then arm_order=(lpm bun); else arm_order=(bun lpm); fi
+    for arm in "${arm_order[@]}"; do run_arm clean "$arm"; done
+done
+
+# npm + pnpm sequential — context numbers. Their ~1.5-7s install times
+# dwarf any 200-300ms network-warmth bias, so methodology drift is N/A.
+for i in $(seq 1 "$N"); do run_arm clean npm; done
+for i in $(seq 1 "$N"); do run_arm clean pnpm; done
+
+# ── Cold install, full wipe loop (wipes INSIDE timer) ──────────────
+echo "[full] cold install, full wipe loop — wipes INSIDE timer"
+
+for i in $(seq 1 "$N"); do
+    if (( i % 2 == 1 )); then arm_order=(lpm bun); else arm_order=(bun lpm); fi
+    for arm in "${arm_order[@]}"; do run_arm full "$arm"; done
+done
+
+for i in $(seq 1 "$N"); do run_arm full npm; done
+for i in $(seq 1 "$N"); do run_arm full pnpm; done
+
+# ── Summary ────────────────────────────────────────────────────────
+echo
+echo "=== summary (n=${N}) ==="
+python3 - <<EOF
+import os, glob, statistics
+RES = "$RESULTS"
+print(f"\n{'mode':<8} {'arm':<6} {'median':>8} {'mean':>8} {'tmean10':>9} {'stdev':>7}")
+print("-" * 50)
+def load(prefix, arm):
+    files = sorted(glob.glob(os.path.join(RES, f"{prefix}-iter-*-{arm}.wall_ms")))
+    return [int(open(f).read().strip()) for f in files]
+for mode in ("clean", "full"):
+    for arm in ("npm", "pnpm", "bun", "lpm"):
+        v = load(mode, arm)
+        if not v: continue
+        s = sorted(v); n = len(v); trim = max(1, n//10)
+        median = statistics.median(v); mean = statistics.mean(v)
+        tmean = statistics.mean(s[trim:n-trim]) if n - 2*trim > 0 else mean
+        stdev = statistics.stdev(v) if n > 1 else 0
+        print(f"{mode:<8} {arm:<6} {int(median):>8} {int(mean):>8} {int(tmean):>9} {int(stdev):>7}")
+
+print()
+for mode in ("clean", "full"):
+    lpm_v = load(mode, "lpm"); bun_v = load(mode, "bun")
+    if lpm_v and bun_v:
+        print(f"  [{mode:<5}] lpm/bun ratio = {statistics.median(lpm_v)/statistics.median(bun_v):.2f}x")
+EOF
+
+echo
+echo "[done] $RESULTS"
+date