Skip to content

feat(phase-49): client-side streaming BFS resolver#8

Merged
tolgaergin merged 23 commits into
mainfrom
phase-49-streaming-resolver
Apr 25, 2026
Merged

feat(phase-49): client-side streaming BFS resolver#8
tolgaergin merged 23 commits into
mainfrom
phase-49-streaming-resolver

Conversation

@tolgaergin
Copy link
Copy Markdown
Contributor

Summary

Phase 49 ships the client-side streaming BFS metadata resolver that replaces the pre-49 server-mediated batch_metadata_deep flow as the cold-install hot path. The walker fetches manifests level-by-level into a shared DashMap cache that the PubGrub solver reads synchronously through a per-canonical Notify wait-loop. Three architectural shifts:

  • Direct-to-npm by default (§3 + §5). npm packages bypass the LPM Cloudflare Worker. The Worker stays in front of @lpm.dev/* (entitlement, audit, signing) and is opt-in via LPM_NPM_ROUTE=proxy for users who need it.
  • Canonical-keyed shared cache (§4). One DashMap keyed by CanonicalKey; the walker inserts under canonical names, the provider canonicalizes at every read so split-retry passes hit the same cell. Closes the silent notify-miss documented in preplan §4.2.
  • Walker-as-producer with shutdown handshake (§12). A new WalkerDone: Arc<AtomicBool> flag is set (Release) at the end of BfsWalker::run via an RAII guard that also broadcasts notify_waiters() across every notify_map entry. The provider's ensure_cached wait-loop pins Notified::enable() and re-checks walker_done after each cache miss — when the walker terminates without inserting a key (e.g. older-version transitive PubGrub later picks), the wait-loop short-circuits to the escape-hatch fetch in microseconds instead of burning the full fetch_wait_timeout.

Bench shape (express @ ^4.18.0, cold cache, same hardware)

Walker strategy Manifests Walker PubGrub Wall
pre-49 baseline (no walker) n/a n/a n/a 40 s ❌ (5 s × 6 timeouts)
§11 full union + §12 broadcast 7,146 31.8 s 2.0 s 33 s
bounded union top-K=4 majors + §12 broadcast 845 7.6 s 11.0 s 11 s
§13 newest-only + §12 broadcast + skip redundant batch 68 0.9 s 1.1 s 2 s
Warm-cache rerun (this branch) 68 6 ms 6 ms 8 ms

Direct mode now sits at par with pnpm and faster than npm on the standard cold-install-clean bench (17 deps → 51 packages, N=5 median): npm 1388 ms / pnpm 762 ms / bun 309 ms / lpm-direct 749 ms / lpm-proxy 2902 ms.

Notable commits in this branch

  • bc56afe feat(phase-49): CanonicalKey — context-free cache key for streaming BFS
  • a9766c5 feat(phase-49): routing primitives — RouteMode + direct-tier npm fetch + parallel fan-out
  • 200ec16 feat(phase-49): provider rewire — canonical-keyed DashMap cache + wait-loop + resolve.rs downstream
  • 5449b1d feat(phase-49): BfsWalker — level-by-level streaming metadata prefetch
  • 0c7358b feat(phase-49): install.rs orchestration — walker + dispatcher + resolve concurrent
  • bcaaf4e feat(phase-49): timing.resolve.streaming_bfs JSON output (§6)
  • eedfe72 fix(phase-49): CAS on ceiling closes multi-429 ratchet race
  • 019d99c fix(phase-49): §11 walker must union deps across all versions + resolve aliases
  • b37c1f7 fix(phase-49): §12 wait-loop shutdown handshake — walker_done broadcast
  • fba1cd4 fix(phase-49): §13 walker is the metadata producer — newest-only + skip redundant deep batch

20 commits total. git log main..HEAD for the full set.

Companion fix on lpm.dev (separate repo, already deployed)

Diagnosed during phase-49 proxy-mode validation: a Drizzle named prepared statement (db.select(...).prepare("validate_registry_token")) was wedging on transaction-mode pgBouncer pooling because db/client.js runs postgres.js with prepare: false. Symptom: every endpoint validating a real lpm_* token (batch-metadata, /api/user/me, per-org token routes, …) hung for 30 s+ until the upstream proxy gave up. The lone .prepare() call site in the entire codebase. Replaced with inline parameterized SQL; the registry_tokens_hash_idx btree keeps it sub-millisecond. Shipped on tolgaergin/a-package-manager@8f8b9657, deployed.

That fix is what made LPM_NPM_ROUTE=proxy benchable here.

Test plan

  • cargo clippy --workspace -- -D warnings
  • cargo fmt --check
  • cargo build --workspace ✅ (zero warnings)
  • cargo nextest run --workspace --exclude lpm-integration-tests4379/4379 passed, 7 skipped
  • cargo test -p lpm-auth × 3 ✅ (47/47 each, deterministic across reruns — no keychain race)
  • No banned crate (fancy-regex) ✅
  • Cold-install-clean A/B bench (N=5, median): direct + proxy both functional, numbers above
  • Express end-to-end install: cold 2 s, warm 8 ms

Out of scope (for follow-up phases)

  • Closing the bun-gap on Direct mode (lpm 749 ms vs bun 309 ms cold-equal). The walker is fast (6 ms warm); the gap is in HTTP/2 multiplexing, manifest-shape selection, and extraction throughput. Engineering work, no CDN spend.
  • Speeding up Proxy mode (lpm-proxy 2.9 s vs lpm-direct 749 ms). Routing the rust client through the Worker's NDJSON deep fast path (Accept: application/x-ndjson + deep:true) for pure-npm batches in proxy mode would let proxy benefit from the Worker-native streaming the eligibility check at workers/registry/src/batch-metadata.js:230 was built for. ~140 ms in my probe vs the current ~1.9 s JSON shape.

🤖 Generated with Claude Code

tolgaergin and others added 23 commits April 24, 2026 15:43
Introduces CanonicalKey (Root | Lpm{owner,name} | Npm{name}) — the
context-free identity used by Phase 49's shared metadata cache and
notify map. Unlike ResolverPackage, CanonicalKey has no `context`
field, so split-retry identities (e.g. `lodash` and `lodash[parent]`)
collapse to the same key.

This is the load-bearing invariant that prevents the split-subtree
notify-miss bug: the walker only discovers canonical names from
parsed manifests, so the shared cache MUST be keyed by CanonicalKey
— keying by ResolverPackage would cause every split retry to miss
the walker's inserts and fall through to the escape-hatch fetch.

See DOCS/new-features/37-rust-client-RUNNER-VISION-phase49-streaming-
resolver-preplan.md §4.2 for the full invariant discussion.

- Adds CanonicalKey to lpm-resolver with From<&ResolverPackage>,
  Display, lpm() / npm() constructors, and from_dep_name().
- 9 unit tests covering context-stripping, Npm-vs-Lpm distinction,
  scoped-npm round-trip, HashMap key identity under split retry,
  and walker/provider parse agreement.
- Exports from lpm-resolver::lib.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h + parallel fan-out

Foundation for Phase 49's direct-to-npm default. Introduces:

- `lpm-registry/src/route.rs`: `RouteMode { Direct, Proxy }` (default
  Direct per preplan §3) and `UpstreamRoute { LpmWorker, NpmDirect }`.
  `RouteMode::route_for_package` enforces the invariant that
  `@lpm.dev/*` ALWAYS routes via the Worker regardless of mode —
  authenticated batch + cost attribution must not be bypassed.
  `LPM_NPM_ROUTE` env var exists as an undocumented debug escape
  hatch via `from_env_or_default`.

- `RegistryClient::get_npm_metadata_direct`: mirrors the Tier 3
  direct-to-npmjs path from `get_npm_package_metadata`, but SKIPS
  the LPM Worker proxy tier entirely. Tier 1 (TTL + HMAC cache)
  is preserved so warm installs and previously-seen packages stay
  cache-fast. This is what callers invoke when direct-mode is the
  whole point of the call; no silent fallback to the Worker hop.

- `RegistryClient::get_npm_metadata_routed`: single dispatch entry
  the Phase 49 walker + provider escape-hatch both use, so routing
  policy lives in one place.

- `RegistryClient::parallel_fetch_npm_manifests`: fan-out helper
  for the walker. Uses `tokio::sync::Semaphore` with `forget()`
  as a one-way ratchet for halve-on-429 adaptive back-pressure
  (per-request 429s are already retried with Retry-After by
  `send_with_retry`; this adds batch-level pressure reduction
  with a floor of 4). Returns per-entry Results + `FanOutStats`
  so the walker can report halve events in
  `timing.resolve.streaming_bfs` without interpreting errors.

Existing `get_npm_package_metadata` (three-tier chain) is untouched;
callers of the old path will be migrated to `get_npm_metadata_routed`
in §3–§5 and the legacy entry deleted in §7.

- 5 new route unit tests; 119 total lpm-registry tests green.
- Clippy -D warnings clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion

Review catch on a9766c5: `parallel_fetch_npm_manifests`'s halve-on-429
path only called `try_acquire_owned` synchronously. When every permit
is already checked out (the exact scenario the back-off exists for),
zero permits are forgotten, the effective ceiling stays at `initial`,
and `halve_events` increments anyway — a silent no-op that lies in
stats.

Fix:

- Add `forget_debt: AtomicUsize` counter. On 429, the observing task
  first tries synchronous `try_acquire_owned` forgets (covers partial
  saturation); any shortfall is registered as debt and paid off by
  the next task completions. Each completing task CAS-claims one
  unit of debt and calls `permit.forget()` instead of releasing it —
  over the next few completions the pool actually shrinks.

- `current_ceiling` only decrements on ACTUAL forgets (immediate OR
  at the point a task pays debt), never on the mere registration of
  debt. This plugs a second hole in the original fix where the stats
  moved below `initial` even when no real forget followed. With the
  corrected accounting, `final_concurrency < initial_concurrency`
  iff permits were genuinely removed from the pool.

- `halve_events` only increments when forgot_now > 0 OR shortfall > 0.
  Halving at-or-below floor records no event (the stat no longer
  claims halving when the floor blocked it).

Four new client tests to pin §2's behavior the route-enum tests
couldn't reach:

- `get_npm_metadata_direct_skips_proxy_tier_entirely`: proxy mock
  with `expect(0)` verifies direct-tier never falls back to the LPM
  Worker.

- `parallel_fetch_preserves_input_order_across_varying_latencies`:
  inverted per-response delays ensure the fan-out would return in
  completion order if the sort-by-input-index were broken.

- `parallel_fetch_per_entry_failures_do_not_abort_batch`: 404 on one
  entry; the other two still return Ok.

- `halve_on_429_ratchets_even_under_full_saturation`: asymmetric mocks
  (pkg-0 fast 429 + pkg-1..7 slow 200s with 2s delay) pin the
  saturated moment at the halving call. Verified to FAIL against a
  sabotaged debt path (`initial=8, final=8, halve_events=1` — exactly
  the bug shape) and PASS against the fix (`final=4`). Requires
  `#[tokio::test(flavor = "multi_thread", worker_threads = 4)]` — the
  default single-threaded tokio runtime scheduled pkg-0 to completion
  before pkg-1..7 acquired permits, so the saturation never
  materialized.

123 lpm-registry tests green, clippy -D warnings clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-loop + resolve.rs downstream

Load-bearing §3 change for Phase 49: the provider's metadata cache
moves from `RefCell<HashMap<ResolverPackage, _>>` to
`Arc<DashMap<CanonicalKey, _>>`, removing the context field from the
cache key so the future walker's canonical inserts unify with every
split-retry identity automatically. Without this, the walker (§4)
would produce cache entries the provider's split-context lookups
silently miss, and escape-hatch fetches would fire on every split
path — exactly the pathology preplan §4.2 warns about.

Key changes:

- `LpmDependencyProvider`: cache type swap. New `notify_map` field
  (`Arc<DashMap<CanonicalKey, Arc<Notify>>>`) for per-canonical wakes.
  New `route_mode: RouteMode` (defaults to `Proxy` in §3 to preserve
  pre-49 behavior; §5 flips to Direct via `with_route_mode`). New
  `fetch_wait_timeout: Duration` (defaults to ZERO — wait-loop short-
  circuits to fetch-on-miss pending §5's walker attachment).

- `ensure_cached` rewritten per preplan §5.1: canonicalize FIRST, then
  cache fast-path, then wait-loop on the per-canonical Notify, then
  escape-hatch to `direct_fetch_and_cache` on timeout. The wait-loop
  is doc-commented as the invariant boundary — future refactors that
  skip canonicalization before a cache read would break the split-
  subtree notify-hit invariant silently.

- `direct_fetch_and_cache` routes via `route_for_package(name)`: LPM
  always via Worker, npm per `route_mode`. `insert_and_notify` enforces
  the §5.5 mandatory order — insert into cache THEN fire waiters.

- Every internal cache call site canonicalizes at the boundary:
  `available_versions`, `pick_natural_version`, `apply_override_target`
  (Range arm), `prioritize`, both uncached-batch filters and the main
  cache read in `get_dependencies`.

- `into_cache` + `into_parts` return `HashMap<CanonicalKey, _>`. Fast
  path via `Arc::try_unwrap`; per-entry clone fallback for the §5
  shape where the walker's `Arc` refcount is still live.

- New constructors: `with_shared_cache(SharedCache, NotifyMap, Duration)`
  and `with_route_mode(RouteMode)`, both `#[allow(dead_code)]` until
  §5 plumbs them from install.rs.

- `resolve.rs` downstream: `ResolveResult.cache` type updated, plus
  `format_solution` and `check_unmet_peers` canonicalize on lookup.
  Test-side `cache.insert` sites canonicalize at the boundary helper.

- Removed the `is_split()` branch in `ensure_cached` that copied
  canonical entries under split keys — canonicalization at the cache
  boundary makes it redundant.

Regression tests (preplan §8.1 boundary tests):

- `ensure_cached_split_retry_hits_canonical_entry`: when the walker
  inserts under the canonical key, a provider lookup via a split-
  context identity must hit the same entry (not time out and fetch).
  Verified to FAIL under a sabotaged canonicalization (forced miss
  on split identities) and PASS under the fix.

- `ensure_cached_canonical_hits_split_insertion_symmetric`: the
  inverse — canonical lookup hits a split-inserted canonical entry.
  Both directions of the canonical collapse must agree.

Incidental: includes `cargo fmt` drift on two lpm-cli files
(`build_state.rs`, `commands/install.rs`) that were non-compliant on
the pre-§3 tree — the CI gate's `cargo fmt --check` enforces it, so
fmt discipline for the tree as a whole is required for §3 to ship
green.

- `dashmap = "6"` added to workspace + lpm-resolver Cargo.toml.
- Full CI gate green: clippy --workspace -D warnings clean, fmt
  clean, 4364 tests pass, 7 skipped, 0 failed. Parallel-default
  `cargo test -p lpm-auth` deterministic across 3 runs.
- 127 lpm-resolver unit tests (was 114 pre-49).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer regression: two concurrent 429s both read `current=8`, both
enqueue `want_forget=4` into debt, and the 8 completions drive the
effective pool to 0 — below the floor of 4. The previous fix made
the CAS atomic on `debt`, but the race is on `ceiling` itself.

Fix: CAS loop on `ceiling` at halve time. At most one handler per
ceiling transition wins the decrement; others observe the new lower
ceiling and either halve again (if still above floor) or bail.
Ceiling now reflects the *committed* ceiling (reduction in motion);
completion path no longer decrements ceiling (was double-counting
with the CAS).

New test `halve_on_429_multi_concurrent_races_respect_floor`: 8
concurrent slow 429s; asserts `final_concurrency >= floor` and
`halve_events == 1` (only one halve step representable with
floor=4, initial=8). Verified to fail under a sabotaged fetch_sub
and pass under the CAS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New `lpm-resolver::walker::BfsWalker` fetches manifests ahead of
PubGrub for the root set and its transitive deps. Drops into the
Phase 49 orchestration §5 will plumb from install.rs; until then it
is exercised only by its own tests.

Design:
- Lives in lpm-resolver (not lpm-registry per preplan §5.5) because
  it operates on resolver types (CanonicalKey, SharedCache, NotifyMap)
  — moving those to lpm-common is a worse fit than relocating the
  walker. lpm-resolver already depends on lpm-registry so there's no
  circular dep.
- Level-by-level BFS; each level partitions names by route and fetches
  the NpmDirect + LpmWorker halves concurrently. npm uses the
  `parallel_fetch_npm_manifests` 50-wide fan-out + halve-on-429; LPM
  uses `batch_metadata`.
- Per-manifest ordering invariant (preplan §5.5, enforced via
  comment block): (1) `shared_cache.insert`, (2) `notify_waiters`,
  (3) fire `roots_ready` oneshot, (4) `spec_tx.send().await`. Step 4
  is last so resolver visibility is never gated on dispatcher
  throughput.
- Transitive discovery walks the `dist-tags.latest` version's
  production deps (falling back to highest version if `latest`
  missing) — matches the dispatcher's `pick_speculative_version`
  shape. Resolver escape-hatch handles any miss.
- MAX_WALK_DEPTH = 16 (matches the Worker's prior server-side cap).
- Per-package fetch failures: debug log + continue; bun/pnpm shape.

Observability: `WalkerSummary { manifests_fetched, cache_hits,
max_depth, spec_tx_send_wait_ms }`. The send-wait counter is the
reviewer's backpressure canary (preplan §5.6) — healthy value is 0.

Also:
- `with_npm_registry_url` promoted from `#[cfg(test)]` to `pub`.
  Walker tests depend on it to mock the npm upstream across crates.
  The W4 memory flagged this setter as test-only, but the concern
  was lockfile multi-registry correctness, not setter visibility.
- `parse_metadata_to_cache_info` exposed `pub(crate)` so walker can
  share the provider's canonical parse path.
- `SharedCache` + `NotifyMap` re-exported from `lpm-resolver::lib`.

Three walker unit tests:
- `walker_end_to_end_discovery_and_signals`: 2 roots + transitives,
  asserts cache populated, roots_ready fires, spec_tx receives all
  frames, summary fields correct.
- `walker_inserts_under_canonical_key_not_context_bearing`: walker
  insert for `lodash` must be visible via `CanonicalKey::from(&split)`
  — same invariant as the provider-side regression, tested through
  the walker path.
- `walker_caps_at_max_walk_depth`: chain of 20 packages; walker caps
  at depth 16, exactly `MAX_WALK_DEPTH` manifests fetched.

130 lpm-resolver tests green (was 127); 4368 workspace tests pass,
clippy --workspace -D warnings clean, fmt clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer finding 1 (medium): the walker's cache-hit branch
incremented `cache_hits`, maybe fired `roots_ready`, then `continue`d
— never reading the cached manifest's deps, so any caller that
reused a shared cache or pre-seeded roots got a truncated walk at
exactly those nodes. §5's orchestration reuses the cache across
walker + provider, so this sits exactly on the seam §5 plumbs into.

Fix: clone the cached `CachedPackageInfo` out of the DashMap guard
(dropping it at the `;` to avoid an E0502 borrow conflict against
the subsequent `&mut self` call) and run `expand_deps_from_info`
on it. Cached entries do NOT re-emit on `spec_tx` — the manifest
was already produced by whoever seeded the cache; the dispatcher
dedupes internally anyway.

Reviewer finding 2 (low): `expand_deps_into`'s "fallback to highest
version" comment was wrong — `meta.versions.keys().next()` is
arbitrary `HashMap` iteration order, not a semver max. Replaced
with `expand_deps_from_info(info, ...)` that reads
`info.versions.first()`, which `parse_metadata_to_cache_info` keeps
sorted newest-first (and prerelease-filtered for npm). One dep-
discovery primitive now covers both the cache-hit and fresh-fetch
paths.

- `commit_manifest` returns the parsed `CachedPackageInfo` so the
  fresh-fetch path can expand via the same helper without re-parsing.
- Removed `expand_deps_into(PackageMetadata, ...)`.
- New regression test `walker_cache_hit_still_expands_transitive_deps`:
  pre-seeds a root with a dep on `leaf-x`, asserts the walker fetches
  `leaf-x` via the expanded cache-hit path, `roots_ready` still fires,
  and `spec_tx` sees only the transitive (cached root not re-emitted).
  Verified to fail under a sabotaged cache-hit branch (skipping
  expansion) and pass under the fix.

131 lpm-resolver tests (was 130); clippy + fmt clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lve concurrent

Replaces the pre-49 batch-then-resolve serial flow with the preplan
§5.3 shape: SharedCache + NotifyMap owned by install.rs, walker
spawned as metadata producer, speculation dispatcher extracted and
spawned as a separate consumer, resolver awaited inline on the
walker's `roots_ready` oneshot, both task handles bundled into a
WalkerJoin drained post-fetch.

Key shape changes:

- `run_deep_batch_with_speculation` deleted. Its dispatcher body is
  extracted into `spawn_speculation_dispatcher(rx, client, store,
  sem, coord, deps) -> (JoinHandle, DispatcherCounters)` — refactor-
  only, dispatcher logic unchanged. The W2 roots-ready signal moves
  to the walker (preplan §5.5).

- `SpeculativeJoin` → `WalkerJoin`: bundles walker + dispatcher
  JoinHandles + counters. `drain(self, stats) -> WalkerSummary`
  awaits both in parallel, folds counters into `spec_stats` for
  observability, returns walker's summary. Critical: neither handle
  is awaited at construction — awaiting early makes the post-fetch
  drain a no-op (the bug preplan §5.3 warns about).

- `spec_fetch_enabled` env gate, `cache_has_all` on-disk gate,
  `prefetched_batch`, `resolve_result_w2` all gone — the walker
  unconditionally drives metadata fetch for the resolver + feeds the
  dispatcher for speculation.

- `resolve_with_prefetch(Option<HashMap>)` → `resolve_with_shared_cache(
  SharedCache, NotifyMap, Duration, RouteMode)`. Provider wires
  through `with_shared_cache` + `with_route_mode`. `fetch_wait_timeout
  = 5s` now live — the wait-loop actually waits for the walker
  instead of short-circuiting to the escape hatch.

- Retry loop simplified: `Arc<DashMap>` persists across split-retry
  passes natively, so `cached_metadata`/`into_cache`/`with_cache`
  round-trips deleted. Dead-code pruning: `with_cache`,
  `with_prefetched_metadata`, `into_cache` removed from provider.

- `RouteMode::from_env_or_default()` reads `LPM_NPM_ROUTE`, defaults
  to Direct (preplan §3 shipped default). Walker + provider's
  escape-hatch share the same mode.

Test harness updates for the route-mode flip:

- `install_global.rs` cypress fixture now mounts each package's
  metadata at BOTH the LPM proxy path (`/api/registry/{name}`) and
  the npm-direct path (`/{name}`), and sets `with_npm_registry_url`
  to the same mock — mode-agnostic. `.expect(1)` counts dropped on
  per-package GETs since the walker/batch interplay doesn't
  guarantee a specific per-mock call count.

- `tests/workflows/tests/support/mock_registry.rs::with_package_and_deps`
  mounts both paths for the same reason.

- Subprocess workflow tests (which can't override `npm_registry_url`
  via the `--registry` CLI flag) set `LPM_NPM_ROUTE=proxy` in the
  shared `lpm()` command builder so the subprocess honors the mock
  server at the proxy path.

4369 workspace tests pass, 0 failures; clippy --workspace -D warnings
clean; fmt clean. Walker summary currently logged at debug; §6 will
fold it into the `timing.resolve.streaming_bfs` JSON sub-object.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…batch_ms

Reviewer findings (medium×2): two timing fields silently changed
semantics after §5's orchestration rewrite.

1. `timing.resolve.initial_batch_ms` — documented as the pre-resolve
   batch-prefetch wall-clock. Pre-fix it was measured at resolve
   completion, so it lumped in PubGrub wall-clock (already separately
   reported as `resolver_stage_timing.pubgrub_ms`) and JSON output
   was internally inconsistent. Fix: capture at roots-ready fire
   inside the resolve async block and return the measurement as a
   tuple. New meaning: "wall-clock from orchestration start to the
   moment the resolver could begin solving" — the Phase 49 analog of
   "batch prefetch done."

2. `timing.fetch_breakdown.speculative.streaming_batch_ms` —
   documented as the metadata-stream/dispatcher window. Pre-fix,
   `WalkerJoin::drain` measured at drain-complete time (post-fetch),
   which included the entire tarball-download overlap tail. Fix:
   await the walker first, close the streaming-batch window at
   walker-complete time, THEN await the dispatcher. The dispatcher
   tail (spec_tasks drain) is the fetch-overlap tail — belongs in
   `fetch_ms`, not the batch window.

Both fields are now meaningful for §10's pre-merge ship-gate A/B
comparison. Reviewer blocked §6 on these because bench interpretation
would be misleading until fixed.

Cypress test + clippy + fmt clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…all window

Reviewer finding (medium): the previous fix still used
`self.started_at.elapsed()` at drain-call time, so if the walker
completed earlier during fetch overlap, awaiting the already-done
JoinHandle returned immediately but the `elapsed()` was still taken
at post-fetch drain time. The metric continued to include the
walker-finished-to-drain-called window — not the metadata-producer
window the field contract documents.

Fix: the walker records its own wall-clock inside `BfsWalker::run()`
from entry to return, stored as `WalkerSummary::walker_wall_ms`.
The measurement is invariant to when the caller chooses to `.await`
the JoinHandle. `WalkerJoin::drain` reads `summary.walker_wall_ms`
into `stats.streaming_batch_ms` — no more external timers.

`WalkerJoin::started_at` field and the local `walker_started_at`
shadow are now dead and deleted. Both the `SpeculativeStats`
docstring (install.rs:166) and the `timing.resolve.*` JSON comment
block (install.rs:2988) updated to reflect the Phase 49 contract:
`streaming_batch_ms` = walker's metadata-producer window,
`initial_batch_ms` = orchestration-start → roots-ready.

Reviewer's low-severity JSON contract drift on `initial_batch_ms`
(install.rs:2993) also fixed in the same docstring block.

Walker tests + workspace clippy + fmt all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Preplan §5.6 observability lands: install.rs emits a new
`timing.resolve.streaming_bfs` sub-object in `--json` output that
combines walker-side metrics (from `WalkerSummary`) with provider-
side counters (from a new `StreamingBfsMetrics`).

New in `lpm-resolver::provider`:
- `StreamingBfsMetrics` — three `Arc<AtomicU64>` counters shared
  across split-retry passes so counts accumulate into the same
  snapshot:
  * `cache_waits` — PubGrub callbacks that entered the wait-loop on
    a cache miss. Healthy on a cold install ≈ total transitive
    packages.
  * `cache_wait_timeouts` — wait-loop exits by timeout. Healthy 0.
  * `escape_hatch_fetches` — fetches that bypassed the wait-loop
    (timeout OR zero-timeout short-circuit). Healthy 0 when walker
    is attached.
- `LpmDependencyProvider::with_streaming_metrics(metrics)` setter
  attaches the shared counters to each retry-pass provider.
- `ensure_cached` increments `cache_waits` on wait-loop entry and
  `cache_wait_timeouts` on either timeout path. `direct_fetch_and_cache`
  increments `escape_hatch_fetches` for every non-root fetch.

Wire-through:
- `resolve_with_shared_cache(..., metrics: StreamingBfsMetrics)` —
  new parameter threaded through every split-retry pass via the
  shared `Arc`. `#[allow(clippy::too_many_arguments)]` on the fn
  per CLAUDE.md (design-level lint, not worth a parameter-struct
  refactor right now).
- `resolve_dependencies_with_overrides` + test shim pass
  `StreamingBfsMetrics::new()` — pre-walker callers get a default
  fresh counter set.
- install.rs hoists a single `streaming_metrics` to outer scope so
  both the resolver (inside the `None` lockfile arm) and the JSON-
  emit block (at outer scope) share the same Arc.

JSON field shape per preplan §5.6 (null on warm lockfile-fast-path):
```json
"timing": {
  "resolve": {
    "streaming_bfs": {
      "walk_ms": 345,
      "manifests_fetched": 227,
      "cache_hits": 3,
      "cache_waits": 227,
      "cache_wait_timeouts": 0,
      "escape_hatch_fetches": 0,
      "spec_tx_send_wait_ms": 0,
      "max_depth": 5
    }
  }
}
```

Docstring block above the `"resolve"` JSON emit documents healthy vs
degraded values per field so bench readers can spot regressions.

131 resolver tests green; 4369 workspace tests pass; clippy
--workspace -D warnings clean; fmt clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ion tests

Reviewer found the "healthy cache_waits ≈ total transitive packages"
framing was quantitatively wrong — `cache_waits` only increments on
wait-loop entry (fast-path cache hits don't count), `ensure_cached`
is called from multiple sites + across split retries, so the counter
can be much lower or higher than installed package count depending
on timing and tree shape.

Fixes:

- `StreamingBfsMetrics` doc: replaced the three-bullet "healthy
  values ≈" block with per-counter qualitative semantics. `cache_waits`
  is explicitly flagged as "how many times PubGrub had to wait on
  the walker" — NOT an installed-package-count proxy.

- `timing.resolve.streaming_bfs` JSON contract block in install.rs:
  same softening, inline in the docstring above the emit.

Tests added to close the testing gap the reviewer noted (the JSON
sub-object previously had no regression coverage):

- `streaming_metrics_cache_hit_does_not_increment_waits`: fast-path
  cache hit leaves all three counters at zero.

- `streaming_metrics_miss_with_zero_timeout_increments_only_escape_hatch`:
  pre-§5 shape (walker unattached, `fetch_wait_timeout == ZERO`) —
  wait-loop never entered (`cache_waits = 0`), escape hatch fires
  once (`escape_hatch_fetches = 1`). Underlying network failure is
  orthogonal to the counter check.

- `streaming_metrics_root_package_does_not_count`: `ensure_cached(Root)`
  returns early before any counter interaction.

- `streaming_metrics_shared_across_clones`: cloning shares the
  underlying Arc<AtomicU64>s — load-bearing for split-retry
  cross-pass accumulation.

135 resolver tests (was 131); clippy + fmt clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§7 per the todo: final sweep of stale references + dead-code deletion
after §5 retired the pre-49 orchestration. Plus the reviewer's
suggested hardening to close the CLI-surface testing gap.

Dead code deleted:
- `RegistryClient::batch_metadata_deep_streaming` — no live callers
  after §5 (walker replaced it as the metadata producer).
- `batch_metadata_inner(..., tx: Option<Sender>)` +
  `parse_ndjson_batch(..., tx: Option<Sender>)` — the `tx` parameter
  chain existed only to feed the streaming variant. With all callers
  now passing `None`, simplified both functions to drop the parameter
  and the two `if let Some(ref sender) = tx { ... }` emission blocks
  inside the parser.

Stale doc refs updated across 4 files — comments that referenced
`resolve_with_prefetch`, `with_prefetched_metadata`,
`run_deep_batch_with_speculation`, `batch_metadata_deep_streaming`,
and `LPM_SPEC_FETCH` as if they still existed. Replaced with Phase 49
current-state references; historical "Phase 49 replaces X" narrative
left intact where it accurately describes the transition.

Reviewer-suggested hardening (CLI-surface gap from bcaaf4e):
- `install_json_output_contains_package_list` now asserts
  `timing.resolve.streaming_bfs` is emitted as an object on fresh-
  resolve installs with all eight documented fields as numbers
  (`walk_ms`, `manifests_fetched`, `cache_hits`, `cache_waits`,
  `cache_wait_timeouts`, `escape_hatch_fetches`, `spec_tx_send_wait_ms`,
  `max_depth`). Plus a timing-agnostic "metadata was produced by
  walker OR escape-hatch" sanity check that stays resilient to
  the harness's proxy route-mode default.

Bonus fix caught by the new workflow assertion:
- `with_batch_metadata` mock helper now auto-wraps bare metadata
  into `{name, metadata}` NDJSON envelopes. The client's
  `parse_ndjson_batch` expects that schema; bare lines were
  silently dropping on the wire, which was why the walker's
  `batch_metadata` returned empty in proxy-mode workflow tests.

4373 workspace tests pass (was 4369); clippy --workspace -D warnings
clean; fmt clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cript scrub

Three related hardening moves, all non-blocking for the shipped
release but tighten the test + observability surface before §10's
ship-gate A/B.

§8 tests added:

- `blocking_pool_saturation_smoke_max_2_threads` (provider.rs) —
  preplan §5.1 softened the blocking-pool concern after verifying
  PubGrub runs inside ONE `spawn_blocking` task (not one per miss),
  but a smoke test is still useful: a future refactor that moves
  `rt.block_on` into a hot per-package site would deadlock a
  2-thread blocking pool. Test pre-seeds the shared cache so
  `ensure_cached` hits the fast path, then fires 16 concurrent
  lookups across 4 `spawn_blocking` tasks under `max_blocking_threads(2)`.
  Wrapped in a 10s timeout — a real deadlock fails the test rather
  than hanging CI.

- `walker_result_independent_of_per_manifest_timing` (walker.rs) —
  preplan §8.1's "order-independence" check: run the same walker
  twice against the same tree with inverted per-manifest delays
  (slow roots + fast leaves vs. fast roots + slow leaves). Asserts
  the final shared-cache keyset, manifests_fetched, and max_depth
  are identical across both passes. Uses `ord-` prefixed package
  names to avoid disk-cache cross-test contamination.

Bench script scrub (reviewer's adjacent note on §7):

- `bench/run.sh`'s "legacy" A/B variant used `LPM_STREAM_FETCH=0
  LPM_SPEC_FETCH=0` to disable Phase-38/39 paths. Those env vars no
  longer control install behavior — the gated code paths were
  retired in §5/§7 — so the legacy variant was running the same
  code path twice. Replaced with the actually-meaningful Phase 49
  A/B: Direct mode (shipped default) vs `LPM_NPM_ROUTE=proxy`
  (escape hatch). This is the A/B §10's ship gate anchors on.

Test-isolation fix (exposed by the order-independence test):

- `RegistryClient::with_cache_dir(Option<PathBuf>)` — new public
  setter to override the default `~/.lpm/cache/metadata/` path or
  disable disk caching entirely (`None`). The default is shared
  across every test in the process AND persists between test runs,
  causing cross-contamination when two tests use the same package
  name. Walker-tests' `client_direct_to` helper now passes `None`
  so each test is hermetic. The existing walker tests that break
  under cross-contamination (revealed when the order-independence
  test added `leaf-a1`/`leaf-a2` packages and poisoned the cache
  for subsequent runs) now pass deterministically.

4375 workspace tests (+2 vs §7's 4373); 137 lpm-resolver tests
(+2); clippy + fmt clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer finding (low): the new public `with_cache_dir` setter
swapped `cache_dir` but left `cache_signing_key` bound to the
original directory's key. That leaves the API semantically
incomplete — any future caller using `with_cache_dir(Some(new_path))`
would write HMAC-signed entries under `new_path` using the OLD
directory's key, breaking the invariant that the signing key is
derived from the cache directory.

Fix: `with_cache_dir` now calls `load_or_create_cache_signing_key`
on the new directory before swapping the field, matching how
`RegistryClient::new` initializes both fields together. For `None`
the call generates a throwaway key (unused since the cache-path
helpers short-circuit on `None`), which keeps the invariant
"`cache_dir` fully determines `cache_signing_key`" uniform across
both branches.

Current Phase 49 callers pass only `None` (walker test isolation),
so no behavior change in the shipped paths. The fix is about making
the public API safe for future Some(path) use without requiring
every caller to remember to separately rotate the key.

137 resolver tests + 124 registry tests + clippy + fmt all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ome) regression

Closes out the code work for Phase 49. §9 + §10 are housekeeping +
user-executed bench work; this commit ships everything except the
actual bench runs.

§9 — F2 bench fixture committed:
- `bench/fixture-large/package.json` — 21 direct deps resolving to
  ~266 transitive packages per preplan §10.1. Pinned so future
  perf phases anchor on a fixed artifact rather than rebuilding
  from memory.

§10 — ship-gate procedure:
- `bench/PHASE-49-SHIP-GATE.md` — the pre-merge checklist, with the
  full CI gate command sequence, F1 + F2 bench gates, streaming-BFS
  observability spot-check, and revert rehearsal procedure.
  References the preplan's §9 single-release rollout rule and §10.2
  math-checked targets.

Reviewer residual gap closed (on `a6d2493`):
- `with_cache_dir_some_path_roundtrips_across_clients` — writes a
  metadata entry with client A pointed at a tempdir, then reads it
  back with a fresh client B pointed at the same tempdir. Passing
  requires: (a) signing-key sidecar persisted in the new dir,
  (b) `load_or_create_cache_signing_key` re-derives the same key
  on the second client. Verified to fail under a sabotaged
  `with_cache_dir` (key rotation removed) and pass under the fix.

Remaining work is user-executed: running the F1/F2 interleaved A/B
benches and the full CI gate from the ship-gate doc before merge.
All 16 Phase 49 commits are code-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ve aliases

First F2 direct-mode ship-gate bench came in at ~5.6s vs preplan §10.2's
1.8s target. JSON diagnosed: walker producer window fine, but
followup_rpc_count=35 / followup_rpc_ms~11178ms — provider's own
batch_metadata_deep follow-up (provider.rs:1121, :1269) was firing 35
times because those names were missing from shared_cache.

Root cause: expand_deps_from_info walked info.versions.first() only.
PubGrub picks the version satisfying the consumer's range; if an older
picked version declares a dep the newest dropped, walker's seen-set
never held it, cache never got its manifest, and get_dependencies fell
through to serial follow-up RPC inside the solve.

Fix: walk UNION of dep names across all versions in info.deps, with
per-version alias rewrite (local "npm:target@range" enqueues target,
not local label). CanonicalKey-keyed seen-set dedupes; net extra work
is bounded by actual historical deps, fetched parallel 50-wide with
disk-TTL dedup. Over-fetch is cheap; missing a name is serial RPC
inside the solver.

Adjacent fix folded in: alias resolution in expand_deps_from_info —
pre-49 walker enqueued local alias labels that 404 against the
registry (silently logged at debug, but target identity also went
missing from cache).

Two regression tests, both sabotage-verified:
- walker_expands_deps_across_all_versions_not_just_newest
- walker_resolves_npm_aliases_to_target_name

CI gate: clippy --workspace -D warnings clean, fmt clean, 4378
workspace tests pass, lpm-auth --lib 47/47 deterministic 2x.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Express install on a fresh package.json regressed from ~2s (npm) to 40s
on phase-49: 6 of the resolver's `ensure_cached` calls each burned the
full 5s `fetch_wait_timeout` waiting on a Notify the walker would never
fire. The wait-loop assumed an unenforced invariant — "the walker will
eventually insert your key" — and the fall-through cost grew linearly
with miss count (60-dep tree → 30s sleep, ~5min on a real Next.js app).

This commit closes the handshake. A new `WalkerDone` (`Arc<AtomicBool>`)
flag is shared between `BfsWalker` and `LpmDependencyProvider`. The
walker fires an RAII `WalkerShutdownGuard` on exit (Drop) that:

  1. stores walker_done = true (Release), then
  2. iterates notify_map and calls notify_waiters() on every entry.

The provider's wait-loop now pins `Notified::enable()` (commits the
subscription synchronously, defeating the wakeup race), then re-checks
cache → walker_done → timeout. Either ordering of the walker's two
shutdown steps wakes a sleeper in microseconds: flag observed first →
short-circuit; notify observed first → wake, re-check cache (still
miss), re-check flag (now true), short-circuit.

The guard pattern is panic-safe: a panic mid-walk still drops the
guard, still broadcasts. Without this, a walker panic stranded every
sleeper for the full timeout.

A new `cache_wait_walker_done_shortcuts` counter exposes the healthy
outcome in `timing.resolve.streaming_bfs` JSON. On the express
benchmark this counter is 6 — exactly the previous
`cache_wait_timeouts` value, confirming each previously-stuck waiter
now short-circuits instead.

Express benchmark (same fixture, same hardware):
  perf.w2_resolve_after_roots:  40,153ms → 74ms      (~543×)
  ensure_cached total:          30,016ms → 70ms
  cache_wait_timeouts:          6        → 0
  cache_wait_walker_done_shortcuts: —    → 6
  wall-clock:                   40s      → <1s

Walker's expand_deps_from_info still uses the union-across-versions
discovery from §11; the bounded-union improvement to drop the 6
walker-misses to 0 lands in a follow-up.

Tests: 140 passed (added walker_done_broadcast_wakes_sleepers_for_unfetched_keys
regression that proves a sleeper subscribed to a key the walker never
fetches wakes within 100ms).

Gates: cargo check --workspace ✅, cargo clippy --workspace -- -D warnings ✅,
cargo nextest run -p lpm-resolver ✅ (140/140), cargo build --release ✅.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ip redundant deep batch

Two follow-ups to §12 that, together, deliver the actual <2s
cold-cache express install. Both rest on a single insight:
**when the walker is attached, IT is the metadata producer; the
in-resolver `batch_metadata_deep` follow-up is redundant work that
races ahead of the walker and pays a Worker round-trip for
manifests the walker is about to insert anyway.**

# Why §11's full union was a regression on its own

§11 widened `expand_deps_from_info` to walk every version's deps so
PubGrub's older-version picks would land in the cache. Measured
in isolation on the express F2 fixture this fixed 35 follow-up RPCs
worth ~11s of pubgrub wall-clock. But the strategy compounds through
BFS depth: each package's full-history dep set pulls in historical
packages whose own histories pull in more.

  Express @ ^4.18.0, cold cache, walker_done broadcast attached:
    full union (commit 019d99c):     7,146 manifests / 31.8s walker / 33s wall
    bounded union top-K=4 majors:      845 manifests / 7.6s walker / 11s wall
    newest-only:                        68 manifests / 0.6s walker / 22s wall (*)

  (*) cold-cache pubgrub blocked on 2x `batch_metadata_deep` Worker
  RPCs (express → 31 uncached → 64 fetched, send → 2 uncached → 2
  fetched). 21,082ms in pubgrub for 68 manifests is unphysical for
  a populated cache — the time was the redundant Worker RPCs racing
  ahead of the walker. See below.

# §13.A — drop the union entirely

Switch `expand_deps_from_info` back to newest-only:

    let Some(newest) = info.versions.first() else { return };
    let ver_str = newest.to_string();
    ...

§12's `walker_done` broadcast made the older-version-pick path cheap:
when PubGrub picks a non-newest version whose deps the walker didn't
fetch, `ensure_cached` short-circuits via the broadcast in
microseconds and a single `direct_fetch_and_cache` lands the manifest
(disk-cache-warm in the common case). Walking proactively to avoid
those calls is a net loss — over-fetching is what blew F2 up in the
first place.

The `cache_wait_walker_done_shortcuts` JSON counter measures exactly
these recoveries; on healthy installs it tracks the older-version
pick count (6 on express, typically <10 on real trees).

# §13.B — skip the redundant deep batch when a walker is attached

`get_dependencies` had two `batch_metadata_deep` call sites
(root + per-package) that fired *inside* the spawn_blocking that
hosts `pubgrub::resolve`. With a walker attached this is harmful:
the resolver races ahead of the walker, sees its caller's package
deps as "uncached" (because the walker is still on an earlier BFS
level), and fires a Worker RPC that the walker is about to obviate.
Cold-cache express measured this at 2 RPCs / 21s of pubgrub
wall-clock — for a 68-manifest tree.

Gate both call sites on `self.fetch_wait_timeout.is_zero()`:

    if self.fetch_wait_timeout.is_zero() {
        // pre-49 callers (no walker) keep their batch fast-path
        ...
    }

Pre-49 callers (resolver-only, no walker) retain identical behavior.
Phase-49 callers go straight to the wait-loop, which already
short-circuits on `walker_done` via §12.

# Express benchmark, cold network, all permutations on the same hardware

  | walker          | skip-batch | manifests | walker | pubgrub |  wall |
  |-----------------|------------|-----------|--------|---------|-------|
  | full union      | no         |     7,146 |  31.8s |    2.0s |  33s  |
  | bounded K=4     | no         |       845 |   7.6s |   11.0s |  11s  |
  | newest-only     | no         |        68 |   0.6s |   21.1s |  22s  |
  | newest-only     | yes (THIS) |        68 |   0.9s |    1.1s |   2s  |

  Warm-cache rerun (this commit): 8ms total resolve, walker 6ms.

# Tests

- New: `walker_expands_only_newest_version_deps` pins both halves of
  the §13 walker contract on a 2-major fixture — newest's deps land,
  older's don't (escape-hatch covers them at solve time).
- Removed: `walker_expands_deps_across_all_versions_not_just_newest`
  (the §11 union test — explicitly contradicts §13.A).
- Existing `walker_done_broadcast_wakes_sleepers_for_unfetched_keys`
  still proves the post-walker shortcut path; the express
  integration bench covers end-to-end.

Gates: cargo check --workspace ✅, cargo clippy --workspace -- -D warnings ✅,
cargo nextest run -p lpm-resolver ✅ (140/140), cargo build --release ✅.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`cargo fmt --check` flagged the §12 install.rs change — the
two-line `let walker_done: WalkerDone = Arc::new(...)` fits on a
single line per rustfmt's default width. No behavior change.

Picked up during the pre-push CI gate (cargo fmt --check) before
opening the phase-49 PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cratch gitignore

Two pre-existing uncommitted changes that surfaced during the PR-#8
CI gate. Bundled here because the precedence.rs change is what
unblocks `cargo clippy --workspace --all-targets -- -D warnings` on
Rust 1.94.0 (CI's pinned toolchain), and the .gitignore line is
trivial scratch-dir hygiene from the same session.

# precedence.rs — drops `clippy::doc-lazy-continuation` failure

The two test-doc paragraphs at lines 376–383 and 487–493 were
authored as bullet-listy prose:

    /// `force-security-floor = true` + project `scriptPolicy = "allow"`
    /// + `--yolo` (CLI = `allow`) → effective is `deny` (user default
    /// floor); ...

`pulldown-cmark` parses the leading `+ ` on the second `///` line as
a list-item marker, then expects subsequent lines to be indented two
spaces under the marker. They aren't, so 1.94's clippy emits
`doc list item without indentation` (only 1.94+ has the lint
this strict; earlier toolchains let it slide, which is why the lint
was latent on main).

The rewrite replaces the bullet-style continuation with single-
sentence prose that says the same thing without the leading-`+`
trigger. No behaviour change; doc-comment text only. CI's exact
gate (`cargo clippy --workspace --all-targets -- -D warnings`) goes
green after this commit.

# .gitignore — `/reproduction` scratch dir

The phase-49 streaming-resolver bench harness (`/reproduction/`) is
local scratch state — fake `$HOME` + `$XDG_CACHE` directories used
to pin the cold-cache path during §12 reproduction. Adding it here
so the path doesn't appear as untracked noise in `git status` while
benching.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inux + macOS

The `slice5_relative_traversal_escape_rejected_unconditionally` test
passed on macOS and failed on Linux CI — pre-existing red on `main`,
masked by the runner-OS asymmetry. Surfaced during PR-#8 CI run.

# Why it failed on Linux only

`load_sandbox_write_dirs` validates entries in three steps, in
order:

  1. dangerous-root denylist (`/`, `/etc`, `/var/run`, `/run`,
     home-relative credentials paths)
  2. traversal-escape (`..` resolution must stay inside project_dir)
  3. user-allowlist intersection

The fixture authored `"sandboxWriteDirs": ["../../etc"]` against a
tempdir-based project. On Linux CI, `tempfile::tempdir()` returns
`/tmp/<x>` so `../../etc` resolves to a real `/etc` — Step 1 fires
and the error mentions "veto / system-level configuration" without
the word "escape", so the assertion `reason.contains("escape")`
fails. On macOS the tempdir lives at `/var/folders/.../T/<x>` so
`../../etc` resolves to a non-existent path that isn't on any
denylist, Step 2 fires as the test expects, "escape" is in the
error, the assertion passes — masking the bug.

# Fix

Change the fixture to `"../sibling-outside"`. That path:

  - resolves to a sibling of the tempdir on every platform
    (`/tmp/sibling-outside` on Linux, `/var/folders/.../sibling-outside`
    on macOS), neither of which is on any dangerous-root denylist
  - escapes `project_dir` unambiguously, so Step 2 fires (the actual
    invariant the test was written to pin: "traversal escape is
    rejected unconditionally with an empty allowlist")

The doc-comment is updated to record the platform-fragility lesson
so a future maintainer reaching for `../../etc` again sees the
warning.

# Verification

  cargo +1.94.0 nextest run -p lpm-sandbox slice5_relative_traversal_escape  ✓
  cargo +1.94.0 clippy --workspace --all-targets -- -D warnings              ✓ exit 0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…slice5 fix

`cargo fmt --check` on Rust 1.94.0 (CI's pinned toolchain) flagged
the `let e = fixture(...)` from 427e1ff — the new
`"../sibling-outside"` body fits on one line and rustfmt prefers
the collapsed form. The previous fix was verified locally with
clippy but I forgot to re-run `cargo +1.94.0 fmt --check` after the
edit; CI caught it. No behaviour change; whitespace only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tolgaergin tolgaergin merged commit 07b4ed7 into main Apr 25, 2026
3 checks passed
@tolgaergin tolgaergin deleted the phase-49-streaming-resolver branch April 25, 2026 17:16
tolgaergin added a commit that referenced this pull request Apr 29, 2026
* phase-60 D2: promote download_tarball_routed helpers to RegistryClient

Behavior-preserving refactor extracting the two private routed-tarball
helpers from install.rs (download_tarball_routed,
download_tarball_streaming_routed) onto RegistryClient as public
methods. Both `lpm install` and the upcoming Phase 60 `lpm add` source-
delivery flow consume the same Custom-route auth-attachment logic.

- crates/lpm-registry/src/client.rs: add public methods
- crates/lpm-cli/src/commands/install.rs: switch all 5 call sites to
  the new methods; delete the private helpers; remove the now-unused
  DownloadedTarball import

All 602 install + npmrc tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* phase-60 60.0.e: PackageMetadata::resolve_version_spec helper

Add a three-tier version-spec resolver on PackageMetadata covering
dist-tag → exact-version → semver-range, mirroring the canonical
pattern at install_global.rs:368-405 verbatim.

Pre-Phase-60, `lpm add react@beta`, `next@canary`, `lodash@^4` all
failed because PackageMetadata::version() is a pure HashMap lookup —
none of those literal strings exist as concrete versions. The new
helper closes the gap.

Per D3 (preplan): both parse-failure and no-satisfying-version
return LpmError::Script (matching install_global verbatim) so the
Phase 60.1 migration of the four duplicate sites (install_global,
install, update_global, global) is a true behavior-preserving
refactor.

9 unit tests cover dist-tag (latest/beta/canary), exact match,
caret/tilde range, no-satisfying error, parse-fail error, and
empty-versions error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* phase-60 60.0+60.1+60.1.5+60.2: lpm add source delivery from any registry

Decouple `lpm add` from LPM-only package identity, mirror install's
full .npmrc setup, switch to file-spool tarball download, add
destination-side path containment, gate dep auto-install on
lpm.config.json presence, and surface external imports for the simple
path. End-to-end flow now works for any package on any registry the
rust client can reach (lpm.dev worker, npmjs.org direct, .npmrc-
declared private registries).

60.0.a + 60.0.b — Identity refactor + drop dotted-name auto-prepend
- New AddTarget enum: Lpm(PackageName) | Npm { spec: String }.
- New resolve_add_target replaces parse_package_ref. No rewriting
  outside the @lpm.dev/ scope — `lodash.merge`, `tolga.foo`, etc.
  resolve to AddTarget::Npm verbatim. Fixes a long-standing
  correctness bug: pre-Phase-60 dotted bare names were silently
  rewritten to @lpm.dev/<name> which doesn't exist on lpm.dev.
- All output / log / JSON sites render via target.display() /
  target.json_name() — `name.scoped()` no longer used unconditionally.
- Skills branch type-encoded via `let AddTarget::Lpm(pkg) = &target`
  pattern, with a why-comment (60.2) explaining the scope gate
  (lpm.dev runs LLM scans on shipped skill content; arbitrary npm
  packages are not scanned).

60.0.c — Mirror install's full .npmrc setup
- Build RouteTable::from_env_and_filesystem before any network call.
- Surface npmrc_warnings (non-JSON) and the strict-ssl=false security
  warning (escapes --json). Clone the client with with_tls_overrides
  so cafile= / strict-ssl=false take effect on metadata + tarball
  fetches. Mirrors install.rs:3295-3445.

60.0.d — Routed metadata + file-spool tarball
- Metadata: AddTarget::Lpm uses get_package_metadata; AddTarget::Npm
  uses get_npm_metadata_routed.
- Tarball: client.download_tarball_routed (D2 promoted helper) +
  lpm_extractor::extract_tarball_from_file. Bounded memory via
  MAX_COMPRESSED_TARBALL_SIZE (500 MB) for free; lpm add typescript
  (~22 MB) and worst-case @scope/giant-fixture no longer load the
  whole tarball into RAM.

60.0.f — Destination-side path containment (D6)
- New resolve_safe_dest helper canonicalizes target_dir once and
  validates every write destination: refuses to follow existing
  symlinks, rejects writes whose canonical parent escapes the target
  root. Wired into the Step 8 file-copy loop. Closes the threat-model
  gap that opened up when add expanded from "trusted lpm.dev
  publishers" to "any npm publisher."

60.1 — Dep gate + bare-imports notice (D4)
- Tighten dep gate: `if !no_install_deps && lpm_config.is_some()`.
  Simple path is download-manager: copy bytes, no auto-install.
- import_rewriter exports a sibling collect_bare_specifiers fn that
  shares an internal SpecifierKind classifier with rewrite_imports
  (anti-drift contract — "bare" means the same thing in both places).
- add.rs surfaces the collected externals as a non-JSON notice and
  as a `external_imports` array in the JSON output.

60.1.5 — Non-interactive simple-path guard
- `lpm_config.is_none() && target_path.is_none() && (yes || json ||
  !is_tty)` errors before the file-copy loop. Heuristically defaulting
  components/ for arbitrary 3rd-party source under --yes/--json/non-TTY
  is a CI/automation footgun.

Tests
- 15 unit tests in add.rs (resolve_add_target classification including
  the dotted-name regression; resolve_safe_dest contracts including
  symlink-refusal on Unix).
- 10 unit tests in import_rewriter.rs (classify_specifier,
  collect_bare_specifiers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* phase-60 60.3: integration tests for lpm add simple path + guards + traversal

Three new wiremock-driven integration tests covering the highest-value
end-to-end scenarios for Phase 60:

- add_simple_non_interactive_without_path.rs (4 sub-tests) — proves
  the 60.1.5 guard fires for --yes, --json, and non-TTY (stdin from
  /dev/null) without --path; positive control with --path succeeds.
  No package.json mutation in any failure case.

- add_source_npm_simple.rs (2 sub-tests) — full simple-path pipeline
  via wiremock npm metadata + tarball: AddTarget::Npm resolves, file-
  spool download, extract, files copied flat (no auto-nest), bare-
  imports notice lists react + @radix-ui/react-slot, package.json
  NOT mutated, .lpm/skills/ NOT created. JSON sub-test asserts the
  package.name uses the npm-style identity (not @lpm.dev/-prefixed)
  and the new external_imports array is well-shaped.

- add_path_traversal_dest_escape.rs — proves resolve_safe_dest is
  wired into the actual write loop, not just unit-tested in
  isolation. Tarball ships an lpm.config.json with files[0].dest =
  "../../escaped/evil.txt" — assertion: containment-violation error,
  exit non-zero, no file written outside target_dir.

Other 60.3 specced tests are either (i) covered by the unit tests
that landed alongside the implementation (#5 dotted-name, #9 version-
spec, #11 symlink — see preplan v6 audit checklist) or (ii)
deliberately deferred where the underlying machinery is already
test-covered by Phase 58.x install tests (#1 lpm.dev rich, #2 npm
rich, #6 npmrc auth, #7 strict-ssl, #8 missing-var fatal).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* phase-60 60.4: README — lpm add now works against any registry

- Update the lpm add one-liner in the Commands list.
- Add a "How lpm add Works" section explaining: source delivery vs.
  install, the firm naming rule (@lpm.dev/owner.name only), the rich
  vs. simple paths, and the non-interactive --path requirement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* phase-60 audit fix: resolve_safe_dest must validate before mkdir

Audit reproduced (with a temp-dir filesystem probe) that the landed
resolve_safe_dest helper still created directories OUTSIDE the
target_dir for two attack vectors before the containment error fired:

1. `dest_rel = "../../escaped/evil.txt"` — `Path::join` resolves
   lexically; `dest.parent()` lands outside target; `create_dir_all`
   ran before the containment check, leaving `<target>/../escaped/`
   on disk even though the file write was correctly blocked.

2. Absolute `dest_rel = "/tmp/elsewhere/evil.txt"` — `Path::join` of
   an absolute path returns the absolute path verbatim; `parent =
   /tmp/elsewhere/`; `create_dir_all` created it before the
   containment check fired.

The original integration test only asserted no escaped FILE existed,
so the directory-side-effect bug passed CI.

Fix
- Reorder resolve_safe_dest so EVERY check that can reject the
  destination runs BEFORE any filesystem mutation:
  Step 1 (NEW) — reject absolute dest_rel up-front.
  Step 2 (NEW) — reject any ParentDir / RootDir / Prefix component.
  Step 3 — refuse existing-symlink destinations.
  Step 4 (NEW) — pre-mkdir ancestor canonicalization: walk up to the
    longest existing ancestor; canonicalize; require it under
    target_root_canonical (catches symlinked intermediate dirs).
  Step 5 — create_dir_all (NOW safe).
  Step 6 — post-mkdir re-canonicalize as TOCTOU defense-in-depth.

  The lexical bans in Steps 1-2 kill the entire `../escape` and
  absolute-path attack classes before any mkdir runs. The longest-
  existing-ancestor walk in Step 4 covers the symlinked-intermediate
  case (target/foo → /tmp/elsewhere). Step 6 is paranoia.

Tests
- Strengthen unit tests:
  - resolve_safe_dest_dotdot_in_path_rejected_with_no_external_dir_created
    now asserts no escape directory was created.
  - resolve_safe_dest_absolute_dest_rejected_with_no_external_dir_created
    is new — covers the absolute-path attack.
  - resolve_safe_dest_dotdot_in_middle_of_path_also_rejected covers
    `foo/../bar.txt` (lexically resolves back inside but still
    rejected up-front).
- Extend integration test:
  - dest_escape_via_dotdot_is_refused_and_creates_no_external_directory
    now snapshots target_dir entries before the run and asserts no
    unexpected new top-level entries appeared, plus no escape dir.
  - dest_escape_via_absolute_path_is_refused_and_creates_no_external_directory
    is new — covers the absolute-path attack at the integration level.

Net: 4923 → 4926 workspace tests; clippy + fmt clean; all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin added a commit that referenced this pull request May 7, 2026
…ng + GraphKey disambiguation

Closes the three load-bearing 4d (default-flip) blockers from Phase
66 4b's deferred-items list as a single coordinated change. Without
these, every line of Phase 4c would build on top of an empty-peers
GraphKey assumption and a `(name, version)`-only key map — exactly
the cheap-now / refactor-later trap the user called out.

#9 — peer-context threading (resolver → install → linker):

- `lpm_resolver::ResolvedPackage` gains `peers: Vec<(String, String)>`,
  populated from `CachedPackageInfo.peer_deps[version]` intersected
  against the resolved-versions lookup the resolver already builds
  for `format_solution` / greedy `into_resolved_packages`. Sorted
  by peer_name for deterministic GraphKey hashing. Both resolver
  arms (pubgrub + greedy) populate symmetrically through
  `compute_resolved_peers` (pubgrub) / inline lookup (greedy, since
  the node table is the lookup).

- `InstallPackage.peers` carries the resolver's output verbatim
  through `resolved_to_install_packages`. Source-kind paths
  (Tarball / Directory / Link / lockfile fast-path) populate empty
  for now; the v2 linker's `ensure_peer_context` re-derives from
  the just-extracted `package.json` when the field arrives empty,
  keeping cold-resolve and warm-fast-path producing the same
  GraphKeys.

- `LinkTarget.peers` propagates from `InstallPackage.peers` at
  every install→link conversion site. v1 ignores the field;
  hoisted-mode v1 wanting cross-project sharing later can fold it
  in without further plumbing.

#4 — fold peers into the GraphKey:

- `lpm_store::v2::GraphKeyInputs::with_peers` now receives the
  resolved `PeerEntry` list from `LinkTarget.peers` instead of the
  empty `Vec<PeerEntry>::new()` placeholder. The hash field
  contract was already in place (`peers` slot in `derive`); we
  just stop passing nothing into it.

- New `with_wrapper_id` setter folds the source-identity
  disambiguator into the hash so `Source::Registry { foo@1.0.0 }`
  and `Source::Tarball { foo@1.0.0 from URL X }` produce distinct
  keys. Empty `wrapper_id` (registry default) preserves the
  pre-Phase-66 hash so existing v2 store entries don't get
  invalidated by this addition.

#8 — multi-source-same-coords disambiguation in v2 linker key map:

- New `KeyMap` type with two indexes — `by_triple` keyed on
  `(name, version, wrapper_id)` for the consumer's own key
  lookup, `by_coords` keyed on `(name, version)` for dep / peer
  edge lookups (which carry only coords today). At construction
  time, a `(name, version)` collision across distinct
  `wrapper_id`s surfaces a hard `LpmError::Store` rather than
  silently aliasing the second target onto the first. Audit-
  fixtures don't exercise multi-source-same-coords today, so the
  error is reachable only via a malformed install set; lifting
  the constraint requires threading wrapper_id through dep edges,
  a Phase 4 follow-up.

v2 linker behavior changes:

- `augment_with_peer_edges` renamed to `ensure_peer_context` and
  rewritten to populate `LinkTarget.peers` (separate Vec) instead
  of mutating `LinkTarget.dependencies`. The fixed-point closure
  loop is gone — each consumer's resolved peers is a single
  per-package fact (the resolver / package.json intersection),
  not a transitive graph property. Transitive resolution flows
  through the per-target loop: when peer B is also a LinkTarget,
  ITS link entry gets ITS own peer siblings.

- `populate_one` synthesizes peer-edge sibling symlinks ALONGSIDE
  dep-edge siblings (peers were previously folded into
  `dependencies`; now they're a separate pass with explicit
  dedupe against already-declared deps).

- `peerDependenciesMeta.optional` controls trace verbosity for
  missing peers — required-but-missing emits a debug log pointing
  at the upstream `check_unmet_peers` gap; optional-missing is
  silent (npm-compat).

Tests:

- `link_packages_v2_distinct_keys_for_peer_divergent_projects`:
  same consumer + same edge graph + DIFFERENT resolved peer
  versions across two projects must produce distinct GraphKeys
  (no silent cross-project sharing under peer-pinning divergence).

- `link_packages_v2_shares_keys_for_peer_identical_projects`:
  same consumer + same edge graph + SAME resolved peer version
  across two projects must produce the same GraphKey (cross-
  project sharing actually works under peer-pinning agreement —
  this is the win the v2 rewrite is supposed to unlock).

- `link_packages_v2_errors_on_multi_source_same_coords`:
  malformed install set with two LinkTargets at the same
  `(name, version)` distinct `wrapper_id` produces a clear
  `multi-source LinkTarget collision` error rather than aliasing.

Pre-merge gate green:
- cargo clippy --workspace --all-targets -- -D warnings ✓
- cargo fmt --check ✓
- cargo nextest run --workspace --exclude lpm-integration-tests ✓
  (5711/5711 pass; one transient lpm-inspect sqlite-races-under-load
  flake on first run — rerun clean)
- cargo test -p lpm-auth (2× parallel-deterministic) ✓
- audit-fixtures: 17 PASS / 1 SKIP / 0 mixed under both default v1
  and `LPM_STORE_VERSION=v2`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin added a commit that referenced this pull request May 14, 2026
…low gate

Ships the 10 cross-command flow tests enumerated in v2 baseline,
flips EXPECT_FULL_V2_FLOWS_BACKFILL to true so future flow drops
hard-fail the audit.

Each flow exercises a real-user multi-command sequence and asserts
the state-transfer claim that ties the commands together — what
command A leaves on disk / in the keychain / in the lockfile is the
input command B reads. Single-command tests assert each step in
isolation; flow tests catch state-shape mismatches between steps.

Flows shipped:
- install → patch → patch-commit → install (patch persistence)
- migrate → install → audit (lockfile round-trips)
- install → rebuild → approve-scripts → rebuild (approval lifecycle)
- doctor --fix → install (fix survives install)
- add → install → graph (added dep visible)
- install → upgrade --major → audit (envelope shape)
- token-rotate → publish --dry-run --check (token hand-off)
- publish --dry-run --check → publish (target agreement)
- install -g → run shimmed binary → uninstall -g (shim lifecycle)
- env push → env pull cross-machine (round-trip — scoped to local
  smoke until a cross-machine harness lands)

Several flows had their assertions scoped narrower than the original
"catches" claim:
- Flow #6 (rebuild lifecycle): rebuild --policy=deny ignores the
  v2 object form of trustedDependencies that approve-scripts writes
  — a real contract gap, filed as private finding #75. The flow
  asserts the manifest mutation; rebuild #2 only checks envelope
  health.
- Flow #4 (upgrade major audit): the workflow tier's MockRegistry
  helpers don't mount GET /api/registry/{name} per-package (only
  the batch endpoint), so upgrade's candidate selection finds no
  candidates. Flow asserts envelope shape; tighten when the mock
  grows the per-package GET.
- Flow #7 (env push/pull cross-machine): proper round-trip needs a
  shared-vault-state test harness that doesn't exist yet. Flow
  smokes per-machine env state isolation; promote when the harness
  lands.
- Flow #8 (install -g): gracefully degrades when install-g doesn't
  emit a shim on the test runner (cli-binary tier owns the strict
  contract).

Run results: 10/10 flow tests pass, all 10 v2 audit tests pass,
full lpm-workflows suite green (623/623), clippy clean, fmt clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin added a commit that referenced this pull request May 14, 2026
…ix (#58)

* test(workflows): pin concurrency + recovery contracts for lpm install

Adds tests/workflows/tests/install_concurrency.rs with 13 falsifiable
tests covering production failure modes that had zero coverage:

Category A — process racing:
  * two concurrent installs on same project (pins finding-#77 floor)
  * install + concurrent store-clean serialize via shared/exclusive
    store_lock (probed via try_with_exclusive_lock on the actual
    lock file, not a directory-existence proxy)
  * two concurrent `lpm install -g` via global_tx_lock — proves
    final manifest + WAL coherence under serialized commits

Category B — interruption recovery:
  * kill mid-tarball-fetch leaves no .lpm/install-hash
  * next `lpm install` converges to a coherent end state

Category C — network faults:
  * tarball 503 → 200 succeeds after retry (counting Respond impl)
  * metadata 404 fails immediately without retry (<2s wall-clock)

Category D — filesystem faults:
  * readonly project dir fails with actionable error (no panic);
    POSIX-only via #[cfg(unix)], RAII guard restores permissions
  * `<project>/.lpm` planted as a regular file fails clearly

Category E — partial state recovery:
  * stale install-hash triggers re-resolve + refetch
  * partial node_modules re-links to full state
  * truncated lpm.lockb either recovers or fails cleanly (no panic)

Category F — WAL recovery hook:
  * torn WAL tail (3 garbage bytes) gets truncated by the dispatcher's
    recovery hook before the command runs; idempotent on re-invocation

Support helper refactor (same commit so the new helper has callers):
  * extracts env-isolation set into `LpmEnvSink` trait +
    `apply_lpm_env(cmd, project)` shared by `lpm()` (assert_cmd) and
    the new `lpm_spawnable()` / `lpm_spawnable_with_registry()`
    (std::process::Command, supports Child::kill())
  * trait impl on both Command variants ensures the two helpers
    cannot drift on the ~30 env knobs that gate test isolation

Surfaced findings during this work:
  * #77 — no project-level install lock: concurrent installs silently
    drop one side's work AND/OR fail with atomic-rename races (3
    observed failure modes documented in findings.md). Fix shape:
    LpmRoot::project_install_lock + with_exclusive_lock_async wrap.
  * #78 — retry-backoff has no test-friendly knob; retry-exhaustion
    tests take 15s+. Fix shape: LPM_RETRY_BACKOFF_MS_OVERRIDE env in
    debug builds.

CI gate locally green:
  clippy --workspace --all-targets -- -D warnings: clean
  cargo fmt --check: clean
  fancy-regex ban: empty
  cargo build --workspace: clean
  cargo nextest run --workspace --exclude lpm-integration-tests:
      6439 passed, 7 skipped, 1 leaky (pre-existing)

Deferred (filed under "next session" in the followup plan):
  B.3 (kill doesn't tear lockfile) — subsumed by B.1/B.2
  B.4 (panic injection) — needs LPM_TEST_PANIC_AT env hook
  C.2 (retry exhaustion) — blocked by finding #78
  C.3 (truncated body) — needs custom Respond with Content-Length mismatch
  D.3 (disk-full simulation) — no portable mechanism
  F.2, F.3 (orphan WAL, torn WAL with real records) — needs
    framed-WAL construction helpers

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(workflows): pin lpm.lock well-formedness + recovery skip-on-contention

Closes B.3 and F.2 of the concurrency tranche — 13 → 15 tests, meeting
the "≥15 of 21" acceptance criterion for Item 2.

B.3 — `install_killed_mid_pipeline_leaves_well_formed_or_absent_lockfile`:
Exercises two SIGKILL windows on the install pipeline — fresh project
and project with a committed lpm.lock from a prior install. After each
kill, asserts the on-disk lpm.lock is either absent OR parses as TOML.
Never half-written. Adds `toml = { workspace = true }` as a workflow-
tests dev-dep for the parse assertion. Helper
`assert_lockfile_well_formed_or_absent` shared between both windows.

F.2 — `lpm_command_skips_recovery_when_another_lpm_holds_global_tx_lock`:
Validates the dispatcher's `try_with_exclusive_lock` idempotent-skip
path at `main.rs:2531`. A background thread acquires `global_tx_lock`
via `lpm_common::with_exclusive_lock` and blocks on a channel. With
the lock held, runs `lpm global list` against a project with a torn-
WAL prefix — asserts the WAL bytes are UNCHANGED (skip arm fired,
recovery did not run). Then releases the lock and re-runs; asserts
the WAL is now truncated (recovery defers correctly to the next
lock-free invocation). Exercises both branches of the `try_with_
exclusive_lock` Ok(None) / Ok(Some) arm.

CI gate locally green:
  cargo clippy --workspace --all-targets -- -D warnings: clean
  cargo fmt --check: clean
  cargo nextest run --workspace --exclude lpm-integration-tests:
      6441/6441 passed, 7 skipped
  5x parallel re-run of install_concurrency: 15/15 stable each run

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(workflows): pin truncated-tarball + orphan-WAL recovery contracts

Two new tests in tests/workflows/tests/install_concurrency.rs:

- C.3 tarball_connection_dropped_mid_body_fails_or_retries: a custom
  wiremock Respond impl serves half a tarball with a Content-Length
  header naming the full length. Pins the install pipeline's
  retry-then-fail behavior on transport-class failures (~14s wall-clock
  for the full 4-attempt retry schedule). Hyper 1.9 server-side panics
  on the Content-Length lie, dropping the connection — a valid surrogate
  for a broken upstream / CDN dropping mid-body. Surfaced 8 tarball GETs
  per install (deterministic, 3-of-3 reproducer), explained by two
  distinct download_tarball_* call sites in install.rs each running the
  4-attempt retry budget.

- F.3 lpm_command_with_orphan_pending_tx_emits_recovery_banner: plants
  both halves of an orphan transaction (WAL Intent record without
  matching Commit/Abort + matching [pending.<pkg>] row in manifest.toml
  pointing at a non-existent install root) and asserts the dispatcher's
  recovery hook fires the RolledBack banner from main.rs:2543. Sets
  RUST_LOG=lpm=info to lift the default lpm=warn filter so the
  tracing::info! line surfaces. Adds lpm-global as a workflow dev-dep
  for WalWriter / IntentPayload / write_for. Pins post-state: orphan
  pending row gone, no spurious active row.

Together these close the C.3 and F.3 gaps in Item 2 of the test
coverage follow-up plan: 17/21 scenarios pinned (was 15/21). The four
remaining items all need source-side hooks (LPM_TEST_PANIC_AT,
LPM_RETRY_BACKOFF_MS_OVERRIDE, container infra) and are out of scope
for this tranche.

Full CI gate green: clippy clean, fmt clean, fancy-regex empty,
6443/6443 nextest pass (was 6441 pre-tranche).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(workflows): pin tarball-extraction security contracts at install tier

New file tests/workflows/tests/tarball_security.rs ships phase 1 of
Item 3 (tarball-extraction security): 5 of 10 planned tests covering
the most distinct security contracts at the install-pipeline tier.
Each test constructs its malicious tarball in-line via tar::Builder
(no checked-in fixtures), serves it through MockRegistry, and runs
lpm install end-to-end so any pipeline-level regression that bypasses
the extractor's hardening is caught.

Tests landed:

- #1 tarball_with_dot_dot_path_entry_is_rejected_by_install — pokes
  package/../escape.txt into the raw tar header bytes; install fails
  with "path traversal detected"; outside sentinel never created.
- #3 tarball_with_absolute_path_entry_is_normalized_to_relative_under_package_dir
  — renamed from "rejected" to reflect actual contract. The extractor's
  strip_first_component consumes the RootDir; an entry like
  /etc/lpm-pwned.txt extracts as node_modules/<pkg>/etc/lpm-pwned.txt.
  Install SUCCEEDS; literal /etc/lpm-pwned.txt is never written.
  Defensible: malformed-but-safe input normalized rather than refused.
- #2 tarball_with_symlink_to_outside_path_is_silently_skipped —
  renamed. The is_file() gate at lib.rs:398 silently drops symlinks;
  install succeeds with byte-identical outside sentinel.
- #5 tarball_with_hard_link_to_outside_file_is_silently_skipped —
  renamed. Same is_file() gate; hardlinks silently skipped; outside
  victim file unmodified.
- #8 tarball_with_setuid_executable_extracts_with_setuid_bit_stripped
  (POSIX-only) — tarball entry mode 0o4755 extracts as 0o755. SUID,
  SGID, and sticky bits all cleared via set_preserve_permissions(false)
  + the explicit `0o644 | exec_bits` mode set after write. Exec bits
  preserved.

Three tests carry a "plan-vs-actual" docstring section explaining why
the rename is defensible — the actual extractor contract differs from
the plan's prescribed phrasing in safe ways, not in regression-grade
ways. No findings filed.

Phase 2 (5 remaining tests: Unicode normalization, device file, FIFO,
zero-byte sanity, OS-max path) is deferred to a follow-up tranche
with rationale + lift estimate documented in the plan. None blocks
phase 1 acceptance.

Pre-merge gate green: clippy clean, fmt clean, fancy-regex empty,
6448/6448 nextest pass (was 6443; +5 for the new tests). 0.18s wall-
clock for the full file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(install): per-project lock prevents concurrent-install data loss

Closes finding #77. Two `lpm install <pkg>` invocations on the same
project no longer race on the manifest snapshot+commit window.

Pre-fix, both processes acquired only a SHARED store_lock and proceeded
in parallel. Each opened its own per-process ManifestTransaction
snapshot of the pre-edit package.json, staged its own dep on top, and
ran the install pipeline. Whoever wrote package.json + lpm.lock last
won; the other process's edits — including its node_modules link —
silently vanished. Both processes still exited 0 with success-path
output. CI scripts that ran two installs in parallel saw no signal of
the data loss.

The fix introduces:

- crates/lpm-common/src/paths.rs::project_install_lock(project_dir):
  free helper returning <project_dir>/.lpm/.install.lock. Re-exported
  from crates/lpm-common/src/lib.rs.

- run_add_packages and run_install_filtered_add in
  crates/lpm-cli/src/commands/install.rs now wrap the snapshot →
  stage → install → finalize → commit window in
  with_exclusive_lock_async against the project lock. The lock is
  per-project (no cross-project contention) and held across all
  ?-early-exits via the async block's return.

For the workspace path, the lock sits at the discovered workspace root
(not per-member) so two concurrent `lpm install --filter <member>`
invocations on the same workspace serialize without per-member
deadlock-ordering complexity.

run_with_options (the inner install pipeline) does NOT acquire this
lock — it's called from inside both run_add_packages's wrap and from
many other commands; double-acquiring the same fd-lock would deadlock
in-process.

Deferred (phase 2, not exercised by A.1): lpm add (add.rs:723-904)
has a similar 180-line transaction with recursive Swift handling.
Wrapping it is invasive and the race surface is theoretical (users
don't typically run `lpm add` and `lpm install` concurrently). Defer
to a separate tranche if a concurrent `lpm add` × `lpm install` race
is ever observed.

Test contract tightening (bug-first per CLAUDE.md):
two_concurrent_installs_on_same_project_leave_well_formed_manifest in
tests/workflows/tests/install_concurrency.rs went from "at-least-one
survives + manifest is well-formed JSON" (the floor) to "BOTH installs
succeed, BOTH packages present in package.json deps, BOTH packages
linked in node_modules/" (the contract). Pre-fix: 1/1 fail (pkg-b
silently dropped). Post-fix: 5/5 pass with no flakes (~1.2s wall-clock
each — install B observes pkg-a's commit and reports "Resolved 2
packages").

Pre-merge gate green: clippy --workspace --all-targets clean, fmt
clean, fancy-regex empty, 6448/6448 nextest pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(registry): test-only retry-backoff override env knob

Closes finding #78 + lands C.2 (`tarball_503_exhausts_retries_fails_with_http_status`).

Pre-fix, retry-exhaustion tests were blocked: the registry client's
backoff schedule (1+2+4+8s, capped at 10s) made every retry-exhaustion
test take ~15s per fetch site (~28s with the install pipeline's 2
distinct download_tarball_* call sites). MAX_RETRIES, RETRY_BASE_DELAY,
and RETRY_MAX_DELAY are private const with no env override. C.2
therefore had to be #[ignore]-gated behind LPM_RUN_SLOW_TESTS=1, and
the retry-exhaustion contract went unproven on `cargo nextest run`.

The fix introduces:

- crates/lpm-registry/src/client.rs::backoff_override(): reads
  LPM_RETRY_BACKOFF_MS_OVERRIDE (a u64 ms value) gated by
  cfg!(debug_assertions) || LPM_TEST_MODE=1. Returns Some(Duration)
  when both conditions hold; None otherwise. Production retry policy
  is immune — release builds without LPM_TEST_MODE=1 silently ignore
  the env.

- backoff_delay(attempt) consults the override before computing the
  exponential schedule.

- The two 429 Retry-After sleep sites also consult the override so a
  future 429-flood retry-exhaustion test wouldn't hang on the
  server-supplied header.

C.2 test landed alongside (bug-first per CLAUDE.md):

- Mock returns 503 on every tarball request — no recovery path.
- Test sets LPM_RETRY_BACKOFF_MS_OVERRIDE=10 on the lpm subprocess.
- Asserts: install fails non-zero, no panic, ≥4 attempts (proves the
  retry loop fired), elapsed < 2s (load-bearing — without the knob
  this fails at ~14s), stderr contains an actionable HTTP-class
  noun (503 / status / http / network / etc).
- Surfaces 8 tarball GETs per install (4 attempts × 2 distinct
  download_tarball_* call sites — matches C.3's observation).

Pre-fix verification: same C.2 against the unfixed client.rs failed
on the elapsed assertion at 14.04s (knob ignored). Post-fix: passes
in 1.6s cold / 0.1s warm. 5/5 passes with no flakes.

Pre-merge gate green: clippy --workspace --all-targets clean, fmt
clean, fancy-regex empty, 6449/6449 nextest pass (was 6448 pre-fix;
+1 for C.2).

Item 2 of the test-coverage-followup-plan now at 18/21 (was 17/21).
Both findings #77 and #78 fixed in production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(workflows): tarball-security phase 2 — Unicode, device, FIFO, zero-byte, long-path

Adds 5 more tests to tarball_security.rs, completing Item 3 of the
test-coverage follow-up plan. Each test pins the actual extractor
contract under malicious-or-edge-case tarball shapes that reach the
install pipeline through MockRegistry.

Tests landed:

- #4 tarball_with_unicode_lookalike_parent_dir_extracts_safely_as_literal_bytes
  — renamed from "_normalization_traversal_rejected" to reflect the
  actual contract. Tarball entry path uses full-width dots U+FF0E `..`
  (bytewise NOT ASCII `..`). Component::ParentDir is byte-exact, so
  `..` becomes Component::Normal. Install SUCCEEDS; `..` materializes
  as a literal directory under node_modules/<pkg>/; outside sentinel
  byte-identical. Defensible because Path::components() doesn't
  NFKC-normalize on POSIX.

- #6 tarball_with_character_device_entry_is_silently_skipped
  (POSIX-only). EntryType::Char with /dev/null-shaped major/minor.
  Same is_file() gate as symlinks/hardlinks — silently skipped.
  Install SUCCEEDS; no device file at the expected path.

- #7 tarball_with_fifo_entry_is_silently_skipped (POSIX-only).
  EntryType::Fifo. Same posture as #6.

- #9 tarball_with_zero_byte_regular_file_extracts_as_empty_file.
  Sanity check that empty files still extract correctly (legitimate
  npm shape: .gitkeep, license placeholders).

- #10 tarball_with_single_path_component_exceeding_name_max_fails_cleanly.
  300-byte single-component name, well over POSIX NAME_MAX=255. Tar
  wire format succeeds via GNU long-name extension; the FILESYSTEM
  rejects on extraction (ENAMETOOLONG). Extractor wraps as
  LpmError::Io → install fails non-zero with the OS error visible
  and an actionable noun in stderr.

Three of the five tests are renamed to reflect actual extractor
contract vs the plan's prescribed phrasing — same "plan-vs-actual"
docstring pattern as phase 1. No findings filed; all 10 contracts
across phase 1 + 2 are defensible-as-implemented.

Pre-merge gate green: clippy --workspace --all-targets clean, fmt
clean, fancy-regex empty, 6454/6454 nextest pass (was 6449 pre-tranche;
+5 for the new tests). Full file 0.2s wall-clock for all 10 tests.

Item 3 now COMPLETE (10/10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(workflows): cross-command flows Item 4 — migrate→rebuild + workspace filter isolation

Closes Item 4 of the test-coverage-followup-plan at 6/6 (target was
≥5). Two additions to tests/workflows/tests/cross_command_flows.rs:

- Plan #1 — extended flow_migrate_install_audit_lockfile_round_trips
  with a `lpm rebuild --dry-run --policy=deny` step. Pins the full
  migrate → install → audit → rebuild lifecycle. Asserts the rebuild
  step exits 0 + does not mutate the post-audit state (lpm.lock +
  lpm.lockb still present). Catches regressions where rebuild's
  lockfile or build-state parser breaks against a freshly-migrated
  manifest.

- Plan #5 — added flow_workspace_install_filter_member_a_does_not_mutate_member_b
  (new test, 159 LOC). Pins the workspace-member isolation contract
  using the workspace-monorepo fixture (3 members: app, core,
  utils):
    1. Initial filtered install on @test/core (re-pinning its
       existing semver dep) populates core's per-member quadruple:
       lpm.lock=319 B, lockb=230 B, install_hash=118 B.
    2. Snapshot core's full quadruple.
    3. Run `lpm install chalk@5.3.0 --filter @test/app` to add a
       new dep to app ONLY.
    4. Assert app's package.json gained chalk; core's quadruple
       (package.json + lpm.lock + lpm.lockb + install-hash) is
       BYTE-IDENTICAL post-install; chalk does NOT appear in
       core's node_modules/.

  Catches a regression where a per-member filtered install
  accidentally also mutates a sibling member's package.json /
  lockfile / install-hash — a real bug class because run_install_filtered_add
  shares the workspace-root project lock (added in #77 fix) and
  could over-snapshot if the target-set computation drifts.

  Helper `mount_pkg_full(mock, name, version)` factors out the
  three-step metadata + batch-metadata + tarball mount so the
  test body stays readable.

Other 4 plan flows already covered pre-tranche:
- Plan #2: flow_add_install_graph_added_dep_visible
- Plan #3: flow_install_patch_patch_commit_install_persists_patch
- Plan #4: flow_token_rotate_publish_dry_run_picks_new_token
- Plan #6: flow_install_upgrade_major_audit_picks_new_version

Pre-merge gate green: clippy --workspace --all-targets clean, fmt
clean, fancy-regex empty, 6455/6455 nextest pass (was 6454; +1 for
the new flow). Plan #5 stable across 5/5 reruns at ~0.11s each.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(install): LPM_TEST_PANIC_AT hook + B.4 panic-rollback contract

Adds a deterministic panic-injection hook to the install pipeline +
unblocks the long-deferred B.4 contract test for ManifestTransaction
Drop-based rollback on panic.

The hook (`maybe_test_panic(stage)` in
crates/lpm-cli/src/commands/install.rs) reads LPM_TEST_PANIC_AT and
panics when the env value matches the stage name. Gated to
`cfg!(debug_assertions) || LPM_TEST_MODE=1` — same pattern as the
#78 retry-backoff override. Production builds without LPM_TEST_MODE=1
silently treat the env as no-op.

Wired 4 stages in `run_add_packages`:
- "after-snapshot" — manifest unchanged; Drop is no-op
- "after-stage" — placeholder `*` written to package.json (load-bearing)
- "after-install" — pipeline complete; manifest still has `*`
- "after-finalize" — concrete versions written; pre-commit only

The hook unblocks B.4 (`install_panics_mid_pipeline_rollback_restores_manifest`),
deferred since the original Item 2 tranche because there was no
deterministic way to trigger a panic mid-install from a workflow
test. Recoverable errors fire `?`-rollback (covered by E.1/E.2/E.3);
SIGKILL bypasses Drop entirely (B.1/B.2/B.3 cover that). The panic
path was the missing rollback proof.

B.4 sets LPM_TEST_PANIC_AT=after-stage and asserts:
- process exits non-zero (panic propagates to runtime)
- stderr contains `"panicked at"` AND `"LPM_TEST_PANIC_AT=after-stage"`
- package.json BYTE-IDENTICAL to pre-stage (Drop ran on unwind, snapshot
  bytes restored — load-bearing)
- the new pkg is NOT in dependencies (placeholder rollback worked)
- .lpm/install-hash absent (invalidate-on-rollback)
- lpm.lock absent (matched optional snapshot's None pre-state)

Catches a regression where:
- panic = "abort" added to release profile (no Drop on panic)
- ManifestTransaction Drop logic stops restoring snapshot bytes
- The `lpm install` snapshot+commit window grows without re-wiring Drop

Test runs in 0.07s warm. 5/5 stable across reruns.

Pre-merge gate green: clippy --workspace --all-targets clean, fmt
clean, fancy-regex empty, 6456/6456 nextest pass (was 6455; +1 for B.4).
install_concurrency now at 19/19. Item 2 of test-coverage-followup-plan
moves to 19/21 — only A.2 (no contract) and D.3 (needs container
infra) remain deferred indefinitely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(workflows): align MockRegistry tarball URL shape with production /-/ gate

Workflow tests mounted tarballs at `/tarballs/{name}-{version}.tgz` —
missing the `/-/` path segment that the registry-client's `evaluate_cached_url`
gate at [crates/lpm-registry/src/client.rs#L4117] requires (`.tgz` suffix
AND `/-/` substring). The gate is a defense-in-depth check that blocks
the H1 auth-token leak: a tampered lockfile URL like `/api/admin/foo.tgz`
(no `/-/`) would otherwise attach the bearer to a non-registry endpoint.

The mismatch produced two test-environment side effects that don't manifest
in production:

1. **WARN noise**: every install test that read a tarball URL from the
   lockfile fast path logged `cached tarball URL for X@Y failed shape check;
   falling back to on-demand lookup`. Polluted stderr across the suite.
2. **`shape_mismatch_count` defeated**: the registry-client documents this
   counter as a "BUG signal — the writer should never emit a gate-rejectable
   URL". Test runs incremented it on every install, making the counter
   useless for catching real bugs.

This commit migrates the mock to the production-shape
`/tarballs/{name}/-/{name}-{version}.tgz` everywhere — both the helper
methods (`MockRegistry::tarball_path` / `tarball_url`) and the ~60
hard-coded `format!` sites across 14 test files + 1 snapshot.

The new `tarball_path` helper is `pub` with a prominent docstring warning
future test authors not to re-introduce the legacy shape. Internal mounts
in `with_package_and_deps` / `with_package_published_at` /
`with_full_package_metadata` all route through it.

Post-fix verification: WARN gone, gate `Accepted` path runs, all 691
lpm-workflows tests pass (0 leaky in the latest full-workspace run, down
from 1-3 leaky pre-fix — fewer fallback paths firing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(workflows): test-coverage-followup tranche — Items 2/3/4/5

Closes the remaining open rows from `private/test-coverage-followup-plan.md`
across four items. ~2,600 LOC of new test code + fixture + budget infra.

**Item 3 — tarball-security additional candidate surfaces (7 tests
in `tarball_security.rs`):**

- `tarball_with_pax_path_traversal_rejected` — PAX extended `path`
  header smuggling `..` is rejected by the extractor's `Component::ParentDir`
  check after the tar crate resolves the override.
- `tarball_with_gnu_longname_traversal_rejected` — symmetric GNU `L`
  entry; same rejection path.
- `tarball_rejects_or_rolls_back_when_later_entry_is_malicious` — pins
  the `rollback_extraction` contract: valid first entry is cleaned up
  when a later `..`-traversal entry trips rejection mid-stream.
- `tarball_with_duplicate_member_path_rejected_or_deterministic` — pins
  current last-write-wins contract (defensible; flagged scanner-
  disagreement risk in test comment).
- `tarball_with_truncated_gzip_rolls_back_partial_extract` — half-
  truncated gzip stream → libdeflate fails cleanly → no partial extract.
- `tarball_ignores_uid_gid_ownership_metadata` (POSIX) — bogus uid/gid
  in tar header is ignored; extracted files owned by process uid.
- `tarball_with_sparse_huge_file_rejected_by_declared_size` — manually-
  constructed tarball with header declaring `MAX_FILE_SIZE + 1` and
  empty on-wire body; extractor rejects on the pre-check at lib.rs:306
  before draining body.

**Item 4 — cross-command flows additional candidate surfaces (2 tests
in `cross_command_flows.rs`):**

- `flow_install_uninstall_install_graph_round_trip` — pins manifest /
  link / graph hand-off through a full round-trip.
- `flow_cache_clean_then_offline_install_uses_store_or_fails_helpfully` —
  pins the cache/store boundary: `cache clean` must not corrupt offline
  install; store-side bytes byte-identical after a clean.

**Item 2 — concurrency/recovery additional candidate surfaces (3 tests
in `install_concurrency.rs`):**

- `cache_clean_during_slow_tarball_install_does_not_corrupt_install`
  (G.4) — install + cache clean run concurrently (different lock paths,
  no serialization); install succeeds despite metadata cache wipe
  mid-stream. Empirical timing observed: install elapsed 1.57s, cache
  clean fired at t=30-39ms cleanly inside the install window.
- `install_panics_after_install_hash_write_rollback_invalidates_hash`
  (G.5) — reuses existing `LPM_TEST_PANIC_AT=after-install` stage (no
  new source-side hook needed — `write_post_install_v6_hash` runs
  inside `run_with_options` which returns BEFORE that stage fires).
  Pins that Drop-based rollback restores manifest AND deletes the
  freshly-written install-hash.
- `malformed_registry_json_fails_without_manifest_or_lockfile_mutation`
  (G.6) — truncated JSON on all three metadata endpoints; install
  fails cleanly, no panic/backtrace, package.json byte-identical, no
  torn lockfile.

**Verdaccio-npm parity for `which@4.0.0`
(`install_real_registry.rs`):**

- `verdaccio_npm_parity_for_bin_package_pins_metadata_and_shim_presence`
  — extends the existing lodash byte-diff with a bin-shipping target
  package. Asserts metadata equivalence + `.bin/<name>` shim present
  on both sides + bin target file materialized + exec bits non-zero
  (POSIX).

**Item 5 — realworld fidelity (new fixture + new test file):**

- `tests/fixtures/realworld-nextjs/` (package.json + README) — pinned
  Next.js 14.2.13 + React 18.3.1 + TypeScript 5.6.3 + 3 `@types/*`
  packages. Resolves to ~28 transitive deps empirically. README
  documents the calibration methodology including raw measurement data.
- `tests/workflows/tests/install_realworld.rs` — `install_realworld_nextjs_fixture_succeeds_through_verdaccio`
  installs the fixture through Verdaccio→npmjs and asserts end-to-end
  success at production scale. Always logs cold + warm wall-clock + peak
  RSS to stderr for calibration data.
- **`LPM_BUDGET_GATE=1`-gated budget assertions**: cold ≤ 25s, warm ≤
  25ms, cold peak RSS ≤ 1500 MiB. Calibrated from N=6 cold + N=3 warm +
  N=3 RSS runs on M-series macOS, 2026-05-14. Memory measurement via
  `/usr/bin/time -l` (macOS) / `-v` (Linux); Windows skips with a clear
  warning.

This closes Item 5 entirely (all 4 acceptance criteria green) and brings
Items 2/3/4 to the parked-by-design or infrastructure-blocked baseline.

CI gate: clippy `--workspace --all-targets -- -D warnings` clean, fmt
clean, fancy-regex empty, build clean, `cargo nextest run --workspace`
6471/6471 pass. Suite runtime ~2:40 (was ~2:24 pre-tranche; +15s for the
realworld test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(workflows): collapse Linux-only let-chain in parse_peak_rss

CI lint on Linux failed on `clippy::collapsible_if` in the Linux-cfg'd
branch of `parse_peak_rss`. The macOS branch had an intermediate
`let bytes_str = rest.trim();` between the two `if let`s, which is why
the local clippy run on macOS didn't catch this — only the macOS-cfg
branch compiled there.

Collapse the Linux branch to use `&&` (stable let-chains) so it
satisfies the lint while preserving the same semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant