feat(phase-49): client-side streaming BFS resolver#8
Merged
Conversation
Introduces CanonicalKey (Root | Lpm{owner,name} | Npm{name}) — the
context-free identity used by Phase 49's shared metadata cache and
notify map. Unlike ResolverPackage, CanonicalKey has no `context`
field, so split-retry identities (e.g. `lodash` and `lodash[parent]`)
collapse to the same key.
This is the load-bearing invariant that prevents the split-subtree
notify-miss bug: the walker only discovers canonical names from
parsed manifests, so the shared cache MUST be keyed by CanonicalKey
— keying by ResolverPackage would cause every split retry to miss
the walker's inserts and fall through to the escape-hatch fetch.
See DOCS/new-features/37-rust-client-RUNNER-VISION-phase49-streaming-
resolver-preplan.md §4.2 for the full invariant discussion.
- Adds CanonicalKey to lpm-resolver with From<&ResolverPackage>,
Display, lpm() / npm() constructors, and from_dep_name().
- 9 unit tests covering context-stripping, Npm-vs-Lpm distinction,
scoped-npm round-trip, HashMap key identity under split retry,
and walker/provider parse agreement.
- Exports from lpm-resolver::lib.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…h + parallel fan-out
Foundation for Phase 49's direct-to-npm default. Introduces:
- `lpm-registry/src/route.rs`: `RouteMode { Direct, Proxy }` (default
Direct per preplan §3) and `UpstreamRoute { LpmWorker, NpmDirect }`.
`RouteMode::route_for_package` enforces the invariant that
`@lpm.dev/*` ALWAYS routes via the Worker regardless of mode —
authenticated batch + cost attribution must not be bypassed.
`LPM_NPM_ROUTE` env var exists as an undocumented debug escape
hatch via `from_env_or_default`.
- `RegistryClient::get_npm_metadata_direct`: mirrors the Tier 3
direct-to-npmjs path from `get_npm_package_metadata`, but SKIPS
the LPM Worker proxy tier entirely. Tier 1 (TTL + HMAC cache)
is preserved so warm installs and previously-seen packages stay
cache-fast. This is what callers invoke when direct-mode is the
whole point of the call; no silent fallback to the Worker hop.
- `RegistryClient::get_npm_metadata_routed`: single dispatch entry
the Phase 49 walker + provider escape-hatch both use, so routing
policy lives in one place.
- `RegistryClient::parallel_fetch_npm_manifests`: fan-out helper
for the walker. Uses `tokio::sync::Semaphore` with `forget()`
as a one-way ratchet for halve-on-429 adaptive back-pressure
(per-request 429s are already retried with Retry-After by
`send_with_retry`; this adds batch-level pressure reduction
with a floor of 4). Returns per-entry Results + `FanOutStats`
so the walker can report halve events in
`timing.resolve.streaming_bfs` without interpreting errors.
Existing `get_npm_package_metadata` (three-tier chain) is untouched;
callers of the old path will be migrated to `get_npm_metadata_routed`
in §3–§5 and the legacy entry deleted in §7.
- 5 new route unit tests; 119 total lpm-registry tests green.
- Clippy -D warnings clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion Review catch on a9766c5: `parallel_fetch_npm_manifests`'s halve-on-429 path only called `try_acquire_owned` synchronously. When every permit is already checked out (the exact scenario the back-off exists for), zero permits are forgotten, the effective ceiling stays at `initial`, and `halve_events` increments anyway — a silent no-op that lies in stats. Fix: - Add `forget_debt: AtomicUsize` counter. On 429, the observing task first tries synchronous `try_acquire_owned` forgets (covers partial saturation); any shortfall is registered as debt and paid off by the next task completions. Each completing task CAS-claims one unit of debt and calls `permit.forget()` instead of releasing it — over the next few completions the pool actually shrinks. - `current_ceiling` only decrements on ACTUAL forgets (immediate OR at the point a task pays debt), never on the mere registration of debt. This plugs a second hole in the original fix where the stats moved below `initial` even when no real forget followed. With the corrected accounting, `final_concurrency < initial_concurrency` iff permits were genuinely removed from the pool. - `halve_events` only increments when forgot_now > 0 OR shortfall > 0. Halving at-or-below floor records no event (the stat no longer claims halving when the floor blocked it). Four new client tests to pin §2's behavior the route-enum tests couldn't reach: - `get_npm_metadata_direct_skips_proxy_tier_entirely`: proxy mock with `expect(0)` verifies direct-tier never falls back to the LPM Worker. - `parallel_fetch_preserves_input_order_across_varying_latencies`: inverted per-response delays ensure the fan-out would return in completion order if the sort-by-input-index were broken. - `parallel_fetch_per_entry_failures_do_not_abort_batch`: 404 on one entry; the other two still return Ok. - `halve_on_429_ratchets_even_under_full_saturation`: asymmetric mocks (pkg-0 fast 429 + pkg-1..7 slow 200s with 2s delay) pin the saturated moment at the halving call. Verified to FAIL against a sabotaged debt path (`initial=8, final=8, halve_events=1` — exactly the bug shape) and PASS against the fix (`final=4`). Requires `#[tokio::test(flavor = "multi_thread", worker_threads = 4)]` — the default single-threaded tokio runtime scheduled pkg-0 to completion before pkg-1..7 acquired permits, so the saturation never materialized. 123 lpm-registry tests green, clippy -D warnings clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-loop + resolve.rs downstream Load-bearing §3 change for Phase 49: the provider's metadata cache moves from `RefCell<HashMap<ResolverPackage, _>>` to `Arc<DashMap<CanonicalKey, _>>`, removing the context field from the cache key so the future walker's canonical inserts unify with every split-retry identity automatically. Without this, the walker (§4) would produce cache entries the provider's split-context lookups silently miss, and escape-hatch fetches would fire on every split path — exactly the pathology preplan §4.2 warns about. Key changes: - `LpmDependencyProvider`: cache type swap. New `notify_map` field (`Arc<DashMap<CanonicalKey, Arc<Notify>>>`) for per-canonical wakes. New `route_mode: RouteMode` (defaults to `Proxy` in §3 to preserve pre-49 behavior; §5 flips to Direct via `with_route_mode`). New `fetch_wait_timeout: Duration` (defaults to ZERO — wait-loop short- circuits to fetch-on-miss pending §5's walker attachment). - `ensure_cached` rewritten per preplan §5.1: canonicalize FIRST, then cache fast-path, then wait-loop on the per-canonical Notify, then escape-hatch to `direct_fetch_and_cache` on timeout. The wait-loop is doc-commented as the invariant boundary — future refactors that skip canonicalization before a cache read would break the split- subtree notify-hit invariant silently. - `direct_fetch_and_cache` routes via `route_for_package(name)`: LPM always via Worker, npm per `route_mode`. `insert_and_notify` enforces the §5.5 mandatory order — insert into cache THEN fire waiters. - Every internal cache call site canonicalizes at the boundary: `available_versions`, `pick_natural_version`, `apply_override_target` (Range arm), `prioritize`, both uncached-batch filters and the main cache read in `get_dependencies`. - `into_cache` + `into_parts` return `HashMap<CanonicalKey, _>`. Fast path via `Arc::try_unwrap`; per-entry clone fallback for the §5 shape where the walker's `Arc` refcount is still live. - New constructors: `with_shared_cache(SharedCache, NotifyMap, Duration)` and `with_route_mode(RouteMode)`, both `#[allow(dead_code)]` until §5 plumbs them from install.rs. - `resolve.rs` downstream: `ResolveResult.cache` type updated, plus `format_solution` and `check_unmet_peers` canonicalize on lookup. Test-side `cache.insert` sites canonicalize at the boundary helper. - Removed the `is_split()` branch in `ensure_cached` that copied canonical entries under split keys — canonicalization at the cache boundary makes it redundant. Regression tests (preplan §8.1 boundary tests): - `ensure_cached_split_retry_hits_canonical_entry`: when the walker inserts under the canonical key, a provider lookup via a split- context identity must hit the same entry (not time out and fetch). Verified to FAIL under a sabotaged canonicalization (forced miss on split identities) and PASS under the fix. - `ensure_cached_canonical_hits_split_insertion_symmetric`: the inverse — canonical lookup hits a split-inserted canonical entry. Both directions of the canonical collapse must agree. Incidental: includes `cargo fmt` drift on two lpm-cli files (`build_state.rs`, `commands/install.rs`) that were non-compliant on the pre-§3 tree — the CI gate's `cargo fmt --check` enforces it, so fmt discipline for the tree as a whole is required for §3 to ship green. - `dashmap = "6"` added to workspace + lpm-resolver Cargo.toml. - Full CI gate green: clippy --workspace -D warnings clean, fmt clean, 4364 tests pass, 7 skipped, 0 failed. Parallel-default `cargo test -p lpm-auth` deterministic across 3 runs. - 127 lpm-resolver unit tests (was 114 pre-49). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer regression: two concurrent 429s both read `current=8`, both enqueue `want_forget=4` into debt, and the 8 completions drive the effective pool to 0 — below the floor of 4. The previous fix made the CAS atomic on `debt`, but the race is on `ceiling` itself. Fix: CAS loop on `ceiling` at halve time. At most one handler per ceiling transition wins the decrement; others observe the new lower ceiling and either halve again (if still above floor) or bail. Ceiling now reflects the *committed* ceiling (reduction in motion); completion path no longer decrements ceiling (was double-counting with the CAS). New test `halve_on_429_multi_concurrent_races_respect_floor`: 8 concurrent slow 429s; asserts `final_concurrency >= floor` and `halve_events == 1` (only one halve step representable with floor=4, initial=8). Verified to fail under a sabotaged fetch_sub and pass under the CAS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New `lpm-resolver::walker::BfsWalker` fetches manifests ahead of
PubGrub for the root set and its transitive deps. Drops into the
Phase 49 orchestration §5 will plumb from install.rs; until then it
is exercised only by its own tests.
Design:
- Lives in lpm-resolver (not lpm-registry per preplan §5.5) because
it operates on resolver types (CanonicalKey, SharedCache, NotifyMap)
— moving those to lpm-common is a worse fit than relocating the
walker. lpm-resolver already depends on lpm-registry so there's no
circular dep.
- Level-by-level BFS; each level partitions names by route and fetches
the NpmDirect + LpmWorker halves concurrently. npm uses the
`parallel_fetch_npm_manifests` 50-wide fan-out + halve-on-429; LPM
uses `batch_metadata`.
- Per-manifest ordering invariant (preplan §5.5, enforced via
comment block): (1) `shared_cache.insert`, (2) `notify_waiters`,
(3) fire `roots_ready` oneshot, (4) `spec_tx.send().await`. Step 4
is last so resolver visibility is never gated on dispatcher
throughput.
- Transitive discovery walks the `dist-tags.latest` version's
production deps (falling back to highest version if `latest`
missing) — matches the dispatcher's `pick_speculative_version`
shape. Resolver escape-hatch handles any miss.
- MAX_WALK_DEPTH = 16 (matches the Worker's prior server-side cap).
- Per-package fetch failures: debug log + continue; bun/pnpm shape.
Observability: `WalkerSummary { manifests_fetched, cache_hits,
max_depth, spec_tx_send_wait_ms }`. The send-wait counter is the
reviewer's backpressure canary (preplan §5.6) — healthy value is 0.
Also:
- `with_npm_registry_url` promoted from `#[cfg(test)]` to `pub`.
Walker tests depend on it to mock the npm upstream across crates.
The W4 memory flagged this setter as test-only, but the concern
was lockfile multi-registry correctness, not setter visibility.
- `parse_metadata_to_cache_info` exposed `pub(crate)` so walker can
share the provider's canonical parse path.
- `SharedCache` + `NotifyMap` re-exported from `lpm-resolver::lib`.
Three walker unit tests:
- `walker_end_to_end_discovery_and_signals`: 2 roots + transitives,
asserts cache populated, roots_ready fires, spec_tx receives all
frames, summary fields correct.
- `walker_inserts_under_canonical_key_not_context_bearing`: walker
insert for `lodash` must be visible via `CanonicalKey::from(&split)`
— same invariant as the provider-side regression, tested through
the walker path.
- `walker_caps_at_max_walk_depth`: chain of 20 packages; walker caps
at depth 16, exactly `MAX_WALK_DEPTH` manifests fetched.
130 lpm-resolver tests green (was 127); 4368 workspace tests pass,
clippy --workspace -D warnings clean, fmt clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer finding 1 (medium): the walker's cache-hit branch incremented `cache_hits`, maybe fired `roots_ready`, then `continue`d — never reading the cached manifest's deps, so any caller that reused a shared cache or pre-seeded roots got a truncated walk at exactly those nodes. §5's orchestration reuses the cache across walker + provider, so this sits exactly on the seam §5 plumbs into. Fix: clone the cached `CachedPackageInfo` out of the DashMap guard (dropping it at the `;` to avoid an E0502 borrow conflict against the subsequent `&mut self` call) and run `expand_deps_from_info` on it. Cached entries do NOT re-emit on `spec_tx` — the manifest was already produced by whoever seeded the cache; the dispatcher dedupes internally anyway. Reviewer finding 2 (low): `expand_deps_into`'s "fallback to highest version" comment was wrong — `meta.versions.keys().next()` is arbitrary `HashMap` iteration order, not a semver max. Replaced with `expand_deps_from_info(info, ...)` that reads `info.versions.first()`, which `parse_metadata_to_cache_info` keeps sorted newest-first (and prerelease-filtered for npm). One dep- discovery primitive now covers both the cache-hit and fresh-fetch paths. - `commit_manifest` returns the parsed `CachedPackageInfo` so the fresh-fetch path can expand via the same helper without re-parsing. - Removed `expand_deps_into(PackageMetadata, ...)`. - New regression test `walker_cache_hit_still_expands_transitive_deps`: pre-seeds a root with a dep on `leaf-x`, asserts the walker fetches `leaf-x` via the expanded cache-hit path, `roots_ready` still fires, and `spec_tx` sees only the transitive (cached root not re-emitted). Verified to fail under a sabotaged cache-hit branch (skipping expansion) and pass under the fix. 131 lpm-resolver tests (was 130); clippy + fmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lve concurrent
Replaces the pre-49 batch-then-resolve serial flow with the preplan
§5.3 shape: SharedCache + NotifyMap owned by install.rs, walker
spawned as metadata producer, speculation dispatcher extracted and
spawned as a separate consumer, resolver awaited inline on the
walker's `roots_ready` oneshot, both task handles bundled into a
WalkerJoin drained post-fetch.
Key shape changes:
- `run_deep_batch_with_speculation` deleted. Its dispatcher body is
extracted into `spawn_speculation_dispatcher(rx, client, store,
sem, coord, deps) -> (JoinHandle, DispatcherCounters)` — refactor-
only, dispatcher logic unchanged. The W2 roots-ready signal moves
to the walker (preplan §5.5).
- `SpeculativeJoin` → `WalkerJoin`: bundles walker + dispatcher
JoinHandles + counters. `drain(self, stats) -> WalkerSummary`
awaits both in parallel, folds counters into `spec_stats` for
observability, returns walker's summary. Critical: neither handle
is awaited at construction — awaiting early makes the post-fetch
drain a no-op (the bug preplan §5.3 warns about).
- `spec_fetch_enabled` env gate, `cache_has_all` on-disk gate,
`prefetched_batch`, `resolve_result_w2` all gone — the walker
unconditionally drives metadata fetch for the resolver + feeds the
dispatcher for speculation.
- `resolve_with_prefetch(Option<HashMap>)` → `resolve_with_shared_cache(
SharedCache, NotifyMap, Duration, RouteMode)`. Provider wires
through `with_shared_cache` + `with_route_mode`. `fetch_wait_timeout
= 5s` now live — the wait-loop actually waits for the walker
instead of short-circuiting to the escape hatch.
- Retry loop simplified: `Arc<DashMap>` persists across split-retry
passes natively, so `cached_metadata`/`into_cache`/`with_cache`
round-trips deleted. Dead-code pruning: `with_cache`,
`with_prefetched_metadata`, `into_cache` removed from provider.
- `RouteMode::from_env_or_default()` reads `LPM_NPM_ROUTE`, defaults
to Direct (preplan §3 shipped default). Walker + provider's
escape-hatch share the same mode.
Test harness updates for the route-mode flip:
- `install_global.rs` cypress fixture now mounts each package's
metadata at BOTH the LPM proxy path (`/api/registry/{name}`) and
the npm-direct path (`/{name}`), and sets `with_npm_registry_url`
to the same mock — mode-agnostic. `.expect(1)` counts dropped on
per-package GETs since the walker/batch interplay doesn't
guarantee a specific per-mock call count.
- `tests/workflows/tests/support/mock_registry.rs::with_package_and_deps`
mounts both paths for the same reason.
- Subprocess workflow tests (which can't override `npm_registry_url`
via the `--registry` CLI flag) set `LPM_NPM_ROUTE=proxy` in the
shared `lpm()` command builder so the subprocess honors the mock
server at the proxy path.
4369 workspace tests pass, 0 failures; clippy --workspace -D warnings
clean; fmt clean. Walker summary currently logged at debug; §6 will
fold it into the `timing.resolve.streaming_bfs` JSON sub-object.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…batch_ms Reviewer findings (medium×2): two timing fields silently changed semantics after §5's orchestration rewrite. 1. `timing.resolve.initial_batch_ms` — documented as the pre-resolve batch-prefetch wall-clock. Pre-fix it was measured at resolve completion, so it lumped in PubGrub wall-clock (already separately reported as `resolver_stage_timing.pubgrub_ms`) and JSON output was internally inconsistent. Fix: capture at roots-ready fire inside the resolve async block and return the measurement as a tuple. New meaning: "wall-clock from orchestration start to the moment the resolver could begin solving" — the Phase 49 analog of "batch prefetch done." 2. `timing.fetch_breakdown.speculative.streaming_batch_ms` — documented as the metadata-stream/dispatcher window. Pre-fix, `WalkerJoin::drain` measured at drain-complete time (post-fetch), which included the entire tarball-download overlap tail. Fix: await the walker first, close the streaming-batch window at walker-complete time, THEN await the dispatcher. The dispatcher tail (spec_tasks drain) is the fetch-overlap tail — belongs in `fetch_ms`, not the batch window. Both fields are now meaningful for §10's pre-merge ship-gate A/B comparison. Reviewer blocked §6 on these because bench interpretation would be misleading until fixed. Cypress test + clippy + fmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…all window Reviewer finding (medium): the previous fix still used `self.started_at.elapsed()` at drain-call time, so if the walker completed earlier during fetch overlap, awaiting the already-done JoinHandle returned immediately but the `elapsed()` was still taken at post-fetch drain time. The metric continued to include the walker-finished-to-drain-called window — not the metadata-producer window the field contract documents. Fix: the walker records its own wall-clock inside `BfsWalker::run()` from entry to return, stored as `WalkerSummary::walker_wall_ms`. The measurement is invariant to when the caller chooses to `.await` the JoinHandle. `WalkerJoin::drain` reads `summary.walker_wall_ms` into `stats.streaming_batch_ms` — no more external timers. `WalkerJoin::started_at` field and the local `walker_started_at` shadow are now dead and deleted. Both the `SpeculativeStats` docstring (install.rs:166) and the `timing.resolve.*` JSON comment block (install.rs:2988) updated to reflect the Phase 49 contract: `streaming_batch_ms` = walker's metadata-producer window, `initial_batch_ms` = orchestration-start → roots-ready. Reviewer's low-severity JSON contract drift on `initial_batch_ms` (install.rs:2993) also fixed in the same docstring block. Walker tests + workspace clippy + fmt all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Preplan §5.6 observability lands: install.rs emits a new
`timing.resolve.streaming_bfs` sub-object in `--json` output that
combines walker-side metrics (from `WalkerSummary`) with provider-
side counters (from a new `StreamingBfsMetrics`).
New in `lpm-resolver::provider`:
- `StreamingBfsMetrics` — three `Arc<AtomicU64>` counters shared
across split-retry passes so counts accumulate into the same
snapshot:
* `cache_waits` — PubGrub callbacks that entered the wait-loop on
a cache miss. Healthy on a cold install ≈ total transitive
packages.
* `cache_wait_timeouts` — wait-loop exits by timeout. Healthy 0.
* `escape_hatch_fetches` — fetches that bypassed the wait-loop
(timeout OR zero-timeout short-circuit). Healthy 0 when walker
is attached.
- `LpmDependencyProvider::with_streaming_metrics(metrics)` setter
attaches the shared counters to each retry-pass provider.
- `ensure_cached` increments `cache_waits` on wait-loop entry and
`cache_wait_timeouts` on either timeout path. `direct_fetch_and_cache`
increments `escape_hatch_fetches` for every non-root fetch.
Wire-through:
- `resolve_with_shared_cache(..., metrics: StreamingBfsMetrics)` —
new parameter threaded through every split-retry pass via the
shared `Arc`. `#[allow(clippy::too_many_arguments)]` on the fn
per CLAUDE.md (design-level lint, not worth a parameter-struct
refactor right now).
- `resolve_dependencies_with_overrides` + test shim pass
`StreamingBfsMetrics::new()` — pre-walker callers get a default
fresh counter set.
- install.rs hoists a single `streaming_metrics` to outer scope so
both the resolver (inside the `None` lockfile arm) and the JSON-
emit block (at outer scope) share the same Arc.
JSON field shape per preplan §5.6 (null on warm lockfile-fast-path):
```json
"timing": {
"resolve": {
"streaming_bfs": {
"walk_ms": 345,
"manifests_fetched": 227,
"cache_hits": 3,
"cache_waits": 227,
"cache_wait_timeouts": 0,
"escape_hatch_fetches": 0,
"spec_tx_send_wait_ms": 0,
"max_depth": 5
}
}
}
```
Docstring block above the `"resolve"` JSON emit documents healthy vs
degraded values per field so bench readers can spot regressions.
131 resolver tests green; 4369 workspace tests pass; clippy
--workspace -D warnings clean; fmt clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ion tests Reviewer found the "healthy cache_waits ≈ total transitive packages" framing was quantitatively wrong — `cache_waits` only increments on wait-loop entry (fast-path cache hits don't count), `ensure_cached` is called from multiple sites + across split retries, so the counter can be much lower or higher than installed package count depending on timing and tree shape. Fixes: - `StreamingBfsMetrics` doc: replaced the three-bullet "healthy values ≈" block with per-counter qualitative semantics. `cache_waits` is explicitly flagged as "how many times PubGrub had to wait on the walker" — NOT an installed-package-count proxy. - `timing.resolve.streaming_bfs` JSON contract block in install.rs: same softening, inline in the docstring above the emit. Tests added to close the testing gap the reviewer noted (the JSON sub-object previously had no regression coverage): - `streaming_metrics_cache_hit_does_not_increment_waits`: fast-path cache hit leaves all three counters at zero. - `streaming_metrics_miss_with_zero_timeout_increments_only_escape_hatch`: pre-§5 shape (walker unattached, `fetch_wait_timeout == ZERO`) — wait-loop never entered (`cache_waits = 0`), escape hatch fires once (`escape_hatch_fetches = 1`). Underlying network failure is orthogonal to the counter check. - `streaming_metrics_root_package_does_not_count`: `ensure_cached(Root)` returns early before any counter interaction. - `streaming_metrics_shared_across_clones`: cloning shares the underlying Arc<AtomicU64>s — load-bearing for split-retry cross-pass accumulation. 135 resolver tests (was 131); clippy + fmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§7 per the todo: final sweep of stale references + dead-code deletion
after §5 retired the pre-49 orchestration. Plus the reviewer's
suggested hardening to close the CLI-surface testing gap.
Dead code deleted:
- `RegistryClient::batch_metadata_deep_streaming` — no live callers
after §5 (walker replaced it as the metadata producer).
- `batch_metadata_inner(..., tx: Option<Sender>)` +
`parse_ndjson_batch(..., tx: Option<Sender>)` — the `tx` parameter
chain existed only to feed the streaming variant. With all callers
now passing `None`, simplified both functions to drop the parameter
and the two `if let Some(ref sender) = tx { ... }` emission blocks
inside the parser.
Stale doc refs updated across 4 files — comments that referenced
`resolve_with_prefetch`, `with_prefetched_metadata`,
`run_deep_batch_with_speculation`, `batch_metadata_deep_streaming`,
and `LPM_SPEC_FETCH` as if they still existed. Replaced with Phase 49
current-state references; historical "Phase 49 replaces X" narrative
left intact where it accurately describes the transition.
Reviewer-suggested hardening (CLI-surface gap from bcaaf4e):
- `install_json_output_contains_package_list` now asserts
`timing.resolve.streaming_bfs` is emitted as an object on fresh-
resolve installs with all eight documented fields as numbers
(`walk_ms`, `manifests_fetched`, `cache_hits`, `cache_waits`,
`cache_wait_timeouts`, `escape_hatch_fetches`, `spec_tx_send_wait_ms`,
`max_depth`). Plus a timing-agnostic "metadata was produced by
walker OR escape-hatch" sanity check that stays resilient to
the harness's proxy route-mode default.
Bonus fix caught by the new workflow assertion:
- `with_batch_metadata` mock helper now auto-wraps bare metadata
into `{name, metadata}` NDJSON envelopes. The client's
`parse_ndjson_batch` expects that schema; bare lines were
silently dropping on the wire, which was why the walker's
`batch_metadata` returned empty in proxy-mode workflow tests.
4373 workspace tests pass (was 4369); clippy --workspace -D warnings
clean; fmt clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cript scrub Three related hardening moves, all non-blocking for the shipped release but tighten the test + observability surface before §10's ship-gate A/B. §8 tests added: - `blocking_pool_saturation_smoke_max_2_threads` (provider.rs) — preplan §5.1 softened the blocking-pool concern after verifying PubGrub runs inside ONE `spawn_blocking` task (not one per miss), but a smoke test is still useful: a future refactor that moves `rt.block_on` into a hot per-package site would deadlock a 2-thread blocking pool. Test pre-seeds the shared cache so `ensure_cached` hits the fast path, then fires 16 concurrent lookups across 4 `spawn_blocking` tasks under `max_blocking_threads(2)`. Wrapped in a 10s timeout — a real deadlock fails the test rather than hanging CI. - `walker_result_independent_of_per_manifest_timing` (walker.rs) — preplan §8.1's "order-independence" check: run the same walker twice against the same tree with inverted per-manifest delays (slow roots + fast leaves vs. fast roots + slow leaves). Asserts the final shared-cache keyset, manifests_fetched, and max_depth are identical across both passes. Uses `ord-` prefixed package names to avoid disk-cache cross-test contamination. Bench script scrub (reviewer's adjacent note on §7): - `bench/run.sh`'s "legacy" A/B variant used `LPM_STREAM_FETCH=0 LPM_SPEC_FETCH=0` to disable Phase-38/39 paths. Those env vars no longer control install behavior — the gated code paths were retired in §5/§7 — so the legacy variant was running the same code path twice. Replaced with the actually-meaningful Phase 49 A/B: Direct mode (shipped default) vs `LPM_NPM_ROUTE=proxy` (escape hatch). This is the A/B §10's ship gate anchors on. Test-isolation fix (exposed by the order-independence test): - `RegistryClient::with_cache_dir(Option<PathBuf>)` — new public setter to override the default `~/.lpm/cache/metadata/` path or disable disk caching entirely (`None`). The default is shared across every test in the process AND persists between test runs, causing cross-contamination when two tests use the same package name. Walker-tests' `client_direct_to` helper now passes `None` so each test is hermetic. The existing walker tests that break under cross-contamination (revealed when the order-independence test added `leaf-a1`/`leaf-a2` packages and poisoned the cache for subsequent runs) now pass deterministically. 4375 workspace tests (+2 vs §7's 4373); 137 lpm-resolver tests (+2); clippy + fmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer finding (low): the new public `with_cache_dir` setter swapped `cache_dir` but left `cache_signing_key` bound to the original directory's key. That leaves the API semantically incomplete — any future caller using `with_cache_dir(Some(new_path))` would write HMAC-signed entries under `new_path` using the OLD directory's key, breaking the invariant that the signing key is derived from the cache directory. Fix: `with_cache_dir` now calls `load_or_create_cache_signing_key` on the new directory before swapping the field, matching how `RegistryClient::new` initializes both fields together. For `None` the call generates a throwaway key (unused since the cache-path helpers short-circuit on `None`), which keeps the invariant "`cache_dir` fully determines `cache_signing_key`" uniform across both branches. Current Phase 49 callers pass only `None` (walker test isolation), so no behavior change in the shipped paths. The fix is about making the public API safe for future Some(path) use without requiring every caller to remember to separately rotate the key. 137 resolver tests + 124 registry tests + clippy + fmt all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ome) regression Closes out the code work for Phase 49. §9 + §10 are housekeeping + user-executed bench work; this commit ships everything except the actual bench runs. §9 — F2 bench fixture committed: - `bench/fixture-large/package.json` — 21 direct deps resolving to ~266 transitive packages per preplan §10.1. Pinned so future perf phases anchor on a fixed artifact rather than rebuilding from memory. §10 — ship-gate procedure: - `bench/PHASE-49-SHIP-GATE.md` — the pre-merge checklist, with the full CI gate command sequence, F1 + F2 bench gates, streaming-BFS observability spot-check, and revert rehearsal procedure. References the preplan's §9 single-release rollout rule and §10.2 math-checked targets. Reviewer residual gap closed (on `a6d2493`): - `with_cache_dir_some_path_roundtrips_across_clients` — writes a metadata entry with client A pointed at a tempdir, then reads it back with a fresh client B pointed at the same tempdir. Passing requires: (a) signing-key sidecar persisted in the new dir, (b) `load_or_create_cache_signing_key` re-derives the same key on the second client. Verified to fail under a sabotaged `with_cache_dir` (key rotation removed) and pass under the fix. Remaining work is user-executed: running the F1/F2 interleaved A/B benches and the full CI gate from the ship-gate doc before merge. All 16 Phase 49 commits are code-ready. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ve aliases First F2 direct-mode ship-gate bench came in at ~5.6s vs preplan §10.2's 1.8s target. JSON diagnosed: walker producer window fine, but followup_rpc_count=35 / followup_rpc_ms~11178ms — provider's own batch_metadata_deep follow-up (provider.rs:1121, :1269) was firing 35 times because those names were missing from shared_cache. Root cause: expand_deps_from_info walked info.versions.first() only. PubGrub picks the version satisfying the consumer's range; if an older picked version declares a dep the newest dropped, walker's seen-set never held it, cache never got its manifest, and get_dependencies fell through to serial follow-up RPC inside the solve. Fix: walk UNION of dep names across all versions in info.deps, with per-version alias rewrite (local "npm:target@range" enqueues target, not local label). CanonicalKey-keyed seen-set dedupes; net extra work is bounded by actual historical deps, fetched parallel 50-wide with disk-TTL dedup. Over-fetch is cheap; missing a name is serial RPC inside the solver. Adjacent fix folded in: alias resolution in expand_deps_from_info — pre-49 walker enqueued local alias labels that 404 against the registry (silently logged at debug, but target identity also went missing from cache). Two regression tests, both sabotage-verified: - walker_expands_deps_across_all_versions_not_just_newest - walker_resolves_npm_aliases_to_target_name CI gate: clippy --workspace -D warnings clean, fmt clean, 4378 workspace tests pass, lpm-auth --lib 47/47 deterministic 2x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Express install on a fresh package.json regressed from ~2s (npm) to 40s on phase-49: 6 of the resolver's `ensure_cached` calls each burned the full 5s `fetch_wait_timeout` waiting on a Notify the walker would never fire. The wait-loop assumed an unenforced invariant — "the walker will eventually insert your key" — and the fall-through cost grew linearly with miss count (60-dep tree → 30s sleep, ~5min on a real Next.js app). This commit closes the handshake. A new `WalkerDone` (`Arc<AtomicBool>`) flag is shared between `BfsWalker` and `LpmDependencyProvider`. The walker fires an RAII `WalkerShutdownGuard` on exit (Drop) that: 1. stores walker_done = true (Release), then 2. iterates notify_map and calls notify_waiters() on every entry. The provider's wait-loop now pins `Notified::enable()` (commits the subscription synchronously, defeating the wakeup race), then re-checks cache → walker_done → timeout. Either ordering of the walker's two shutdown steps wakes a sleeper in microseconds: flag observed first → short-circuit; notify observed first → wake, re-check cache (still miss), re-check flag (now true), short-circuit. The guard pattern is panic-safe: a panic mid-walk still drops the guard, still broadcasts. Without this, a walker panic stranded every sleeper for the full timeout. A new `cache_wait_walker_done_shortcuts` counter exposes the healthy outcome in `timing.resolve.streaming_bfs` JSON. On the express benchmark this counter is 6 — exactly the previous `cache_wait_timeouts` value, confirming each previously-stuck waiter now short-circuits instead. Express benchmark (same fixture, same hardware): perf.w2_resolve_after_roots: 40,153ms → 74ms (~543×) ensure_cached total: 30,016ms → 70ms cache_wait_timeouts: 6 → 0 cache_wait_walker_done_shortcuts: — → 6 wall-clock: 40s → <1s Walker's expand_deps_from_info still uses the union-across-versions discovery from §11; the bounded-union improvement to drop the 6 walker-misses to 0 lands in a follow-up. Tests: 140 passed (added walker_done_broadcast_wakes_sleepers_for_unfetched_keys regression that proves a sleeper subscribed to a key the walker never fetches wakes within 100ms). Gates: cargo check --workspace ✅, cargo clippy --workspace -- -D warnings ✅, cargo nextest run -p lpm-resolver ✅ (140/140), cargo build --release ✅. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ip redundant deep batch
Two follow-ups to §12 that, together, deliver the actual <2s
cold-cache express install. Both rest on a single insight:
**when the walker is attached, IT is the metadata producer; the
in-resolver `batch_metadata_deep` follow-up is redundant work that
races ahead of the walker and pays a Worker round-trip for
manifests the walker is about to insert anyway.**
# Why §11's full union was a regression on its own
§11 widened `expand_deps_from_info` to walk every version's deps so
PubGrub's older-version picks would land in the cache. Measured
in isolation on the express F2 fixture this fixed 35 follow-up RPCs
worth ~11s of pubgrub wall-clock. But the strategy compounds through
BFS depth: each package's full-history dep set pulls in historical
packages whose own histories pull in more.
Express @ ^4.18.0, cold cache, walker_done broadcast attached:
full union (commit 019d99c): 7,146 manifests / 31.8s walker / 33s wall
bounded union top-K=4 majors: 845 manifests / 7.6s walker / 11s wall
newest-only: 68 manifests / 0.6s walker / 22s wall (*)
(*) cold-cache pubgrub blocked on 2x `batch_metadata_deep` Worker
RPCs (express → 31 uncached → 64 fetched, send → 2 uncached → 2
fetched). 21,082ms in pubgrub for 68 manifests is unphysical for
a populated cache — the time was the redundant Worker RPCs racing
ahead of the walker. See below.
# §13.A — drop the union entirely
Switch `expand_deps_from_info` back to newest-only:
let Some(newest) = info.versions.first() else { return };
let ver_str = newest.to_string();
...
§12's `walker_done` broadcast made the older-version-pick path cheap:
when PubGrub picks a non-newest version whose deps the walker didn't
fetch, `ensure_cached` short-circuits via the broadcast in
microseconds and a single `direct_fetch_and_cache` lands the manifest
(disk-cache-warm in the common case). Walking proactively to avoid
those calls is a net loss — over-fetching is what blew F2 up in the
first place.
The `cache_wait_walker_done_shortcuts` JSON counter measures exactly
these recoveries; on healthy installs it tracks the older-version
pick count (6 on express, typically <10 on real trees).
# §13.B — skip the redundant deep batch when a walker is attached
`get_dependencies` had two `batch_metadata_deep` call sites
(root + per-package) that fired *inside* the spawn_blocking that
hosts `pubgrub::resolve`. With a walker attached this is harmful:
the resolver races ahead of the walker, sees its caller's package
deps as "uncached" (because the walker is still on an earlier BFS
level), and fires a Worker RPC that the walker is about to obviate.
Cold-cache express measured this at 2 RPCs / 21s of pubgrub
wall-clock — for a 68-manifest tree.
Gate both call sites on `self.fetch_wait_timeout.is_zero()`:
if self.fetch_wait_timeout.is_zero() {
// pre-49 callers (no walker) keep their batch fast-path
...
}
Pre-49 callers (resolver-only, no walker) retain identical behavior.
Phase-49 callers go straight to the wait-loop, which already
short-circuits on `walker_done` via §12.
# Express benchmark, cold network, all permutations on the same hardware
| walker | skip-batch | manifests | walker | pubgrub | wall |
|-----------------|------------|-----------|--------|---------|-------|
| full union | no | 7,146 | 31.8s | 2.0s | 33s |
| bounded K=4 | no | 845 | 7.6s | 11.0s | 11s |
| newest-only | no | 68 | 0.6s | 21.1s | 22s |
| newest-only | yes (THIS) | 68 | 0.9s | 1.1s | 2s |
Warm-cache rerun (this commit): 8ms total resolve, walker 6ms.
# Tests
- New: `walker_expands_only_newest_version_deps` pins both halves of
the §13 walker contract on a 2-major fixture — newest's deps land,
older's don't (escape-hatch covers them at solve time).
- Removed: `walker_expands_deps_across_all_versions_not_just_newest`
(the §11 union test — explicitly contradicts §13.A).
- Existing `walker_done_broadcast_wakes_sleepers_for_unfetched_keys`
still proves the post-walker shortcut path; the express
integration bench covers end-to-end.
Gates: cargo check --workspace ✅, cargo clippy --workspace -- -D warnings ✅,
cargo nextest run -p lpm-resolver ✅ (140/140), cargo build --release ✅.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`cargo fmt --check` flagged the §12 install.rs change — the two-line `let walker_done: WalkerDone = Arc::new(...)` fits on a single line per rustfmt's default width. No behavior change. Picked up during the pre-push CI gate (cargo fmt --check) before opening the phase-49 PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cratch gitignore Two pre-existing uncommitted changes that surfaced during the PR-#8 CI gate. Bundled here because the precedence.rs change is what unblocks `cargo clippy --workspace --all-targets -- -D warnings` on Rust 1.94.0 (CI's pinned toolchain), and the .gitignore line is trivial scratch-dir hygiene from the same session. # precedence.rs — drops `clippy::doc-lazy-continuation` failure The two test-doc paragraphs at lines 376–383 and 487–493 were authored as bullet-listy prose: /// `force-security-floor = true` + project `scriptPolicy = "allow"` /// + `--yolo` (CLI = `allow`) → effective is `deny` (user default /// floor); ... `pulldown-cmark` parses the leading `+ ` on the second `///` line as a list-item marker, then expects subsequent lines to be indented two spaces under the marker. They aren't, so 1.94's clippy emits `doc list item without indentation` (only 1.94+ has the lint this strict; earlier toolchains let it slide, which is why the lint was latent on main). The rewrite replaces the bullet-style continuation with single- sentence prose that says the same thing without the leading-`+` trigger. No behaviour change; doc-comment text only. CI's exact gate (`cargo clippy --workspace --all-targets -- -D warnings`) goes green after this commit. # .gitignore — `/reproduction` scratch dir The phase-49 streaming-resolver bench harness (`/reproduction/`) is local scratch state — fake `$HOME` + `$XDG_CACHE` directories used to pin the cold-cache path during §12 reproduction. Adding it here so the path doesn't appear as untracked noise in `git status` while benching. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inux + macOS The `slice5_relative_traversal_escape_rejected_unconditionally` test passed on macOS and failed on Linux CI — pre-existing red on `main`, masked by the runner-OS asymmetry. Surfaced during PR-#8 CI run. # Why it failed on Linux only `load_sandbox_write_dirs` validates entries in three steps, in order: 1. dangerous-root denylist (`/`, `/etc`, `/var/run`, `/run`, home-relative credentials paths) 2. traversal-escape (`..` resolution must stay inside project_dir) 3. user-allowlist intersection The fixture authored `"sandboxWriteDirs": ["../../etc"]` against a tempdir-based project. On Linux CI, `tempfile::tempdir()` returns `/tmp/<x>` so `../../etc` resolves to a real `/etc` — Step 1 fires and the error mentions "veto / system-level configuration" without the word "escape", so the assertion `reason.contains("escape")` fails. On macOS the tempdir lives at `/var/folders/.../T/<x>` so `../../etc` resolves to a non-existent path that isn't on any denylist, Step 2 fires as the test expects, "escape" is in the error, the assertion passes — masking the bug. # Fix Change the fixture to `"../sibling-outside"`. That path: - resolves to a sibling of the tempdir on every platform (`/tmp/sibling-outside` on Linux, `/var/folders/.../sibling-outside` on macOS), neither of which is on any dangerous-root denylist - escapes `project_dir` unambiguously, so Step 2 fires (the actual invariant the test was written to pin: "traversal escape is rejected unconditionally with an empty allowlist") The doc-comment is updated to record the platform-fragility lesson so a future maintainer reaching for `../../etc` again sees the warning. # Verification cargo +1.94.0 nextest run -p lpm-sandbox slice5_relative_traversal_escape ✓ cargo +1.94.0 clippy --workspace --all-targets -- -D warnings ✓ exit 0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…slice5 fix `cargo fmt --check` on Rust 1.94.0 (CI's pinned toolchain) flagged the `let e = fixture(...)` from 427e1ff — the new `"../sibling-outside"` body fits on one line and rustfmt prefers the collapsed form. The previous fix was verified locally with clippy but I forgot to re-run `cargo +1.94.0 fmt --check` after the edit; CI caught it. No behaviour change; whitespace only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 tasks
tolgaergin
added a commit
that referenced
this pull request
Apr 29, 2026
* phase-60 D2: promote download_tarball_routed helpers to RegistryClient
Behavior-preserving refactor extracting the two private routed-tarball
helpers from install.rs (download_tarball_routed,
download_tarball_streaming_routed) onto RegistryClient as public
methods. Both `lpm install` and the upcoming Phase 60 `lpm add` source-
delivery flow consume the same Custom-route auth-attachment logic.
- crates/lpm-registry/src/client.rs: add public methods
- crates/lpm-cli/src/commands/install.rs: switch all 5 call sites to
the new methods; delete the private helpers; remove the now-unused
DownloadedTarball import
All 602 install + npmrc tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* phase-60 60.0.e: PackageMetadata::resolve_version_spec helper
Add a three-tier version-spec resolver on PackageMetadata covering
dist-tag → exact-version → semver-range, mirroring the canonical
pattern at install_global.rs:368-405 verbatim.
Pre-Phase-60, `lpm add react@beta`, `next@canary`, `lodash@^4` all
failed because PackageMetadata::version() is a pure HashMap lookup —
none of those literal strings exist as concrete versions. The new
helper closes the gap.
Per D3 (preplan): both parse-failure and no-satisfying-version
return LpmError::Script (matching install_global verbatim) so the
Phase 60.1 migration of the four duplicate sites (install_global,
install, update_global, global) is a true behavior-preserving
refactor.
9 unit tests cover dist-tag (latest/beta/canary), exact match,
caret/tilde range, no-satisfying error, parse-fail error, and
empty-versions error.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* phase-60 60.0+60.1+60.1.5+60.2: lpm add source delivery from any registry
Decouple `lpm add` from LPM-only package identity, mirror install's
full .npmrc setup, switch to file-spool tarball download, add
destination-side path containment, gate dep auto-install on
lpm.config.json presence, and surface external imports for the simple
path. End-to-end flow now works for any package on any registry the
rust client can reach (lpm.dev worker, npmjs.org direct, .npmrc-
declared private registries).
60.0.a + 60.0.b — Identity refactor + drop dotted-name auto-prepend
- New AddTarget enum: Lpm(PackageName) | Npm { spec: String }.
- New resolve_add_target replaces parse_package_ref. No rewriting
outside the @lpm.dev/ scope — `lodash.merge`, `tolga.foo`, etc.
resolve to AddTarget::Npm verbatim. Fixes a long-standing
correctness bug: pre-Phase-60 dotted bare names were silently
rewritten to @lpm.dev/<name> which doesn't exist on lpm.dev.
- All output / log / JSON sites render via target.display() /
target.json_name() — `name.scoped()` no longer used unconditionally.
- Skills branch type-encoded via `let AddTarget::Lpm(pkg) = &target`
pattern, with a why-comment (60.2) explaining the scope gate
(lpm.dev runs LLM scans on shipped skill content; arbitrary npm
packages are not scanned).
60.0.c — Mirror install's full .npmrc setup
- Build RouteTable::from_env_and_filesystem before any network call.
- Surface npmrc_warnings (non-JSON) and the strict-ssl=false security
warning (escapes --json). Clone the client with with_tls_overrides
so cafile= / strict-ssl=false take effect on metadata + tarball
fetches. Mirrors install.rs:3295-3445.
60.0.d — Routed metadata + file-spool tarball
- Metadata: AddTarget::Lpm uses get_package_metadata; AddTarget::Npm
uses get_npm_metadata_routed.
- Tarball: client.download_tarball_routed (D2 promoted helper) +
lpm_extractor::extract_tarball_from_file. Bounded memory via
MAX_COMPRESSED_TARBALL_SIZE (500 MB) for free; lpm add typescript
(~22 MB) and worst-case @scope/giant-fixture no longer load the
whole tarball into RAM.
60.0.f — Destination-side path containment (D6)
- New resolve_safe_dest helper canonicalizes target_dir once and
validates every write destination: refuses to follow existing
symlinks, rejects writes whose canonical parent escapes the target
root. Wired into the Step 8 file-copy loop. Closes the threat-model
gap that opened up when add expanded from "trusted lpm.dev
publishers" to "any npm publisher."
60.1 — Dep gate + bare-imports notice (D4)
- Tighten dep gate: `if !no_install_deps && lpm_config.is_some()`.
Simple path is download-manager: copy bytes, no auto-install.
- import_rewriter exports a sibling collect_bare_specifiers fn that
shares an internal SpecifierKind classifier with rewrite_imports
(anti-drift contract — "bare" means the same thing in both places).
- add.rs surfaces the collected externals as a non-JSON notice and
as a `external_imports` array in the JSON output.
60.1.5 — Non-interactive simple-path guard
- `lpm_config.is_none() && target_path.is_none() && (yes || json ||
!is_tty)` errors before the file-copy loop. Heuristically defaulting
components/ for arbitrary 3rd-party source under --yes/--json/non-TTY
is a CI/automation footgun.
Tests
- 15 unit tests in add.rs (resolve_add_target classification including
the dotted-name regression; resolve_safe_dest contracts including
symlink-refusal on Unix).
- 10 unit tests in import_rewriter.rs (classify_specifier,
collect_bare_specifiers).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* phase-60 60.3: integration tests for lpm add simple path + guards + traversal
Three new wiremock-driven integration tests covering the highest-value
end-to-end scenarios for Phase 60:
- add_simple_non_interactive_without_path.rs (4 sub-tests) — proves
the 60.1.5 guard fires for --yes, --json, and non-TTY (stdin from
/dev/null) without --path; positive control with --path succeeds.
No package.json mutation in any failure case.
- add_source_npm_simple.rs (2 sub-tests) — full simple-path pipeline
via wiremock npm metadata + tarball: AddTarget::Npm resolves, file-
spool download, extract, files copied flat (no auto-nest), bare-
imports notice lists react + @radix-ui/react-slot, package.json
NOT mutated, .lpm/skills/ NOT created. JSON sub-test asserts the
package.name uses the npm-style identity (not @lpm.dev/-prefixed)
and the new external_imports array is well-shaped.
- add_path_traversal_dest_escape.rs — proves resolve_safe_dest is
wired into the actual write loop, not just unit-tested in
isolation. Tarball ships an lpm.config.json with files[0].dest =
"../../escaped/evil.txt" — assertion: containment-violation error,
exit non-zero, no file written outside target_dir.
Other 60.3 specced tests are either (i) covered by the unit tests
that landed alongside the implementation (#5 dotted-name, #9 version-
spec, #11 symlink — see preplan v6 audit checklist) or (ii)
deliberately deferred where the underlying machinery is already
test-covered by Phase 58.x install tests (#1 lpm.dev rich, #2 npm
rich, #6 npmrc auth, #7 strict-ssl, #8 missing-var fatal).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* phase-60 60.4: README — lpm add now works against any registry
- Update the lpm add one-liner in the Commands list.
- Add a "How lpm add Works" section explaining: source delivery vs.
install, the firm naming rule (@lpm.dev/owner.name only), the rich
vs. simple paths, and the non-interactive --path requirement.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* phase-60 audit fix: resolve_safe_dest must validate before mkdir
Audit reproduced (with a temp-dir filesystem probe) that the landed
resolve_safe_dest helper still created directories OUTSIDE the
target_dir for two attack vectors before the containment error fired:
1. `dest_rel = "../../escaped/evil.txt"` — `Path::join` resolves
lexically; `dest.parent()` lands outside target; `create_dir_all`
ran before the containment check, leaving `<target>/../escaped/`
on disk even though the file write was correctly blocked.
2. Absolute `dest_rel = "/tmp/elsewhere/evil.txt"` — `Path::join` of
an absolute path returns the absolute path verbatim; `parent =
/tmp/elsewhere/`; `create_dir_all` created it before the
containment check fired.
The original integration test only asserted no escaped FILE existed,
so the directory-side-effect bug passed CI.
Fix
- Reorder resolve_safe_dest so EVERY check that can reject the
destination runs BEFORE any filesystem mutation:
Step 1 (NEW) — reject absolute dest_rel up-front.
Step 2 (NEW) — reject any ParentDir / RootDir / Prefix component.
Step 3 — refuse existing-symlink destinations.
Step 4 (NEW) — pre-mkdir ancestor canonicalization: walk up to the
longest existing ancestor; canonicalize; require it under
target_root_canonical (catches symlinked intermediate dirs).
Step 5 — create_dir_all (NOW safe).
Step 6 — post-mkdir re-canonicalize as TOCTOU defense-in-depth.
The lexical bans in Steps 1-2 kill the entire `../escape` and
absolute-path attack classes before any mkdir runs. The longest-
existing-ancestor walk in Step 4 covers the symlinked-intermediate
case (target/foo → /tmp/elsewhere). Step 6 is paranoia.
Tests
- Strengthen unit tests:
- resolve_safe_dest_dotdot_in_path_rejected_with_no_external_dir_created
now asserts no escape directory was created.
- resolve_safe_dest_absolute_dest_rejected_with_no_external_dir_created
is new — covers the absolute-path attack.
- resolve_safe_dest_dotdot_in_middle_of_path_also_rejected covers
`foo/../bar.txt` (lexically resolves back inside but still
rejected up-front).
- Extend integration test:
- dest_escape_via_dotdot_is_refused_and_creates_no_external_directory
now snapshots target_dir entries before the run and asserts no
unexpected new top-level entries appeared, plus no escape dir.
- dest_escape_via_absolute_path_is_refused_and_creates_no_external_directory
is new — covers the absolute-path attack at the integration level.
Net: 4923 → 4926 workspace tests; clippy + fmt clean; all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 tasks
tolgaergin
added a commit
that referenced
this pull request
May 7, 2026
…ng + GraphKey disambiguation Closes the three load-bearing 4d (default-flip) blockers from Phase 66 4b's deferred-items list as a single coordinated change. Without these, every line of Phase 4c would build on top of an empty-peers GraphKey assumption and a `(name, version)`-only key map — exactly the cheap-now / refactor-later trap the user called out. #9 — peer-context threading (resolver → install → linker): - `lpm_resolver::ResolvedPackage` gains `peers: Vec<(String, String)>`, populated from `CachedPackageInfo.peer_deps[version]` intersected against the resolved-versions lookup the resolver already builds for `format_solution` / greedy `into_resolved_packages`. Sorted by peer_name for deterministic GraphKey hashing. Both resolver arms (pubgrub + greedy) populate symmetrically through `compute_resolved_peers` (pubgrub) / inline lookup (greedy, since the node table is the lookup). - `InstallPackage.peers` carries the resolver's output verbatim through `resolved_to_install_packages`. Source-kind paths (Tarball / Directory / Link / lockfile fast-path) populate empty for now; the v2 linker's `ensure_peer_context` re-derives from the just-extracted `package.json` when the field arrives empty, keeping cold-resolve and warm-fast-path producing the same GraphKeys. - `LinkTarget.peers` propagates from `InstallPackage.peers` at every install→link conversion site. v1 ignores the field; hoisted-mode v1 wanting cross-project sharing later can fold it in without further plumbing. #4 — fold peers into the GraphKey: - `lpm_store::v2::GraphKeyInputs::with_peers` now receives the resolved `PeerEntry` list from `LinkTarget.peers` instead of the empty `Vec<PeerEntry>::new()` placeholder. The hash field contract was already in place (`peers` slot in `derive`); we just stop passing nothing into it. - New `with_wrapper_id` setter folds the source-identity disambiguator into the hash so `Source::Registry { foo@1.0.0 }` and `Source::Tarball { foo@1.0.0 from URL X }` produce distinct keys. Empty `wrapper_id` (registry default) preserves the pre-Phase-66 hash so existing v2 store entries don't get invalidated by this addition. #8 — multi-source-same-coords disambiguation in v2 linker key map: - New `KeyMap` type with two indexes — `by_triple` keyed on `(name, version, wrapper_id)` for the consumer's own key lookup, `by_coords` keyed on `(name, version)` for dep / peer edge lookups (which carry only coords today). At construction time, a `(name, version)` collision across distinct `wrapper_id`s surfaces a hard `LpmError::Store` rather than silently aliasing the second target onto the first. Audit- fixtures don't exercise multi-source-same-coords today, so the error is reachable only via a malformed install set; lifting the constraint requires threading wrapper_id through dep edges, a Phase 4 follow-up. v2 linker behavior changes: - `augment_with_peer_edges` renamed to `ensure_peer_context` and rewritten to populate `LinkTarget.peers` (separate Vec) instead of mutating `LinkTarget.dependencies`. The fixed-point closure loop is gone — each consumer's resolved peers is a single per-package fact (the resolver / package.json intersection), not a transitive graph property. Transitive resolution flows through the per-target loop: when peer B is also a LinkTarget, ITS link entry gets ITS own peer siblings. - `populate_one` synthesizes peer-edge sibling symlinks ALONGSIDE dep-edge siblings (peers were previously folded into `dependencies`; now they're a separate pass with explicit dedupe against already-declared deps). - `peerDependenciesMeta.optional` controls trace verbosity for missing peers — required-but-missing emits a debug log pointing at the upstream `check_unmet_peers` gap; optional-missing is silent (npm-compat). Tests: - `link_packages_v2_distinct_keys_for_peer_divergent_projects`: same consumer + same edge graph + DIFFERENT resolved peer versions across two projects must produce distinct GraphKeys (no silent cross-project sharing under peer-pinning divergence). - `link_packages_v2_shares_keys_for_peer_identical_projects`: same consumer + same edge graph + SAME resolved peer version across two projects must produce the same GraphKey (cross- project sharing actually works under peer-pinning agreement — this is the win the v2 rewrite is supposed to unlock). - `link_packages_v2_errors_on_multi_source_same_coords`: malformed install set with two LinkTargets at the same `(name, version)` distinct `wrapper_id` produces a clear `multi-source LinkTarget collision` error rather than aliasing. Pre-merge gate green: - cargo clippy --workspace --all-targets -- -D warnings ✓ - cargo fmt --check ✓ - cargo nextest run --workspace --exclude lpm-integration-tests ✓ (5711/5711 pass; one transient lpm-inspect sqlite-races-under-load flake on first run — rerun clean) - cargo test -p lpm-auth (2× parallel-deterministic) ✓ - audit-fixtures: 17 PASS / 1 SKIP / 0 mixed under both default v1 and `LPM_STORE_VERSION=v2`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin
added a commit
that referenced
this pull request
May 14, 2026
…low gate Ships the 10 cross-command flow tests enumerated in v2 baseline, flips EXPECT_FULL_V2_FLOWS_BACKFILL to true so future flow drops hard-fail the audit. Each flow exercises a real-user multi-command sequence and asserts the state-transfer claim that ties the commands together — what command A leaves on disk / in the keychain / in the lockfile is the input command B reads. Single-command tests assert each step in isolation; flow tests catch state-shape mismatches between steps. Flows shipped: - install → patch → patch-commit → install (patch persistence) - migrate → install → audit (lockfile round-trips) - install → rebuild → approve-scripts → rebuild (approval lifecycle) - doctor --fix → install (fix survives install) - add → install → graph (added dep visible) - install → upgrade --major → audit (envelope shape) - token-rotate → publish --dry-run --check (token hand-off) - publish --dry-run --check → publish (target agreement) - install -g → run shimmed binary → uninstall -g (shim lifecycle) - env push → env pull cross-machine (round-trip — scoped to local smoke until a cross-machine harness lands) Several flows had their assertions scoped narrower than the original "catches" claim: - Flow #6 (rebuild lifecycle): rebuild --policy=deny ignores the v2 object form of trustedDependencies that approve-scripts writes — a real contract gap, filed as private finding #75. The flow asserts the manifest mutation; rebuild #2 only checks envelope health. - Flow #4 (upgrade major audit): the workflow tier's MockRegistry helpers don't mount GET /api/registry/{name} per-package (only the batch endpoint), so upgrade's candidate selection finds no candidates. Flow asserts envelope shape; tighten when the mock grows the per-package GET. - Flow #7 (env push/pull cross-machine): proper round-trip needs a shared-vault-state test harness that doesn't exist yet. Flow smokes per-machine env state isolation; promote when the harness lands. - Flow #8 (install -g): gracefully degrades when install-g doesn't emit a shim on the test runner (cli-binary tier owns the strict contract). Run results: 10/10 flow tests pass, all 10 v2 audit tests pass, full lpm-workflows suite green (623/623), clippy clean, fmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin
added a commit
that referenced
this pull request
May 14, 2026
…ix (#58) * test(workflows): pin concurrency + recovery contracts for lpm install Adds tests/workflows/tests/install_concurrency.rs with 13 falsifiable tests covering production failure modes that had zero coverage: Category A — process racing: * two concurrent installs on same project (pins finding-#77 floor) * install + concurrent store-clean serialize via shared/exclusive store_lock (probed via try_with_exclusive_lock on the actual lock file, not a directory-existence proxy) * two concurrent `lpm install -g` via global_tx_lock — proves final manifest + WAL coherence under serialized commits Category B — interruption recovery: * kill mid-tarball-fetch leaves no .lpm/install-hash * next `lpm install` converges to a coherent end state Category C — network faults: * tarball 503 → 200 succeeds after retry (counting Respond impl) * metadata 404 fails immediately without retry (<2s wall-clock) Category D — filesystem faults: * readonly project dir fails with actionable error (no panic); POSIX-only via #[cfg(unix)], RAII guard restores permissions * `<project>/.lpm` planted as a regular file fails clearly Category E — partial state recovery: * stale install-hash triggers re-resolve + refetch * partial node_modules re-links to full state * truncated lpm.lockb either recovers or fails cleanly (no panic) Category F — WAL recovery hook: * torn WAL tail (3 garbage bytes) gets truncated by the dispatcher's recovery hook before the command runs; idempotent on re-invocation Support helper refactor (same commit so the new helper has callers): * extracts env-isolation set into `LpmEnvSink` trait + `apply_lpm_env(cmd, project)` shared by `lpm()` (assert_cmd) and the new `lpm_spawnable()` / `lpm_spawnable_with_registry()` (std::process::Command, supports Child::kill()) * trait impl on both Command variants ensures the two helpers cannot drift on the ~30 env knobs that gate test isolation Surfaced findings during this work: * #77 — no project-level install lock: concurrent installs silently drop one side's work AND/OR fail with atomic-rename races (3 observed failure modes documented in findings.md). Fix shape: LpmRoot::project_install_lock + with_exclusive_lock_async wrap. * #78 — retry-backoff has no test-friendly knob; retry-exhaustion tests take 15s+. Fix shape: LPM_RETRY_BACKOFF_MS_OVERRIDE env in debug builds. CI gate locally green: clippy --workspace --all-targets -- -D warnings: clean cargo fmt --check: clean fancy-regex ban: empty cargo build --workspace: clean cargo nextest run --workspace --exclude lpm-integration-tests: 6439 passed, 7 skipped, 1 leaky (pre-existing) Deferred (filed under "next session" in the followup plan): B.3 (kill doesn't tear lockfile) — subsumed by B.1/B.2 B.4 (panic injection) — needs LPM_TEST_PANIC_AT env hook C.2 (retry exhaustion) — blocked by finding #78 C.3 (truncated body) — needs custom Respond with Content-Length mismatch D.3 (disk-full simulation) — no portable mechanism F.2, F.3 (orphan WAL, torn WAL with real records) — needs framed-WAL construction helpers Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflows): pin lpm.lock well-formedness + recovery skip-on-contention Closes B.3 and F.2 of the concurrency tranche — 13 → 15 tests, meeting the "≥15 of 21" acceptance criterion for Item 2. B.3 — `install_killed_mid_pipeline_leaves_well_formed_or_absent_lockfile`: Exercises two SIGKILL windows on the install pipeline — fresh project and project with a committed lpm.lock from a prior install. After each kill, asserts the on-disk lpm.lock is either absent OR parses as TOML. Never half-written. Adds `toml = { workspace = true }` as a workflow- tests dev-dep for the parse assertion. Helper `assert_lockfile_well_formed_or_absent` shared between both windows. F.2 — `lpm_command_skips_recovery_when_another_lpm_holds_global_tx_lock`: Validates the dispatcher's `try_with_exclusive_lock` idempotent-skip path at `main.rs:2531`. A background thread acquires `global_tx_lock` via `lpm_common::with_exclusive_lock` and blocks on a channel. With the lock held, runs `lpm global list` against a project with a torn- WAL prefix — asserts the WAL bytes are UNCHANGED (skip arm fired, recovery did not run). Then releases the lock and re-runs; asserts the WAL is now truncated (recovery defers correctly to the next lock-free invocation). Exercises both branches of the `try_with_ exclusive_lock` Ok(None) / Ok(Some) arm. CI gate locally green: cargo clippy --workspace --all-targets -- -D warnings: clean cargo fmt --check: clean cargo nextest run --workspace --exclude lpm-integration-tests: 6441/6441 passed, 7 skipped 5x parallel re-run of install_concurrency: 15/15 stable each run Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflows): pin truncated-tarball + orphan-WAL recovery contracts Two new tests in tests/workflows/tests/install_concurrency.rs: - C.3 tarball_connection_dropped_mid_body_fails_or_retries: a custom wiremock Respond impl serves half a tarball with a Content-Length header naming the full length. Pins the install pipeline's retry-then-fail behavior on transport-class failures (~14s wall-clock for the full 4-attempt retry schedule). Hyper 1.9 server-side panics on the Content-Length lie, dropping the connection — a valid surrogate for a broken upstream / CDN dropping mid-body. Surfaced 8 tarball GETs per install (deterministic, 3-of-3 reproducer), explained by two distinct download_tarball_* call sites in install.rs each running the 4-attempt retry budget. - F.3 lpm_command_with_orphan_pending_tx_emits_recovery_banner: plants both halves of an orphan transaction (WAL Intent record without matching Commit/Abort + matching [pending.<pkg>] row in manifest.toml pointing at a non-existent install root) and asserts the dispatcher's recovery hook fires the RolledBack banner from main.rs:2543. Sets RUST_LOG=lpm=info to lift the default lpm=warn filter so the tracing::info! line surfaces. Adds lpm-global as a workflow dev-dep for WalWriter / IntentPayload / write_for. Pins post-state: orphan pending row gone, no spurious active row. Together these close the C.3 and F.3 gaps in Item 2 of the test coverage follow-up plan: 17/21 scenarios pinned (was 15/21). The four remaining items all need source-side hooks (LPM_TEST_PANIC_AT, LPM_RETRY_BACKOFF_MS_OVERRIDE, container infra) and are out of scope for this tranche. Full CI gate green: clippy clean, fmt clean, fancy-regex empty, 6443/6443 nextest pass (was 6441 pre-tranche). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflows): pin tarball-extraction security contracts at install tier New file tests/workflows/tests/tarball_security.rs ships phase 1 of Item 3 (tarball-extraction security): 5 of 10 planned tests covering the most distinct security contracts at the install-pipeline tier. Each test constructs its malicious tarball in-line via tar::Builder (no checked-in fixtures), serves it through MockRegistry, and runs lpm install end-to-end so any pipeline-level regression that bypasses the extractor's hardening is caught. Tests landed: - #1 tarball_with_dot_dot_path_entry_is_rejected_by_install — pokes package/../escape.txt into the raw tar header bytes; install fails with "path traversal detected"; outside sentinel never created. - #3 tarball_with_absolute_path_entry_is_normalized_to_relative_under_package_dir — renamed from "rejected" to reflect actual contract. The extractor's strip_first_component consumes the RootDir; an entry like /etc/lpm-pwned.txt extracts as node_modules/<pkg>/etc/lpm-pwned.txt. Install SUCCEEDS; literal /etc/lpm-pwned.txt is never written. Defensible: malformed-but-safe input normalized rather than refused. - #2 tarball_with_symlink_to_outside_path_is_silently_skipped — renamed. The is_file() gate at lib.rs:398 silently drops symlinks; install succeeds with byte-identical outside sentinel. - #5 tarball_with_hard_link_to_outside_file_is_silently_skipped — renamed. Same is_file() gate; hardlinks silently skipped; outside victim file unmodified. - #8 tarball_with_setuid_executable_extracts_with_setuid_bit_stripped (POSIX-only) — tarball entry mode 0o4755 extracts as 0o755. SUID, SGID, and sticky bits all cleared via set_preserve_permissions(false) + the explicit `0o644 | exec_bits` mode set after write. Exec bits preserved. Three tests carry a "plan-vs-actual" docstring section explaining why the rename is defensible — the actual extractor contract differs from the plan's prescribed phrasing in safe ways, not in regression-grade ways. No findings filed. Phase 2 (5 remaining tests: Unicode normalization, device file, FIFO, zero-byte sanity, OS-max path) is deferred to a follow-up tranche with rationale + lift estimate documented in the plan. None blocks phase 1 acceptance. Pre-merge gate green: clippy clean, fmt clean, fancy-regex empty, 6448/6448 nextest pass (was 6443; +5 for the new tests). 0.18s wall- clock for the full file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(install): per-project lock prevents concurrent-install data loss Closes finding #77. Two `lpm install <pkg>` invocations on the same project no longer race on the manifest snapshot+commit window. Pre-fix, both processes acquired only a SHARED store_lock and proceeded in parallel. Each opened its own per-process ManifestTransaction snapshot of the pre-edit package.json, staged its own dep on top, and ran the install pipeline. Whoever wrote package.json + lpm.lock last won; the other process's edits — including its node_modules link — silently vanished. Both processes still exited 0 with success-path output. CI scripts that ran two installs in parallel saw no signal of the data loss. The fix introduces: - crates/lpm-common/src/paths.rs::project_install_lock(project_dir): free helper returning <project_dir>/.lpm/.install.lock. Re-exported from crates/lpm-common/src/lib.rs. - run_add_packages and run_install_filtered_add in crates/lpm-cli/src/commands/install.rs now wrap the snapshot → stage → install → finalize → commit window in with_exclusive_lock_async against the project lock. The lock is per-project (no cross-project contention) and held across all ?-early-exits via the async block's return. For the workspace path, the lock sits at the discovered workspace root (not per-member) so two concurrent `lpm install --filter <member>` invocations on the same workspace serialize without per-member deadlock-ordering complexity. run_with_options (the inner install pipeline) does NOT acquire this lock — it's called from inside both run_add_packages's wrap and from many other commands; double-acquiring the same fd-lock would deadlock in-process. Deferred (phase 2, not exercised by A.1): lpm add (add.rs:723-904) has a similar 180-line transaction with recursive Swift handling. Wrapping it is invasive and the race surface is theoretical (users don't typically run `lpm add` and `lpm install` concurrently). Defer to a separate tranche if a concurrent `lpm add` × `lpm install` race is ever observed. Test contract tightening (bug-first per CLAUDE.md): two_concurrent_installs_on_same_project_leave_well_formed_manifest in tests/workflows/tests/install_concurrency.rs went from "at-least-one survives + manifest is well-formed JSON" (the floor) to "BOTH installs succeed, BOTH packages present in package.json deps, BOTH packages linked in node_modules/" (the contract). Pre-fix: 1/1 fail (pkg-b silently dropped). Post-fix: 5/5 pass with no flakes (~1.2s wall-clock each — install B observes pkg-a's commit and reports "Resolved 2 packages"). Pre-merge gate green: clippy --workspace --all-targets clean, fmt clean, fancy-regex empty, 6448/6448 nextest pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(registry): test-only retry-backoff override env knob Closes finding #78 + lands C.2 (`tarball_503_exhausts_retries_fails_with_http_status`). Pre-fix, retry-exhaustion tests were blocked: the registry client's backoff schedule (1+2+4+8s, capped at 10s) made every retry-exhaustion test take ~15s per fetch site (~28s with the install pipeline's 2 distinct download_tarball_* call sites). MAX_RETRIES, RETRY_BASE_DELAY, and RETRY_MAX_DELAY are private const with no env override. C.2 therefore had to be #[ignore]-gated behind LPM_RUN_SLOW_TESTS=1, and the retry-exhaustion contract went unproven on `cargo nextest run`. The fix introduces: - crates/lpm-registry/src/client.rs::backoff_override(): reads LPM_RETRY_BACKOFF_MS_OVERRIDE (a u64 ms value) gated by cfg!(debug_assertions) || LPM_TEST_MODE=1. Returns Some(Duration) when both conditions hold; None otherwise. Production retry policy is immune — release builds without LPM_TEST_MODE=1 silently ignore the env. - backoff_delay(attempt) consults the override before computing the exponential schedule. - The two 429 Retry-After sleep sites also consult the override so a future 429-flood retry-exhaustion test wouldn't hang on the server-supplied header. C.2 test landed alongside (bug-first per CLAUDE.md): - Mock returns 503 on every tarball request — no recovery path. - Test sets LPM_RETRY_BACKOFF_MS_OVERRIDE=10 on the lpm subprocess. - Asserts: install fails non-zero, no panic, ≥4 attempts (proves the retry loop fired), elapsed < 2s (load-bearing — without the knob this fails at ~14s), stderr contains an actionable HTTP-class noun (503 / status / http / network / etc). - Surfaces 8 tarball GETs per install (4 attempts × 2 distinct download_tarball_* call sites — matches C.3's observation). Pre-fix verification: same C.2 against the unfixed client.rs failed on the elapsed assertion at 14.04s (knob ignored). Post-fix: passes in 1.6s cold / 0.1s warm. 5/5 passes with no flakes. Pre-merge gate green: clippy --workspace --all-targets clean, fmt clean, fancy-regex empty, 6449/6449 nextest pass (was 6448 pre-fix; +1 for C.2). Item 2 of the test-coverage-followup-plan now at 18/21 (was 17/21). Both findings #77 and #78 fixed in production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflows): tarball-security phase 2 — Unicode, device, FIFO, zero-byte, long-path Adds 5 more tests to tarball_security.rs, completing Item 3 of the test-coverage follow-up plan. Each test pins the actual extractor contract under malicious-or-edge-case tarball shapes that reach the install pipeline through MockRegistry. Tests landed: - #4 tarball_with_unicode_lookalike_parent_dir_extracts_safely_as_literal_bytes — renamed from "_normalization_traversal_rejected" to reflect the actual contract. Tarball entry path uses full-width dots U+FF0E `..` (bytewise NOT ASCII `..`). Component::ParentDir is byte-exact, so `..` becomes Component::Normal. Install SUCCEEDS; `..` materializes as a literal directory under node_modules/<pkg>/; outside sentinel byte-identical. Defensible because Path::components() doesn't NFKC-normalize on POSIX. - #6 tarball_with_character_device_entry_is_silently_skipped (POSIX-only). EntryType::Char with /dev/null-shaped major/minor. Same is_file() gate as symlinks/hardlinks — silently skipped. Install SUCCEEDS; no device file at the expected path. - #7 tarball_with_fifo_entry_is_silently_skipped (POSIX-only). EntryType::Fifo. Same posture as #6. - #9 tarball_with_zero_byte_regular_file_extracts_as_empty_file. Sanity check that empty files still extract correctly (legitimate npm shape: .gitkeep, license placeholders). - #10 tarball_with_single_path_component_exceeding_name_max_fails_cleanly. 300-byte single-component name, well over POSIX NAME_MAX=255. Tar wire format succeeds via GNU long-name extension; the FILESYSTEM rejects on extraction (ENAMETOOLONG). Extractor wraps as LpmError::Io → install fails non-zero with the OS error visible and an actionable noun in stderr. Three of the five tests are renamed to reflect actual extractor contract vs the plan's prescribed phrasing — same "plan-vs-actual" docstring pattern as phase 1. No findings filed; all 10 contracts across phase 1 + 2 are defensible-as-implemented. Pre-merge gate green: clippy --workspace --all-targets clean, fmt clean, fancy-regex empty, 6454/6454 nextest pass (was 6449 pre-tranche; +5 for the new tests). Full file 0.2s wall-clock for all 10 tests. Item 3 now COMPLETE (10/10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflows): cross-command flows Item 4 — migrate→rebuild + workspace filter isolation Closes Item 4 of the test-coverage-followup-plan at 6/6 (target was ≥5). Two additions to tests/workflows/tests/cross_command_flows.rs: - Plan #1 — extended flow_migrate_install_audit_lockfile_round_trips with a `lpm rebuild --dry-run --policy=deny` step. Pins the full migrate → install → audit → rebuild lifecycle. Asserts the rebuild step exits 0 + does not mutate the post-audit state (lpm.lock + lpm.lockb still present). Catches regressions where rebuild's lockfile or build-state parser breaks against a freshly-migrated manifest. - Plan #5 — added flow_workspace_install_filter_member_a_does_not_mutate_member_b (new test, 159 LOC). Pins the workspace-member isolation contract using the workspace-monorepo fixture (3 members: app, core, utils): 1. Initial filtered install on @test/core (re-pinning its existing semver dep) populates core's per-member quadruple: lpm.lock=319 B, lockb=230 B, install_hash=118 B. 2. Snapshot core's full quadruple. 3. Run `lpm install chalk@5.3.0 --filter @test/app` to add a new dep to app ONLY. 4. Assert app's package.json gained chalk; core's quadruple (package.json + lpm.lock + lpm.lockb + install-hash) is BYTE-IDENTICAL post-install; chalk does NOT appear in core's node_modules/. Catches a regression where a per-member filtered install accidentally also mutates a sibling member's package.json / lockfile / install-hash — a real bug class because run_install_filtered_add shares the workspace-root project lock (added in #77 fix) and could over-snapshot if the target-set computation drifts. Helper `mount_pkg_full(mock, name, version)` factors out the three-step metadata + batch-metadata + tarball mount so the test body stays readable. Other 4 plan flows already covered pre-tranche: - Plan #2: flow_add_install_graph_added_dep_visible - Plan #3: flow_install_patch_patch_commit_install_persists_patch - Plan #4: flow_token_rotate_publish_dry_run_picks_new_token - Plan #6: flow_install_upgrade_major_audit_picks_new_version Pre-merge gate green: clippy --workspace --all-targets clean, fmt clean, fancy-regex empty, 6455/6455 nextest pass (was 6454; +1 for the new flow). Plan #5 stable across 5/5 reruns at ~0.11s each. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(install): LPM_TEST_PANIC_AT hook + B.4 panic-rollback contract Adds a deterministic panic-injection hook to the install pipeline + unblocks the long-deferred B.4 contract test for ManifestTransaction Drop-based rollback on panic. The hook (`maybe_test_panic(stage)` in crates/lpm-cli/src/commands/install.rs) reads LPM_TEST_PANIC_AT and panics when the env value matches the stage name. Gated to `cfg!(debug_assertions) || LPM_TEST_MODE=1` — same pattern as the #78 retry-backoff override. Production builds without LPM_TEST_MODE=1 silently treat the env as no-op. Wired 4 stages in `run_add_packages`: - "after-snapshot" — manifest unchanged; Drop is no-op - "after-stage" — placeholder `*` written to package.json (load-bearing) - "after-install" — pipeline complete; manifest still has `*` - "after-finalize" — concrete versions written; pre-commit only The hook unblocks B.4 (`install_panics_mid_pipeline_rollback_restores_manifest`), deferred since the original Item 2 tranche because there was no deterministic way to trigger a panic mid-install from a workflow test. Recoverable errors fire `?`-rollback (covered by E.1/E.2/E.3); SIGKILL bypasses Drop entirely (B.1/B.2/B.3 cover that). The panic path was the missing rollback proof. B.4 sets LPM_TEST_PANIC_AT=after-stage and asserts: - process exits non-zero (panic propagates to runtime) - stderr contains `"panicked at"` AND `"LPM_TEST_PANIC_AT=after-stage"` - package.json BYTE-IDENTICAL to pre-stage (Drop ran on unwind, snapshot bytes restored — load-bearing) - the new pkg is NOT in dependencies (placeholder rollback worked) - .lpm/install-hash absent (invalidate-on-rollback) - lpm.lock absent (matched optional snapshot's None pre-state) Catches a regression where: - panic = "abort" added to release profile (no Drop on panic) - ManifestTransaction Drop logic stops restoring snapshot bytes - The `lpm install` snapshot+commit window grows without re-wiring Drop Test runs in 0.07s warm. 5/5 stable across reruns. Pre-merge gate green: clippy --workspace --all-targets clean, fmt clean, fancy-regex empty, 6456/6456 nextest pass (was 6455; +1 for B.4). install_concurrency now at 19/19. Item 2 of test-coverage-followup-plan moves to 19/21 — only A.2 (no contract) and D.3 (needs container infra) remain deferred indefinitely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(workflows): align MockRegistry tarball URL shape with production /-/ gate Workflow tests mounted tarballs at `/tarballs/{name}-{version}.tgz` — missing the `/-/` path segment that the registry-client's `evaluate_cached_url` gate at [crates/lpm-registry/src/client.rs#L4117] requires (`.tgz` suffix AND `/-/` substring). The gate is a defense-in-depth check that blocks the H1 auth-token leak: a tampered lockfile URL like `/api/admin/foo.tgz` (no `/-/`) would otherwise attach the bearer to a non-registry endpoint. The mismatch produced two test-environment side effects that don't manifest in production: 1. **WARN noise**: every install test that read a tarball URL from the lockfile fast path logged `cached tarball URL for X@Y failed shape check; falling back to on-demand lookup`. Polluted stderr across the suite. 2. **`shape_mismatch_count` defeated**: the registry-client documents this counter as a "BUG signal — the writer should never emit a gate-rejectable URL". Test runs incremented it on every install, making the counter useless for catching real bugs. This commit migrates the mock to the production-shape `/tarballs/{name}/-/{name}-{version}.tgz` everywhere — both the helper methods (`MockRegistry::tarball_path` / `tarball_url`) and the ~60 hard-coded `format!` sites across 14 test files + 1 snapshot. The new `tarball_path` helper is `pub` with a prominent docstring warning future test authors not to re-introduce the legacy shape. Internal mounts in `with_package_and_deps` / `with_package_published_at` / `with_full_package_metadata` all route through it. Post-fix verification: WARN gone, gate `Accepted` path runs, all 691 lpm-workflows tests pass (0 leaky in the latest full-workspace run, down from 1-3 leaky pre-fix — fewer fallback paths firing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflows): test-coverage-followup tranche — Items 2/3/4/5 Closes the remaining open rows from `private/test-coverage-followup-plan.md` across four items. ~2,600 LOC of new test code + fixture + budget infra. **Item 3 — tarball-security additional candidate surfaces (7 tests in `tarball_security.rs`):** - `tarball_with_pax_path_traversal_rejected` — PAX extended `path` header smuggling `..` is rejected by the extractor's `Component::ParentDir` check after the tar crate resolves the override. - `tarball_with_gnu_longname_traversal_rejected` — symmetric GNU `L` entry; same rejection path. - `tarball_rejects_or_rolls_back_when_later_entry_is_malicious` — pins the `rollback_extraction` contract: valid first entry is cleaned up when a later `..`-traversal entry trips rejection mid-stream. - `tarball_with_duplicate_member_path_rejected_or_deterministic` — pins current last-write-wins contract (defensible; flagged scanner- disagreement risk in test comment). - `tarball_with_truncated_gzip_rolls_back_partial_extract` — half- truncated gzip stream → libdeflate fails cleanly → no partial extract. - `tarball_ignores_uid_gid_ownership_metadata` (POSIX) — bogus uid/gid in tar header is ignored; extracted files owned by process uid. - `tarball_with_sparse_huge_file_rejected_by_declared_size` — manually- constructed tarball with header declaring `MAX_FILE_SIZE + 1` and empty on-wire body; extractor rejects on the pre-check at lib.rs:306 before draining body. **Item 4 — cross-command flows additional candidate surfaces (2 tests in `cross_command_flows.rs`):** - `flow_install_uninstall_install_graph_round_trip` — pins manifest / link / graph hand-off through a full round-trip. - `flow_cache_clean_then_offline_install_uses_store_or_fails_helpfully` — pins the cache/store boundary: `cache clean` must not corrupt offline install; store-side bytes byte-identical after a clean. **Item 2 — concurrency/recovery additional candidate surfaces (3 tests in `install_concurrency.rs`):** - `cache_clean_during_slow_tarball_install_does_not_corrupt_install` (G.4) — install + cache clean run concurrently (different lock paths, no serialization); install succeeds despite metadata cache wipe mid-stream. Empirical timing observed: install elapsed 1.57s, cache clean fired at t=30-39ms cleanly inside the install window. - `install_panics_after_install_hash_write_rollback_invalidates_hash` (G.5) — reuses existing `LPM_TEST_PANIC_AT=after-install` stage (no new source-side hook needed — `write_post_install_v6_hash` runs inside `run_with_options` which returns BEFORE that stage fires). Pins that Drop-based rollback restores manifest AND deletes the freshly-written install-hash. - `malformed_registry_json_fails_without_manifest_or_lockfile_mutation` (G.6) — truncated JSON on all three metadata endpoints; install fails cleanly, no panic/backtrace, package.json byte-identical, no torn lockfile. **Verdaccio-npm parity for `which@4.0.0` (`install_real_registry.rs`):** - `verdaccio_npm_parity_for_bin_package_pins_metadata_and_shim_presence` — extends the existing lodash byte-diff with a bin-shipping target package. Asserts metadata equivalence + `.bin/<name>` shim present on both sides + bin target file materialized + exec bits non-zero (POSIX). **Item 5 — realworld fidelity (new fixture + new test file):** - `tests/fixtures/realworld-nextjs/` (package.json + README) — pinned Next.js 14.2.13 + React 18.3.1 + TypeScript 5.6.3 + 3 `@types/*` packages. Resolves to ~28 transitive deps empirically. README documents the calibration methodology including raw measurement data. - `tests/workflows/tests/install_realworld.rs` — `install_realworld_nextjs_fixture_succeeds_through_verdaccio` installs the fixture through Verdaccio→npmjs and asserts end-to-end success at production scale. Always logs cold + warm wall-clock + peak RSS to stderr for calibration data. - **`LPM_BUDGET_GATE=1`-gated budget assertions**: cold ≤ 25s, warm ≤ 25ms, cold peak RSS ≤ 1500 MiB. Calibrated from N=6 cold + N=3 warm + N=3 RSS runs on M-series macOS, 2026-05-14. Memory measurement via `/usr/bin/time -l` (macOS) / `-v` (Linux); Windows skips with a clear warning. This closes Item 5 entirely (all 4 acceptance criteria green) and brings Items 2/3/4 to the parked-by-design or infrastructure-blocked baseline. CI gate: clippy `--workspace --all-targets -- -D warnings` clean, fmt clean, fancy-regex empty, build clean, `cargo nextest run --workspace` 6471/6471 pass. Suite runtime ~2:40 (was ~2:24 pre-tranche; +15s for the realworld test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(workflows): collapse Linux-only let-chain in parse_peak_rss CI lint on Linux failed on `clippy::collapsible_if` in the Linux-cfg'd branch of `parse_peak_rss`. The macOS branch had an intermediate `let bytes_str = rest.trim();` between the two `if let`s, which is why the local clippy run on macOS didn't catch this — only the macOS-cfg branch compiled there. Collapse the Linux branch to use `&&` (stable let-chains) so it satisfies the lint while preserving the same semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 49 ships the client-side streaming BFS metadata resolver that replaces the pre-49 server-mediated
batch_metadata_deepflow as the cold-install hot path. The walker fetches manifests level-by-level into a sharedDashMapcache that the PubGrub solver reads synchronously through a per-canonicalNotifywait-loop. Three architectural shifts:@lpm.dev/*(entitlement, audit, signing) and is opt-in viaLPM_NPM_ROUTE=proxyfor users who need it.DashMapkeyed byCanonicalKey; the walker inserts under canonical names, the provider canonicalizes at every read so split-retry passes hit the same cell. Closes the silent notify-miss documented in preplan §4.2.WalkerDone: Arc<AtomicBool>flag is set (Release) at the end ofBfsWalker::runvia an RAII guard that also broadcastsnotify_waiters()across everynotify_mapentry. The provider'sensure_cachedwait-loop pinsNotified::enable()and re-checkswalker_doneafter each cache miss — when the walker terminates without inserting a key (e.g. older-version transitive PubGrub later picks), the wait-loop short-circuits to the escape-hatch fetch in microseconds instead of burning the fullfetch_wait_timeout.Bench shape (express @ ^4.18.0, cold cache, same hardware)
Direct mode now sits at par with pnpm and faster than npm on the standard
cold-install-cleanbench (17 deps → 51 packages, N=5 median): npm 1388 ms / pnpm 762 ms / bun 309 ms / lpm-direct 749 ms / lpm-proxy 2902 ms.Notable commits in this branch
bc56afe feat(phase-49): CanonicalKey — context-free cache key for streaming BFSa9766c5 feat(phase-49): routing primitives — RouteMode + direct-tier npm fetch + parallel fan-out200ec16 feat(phase-49): provider rewire — canonical-keyed DashMap cache + wait-loop + resolve.rs downstream5449b1d feat(phase-49): BfsWalker — level-by-level streaming metadata prefetch0c7358b feat(phase-49): install.rs orchestration — walker + dispatcher + resolve concurrentbcaaf4e feat(phase-49): timing.resolve.streaming_bfs JSON output (§6)eedfe72 fix(phase-49): CAS on ceiling closes multi-429 ratchet race019d99c fix(phase-49): §11 walker must union deps across all versions + resolve aliasesb37c1f7 fix(phase-49): §12 wait-loop shutdown handshake — walker_done broadcastfba1cd4 fix(phase-49): §13 walker is the metadata producer — newest-only + skip redundant deep batch20 commits total.
git log main..HEADfor the full set.Companion fix on lpm.dev (separate repo, already deployed)
Diagnosed during phase-49 proxy-mode validation: a Drizzle named prepared statement (
db.select(...).prepare("validate_registry_token")) was wedging on transaction-mode pgBouncer pooling becausedb/client.jsruns postgres.js withprepare: false. Symptom: every endpoint validating a reallpm_*token (batch-metadata, /api/user/me, per-org token routes, …) hung for 30 s+ until the upstream proxy gave up. The lone.prepare()call site in the entire codebase. Replaced with inline parameterized SQL; theregistry_tokens_hash_idxbtree keeps it sub-millisecond. Shipped ontolgaergin/a-package-manager@8f8b9657, deployed.That fix is what made
LPM_NPM_ROUTE=proxybenchable here.Test plan
cargo clippy --workspace -- -D warnings✅cargo fmt --check✅cargo build --workspace✅ (zero warnings)cargo nextest run --workspace --exclude lpm-integration-tests✅ 4379/4379 passed, 7 skippedcargo test -p lpm-auth× 3 ✅ (47/47 each, deterministic across reruns — no keychain race)fancy-regex) ✅Out of scope (for follow-up phases)
Accept: application/x-ndjson+deep:true) for pure-npm batches in proxy mode would let proxy benefit from the Worker-native streaming the eligibility check atworkers/registry/src/batch-metadata.js:230was built for. ~140 ms in my probe vs the current ~1.9 s JSON shape.🤖 Generated with Claude Code