fix(registry): NDJSON parse loop — O(n²) scan + wall-clock timeout#3
Merged
Conversation
`RegistryClient::new` configured `reqwest::ClientBuilder::timeout(30s)` —
a wall-clock cap covering the entire request + response cycle, body
read included. On the decision-gate fixture (54 direct deps, ~66 MB
deep NDJSON response) the server legitimately takes 30+ seconds to
stream the body, so the timer fired mid-body at ~51 MB / 7500 chunks
and every cold install above ~40 roots logged `WARN batch prefetch
failed, falling back to sequential resolution (slower): registry
error: NDJSON read error ... error decoding response body <- request
or response body error <- operation timed out`.
Root-cause diagnosis: expanded `reqwest::Error` Display to walk the
full `source()` chain so the kind-only top-level string stops hiding
the hyper-level cause. The chain surfaced `operation timed out` as
the real reason. Cross-checking CPU vs wall-clock timing on the same
fixture showed parse (85 ms) + cache_write (88 ms) = 173 ms total
CPU, i.e. the remaining 29.8 s was genuinely network wait. Server
is legitimately slow for this response size; the old wall-clock cap
is the wrong tool.
Fix: swap `.timeout()` for `.connect_timeout(10s) + .read_timeout(30s)`.
`read_timeout` is a per-read idle timer that resets on each
successful chunk, so a slow-but-progressing stream completes intact;
a genuinely stalled server still gets interrupted at 30 s without a
read.
Behavioral regression tests (new, in `lpm-registry` client tests):
- `batch_metadata_deep_tolerates_slow_streaming_body_under_read_timeout`
— drives a custom TCP-level streaming mock server that sends
4 NDJSON chunks at 200 ms apart (800 ms total) through a client
configured with `read_timeout=500ms`. Pre-fix wall-clock would
kill at 500 ms; post-fix the request completes with 4 entries.
- `batch_metadata_deep_fails_under_old_wallclock_timeout` — same
server, this test instantiates the OLD-style client via a raw
`reqwest::Client::builder().timeout(500ms)` and asserts the
request aborts with a timeout-sourced `Registry` error. Pinned
as a forward-regression guard: anyone re-introducing `.timeout()`
on the prod builder trips this test.
Post-fix prod measurements (lpm.dev, 3 cold runs each, median):
Fixture | pre-fix total_ms | post-fix total_ms | WARN? | followup_rpc_count
----------------+------------------+-------------------+-------+---------------------
51-pkg (Phase 39)| ~573 | 542 | none | 1
280-pkg (Phase 40)| ~6 500 | 6 759 | none | 20
decision-gate | 44 054 | 46 454 | none | 45 (was 73)
On decision-gate specifically, total wall-clock grew ~2.4 s because
pre-fix was silently short-circuiting to sequential resolution at
the 30 s mark (fetch was then slower from individual lookups:
3 269 ms pre-fix vs 1 244 ms post-fix — a 2.0 s fetch win). Net is
roughly flat, but with the Phase 38 P3 speculation dispatcher now
actually engaged — 0 spec tasks dispatched pre-fix vs 345 / 345
post-fix (288 transitive, max_depth_reached 5). Previously, every
cold install of a fixture above ~40 roots quietly disabled the
speculation path the Phase 38 code was designed to run. The
`followup_rpc_count` drop (73 → 45, −38 %) reflects the metadata
cache now being populated by the full deep batch rather than
piece-by-piece.
Diagnostic instrumentation added (kept for future incidents):
reqwest error source-chain walk + per-chunk byte/chunk counters in
the warn string. Cheap (runs only on error), load-bearing for
future transport failures.
Metadata-bloat sensitivity finding (investigated during diagnosis,
noted here for future optimization — out of scope for this PR):
`lpm-resolver::ranges::to_pubgrub_ranges(&available_versions)` runs
on every PubGrub `get_dependencies` call and is O(N) in the number
of versions for that package. Not memoized. Adding metadata for K
new packages × avg 50 versions each × ~3 PubGrub backtrack queries
per package matches the +962 ms pubgrub_core_ms regression measured
during Phase 41's A/B (see phase41 postmortem). Memoizing
`(pkg, range) → Ranges` would cut that factor. Track as follow-up.
CI gate locally green:
- cargo clippy --workspace -- -D warnings
- cargo fmt --check
- fancy-regex ban
- cargo build --workspace
- cargo nextest run --workspace --exclude lpm-integration-tests
(3641 passed, 7 skipped, 0 failed)
- cargo test -p lpm-auth x2 (43 passed both runs, parallel-deterministic)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the read_timeout fix in this same PR. With the timeout
bug out of the way, the decision-gate cold install STILL showed a
~5× gap between the raw reqwest body drain (~7 s for 66 MB at
9.5 MB/s) and our parse loop (~40 s at 1.7 MB/s). This commit
closes that gap.
## Narrowed with sub-timers
Added temporary per-phase timers around the NDJSON parse loop and
reran the decision-gate cold install. On a 22-second run that
consumed 76 MB / 11 000 chunks / 365 entries, the breakdown was:
phase ms
----------- ----
chunk_wait 351 (reqwest)
extend 18
scan 21088 ← 95% of the loop
utf8 3
parse 253
cache_write 306
clone 116
send 1
map_insert 0
drain 0
total 22146
Every CPU path OTHER than scan summed to ~700 ms. The scan alone
burned 21 s. This empirically falsifies several popular theories about
where the tax lived (channel backpressure — channel is size 512, not
32; sync cache writes — only 306 ms; speculation bandwidth
competition — same 40 s with `LPM_SPEC_FETCH=0`).
## Root cause
The inner `while let Some(newline_pos) = buffer.iter().position(|&b|
b == b'\n')` restarted from offset 0 on every iteration. With an
average NDJSON line of ~200 KB arriving over ~30 chunks of ~9 KB, we
re-scanned the same bytes every time a new chunk landed:
chunk 1: scan 9 KB → no newline
chunk 2: scan 18 KB (9 KB re-scanned)
chunk 3: scan 27 KB (18 KB re-scanned)
...
chunk 30: scan 270 KB → newline found
Per-line cost: sum(i × 9 KB for i in 1..30) ≈ 4 MB scanned.
Across 365 lines: ~1.5 GB scanned.
At `[u8]::iter().position()`'s ~74 MB/s throughput (bounds-checked,
non-SIMD): ~21 s. Matches measurement exactly.
## Fix
Track a `scan_from: usize` cursor marking the first byte we haven't
yet inspected for a newline. After each chunk append, resume scanning
from `scan_from` instead of 0. When a line is drained, reset to 0
because the remaining bytes shifted. Net: every byte scanned at most
once → O(total_bytes) ≈ 66 ms for 66 MB.
## Post-fix measurements (lpm.dev, cold installs)
fixture | pre-scan-fix total_ms | post-scan-fix total_ms | delta
---------------|----------------------:|-----------------------:|--------
51-pkg | 542 | 516-671 | ~same
280-pkg | 6759 | 3227-3645 | −49%
decision-gate | 46454 | 6704-7136 warm |
| | 15006 cold run | −85% warm
| | | −68% cold
Sub-breakdown on decision-gate (warm cache, 2 of 3 runs shown):
metric | pre-fix | post-fix (run 2) | delta
--------------------|--------:|-----------------:|--------
initial_batch_ms | 39 977 | 2 942 | −93%
followup_rpc_ms | 3 732 | 1 639 | −56%
followup_rpc_count | 45 | 45 | same
fetch_ms | 1 244 | 1 221 | same
resolve_ms | 44 462 | 5 383 | −88%
initial_batch_ms now within ~1.5× of the raw reqwest drain (which
tops out at ~30 MB/s warm → ~2 s for this body). followup_rpc_ms
dropped too because those follow-up RPCs run the same parse loop
and were paying the same quadratic tax.
Bun's reported ~3.6 s on this fixture is now 2× away, down from 12×.
## Tests
Both streaming regression tests from the earlier timeout fix still
pass. The `iter().position` scan quadratic was present in pre-scan-fix
runs of the timeout-tolerance test too, but the test uses only 4
chunks × 200 ms = 800 ms, where the quadratic cost is still sub-ms
and doesn't change test outcomes. The post-fix test completes ~20 ms
faster but nothing load-bearing shifts.
## On Gemini's speculation (2026-04-17)
An independent second-opinion review attributed the 5× gap to
`mpsc::channel(32)` backpressure, synchronous line-by-line I/O, and
reqwest buffer allocation patterns. Checking each: channel was
actually 512, not 32; cache-write + parse totalled 560 ms not
1.6 s; raw reqwest benched at 30 MB/s on warm runs (no allocation
problem). The actual cause was a classic O(n²) algorithm bug in our
own code that looked like a transport bottleneck from the outside.
Noted in `DOCS/new-features/37-rust-client-RUNNER-VISION-phase41.md`
follow-ups; Phase 42's "ingestion throughput" work item is now
closed in a single commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin
added a commit
that referenced
this pull request
Apr 17, 2026
Phase 42 — resolver correctness + perf fix + algorithmic insurance. Since v0.19.1: - **fix(registry)** — Phase 42 P0/P1 NDJSON parse loop — O(n²) scan + wall-clock timeout (#3). Decision-gate install 46 s → 6.7 s (7× speedup) when the registry streams large batch-metadata responses. The O(n²) scan was the dominant cost in NDJSON parsing for large packuments; a rolling-offset rewrite brings it to O(n). Wall-clock timeout defends against slow-emit registries stalling the resolver. - **fix(install)** — Phase 40 P4 split-context dedupe (#2). When two sibling parents produced the same grandchild under different split contexts, the grandchild was duplicated in the fetch/link plan. Dedupe on canonical `(name, version)` before fetch dispatch. No user-visible lockfile change. - **perf(resolver)** — Phase 42 P2 `NpmRange → pubgrub::Ranges` memoization (#4). Null-result in benchmarks but shipped as algorithmic insurance: the conversion is O(m) per call and was re-computed on every PubGrub visit. Memoized table eliminates redundant work. Neutral on current fixtures, protective against pathological cases. - **chore(ci)** — Node.js 24 opt-in for JavaScript actions ahead of GitHub's 2026-06-02 forced-default. No breaking changes. Lockfile compatibility unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin
added a commit
that referenced
this pull request
Apr 18, 2026
Three tightenings after the 2026-04-18 GPT audit of P43-1: ## Finding #1 (Medium) — stale v1 `lpm.lockb` persists across reads `read_fast` falls back to TOML when binary open returns `UnsupportedVersion`, but previously left the v1 file on disk. `read_fast` is called from `lpm install`, `lpm upgrade`, AND `lpm outdated`. Read-only commands never trigger a write, so an upgraded user would pay the open-reject + TOML-parse cost on every `lpm outdated` invocation forever — a real perf regression when shipping P43-1 standalone before P43-2's install writeback lands. Fix: best-effort delete the stale binary when open returns `UnsupportedVersion`. Deletion is scoped to the version mismatch only; other errors (corrupt magic, structural issues) leave the file on disk for forensic inspection. Delete failures (read-only FS, permission denied) are swallowed — correctness still holds via the TOML fallback. Test `phase43_read_fast_falls_back_to_toml_when_binary_is_v1` flipped to assert deletion. New test `phase43_read_fast_preserves_binary_on_non_version_errors` guards against aggressive-deletion regression. ## Finding #2 (Low) — corrupt tarball pair surfaces `Some("")` `BinaryLockfileReader::open` validated the deps-table layout but not the new tarball pair. Corrupt `(tarball_off, tarball_len)` bytes flowed through `read_str` (which degrades out-of-bounds reads to `""`), so `tarball()` returned `Some("")` — silent corruption that P43-2 would later feed into the shape gate. Fix: extend the per-entry validation loop in `open` to bounds- check the tarball pair against the string table length. `(0, 0)` is the null sentinel and bypasses the check; any other pair must fit within `[string_table_off, EOF)`. Corruption now forces TOML fallback via `read_fast` — matching how v1 handled source/integrity range issues via the deps-validation mechanism. New test `phase43_open_rejects_corrupt_tarball_pair` exercises the `u32::MAX` offset case. ## Finding #3 (Low) — empty-string rejection asymmetric The binary writer rejected empty `source` / `integrity` / `tarball` at serialization time (via `insert_optional`), but `from_toml` was pure serde — it accepted `tarball = ""` cleanly and only failed later when `write_all` tried to emit the binary. P43-1's commit message claimed "consistent rejection" which was wrong. Fix: add a validation pass in `from_toml` that walks all packages and rejects empty `source` / `integrity` / `tarball`. Matches the binary writer's rejection at the parse boundary, preventing asymmetric late failures. New test `phase43_from_toml_rejects_empty_optional_strings` covers all three fields parametrically. ## Results 74 lockfile tests pass (+3 from +2 existing tests). Workspace nextest: 3659 pass (+3 since P43-1 initial commit). CI gate (CARGO_TARGET_DIR=/tmp/lpm-phase43-target): - cargo clippy --workspace -- -D warnings ✓ - cargo fmt --check ✓ - grep -r 'fancy-regex' crates/*/Cargo.toml ✓ (no matches) - cargo build --workspace ✓ - cargo nextest run --workspace --exclude lpm-integration-tests 3659 tests pass, 7 skipped ✓ - cargo test -p lpm-auth (3x) 43 tests pass each ✓ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin
added a commit
that referenced
this pull request
Apr 18, 2026
…metry completeness Addresses three GPT-audit findings on P43-2 commit 1, turning the gate-accepted URL path into a safe standalone rollout. Commit 1's claim of "clean standalone rollout point" was wrong — a stale stored URL that passed the scheme/shape/origin gate would immediately hard-fail the install (both legacy line 3711 and streaming line 3814 routed `NotFound` straight to `handle_tarball_not_found`). This commit is **safety**, not cleanup — P43-2 commit 1 should not ship without it. ## Finding #1 (Medium) — stale URL = first-run hard failure Pre-Phase-43, the lockfile didn't store URLs, so every fetch did its own metadata round-trip first — stale upstream paths were refreshed transparently. Post-commit-1, a stored URL that the gate accepts is reused as-is; if upstream republished or migrated the tarball path, the download 404s and lockfiles get nuked. Fix: both fetch paths gain a **same-run retry** on stored-URL 404s. On `LpmError::NotFound` where `p.tarball_url.is_some()`: 1. Invalidate the metadata cache for this package. 2. Re-resolve via `resolve_tarball_url(..., cached_url=None)` to force a real metadata round-trip. 3. Guard against loop: if fresh URL == stale URL, metadata itself is stuck → fall through to `handle_tarball_not_found`. 4. Retry the download ONCE with the fresh URL. 5. On success: bump `stale_recovery` counter, carry on. 6. On second 404: bump `stale_hard_fail` counter, fall through to `handle_tarball_not_found`. On-demand path 404s (no stored URL) skip retry — there's nothing stale to refresh. ## Finding #2 (Medium) — `handle_tarball_not_found` is CWD-relative Pre-fix `Path::new("lpm.lock")` / `Path::new("lpm.lockb")` — deletes relative to process CWD, not the project root. A programmatic install from a nested directory would leak stale lockfile state (the `lpm.lock` at project root stays, the retry repeats indefinitely). Now takes `project_dir: &Path` and uses `project_dir.join(LOCKFILE_NAME)` / `.join(BINARY_LOCKFILE_NAME)`. `project_dir` is threaded through `fetch_and_store_legacy` and `fetch_and_store_streaming`; captured from the existing `project_dir_buf` in the task dispatch closure (line 1663). ## Finding #3 (Low) — RejectedScheme had no counter `try_lockfile_fast_path`'s `RejectedScheme` branch previously only logged. Now bumps `gate_stats.scheme_mismatch` so corrupt- lockfile signals are observable in telemetry (symmetric with shape/origin — all three rejection types now have counters). ## GateStats expanded Adds three AtomicU64 counters: `scheme_mismatch`, `stale_recovery`, `stale_hard_fail`. All surfaced in the JSON `timing.fetch_breakdown.tarball_url_gate` object. ## Legacy fetch path restructured `fetch_and_store_legacy` previously delegated to `fetch_tarball_to_file` (URL resolve + download composed). For the retry path to distinguish a metadata 404 from a download 404, the two steps are now inline; `fetch_tarball_to_file` is removed as orphaned. Behaviorally equivalent on the happy path. ## What's NOT in this commit (lands in P43-2 commit 3/3) - Generalized writeback trigger (C1 + C3 + C4 from the audit passes) — with retry in place, convergence is a perf concern not a correctness one: every install pays one extra metadata round-trip for each stale package until the lockfile is refreshed by some other trigger (add/remove dep). - Regression tests (9 cases from the design doc). ## Results CI gate (CARGO_TARGET_DIR=/tmp/lpm-phase43-target): - cargo clippy --workspace -- -D warnings ✓ - cargo fmt --check ✓ - grep -r 'fancy-regex' crates/*/Cargo.toml ✓ (no matches) - cargo build --workspace ✓ - cargo nextest run --workspace --exclude lpm-integration-tests 3669 tests pass, 7 skipped ✓ - cargo test -p lpm-auth (3x) 43 tests pass each ✓ Design doc: DOCS/new-features/37-rust-client-RUNNER-VISION-phase43.md (§P43-2 Changes 2 + 4) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin
added a commit
that referenced
this pull request
Apr 18, 2026
* feat(lockfile): Phase 43 P43-0 — add `tarball` field to LockedPackage
Adds `pub tarball: Option<String>` to `LockedPackage` with
`#[serde(default, skip_serializing_if = "Option::is_none")]`.
`LOCKFILE_VERSION` stays at 1 — the change is additive-only at the
TOML layer. Old lockfiles without the field parse as `None`; new
lockfiles with all-None `tarball` serialize byte-identically to
pre-Phase-43 lockfiles.
The TOML writer now threads `InstallPackage.tarball_url` →
`LockedPackage.tarball` at both call sites in `install.rs`
(lockfile-fast-path writer and fresh-resolve writer).
Touches 96 `LockedPackage { ... }` construction sites across 14
files. Every site gets an explicit `tarball: None,` for diff
pattern-matchability rather than `..Default::default()`.
Also fixes three pre-existing breakage sites in integration tests
that were never reached by CI (`--exclude lpm-integration-tests`):
missing `alias_dependencies` (from Phase 40 P2) plus the new
`tarball` field in `tests/integration/tests/{core,output_parity}.rs`,
and missing `aliases` + `root_link_names` on `LinkTarget` in
`core.rs`. Noticed-while-in-file; keeps the integration-test
crate compilable.
New tests (5 in `lpm-lockfile`):
- `phase43_tarball_roundtrips_when_present`
- `phase43_tarball_absent_keeps_old_lockfiles_byte_identical`
- `phase43_tarball_mixed_population_roundtrips`
- `phase43_old_lockfile_without_tarball_field_parses`
- (plus integration crate sites compile)
P43-1 (binary v2 layout) and P43-2 (fast-path gate + stale-URL
retry + generalized writeback) land in follow-up commits.
CI gate (CARGO_TARGET_DIR=/tmp/lpm-phase43-target):
- cargo clippy --workspace -- -D warnings ✓
- cargo fmt --check ✓
- grep -r 'fancy-regex' crates/*/Cargo.toml ✓ (no matches)
- cargo build --workspace ✓
- cargo nextest run --workspace --exclude lpm-integration-tests
3647 tests pass, 7 skipped ✓
- cargo test -p lpm-auth (3x) 43 tests pass each ✓
Design doc: DOCS/new-features/37-rust-client-RUNNER-VISION-phase43.md
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(lockfile): Phase 43 P43-1 — binary v2 layout with tarball URL slot
Bumps `BINARY_VERSION` 1 → 2. Appends a `(tarball_off: u32,
tarball_len: u16)` pair to every package entry, growing
`ENTRY_SIZE` 30 → 36 bytes. The `(0, 0)` null sentinel distinguishes
`None` from a real URL — same pattern already used for
`source` / `integrity`.
## Strict version check (v2 reader rejects v1 strictly)
Changed from `version == 0 || version > BINARY_VERSION` to
`version != BINARY_VERSION`. The per-entry layout differs between
versions (v1 = 30B, v2 = 36B) — a v2 reader decoding v1 entries
as 36-byte entries would read package N's `name_off`/`name_len`
as package N-1's (nonexistent) tarball pair and produce garbage.
`read_fast` catches the error and falls through to TOML; P43-2
will add a dedicated writeback trigger so fast-path installs also
complete the v1→v2 binary migration.
## Empty-string rejection (M2 from 3rd-pass audit)
`StringTable::insert("")` on the first insert would produce
exactly `(off=0, len=0)` — indistinguishable from the null
sentinel. The new `insert_optional` helper rejects empty strings
outright for all three optional fields (`source`, `integrity`,
`tarball`). Empty tarball URLs / integrity hashes / sources are
nonsensical input; failing loud is correct.
## Tests added (9 new, total 71 pass)
- `phase43_entry_size_is_36_bytes` — wire-format invariant guard.
- `phase43_tarball_roundtrips_through_binary` — Some(url) survives.
- `phase43_mixed_tarball_population_roundtrips` — rollout-window case.
- `phase43_writer_rejects_empty_tarball` — M2 sentinel protection.
- `phase43_writer_rejects_empty_source_and_integrity_too` — M2
applied consistently across all optional fields.
- `phase43_v2_reader_rejects_v1_binary_strict` — v1 header hand-
rolled, verifies strict `!=` guard.
- `phase43_v2_reader_rejects_future_version_3` — forward-incompat.
- `phase43_null_tarball_sentinel_roundtrips` — None/None/None
round-trips correctly through all three optional slots.
- `phase43_read_fast_falls_back_to_toml_when_binary_is_v1` — the
key migration scenario; client upgrade sees v1 `lpm.lockb`,
rejects it, falls through to TOML. v1 file intentionally stays
on disk (P43-2 writeback will clean it up).
## Outdated v1-only docs corrected
The alias-metadata limitation applies to both v1 and v2 (Phase 43
only added the tarball slot, not an alias section). Comments in
`binary_format_supports`, `to_binary`, `to_lockfile`, and
`to_locked_package` updated to say "the binary format" rather
than "v1 binary".
## Size impact
+6 bytes per package entry + URL bytes in string table. For a
typical 409-package lockfile with ~60-char tarball URLs:
409 × 6 + 409 × 60 ≈ 27 KB growth. Mapped file — zero memory
cost beyond pages actually read.
CI gate (CARGO_TARGET_DIR=/tmp/lpm-phase43-target):
- cargo clippy --workspace -- -D warnings ✓
- cargo fmt --check ✓
- grep -r 'fancy-regex' crates/*/Cargo.toml ✓ (no matches)
- cargo build --workspace ✓
- cargo nextest run --workspace --exclude lpm-integration-tests
3656 tests pass, 7 skipped (+9 since P43-0) ✓
- cargo test -p lpm-auth (3x) 43 tests pass each ✓
Design doc: DOCS/new-features/37-rust-client-RUNNER-VISION-phase43.md (§P43-1)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(lockfile): Phase 43 P43-1 — address GPT audit follow-up
Three tightenings after the 2026-04-18 GPT audit of P43-1:
## Finding #1 (Medium) — stale v1 `lpm.lockb` persists across reads
`read_fast` falls back to TOML when binary open returns
`UnsupportedVersion`, but previously left the v1 file on disk.
`read_fast` is called from `lpm install`, `lpm upgrade`, AND
`lpm outdated`. Read-only commands never trigger a write, so an
upgraded user would pay the open-reject + TOML-parse cost on every
`lpm outdated` invocation forever — a real perf regression when
shipping P43-1 standalone before P43-2's install writeback lands.
Fix: best-effort delete the stale binary when open returns
`UnsupportedVersion`. Deletion is scoped to the version mismatch
only; other errors (corrupt magic, structural issues) leave the
file on disk for forensic inspection. Delete failures (read-only
FS, permission denied) are swallowed — correctness still holds via
the TOML fallback.
Test `phase43_read_fast_falls_back_to_toml_when_binary_is_v1`
flipped to assert deletion. New test
`phase43_read_fast_preserves_binary_on_non_version_errors` guards
against aggressive-deletion regression.
## Finding #2 (Low) — corrupt tarball pair surfaces `Some("")`
`BinaryLockfileReader::open` validated the deps-table layout but
not the new tarball pair. Corrupt `(tarball_off, tarball_len)`
bytes flowed through `read_str` (which degrades out-of-bounds
reads to `""`), so `tarball()` returned `Some("")` — silent
corruption that P43-2 would later feed into the shape gate.
Fix: extend the per-entry validation loop in `open` to bounds-
check the tarball pair against the string table length. `(0, 0)`
is the null sentinel and bypasses the check; any other pair must
fit within `[string_table_off, EOF)`. Corruption now forces TOML
fallback via `read_fast` — matching how v1 handled source/integrity
range issues via the deps-validation mechanism.
New test `phase43_open_rejects_corrupt_tarball_pair` exercises the
`u32::MAX` offset case.
## Finding #3 (Low) — empty-string rejection asymmetric
The binary writer rejected empty `source` / `integrity` / `tarball`
at serialization time (via `insert_optional`), but `from_toml` was
pure serde — it accepted `tarball = ""` cleanly and only failed
later when `write_all` tried to emit the binary. P43-1's commit
message claimed "consistent rejection" which was wrong.
Fix: add a validation pass in `from_toml` that walks all packages
and rejects empty `source` / `integrity` / `tarball`. Matches the
binary writer's rejection at the parse boundary, preventing
asymmetric late failures.
New test `phase43_from_toml_rejects_empty_optional_strings`
covers all three fields parametrically.
## Results
74 lockfile tests pass (+3 from +2 existing tests). Workspace
nextest: 3659 pass (+3 since P43-1 initial commit).
CI gate (CARGO_TARGET_DIR=/tmp/lpm-phase43-target):
- cargo clippy --workspace -- -D warnings ✓
- cargo fmt --check ✓
- grep -r 'fancy-regex' crates/*/Cargo.toml ✓ (no matches)
- cargo build --workspace ✓
- cargo nextest run --workspace --exclude lpm-integration-tests
3659 tests pass, 7 skipped ✓
- cargo test -p lpm-auth (3x) 43 tests pass each ✓
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(lockfile): Phase 43 — 2nd-round GPT audit follow-up (zero-length tarball gap + future-version preservation)
Two tightenings after the 2nd-round GPT audit of the 1st follow-up:
## Finding #1 (Low) — zero-length tarball with non-zero offset slipped through
The 1st follow-up added a range-check `off + len > string_table_len`
in `BinaryLockfileReader::open`, but that fails trivially for any
`(off != 0, len == 0)` pair because `off + 0 > len` is false for
any in-bounds offset. Combined with `tarball()` treating
"not both zero" as `Some(...)`, a corrupt pair still surfaced
`Some("")` — exactly the class of silent corruption the 1st
follow-up was supposed to close.
Fix: add an explicit `len == 0 && off != 0` rejection BEFORE the
range check. The only legitimate zero-length slot is the null
sentinel `(0, 0)`; any non-null value must have `len > 0`. The
check is trivial (one comparison) and makes the invariant
auditable from the code.
Test `phase43_open_rejects_corrupt_tarball_pair_zero_length_nonzero_offset`
stomps a valid binary's tarball slot with `(off=5, len=0)` and
asserts rejection with a clear error message.
## Finding #2 (Low) — future-version binary was deleted aggressively
The 1st follow-up deleted the stale `lpm.lockb` on ANY
`UnsupportedVersion`. That's fine for the common case (old client
wrote a v1, new client is v2), but it also fires when a NEWER
client's v3 binary is read by the current v2 client — forcing the
newer client to regenerate on its next install. GPT flagged this
as an open question: since `lpm.lockb` is derivative cache,
aggressive deletion is acceptable but not ideal.
Fix: narrow the delete to `found < BINARY_VERSION` only. Future-
version binaries fall through to TOML (correctness preserved) but
stay on disk so the newer client's fast path isn't churned.
Required promoting `BINARY_VERSION` from private `const` to `pub`
so `read_fast` in `lib.rs` can compare against it.
Test `phase43_read_fast_preserves_binary_on_future_version` hand-
rolls a v99 header, asserts TOML fallback succeeds AND the binary
is preserved on disk.
## Results
76 lockfile tests pass (+2 from this commit). Workspace nextest:
3661 pass.
CI gate (CARGO_TARGET_DIR=/tmp/lpm-phase43-target):
- cargo clippy --workspace -- -D warnings ✓
- cargo fmt --check ✓
- grep -r 'fancy-regex' crates/*/Cargo.toml ✓ (no matches)
- cargo build --workspace ✓
- cargo nextest run --workspace --exclude lpm-integration-tests
3661 tests pass, 7 skipped ✓
- cargo test -p lpm-auth (3x) 43 tests pass each ✓
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(install): Phase 43 P43-2 (1/3) — gate + URL reuse + url_lookup_ms sub-timer
First of three commits delivering P43-2. Scope here is the READ
path: stored tarball URLs from `lpm.lockb` are reused at install
time when safe, falling through to on-demand metadata lookup
when the gate rejects. No stale-URL retry, no writeback — those
land in commits 2/3.
## crates/lpm-registry — new public API
- `is_https_url` / `is_localhost_url` promoted from private `fn`
to `pub fn` (visibility-only; no logic change).
- New `GateDecision` enum: `Accepted`, `RejectedScheme`,
`RejectedShape`, `RejectedOrigin`. Distinct variants so callers
can drive per-reason telemetry counters without re-running
checks.
- New `RegistryClient::is_configured_origin(url)`: returns true
if the URL's origin matches `base_url` or `npm_registry_url`.
Opaque origins (`file://`, `data:`) never match.
- New `evaluate_cached_url(url, client) -> GateDecision`: composes
scheme + shape + origin gates. The shape gate requires BOTH
`.tgz` suffix AND a `/-/` path segment; the latter closes the
H1 SSRF-via-lockfile gap (bare `.tgz` suffix would pass crafted
paths like `/api/admin/foo.tgz`).
## crates/lpm-cli/install.rs — consumption
- `try_lockfile_fast_path` signature grows `client` +
`gate_stats` params; two call sites (offline + main install)
updated.
- Line 2679 `tarball_url: None` → match on `evaluate_cached_url`:
`Accepted` reuses the stored URL; rejections bump the right
counter (for shape / origin; scheme rejections are a corrupt-
lockfile signal and log-only, matching the writer invariant).
- `GateStats` struct with `AtomicU64` counters (origin/shape
mismatch). Surfaced on `timing.fetch_breakdown.tarball_url_gate`
in `--json` output.
- `TaskTimings` gains `url_lookup_ms` field — measured in BOTH
legacy and streaming fetch paths so the Phase 43 win is
directly measurable regardless of which path is active.
Previously URL lookup was either buried in `download_ms`
(legacy) or untimed (streaming); carving it out makes the
primary projection target visible.
- `FetchBreakdown` gains `url_lookup_sum_ms` / `url_lookup_max_ms`.
- Legacy path `fetch_tarball_to_file` signature grows to return
`(DownloadedTarball, url_lookup_ms)`; `download_ms` narrows
to GET + temp-file write only (subtracting the carved-out
lookup via `saturating_sub`).
- Streaming path times `resolve_tarball_url` inline; comment
corrected (was wrong — URL resolution was never in
`extract_ms` on this path, it was untimed).
## What's NOT in this commit (lands in P43-2 commit 2/3)
- Stale-URL same-run retry (C1 from 3rd-pass audit).
- Generalized writeback trigger (C1 + C3 + C4).
- `handle_tarball_not_found` project_dir fix (L3).
- `stale_recovery` / `stale_hard_fail` counters.
- Regression tests (land in commit 3).
## Tests added (8 new, in lpm-registry)
- `phase43_gate_accepts_canonical_lpm_tarball_url`
- `phase43_gate_accepts_canonical_npm_tarball_url`
- `phase43_gate_rejects_non_https_non_localhost`
- `phase43_gate_rejects_wrong_suffix`
- `phase43_gate_rejects_admin_style_path_without_dash_segment`
(H1 SSRF defense — the `/-/` segment check)
- `phase43_gate_rejects_origin_mismatch_after_registry_switch`
- `phase43_gate_allows_localhost_registry`
- `phase43_gate_rejects_malformed_url`
CI gate (CARGO_TARGET_DIR=/tmp/lpm-phase43-target):
- cargo clippy --workspace -- -D warnings ✓
- cargo fmt --check ✓
- grep -r 'fancy-regex' crates/*/Cargo.toml ✓ (no matches)
- cargo build --workspace ✓
- cargo nextest run --workspace --exclude lpm-integration-tests
3669 tests pass, 7 skipped (+8 since P43-1) ✓
- cargo test -p lpm-auth (3x) 43 tests pass each ✓
Design doc: DOCS/new-features/37-rust-client-RUNNER-VISION-phase43.md (§P43-2 Change 1)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(install): Phase 43 P43-2 (2/3) — stale-URL retry + CWD fix + telemetry completeness
Addresses three GPT-audit findings on P43-2 commit 1, turning the
gate-accepted URL path into a safe standalone rollout. Commit 1's
claim of "clean standalone rollout point" was wrong — a stale
stored URL that passed the scheme/shape/origin gate would
immediately hard-fail the install (both legacy line 3711 and
streaming line 3814 routed `NotFound` straight to
`handle_tarball_not_found`). This commit is **safety**, not
cleanup — P43-2 commit 1 should not ship without it.
## Finding #1 (Medium) — stale URL = first-run hard failure
Pre-Phase-43, the lockfile didn't store URLs, so every fetch did
its own metadata round-trip first — stale upstream paths were
refreshed transparently. Post-commit-1, a stored URL that the
gate accepts is reused as-is; if upstream republished or migrated
the tarball path, the download 404s and lockfiles get nuked.
Fix: both fetch paths gain a **same-run retry** on stored-URL
404s. On `LpmError::NotFound` where `p.tarball_url.is_some()`:
1. Invalidate the metadata cache for this package.
2. Re-resolve via `resolve_tarball_url(..., cached_url=None)` to
force a real metadata round-trip.
3. Guard against loop: if fresh URL == stale URL, metadata itself
is stuck → fall through to `handle_tarball_not_found`.
4. Retry the download ONCE with the fresh URL.
5. On success: bump `stale_recovery` counter, carry on.
6. On second 404: bump `stale_hard_fail` counter, fall through
to `handle_tarball_not_found`.
On-demand path 404s (no stored URL) skip retry — there's nothing
stale to refresh.
## Finding #2 (Medium) — `handle_tarball_not_found` is CWD-relative
Pre-fix `Path::new("lpm.lock")` / `Path::new("lpm.lockb")` —
deletes relative to process CWD, not the project root. A
programmatic install from a nested directory would leak stale
lockfile state (the `lpm.lock` at project root stays, the retry
repeats indefinitely). Now takes `project_dir: &Path` and uses
`project_dir.join(LOCKFILE_NAME)` / `.join(BINARY_LOCKFILE_NAME)`.
`project_dir` is threaded through `fetch_and_store_legacy` and
`fetch_and_store_streaming`; captured from the existing
`project_dir_buf` in the task dispatch closure (line 1663).
## Finding #3 (Low) — RejectedScheme had no counter
`try_lockfile_fast_path`'s `RejectedScheme` branch previously
only logged. Now bumps `gate_stats.scheme_mismatch` so corrupt-
lockfile signals are observable in telemetry (symmetric with
shape/origin — all three rejection types now have counters).
## GateStats expanded
Adds three AtomicU64 counters: `scheme_mismatch`,
`stale_recovery`, `stale_hard_fail`. All surfaced in the JSON
`timing.fetch_breakdown.tarball_url_gate` object.
## Legacy fetch path restructured
`fetch_and_store_legacy` previously delegated to
`fetch_tarball_to_file` (URL resolve + download composed). For
the retry path to distinguish a metadata 404 from a download
404, the two steps are now inline; `fetch_tarball_to_file` is
removed as orphaned. Behaviorally equivalent on the happy path.
## What's NOT in this commit (lands in P43-2 commit 3/3)
- Generalized writeback trigger (C1 + C3 + C4 from the audit
passes) — with retry in place, convergence is a perf concern
not a correctness one: every install pays one extra metadata
round-trip for each stale package until the lockfile is
refreshed by some other trigger (add/remove dep).
- Regression tests (9 cases from the design doc).
## Results
CI gate (CARGO_TARGET_DIR=/tmp/lpm-phase43-target):
- cargo clippy --workspace -- -D warnings ✓
- cargo fmt --check ✓
- grep -r 'fancy-regex' crates/*/Cargo.toml ✓ (no matches)
- cargo build --workspace ✓
- cargo nextest run --workspace --exclude lpm-integration-tests
3669 tests pass, 7 skipped ✓
- cargo test -p lpm-auth (3x) 43 tests pass each ✓
Design doc: DOCS/new-features/37-rust-client-RUNNER-VISION-phase43.md (§P43-2 Changes 2 + 4)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(install): Phase 43 P43-2 (3/3) — generalized writeback for URL convergence + v1→v2 binary migration
Final P43-2 commit — delivers Change 3 from the design doc (the
convergence/persistence fixes C1 + C3 + C4 from the 3rd and 4th
audit passes). With retry safety from commit 2/3 already in
place, this commit is about correctness of the durable state:
making the lockfile actually converge on the URLs used, so the
next install doesn't redo the same recovery work.
## Change 3: generalized writeback
`try_lockfile_fast_path` now returns `LockfileFastPath { packages,
lockfile, needs_binary_upgrade }` instead of bare `Vec<InstallPackage>`.
The driver stashes `lockfile` and `needs_binary_upgrade` so the
install-end writeback step can see them.
Fetch tasks now return the FINAL URL used (legacy + streaming
grew from `Result<(sri, timings)>` to `Result<(sri, timings,
final_url)>`). Happy path: `final_url == initial_url` (stored
URL succeeded). Stale-URL recovery: `final_url == fresh_url`
(retry's metadata round-trip URL). Origin-mismatch rebase:
`final_url == on_demand_url` (gate rejected the stored URL,
`initial_url` came from `resolve_tarball_url(cached=None)`).
Post-fetch aggregator builds `fresh_urls: HashMap<(name, version),
String>` — populated only when the task actually hit the registry
(the store-hit short-circuit reports `None` to avoid double-
counting a sibling task's URL).
At install-end, when `used_lockfile == true`, three trigger
conditions fire the writeback:
1. `!fresh_urls.is_empty()` — URL divergence from stored.
2. `needs_binary_upgrade == true` — v2 `lpm.lockb` was missing
or out-of-version at fast-path time.
3. Both — compound message.
On the true happy path (URLs all match AND v2 binary current),
no write happens. Lockfiles stay byte-identical, CI diffs stay
empty.
## Addresses C1, C3, C4 from the 4th-pass audit
- **C1 (self-healing writeback):** Stale-URL recovery now
persists across runs. Previously, the retry would succeed but
the stored URL stayed stale, so every install re-ran the 404 +
retry dance.
- **C3 (registry-switch convergence):** `LPM_REGISTRY_URL`
switches now converge after one install. Previously,
origin-mismatch rejections fell through to on-demand lookup
correctly but the lockfile never picked up the new-origin
URLs — every subsequent install repeated the on-demand
round-trip forever.
- **C4 (v1→v2 binary migration):** Fast-path-only installs now
complete the binary migration. Previously, the v2 reader
rejected v1 binary + `read_fast` fell back to TOML, but the
fast-path writer was gated on `!used_lockfile` so the v2
rewrite never fired. Commit 1 of P43-1-follow-up mitigated
the perf regression on read-only commands via best-effort
deletion of v1 binaries; this commit completes the migration
properly on fast-path installs.
## What's NOT in this commit
- **End-to-end install-flow regression tests** (the 9 cases from
the design doc). These need mock-registry scaffolding that
doesn't exist in the CLI test harness. Deferred to a
follow-up "test infra + regression tests" commit. The 4 unit
tests below cover the critical new code paths.
- **Offline-mode binary upgrade:** `--offline` intentionally
skips writeback. Users can trigger v1→v2 migration with any
online install.
## Tests added (4 new)
- `phase43_handle_tarball_not_found_honors_project_dir` —
design-doc test #8, verifies the CWD fix from commit 2/3 is
still wired correctly.
- `phase43_try_lockfile_fast_path_flags_v1_binary_for_upgrade`
— design-doc test #9 mechanism: v1 `lpm.lockb` on disk →
`needs_binary_upgrade = true`.
- `phase43_try_lockfile_fast_path_flags_missing_binary_for_upgrade`
— complement: no binary at all → `needs_binary_upgrade = true`.
- `phase43_try_lockfile_fast_path_skips_upgrade_when_binary_current`
— happy-path guard: current v2 binary → `needs_binary_upgrade = false`,
no spurious rewrites on every install.
## Byte-stability guarantee
Regression test `phase43_try_lockfile_fast_path_skips_upgrade_when_binary_current`
guards the byte-stability invariant — a future refactor that
accidentally sets `needs_binary_upgrade = true` on clean
installs would churn every CI lockfile commit. Test fails if
the happy-path trigger isn't clean.
## Results
CI gate (CARGO_TARGET_DIR=/tmp/lpm-phase43-target):
- cargo clippy --workspace -- -D warnings ✓
- cargo fmt --check ✓
- grep -r 'fancy-regex' crates/*/Cargo.toml ✓ (no matches)
- cargo build --workspace ✓
- cargo nextest run --workspace --exclude lpm-integration-tests
3673 tests pass, 7 skipped (+4 since P43-2 commit 2/3) ✓
- cargo test -p lpm-auth (3x) 43 tests pass each ✓
Design doc: DOCS/new-features/37-rust-client-RUNNER-VISION-phase43.md (§P43-2 Change 3)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(install): Phase 43 P43-2 — failing-test-first retrofit + A/B bench verification
Retrofits the methodology reminder #2 contract from the design doc
("P43-2's regression test must fail on the pre-fix `tarball_url:
None` assumption and pass when URL is populated from lockfile").
Adds three tests that directly exercise the core Phase 43 read-path
contract at the struct-creation boundary in `try_lockfile_fast_path`.
## Tests added
1. `phase43_gate_accepted_url_populates_tarball_url` — the fix-
verifier. Lockfile stores a canonical npm tarball URL; gate
accepts; `InstallPackage.tarball_url` MUST be `Some(url)`.
2. `phase43_gate_rejected_urls_downgrade_to_none_with_telemetry` —
parametric over the three rejection modes. Asserts both the
downgrade AND the correct counter bump.
3. `phase43_no_stored_tarball_produces_none_install_package_url` —
boundary-case guard for pre-Phase-43 lockfile shape.
## Empirical verification (2026-04-18)
The first two tests were surgically verified to FAIL against the
pre-fix `tarball_url: None` stub. Procedure:
1. Replaced the gate-match block at install.rs:~2908 with the
pre-fix stub `tarball_url: None`.
2. Ran `cargo test -p lpm-cli phase43`.
3. Observed:
- `phase43_gate_accepted_url_populates_tarball_url` FAILED
(expected `Some(url)`, got `None`).
- `phase43_gate_rejected_urls_downgrade_to_none_with_telemetry`
FAILED (counters at 0 instead of 1).
- Other 5 phase43 tests passed (orthogonal behaviors).
4. Restored the gate logic via `git restore`.
5. Reran — all 7 pass.
This is the "fail before fix, pass after fix" contract that
methodology #2 specifies.
## A/B bench results (same session — per methodology reminder #1)
Decision-gate fixture (/tmp/phase40-decision-gate), 409 packages,
fresh-CI shape (lockfile present + cold store + cold
node_modules). 5 runs each, interleaved A/B/A/B/A/B/A/B/A/B.
`LPM_HOME=/tmp/lpm-ab-home/.lpm` isolation (user's ~497MB store
untouched).
| metric | A (main) | B (P43) | delta | projected |
|---------------------------|---------:|--------:|--------:|----------:|
| total_ms | 4592 | 4043 | -12% | -17% |
| fetch_ms | 4342 | 3568 | **-18%**| **-18%** |
| queue_wait.sum_ms | 437,700 | 356,092 | -19% | -58% |
| url_lookup.sum_ms (new) | n/a | 0 | HIT | near-0 |
| gate mismatch counters | n/a | all 0 | clean | — |
Readings:
- **Primary projection target MET.** `url_lookup.sum_ms` is
exactly 0 across all 5 B runs. Every one of 409 packages' URLs
was reused from the lockfile with zero metadata round-trips.
- **`fetch_ms` delta EXACTLY as projected** (-18% vs -18%).
- **Projection overshot on secondary metrics.** `queue_wait.sum_ms`
projected to drop -58% via the causal-chain hypothesis
(shorter per-task → faster permit turnover). Actual drop is
-19%. Interpretation per stop-and-diagnose rule: queue_wait
is dominated by actual download latency, not URL lookup. The
causal-chain reasoning in the doc was overstated.
- **`total_ms` delta falls short** (-12% vs -17%). The fetch-side
win is fully captured in `fetch_ms`; some doesn't translate
to total wall-clock — link_ms variance (run B#5 had 1110ms
link_ms vs typical ~220ms) and pre-fetch overhead.
- **Gate counters all zero on B** — no origin/shape/scheme
rejections, no stale recovery, no hard fails. Steady-state
clean across 2045 package-fetches.
CI gate (CARGO_TARGET_DIR=/tmp/lpm-phase43-target):
- cargo clippy --workspace -- -D warnings ✓
- cargo fmt --check ✓
- grep -r 'fancy-regex' crates/*/Cargo.toml ✓ (no matches)
- cargo build --workspace ✓
- cargo nextest run --workspace --exclude lpm-integration-tests
3676 tests pass (1 perf-test flake under load — rerun in
isolation passed; unrelated lpm-task eval perf assertion) ✓
- cargo test -p lpm-auth (3x) 43 tests pass each ✓
- cargo test -p lpm-cli phase43 7 pass ✓
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin
added a commit
that referenced
this pull request
Apr 24, 2026
…ests
Introduces crates/lpm-cli/src/precedence.rs implementing the §6
three-layer containment model's pure-policy branch from
DOCS/new-features/37-rust-client-RUNNER-VISION-phase48.md
(a-package-manager repo). Foundation commit — resolver is a pure
function, all tests green, not yet wired into production callers.
# Rule it encodes
Legacy pure-policy knobs (`scriptPolicy`) keep Phase 46 project >
user precedence when `force-security-floor = false`. The default
flip for legacy knobs is Phase 5 (Phase 49), not P0.
New Phase 48 pure-policy knobs (`network-policy`,
`install-policy.strict-behavioral`) always use user-is-floor, with
project values that loosen below the floor rejected at load time
with a named-source warning. No approval path; "loosen" means
"project declaration → drop, surface warning."
When `force-security-floor = true`, legacy and new behave
identically: user is floor for both, CLI loosening flags
(`--yolo`, `--policy=allow`, `--accept-*`) suppressed, project
loosening rejected. Tightening CLI / project values are still
honored — the flag blocks loosening, not movement. Approval
suspension is a separate concern for later in P0.
# Scope of this commit
- Public types: `PolicyKind` (Legacy|New), `PolicyTier`
(Cli|Project|User|Default), `RejectionReason`
(ForceFlagSuppressesCli | ForceFlagRejectsProject |
NewKnobProjectLoosens), `Rejection<T>`, `Resolution<T>`,
`PolicyInputs<T>`.
- Public trait `PurePolicyKnob` with `NAME`, `KIND`, `loosens`.
- Public function `resolve_pure_policy<T: PurePolicyKnob>`: pure,
no I/O, three documented branches.
- `PurePolicyKnob` impl on `ScriptPolicy` (Legacy kind).
- Test-local `NetworkPolicy { Fenced, Allow }` stub (New kind);
real `network-policy` lands in P3 with the backend wiring.
# Exit criteria covered (§7 P0)
Test `force_flag_rejects_cli_yolo_and_project_loosening` pins #1:
`force-security-floor = true` + project `scriptPolicy = "allow"`
+ `--yolo` → effective `deny`, CLI rejection + project rejection
both fired.
Test `legacy_knob_without_force_flag_preserves_phase46_order`
pins #2: Phase 46 project > user order preserved for legacy
knob, no rejections.
Test `new_knob_without_force_flag_rejects_project_loosening`
pins #3: new-knob user floor prevails without force flag, project
value rejected with `NewKnobProjectLoosens` reason. Rejection
count asserted == 1 to pin the "no request flow, no approval
path" rule from the test side — if a future refactor routes
rejections through the approval UI, this test fails loudly.
# Not yet in this commit (next slices on this branch)
1. Wire `force-security-floor` into GlobalConfig and re-plumb
resolve_script_policy through the new resolver so legacy
callers hit the tested logic.
2. Approval-record suspension + three distinct migration-warning
wordings (one per RejectionReason variant).
3. max-sandbox-write-roots config key + load_sandbox_write_dirs
hardening (reject `/`, `$HOME/.ssh`, etc.).
4. Per-package capability resolver for passEnv, sandboxLimits,
readProject.
Deferred intentionally: the existing resolve_script_policy
behavior is unchanged in this commit. Re-plumbing happens in the
next slice so the "no behavior change" promise for legacy knobs
stays auditable.
# Verification
- cargo clippy --workspace -- -D warnings: clean
- cargo fmt --check: clean
- cargo nextest run --workspace --exclude lpm-integration-tests
--no-fail-fast: 4231 passed, 7 skipped, 0 failed
- grep fancy-regex crates/*/Cargo.toml: empty (banned crate absent)
- 12/12 precedence unit tests pass (3 exit-criterion tests + 6
edge cases + 2 loosens-ordering proofs + 1 display-phrasing
pin).
Ran with CARGO_TARGET_DIR=/tmp/lpm-phase48-target so the dev
incremental cache stayed clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin
added a commit
that referenced
this pull request
Apr 30, 2026
…rs/ (Phase 61.1) The big lever — the isolated linker's per-package wrapper tree moves out of `node_modules/.lpm/` to `<project>/.lpm/wrappers/`. After the relayout, `rm -rf node_modules` no longer wipes the entire incremental linker cache, so the warm-install bench (and the user pattern Phase 57.2 surfaced — wiping node_modules after a teammate's lockfile change) actually exercises the incremental linker. Symlink-target shape changes (audit fix #1, v3): - Phase 3 root symlinks (canonical + aliases) gain one extra `..` segment and route through `<project>/.lpm/wrappers/<seg>/...`. Centralized in `LayoutPaths::root_symlink_target()` so the depth math (link-depth + 1) is computed in one place. - Phase 3.5 self-references unchanged — they target the project root, which doesn't move under Tier 2. - Phase 2 internal sibling-wrapper symlinks unchanged — both endpoints live inside `.lpm/wrappers/` so the relative `../../` shape is preserved. Drive-by audit fixes folded in: - #3 (bin-shim wrapper segment): `create_bin_links` now uses `pkg.wrapper_segment()` instead of hardcoding `format!("{safe}@{version}")`. Pre-fix, local-source deps with a `bin` field produced shims pointing at non-existent wrapper paths. - #7 (Windows junction `..` normalization): added a lexical-clean helper inside `create_symlink_or_junction`'s Windows arm so the `../.lpm/wrappers/...` shape doesn't embed an unresolved `..` segment in the path handed to `cmd /c mklink /J`. `cleanup_stale_entries` updates: - Explicitly creates `node_modules/` (pre-Tier-2 the wrapper-root `create_dir_all` covered both via parent recursion; now they're disjoint paths). - Skips dotfile entries (e.g., the new `.version` schema-tag) when sweeping stale wrappers. - Writes `<wrapper-root>/.version` (D6) for forward-compat shape detection. Test fixtures migrated to use `LayoutPaths` so they track production semantics on any future shape change. 4949 workspace tests pass; clippy --workspace -D warnings clean; cargo fmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
11 tasks
tolgaergin
added a commit
that referenced
this pull request
Apr 30, 2026
* feat(lpm-runtime): RuntimeStatus carries resolved managed-runtime bin
`Ready` and `Installed` now carry a `bin_dir: PathBuf` field — the
managed-runtime bin path that `node_bin_dir(&version)` already resolves
inside `ensure_runtime` and would otherwise discard. Downstream callers
(the PATH builder in `lpm-runner/bin_path`) can consume this hint to
skip a redundant `detect_node_version` + `list_installed` pass per
`lpm run` invocation.
For the `Installed` branch, defensively re-stat after install — if the
freshly-installed bin dir vanished mid-call (race / external tampering),
degrade to `NotInstalled` rather than panic.
This is the data-shape change that the rest of Phase 61 Tier 1 builds on.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(lpm-runner): 3-state ManagedRuntimeHint + pre-resolved PATH builder
Adds `ManagedRuntimeHint { Bin(PathBuf) | Absent | Unknown }` plus
`build_path_with_bins_pre_resolved(start_dir, hint)`. The existing
public `build_path_with_bins` becomes a thin wrapper that passes
`Unknown` — preserving the silent-detect contract for callers that
don't go through `ensure_runtime` first (rebuild, dlx, hooks,
tools.rs, doctor, orchestrator).
Why three states, not `Option<PathBuf>`:
- `Bin(path)` — caller resolved the managed runtime: use it directly.
- `Absent` — caller called `ensure_runtime` and confirmed there
is no managed runtime to use. PATH builder skips the
silent re-detect entirely (the win on unpinned projects).
- `Unknown` — caller hasn't checked. Falls back to silent detect
(current pre-Phase-61 behavior).
Collapsing `Absent` and `Unknown` into one nullable would force the
silent re-detect on the unpinned-project path — the most common shape.
Two deterministic unit tests cover the contract: `_uses_hinted_bin`
asserts the produced PATH is exactly [nm_bin, hint_bin, ...inherited]
when `Bin(...)` is supplied (uses a non-existent fake path so any
re-stat would fail-loud); `_absent_skips_runtime` asserts the PATH is
exactly [nm_bin, ...inherited]. Both assert full structure rather than
substring presence/absence so they're robust to whatever managed-
runtime fragments the developer's PATH happens to contain.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(lpm-runner/script): thread bin_hint through script/command entrypoints
Extends every `pub fn run_*` in the script runner with a
`bin_hint: &ManagedRuntimeHint` parameter, routing each internal
PATH-build through `build_path_with_bins_pre_resolved` instead of the
silent-detect wrapper. Eight entrypoints touched:
- run_script, run_script_with_envs, run_script_captured
- run_script_buffered, run_script_prefixed
- run_command, run_command_captured, run_command_buffered,
run_command_prefixed
No backwards-compatibility shims — per CLAUDE.md "no `// removed`
comments, no shims, no parallel slow-path wrappers." Tests pass
`&ManagedRuntimeHint::Unknown` (imported as `Unknown` at the top of
the test mod for brevity).
Public API surface change is mechanical (one extra parameter); the
sole external consumer is `lpm-cli`, migrated in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(lpm-cli): consume bin_hint, collapse cache-config reads, delete dead wrappers
Threads the `ManagedRuntimeHint` from `commands::run::ensure_runtime`
through the script-execution chain so the downstream PATH builder
doesn't redo `detect_node_version` + `list_installed` on every
`lpm run` invocation.
Signature changes:
- `commands::run::ensure_runtime` now returns `ManagedRuntimeHint`
(`Bin(bin_dir)` for Ready/Installed; `Absent` for NotInstalled and
NoRequirement).
- `run`, `run_multi`, `run_workspace`, `run_watch`, `exec`,
`run_tasks_sequential`, `run_tasks_parallel`, `run_task`, and
`run_task_captured` all gain a `bin_hint` parameter.
Caller migration:
- `main.rs:3102` (watch path) and `main.rs:3527` (External script
shortcut) capture the hint before calling `run_watch` / `run`.
- `dev.rs` captures `runtime_hint` via the existing `tokio::join!`
block instead of discarding it; threads to the dev script invocation.
- `migrate.rs::run_verification` resolves the hint once and reuses
it across the build + test verification scripts.
Caller contract: every callsite of `run` / `run_multi` / `run_watch`
/ `exec` MUST invoke `ensure_runtime` first — that's where the
user-visible "Using node X" notice + auto-install fire. Documented
on `pub async fn run` so future callers don't bypass it accidentally.
Cache-context dedup (Tier 1.4.2):
- `run` reads `lpm.json` once at the top instead of twice (cache-hit
check + caching-enabled check both used to read).
- Migrates the simple-script path to use the existing
`try_cache_hit_with_config` and `is_task_cached_with_config`
helpers — the no-config wrappers were only used by this one
callsite.
Dead-code removal (CLAUDE.md "no shims"):
- Delete `is_task_cached`, `try_cache_hit`, `try_cache_store_with_output`
— every other call site already used the `_with_config` variants.
- Delete the `is_task_cached_false_without_lpm_json` test that
exclusively exercised the deleted wrapper; the equivalent contract
is exercised by `is_task_cached_with_config_*` tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* perf(lpm-cli/run): Tier 1 follow-ups — workspace pin inheritance, parallel Arc reuse, is_meta_task plumbing
Three follow-ups that landed during the M/L review pass on top of the
base hint threading:
L1 — `is_meta_task` no longer reads `package.json` per call.
Caller (`run_multi`, `run_workspace_package`) extracts `pkg.scripts`
once and threads it down through `run_tasks_sequential` /
`run_tasks_parallel` / `is_meta_task`. The dependsOn-but-no-command
case previously paid one `package.json` read per task in the
parallel loop; now zero. The `is_meta_task_from_config` alias
collapses into the single `is_meta_task` since the helper is
filesystem-free now.
L2 — `run_tasks_parallel` wraps shared per-call state in `Arc`.
Pre-Tier-1: each spawned thread did a full `clone` of the hint,
the tasks `HashMap`, the `LpmJsonConfig`, and (post-L1) the
`pkg_scripts` `HashMap`. Post-Tier-1: each is `Arc::new`'d once
before the loop, threads do a refcount bump. Negligible per-thread
but avoids quadratic-feeling allocations on wide parallel levels.
L3 — workspace per-member calls inherit the root hint when the
member has no own pin.
`run_workspace_package` probes the member dir via
`lpm_runtime::detect::detect_node_version` (single-dir, no walk).
If the member has its own .nvmrc / engines / lpm.json runtime,
pass `Unknown` so the silent detect resolves the member-level
pin. If not, inherit the root hint. Matches user intuition that
the workspace-root pin governs the whole workspace (like nvm
walking parent dirs).
Small behavior change: a workspace member with NO own Node pin
now uses the root-resolved managed runtime instead of falling back
to system Node. Arguably a bug fix — pre-Tier-1 behavior was
inconsistent (root auto-installed Node 22 but member silently
ran on whatever `node` happened to be on PATH).
Plus the M/L review fixes batched in:
- M1: doc note on `pub async fn run` documenting the
`ensure_runtime`-must-be-called-first contract.
- M2/M3: `bin_path` test assertions tightened to compare the full
PATH segment list, not substring presence/absence (robust to
whatever managed-runtime fragments the developer's PATH happens
to contain).
- Style: `Default for ManagedRuntimeHint` returning `Unknown`; test
mods import `ManagedRuntimeHint::Unknown` so call sites read
`&Unknown` instead of `&ManagedRuntimeHint::Unknown`.
Measurement (n=101, time.perf_counter_ns(), M5 Mac, load avg ~3):
- Managed-runtime fixture (.nvmrc + 7 entries): ~150 µs / lpm run.
- No-managed-runtime fixture: ~60 µs / lpm run.
- bench/run.sh script-overhead (1ms resolution, n=21): within noise.
Sub-perceptible at ms resolution; preparatory plumbing for Tier 2
warm-path relayout. See preplan v3 status block for full numbers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(lpm-linker): introduce LayoutPaths utility (Phase 61.0.5, no behavior change)
Centralizes wrapper / metadata / health-check path construction. Every
production callsite that built `node_modules/.lpm/...` paths inline now
goes through `LayoutPaths::for_project(project_dir).{isolated,hoisted}_*`.
61.0.5 contract: every helper returns the legacy path
(`node_modules/.lpm/`). No observable behavior change. 61.1 will flip
`isolated_*` to `<project>/.lpm/wrappers/...` as a single source-of-truth
edit; consumers migrate transparently.
Production migrations in this commit:
- `lpm-linker::cleanup_stale_entries`: wrapper-root construction
- `lpm-linker::link_one_package`: pkg-entry-dir + .linked marker
- `lpm-linker::link_finalize`: wrapper-root for bin link traversal
- `lpm-linker::link_packages_hoisted`: metadata path + nested-root (via
`hoisted_*` helpers, intentionally still scoped to `node_modules/`)
- `lpm-cli::commands::rebuild::live_package_dir`: isolated probe
`doctor.rs` predicate is intentionally NOT migrated here — its semantic
change (handling hoisted-no-conflicts via `install_appears_healthy()`)
lands in 61.4.
Adds `crates/lpm-linker/src/layout.rs` with 13 unit tests covering all
helpers including the 5 `InstallHealth` variants and the
`needs_layout_migration` invariant in 61.0.5.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(lpm-linker): flip isolated wrapper root to <project>/.lpm/wrappers/ (Phase 61.1)
The big lever — the isolated linker's per-package wrapper tree moves
out of `node_modules/.lpm/` to `<project>/.lpm/wrappers/`. After the
relayout, `rm -rf node_modules` no longer wipes the entire incremental
linker cache, so the warm-install bench (and the user pattern Phase 57.2
surfaced — wiping node_modules after a teammate's lockfile change)
actually exercises the incremental linker.
Symlink-target shape changes (audit fix #1, v3):
- Phase 3 root symlinks (canonical + aliases) gain one extra `..`
segment and route through `<project>/.lpm/wrappers/<seg>/...`.
Centralized in `LayoutPaths::root_symlink_target()` so the depth
math (link-depth + 1) is computed in one place.
- Phase 3.5 self-references unchanged — they target the project root,
which doesn't move under Tier 2.
- Phase 2 internal sibling-wrapper symlinks unchanged — both endpoints
live inside `.lpm/wrappers/` so the relative `../../` shape is
preserved.
Drive-by audit fixes folded in:
- #3 (bin-shim wrapper segment): `create_bin_links` now uses
`pkg.wrapper_segment()` instead of hardcoding
`format!("{safe}@{version}")`. Pre-fix, local-source deps with a
`bin` field produced shims pointing at non-existent wrapper paths.
- #7 (Windows junction `..` normalization): added a lexical-clean
helper inside `create_symlink_or_junction`'s Windows arm so the
`../.lpm/wrappers/...` shape doesn't embed an unresolved `..`
segment in the path handed to `cmd /c mklink /J`.
`cleanup_stale_entries` updates:
- Explicitly creates `node_modules/` (pre-Tier-2 the wrapper-root
`create_dir_all` covered both via parent recursion; now they're
disjoint paths).
- Skips dotfile entries (e.g., the new `.version` schema-tag) when
sweeping stale wrappers.
- Writes `<wrapper-root>/.version` (D6) for forward-compat shape
detection.
Test fixtures migrated to use `LayoutPaths` so they track production
semantics on any future shape change. 4949 workspace tests pass;
clippy --workspace -D warnings clean; cargo fmt clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(lpm-cli): rebuild.rs uses LayoutPaths + closes store-fallback hole (Phase 61.2)
Three things land together because they all touch `prepare_live_package_dir`:
D8a — store-fallback hard-error. Pre-Phase-61 the function returned
`Ok(store_path)` whenever the live probe fell through, letting the
caller chdir into canonical store bytes for a lifecycle script. On
macOS (clonefile, CoW) that was silent corruption on first write; on
Linux (hardlinks) the early `if !live.starts_with(store_root)` branch
skipped detach so the script ran against shared inodes. Either way, a
soundness violation. Post-fix the function returns `Err("...not linked
into project — refusing to run lifecycle script inside the store...")`
so failures are loud, actionable, and never corrupt the store.
Audit fix #4 — wrapper-segment shape. `live_package_dir` now takes a
`wrapper_id: Option<&str>` and computes the wrapper segment via
`LayoutPaths::wrapper_segment(name, version, wrapper_id)`. The same
helper `LinkTarget::wrapper_segment` delegates to (single source of
truth across the linker / rebuild / future doctor code paths). Pre-fix
the inline `format!("{safe}@{version}")` silently missed every
non-Registry source: a Directory / Link / Tarball / Git dep with a
lifecycle script had its wrapper probe fail and fall through to the
store. Post-fix `ScriptablePackage` carries the `wrapper_id` derived
from `lp.source` via `Source::source_id()`.
Audit fix #5 — test inversion. The pre-existing
`prepare_live_package_dir_does_not_detach_when_path_is_under_store_root`
test pinned the silent-fallback contract D8a inverts. Replaced with
`prepare_live_package_dir_errors_when_unlinked` asserting the new
`Err("...not linked into project...")` shape; canary-bytes-intact
assertion preserved.
Adjacent fix in `p6_triage_autoexec_reference.rs`: the test seeded
the store but not the wrapper, relying on the silent-fallback hole to
run lifecycle scripts. Added a `seed_wrapper` helper that materializes
`<project>/.lpm/wrappers/<seg>/node_modules/<name>/` from the store —
mirroring real post-install state. Pre-D8a the same fixture passed by
accident; the new state captures the actual contract.
`LayoutPaths::wrapper_segment` is the new cross-crate helper.
`LinkTarget::wrapper_segment` delegates to it so the two cannot drift.
4949 workspace tests pass; clippy --workspace -D warnings clean;
cargo fmt clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(lpm-cli): layout-aware install_state + wrapper-layout migration (Phase 61.3)
Two pieces, both load-bearing per the v3 audit fix #2 / D8c:
1. Layout-aware freshness gate. `check_install_state` AND
`try_mtime_fast_path` now consult
`LayoutPaths::needs_layout_migration()` and force `up_to_date = false`
when a populated legacy `node_modules/.lpm/` coexists with an empty
`<project>/.lpm/wrappers/`. Without this gate, an upgrade-in-place
user (binary upgraded but `node_modules/` not wiped) hash-matches
on the install-hash check, the top-of-`main` fast lane
short-circuits, and the migration code path never runs — they stay
silently on the legacy layout until something else invalidates the
hash.
2. Migration code path inside `lpm install`. Right after the fast-exit
guard returns false, `migrate_legacy_wrapper_layout` checks the
same predicate and (when true) wipes `node_modules/.lpm/` so the
subsequent `cleanup_stale_entries` rebuilds at the new wrapper-root
location. No rename-first attempt — cross-FS rename hazards
(Linux containers, network FS, EXDEV) outweigh the saved relink
cost, which Phase 61 makes faster anyway. Best-effort wipe; legacy-
state quirks don't abort the install.
D9 — migration notice modes. Human-pretty mode prints a one-line
"migrating wrapper layout" notice via `output::info`; JSON / `--quiet`
/ non-TTY remain silent.
Tests added:
- `legacy_layout_present_forces_install_via_full_read` — hash matches
but migration is owed → `up_to_date = false`.
- `legacy_layout_present_forces_install_via_mtime_fast_path` — same
but with v2 mtime line; the mtime fast path bails to slow path.
- `empty_legacy_dir_does_not_force_install` — empty `.lpm/` doesn't
count as legacy.
- `populated_new_layout_does_not_force_install` — both populated →
migration considered complete; gate stops firing.
- `migrate_legacy_wrapper_layout_wipes_legacy_state` — happy path.
- `migrate_legacy_wrapper_layout_noop_when_not_owed` — no-op
on a fresh project (doesn't synthesize directories).
- `migrate_legacy_wrapper_layout_noop_when_both_populated` —
doesn't wipe on a mid-migration mixed state (real convergence
happens via the next normal install).
4956 workspace tests pass; clippy --workspace -D warnings clean;
cargo fmt clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(lpm-cli): doctor + gitignore + sandbox comment refresh (Phase 61.4 + 61.5 + 61.7)
61.4 — `lpm doctor` predicate becomes layout-aware. The legacy
`nm.exists() && nm.join(".lpm").exists()` probe is replaced with
`LayoutPaths::install_appears_healthy()` plus a `needs_layout_migration()`
gate. The doctor now distinguishes:
- Healthy { Isolated } → "exists with .lpm/wrappers store"
- Healthy { Hoisted } → "exists with hoisted layout"
- Healthy { Mixed } → warn + remediation
- NodeModulesPresentButNoStore → warn (existing message preserved)
- NoNodeModules → fail (existing message preserved)
- legacy layout detected (migration owed) → warn pointing the user
at `lpm install` to converge
The hoisted-no-conflicts case (which the legacy predicate misreported
as "no .lpm store") now correctly classifies as healthy.
61.5 — `ensure_lpm_wrappers_gitignore` runtime helper. Mirrors
`ensure_skills_gitignore` (and the lpm-vault / npmrc siblings):
runtime "ensure once" pattern, idempotent, OpenOptions-append to
narrow the TOCTOU window. Marker is `.lpm/wrappers/`. Wired into the
install entry point alongside `migrate_legacy_wrapper_layout`.
61.7 — sandbox comment refresh. `landlock_rules.rs` explanatory
comment referenced `{project}/node_modules/.lpm/`; updated to mention
the post-Phase-61.1 `<project>/.lpm/wrappers/` location. The actual
ReadWrite rule at line 103 already grants `<project>/.lpm` so the
post-relayout location was already covered — comment-only change,
no functional impact.
Tests added:
- `ensure_lpm_wrappers_gitignore_appends_entry`
- `ensure_lpm_wrappers_gitignore_no_duplicate`
- `ensure_lpm_wrappers_gitignore_creates_when_no_gitignore`
4959 workspace tests pass; clippy --workspace -D warnings clean;
cargo fmt clean; no fancy-regex.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(lpm-linker): retarget legacy root symlinks + dotfile-aware layout predicates
Two audit fixes (round 2 of Phase 61 review):
CRITICAL — legacy root-symlink retarget. Pre-fix, the 61.3 migration
wiped `node_modules/.lpm/` but never touched root symlinks at
`node_modules/<pkg>` whose targets pointed into the legacy
wrapper-root shape. Phase 3's `if root_link.exists()` guard skipped
recreation, so an upgrade-in-place install left dangling symlinks —
the wrapper tree was wiped, but `node_modules/<pkg>` still pointed at
the old location and stayed broken.
Fix: `cleanup_stale_entries`'s root-symlink sweep gains a second
predicate. Beyond the existing "not in `direct_names`" stale-name
removal, it now ALSO removes any root symlink whose target traverses
a `.lpm/` segment NOT followed by `wrappers/` (legacy shape). Phase 3
recreates with the correct new target. Walks `Path::components()`
so the predicate is robust to path-separator style and to whether
the relative target leads with `.lpm/` (unscoped) or `../.lpm/`
(scoped). Self-refs (target = `..`, no `.lpm`) and workspace-member
symlinks (target outside `.lpm/`) are unaffected.
5 new tests:
- `cleanup_stale_entries_removes_legacy_shape_root_symlink`
- `cleanup_stale_entries_preserves_new_shape_root_symlink`
- `cleanup_stale_entries_preserves_workspace_member_symlink`
- `cleanup_stale_entries_preserves_self_reference_symlink`
- `link_finalize_retargets_legacy_root_symlink_after_migration`
(end-to-end: post-migration install produces a working symlink
resolving to a real `package.json`)
MEDIUM — `.version` schema-tag must not mask migration. The 61.1
`.version` write at the wrapper root happens BEFORE any wrapper is
materialized; pre-fix, `dir_is_nonempty` counted `.version` as
evidence of a populated layout, so a half-completed install (or any
state where the new root has only `.version`) would silently mask a
needed migration AND make `lpm doctor` report a healthy isolated
install when no wrappers actually existed. Both
`needs_layout_migration` and `install_appears_healthy` consume the
helper.
Fix: `dir_is_nonempty` now skips entries whose name starts with `.`.
Wrapper segments from `LayoutPaths::wrapper_segment` cannot produce
a leading-dot name (path-separator sanitizer is `replace('/', '+')`,
never `.`), so the dotfile filter cannot miss a real wrapper.
2 new tests:
- `needs_layout_migration_true_when_new_root_has_only_version_file`
- `install_appears_healthy_metadata_only_root_is_not_isolated`
4966 workspace tests pass; clippy --workspace -D warnings clean;
cargo fmt clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(lpm-linker): scoped legacy-symlink retarget belt-and-braces
Audit follow-up: the scoped-name branch (`@scope/pkg`) of
`cleanup_stale_entries`'s root-symlink sweep traverses a separate
code path from the unscoped branch. The retarget fix in the prior
commit applies to both, but the existing test only exercised the
unscoped case. This test adds the scoped equivalent so a future
refactor that drops the legacy-shape predicate from the scoped
branch fails loud.
Setup: a `node_modules/@types/node` symlink whose target is the
pre-Phase-61.1 scoped shape (`../.lpm/<seg>/node_modules/@types/node`,
no `wrappers/` segment). After cleanup the legacy symlink must be
removed so Phase 3 recreates it pointing at the new
`../../.lpm/wrappers/<seg>/...` two-level shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin
added a commit
that referenced
this pull request
May 8, 2026
…ips during tarball unpack Phase 66 perf followup #1, samply-driven (2026-05-08). The tar crate's default unpack path calls `fchmodat` + `fchownat` per regular file — ~9 % of cold-install CPU on `bench/fixture-large` per the `tar::EntryFields::unpack::set_ownerships` self-time hotspot. For a 256-package install at ~80 entries each, that's ~20 000 unnecessary metadata syscalls. We don't preserve perms or ownerships in the global store: extracted files inherit the lpm process's umask and uid, which is what every downstream `require()` actually expects. npm tarballs ship with arbitrary uid/gid + mode bits authored on whoever-built-the-tarball's OS — they mean nothing on the consumer side. Disabling both is a 2-line change with zero behavior delta on the require/import path. Bench delta (paired n=10 cold/clean on fixture-large): - extract sum: 116 → 61 ms (Fix #1 + Fix #3 combined) - total wall: ~5-15 ms recovered Modest in absolute terms but pure win — the syscalls were doing nothing useful. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin
added a commit
that referenced
this pull request
May 8, 2026
Phase 66 perf followup #3, samply-driven (2026-05-08). `prepare_output_path` was a ~4.4 % cold-install hotspot per samply's self-time data: per tarball entry, it walked every path component calling `symlink_metadata` per component to detect the symlink-attack case. For an npm tarball with ~80 entries averaging 4 path components, that's ~320 redundant `symlink_metadata` syscalls per package — the same intermediate dirs (`src/`, `lib/`, `node_modules/.bin/`, etc.) get re-stat'd on every entry that lives under them. Fix: thread a `HashSet<PathBuf>` of verified-already parent dirs through `extract_tarball_from_reader_with_inspector` and skip the syscall on cache hit. Only NON-leaf components are cached — the leaf is the per-entry file path which still needs the symlink-attack guard. Safety invariant: the cache only carries verified PARENT directories, never leaf paths. A leaf hit can never live in the set, so the symlink-attack guard cannot be bypassed by accident. Bench delta (paired n=10 cold/clean on fixture-large): - extract sum: 116 → 61 ms (combined with Fix #1's chmod skip) - ~320 → ~84 `symlink_metadata` calls per ~80-entry package - total wall: ~5-10 ms recovered (small because syscall is cached by the OS dirent cache and parallelism bounds wall by max-per-task) Capacity heuristic: HashSet sized for 64 parents up front. Most npm tarballs have ≤ 10 distinct intermediate dirs; 64 covers the long tail without over-allocating. Pinned by the existing 20 extractor unit tests (every path-traversal edge case still passes after the signature change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin
added a commit
that referenced
this pull request
May 8, 2026
Three small, verified wins on cold-install CPU + wall, identified by symbolicated samply on bench/fixture-large after fixes b03d051 already landed. #1 extractor: archive.set_preserve_mtime(false) Pre-fix flame attributed 2.0% of active CPU to filetime::set_file_- handle_times → fsetattrlist (100% from extract_tarball). mtime is meaningless for content-addressable store bytes — require() doesn't read it; lpm doctor doesn't use it. tar 0.4.45's preserve_mtime defaulted to true; flipping it eliminates the syscall entirely. #4 extractor: stream_entry_to_disk replaces entry.unpack() for files Even with preserve_permissions(false) tar 0.4.45's _set_perms still unconditionally calls fs::set_permissions (entry.rs:814 — the flag only controls SUID-bit retention). 1.7% of active CPU was __fchmod from 100% extract_tarball. New helper does File::create + io::copy only — same minimal write semantics as the existing write_buffered_entry path. #2 store: LinkMeta::write_to_unpublished skips inner tmp+rename populate_into stages the sidecar inside an unpublished tmp_dir (links/<key>.tmp.<pid>.<tid>/) that is published via a single outer atomic rename. The inner tmp+rename in write_to was redundant: no observer can ever see a half-written sidecar inside an unpublished dir. New write_to_unpublished writes the JSON directly. Saves one rename syscall per link entry × N packages. Verification — paired A/B median over 8 iters (worst dropped): Stage | Pre-fix | Post-fix | Δ total | 998 ms | 937 ms | −61 ms (−6.1%) fetch | 355 | 304 | −51 ms (#1 + #4) link | 138 | 132 | −6 ms (#2) Flame profile confirms target syscalls eliminated: set_perms_ownerships: 1.7% → 0 set_file_handle_times: 2.0% → 0 __fchmod: 1.7% → 0 __rename: 10.2% → 7.7% LinkMeta::write_to → write_to_unpublished: 4.2% → 0.3% Tests: cargo nextest run -p lpm-extractor -p lpm-store — 134/134 pass. Clippy: clean across workspace. Followups still open (separate tranche): - #5 fuse extract+analyze (drop redundant 2nd-pass walk, ~10% on-CPU) - #6 restore event-driven link/fetch overlap (~50-100 ms wall) - #3 lazy warm-hit sidecar touch (warm-install-only) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin
added a commit
that referenced
this pull request
May 8, 2026
…s (v2 path) Drops the redundant post-extract directory walk + per-source-file disk re-read that the v2 path inherited from its non-streaming origin. v2's `extract_object_with_timings` now mirrors v1's `stream_and_store_package` (lib.rs:594-618): the extractor's `extract_tarball_from_reader_with_inspector` filters scannable entries via `PackageAnalyzer::should_scan` and feeds matching entries' bytes into the analyzer while still in the tar walk's write buffer. The post-extract `finalize` only reads `package.json` for manifest tags — the per-source-file pass is gone. Pre-flame attribution for the eliminated work: analyze_package (orchestrator): 3.8% active CPU analyze_single_file (per-file fs::read): 4.9% collect_source_files_recursive (fs walk): 1.5% total redundant cost: ~10.2% Post-#5 flame: analyze_package*: gone (only manifest 0.4%) analyze_single_file: gone collect_source_files_recursive: gone PackageAnalyzer::feed (during extract): 8.5% (= analyze_bytes work moved to fused path) Bench (single iter, network-noisy resolver swamps total wall): fetch_ms: 304 → 264 (−40 ms) link_ms: 132 → 124 (−8 ms) extract_sum unchanged (analyzer cost folded into the extract phase) security_sum: now ~0 (only finalize manifest read remains there) `tarball_data: &[u8]` implements Read directly so no Cursor needed. RefCell wraps the analyzer for the FnMut inspector closure. Tests: cargo nextest run -p lpm-extractor -p lpm-store -p lpm-security — 543/543 pass. Clippy: clean. Followups still open: #6 restore event-driven link/fetch overlap (~50-100 ms wall, substantial) #3 lazy warm-hit sidecar touch (warm-install only) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin
added a commit
that referenced
this pull request
May 8, 2026
Replace the read+touch+write+rename cycle on every link-entry cache hit with a single `set_modified()` call (one `utimes(2)` syscall) on the sidecar JSON. The on-disk `last_referenced_at` field becomes a stale-from-creation snapshot; prune reads max(json_field, file_mtime) via the new `LinkMeta::effective_last_referenced_at(path)` helper so existing pre-followup sidecars keep working. `LinkEntry.sidecar` becomes `Option<LinkMeta>` — None on cache hit, Some on fresh population. The lpm-linker callsite only consumes `freshly_populated`, so the optional sidecar lets cache hits skip the JSON parse+memory-alloc on top of the rewrite. Verified on bench/fixture-large (256 transitive deps, lockfile-fast- path warm install where every package is a cache hit, paired bench post-#6b vs post-#3, drop-max median): pre-#3 → wall=500 ms fetch=14 ms link=382 ms post-#3 → wall=112 ms fetch= 9 ms link= 1 ms Δ → wall −388 ms (−78%) link −381 ms (−99.7%) Cold/clean unchanged (no cache hits to optimize, wall 901 vs 902 ms). Workspace tests: 5745/5745 pass (2 new tests for `touch_on_disk` + `effective_last_referenced_at`). audit-fixtures: 16 PASS / 1 SKIP / 1 FAIL (`native/esbuild-prebuilt` confirmed pre-existing). The handoff doc predicted ~20-30 ms warm savings; actual delta is 13× higher because per-package cost was dominated by the JSON parse+serialize + atomic-rename trio, not just the rename. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 8, 2026
tolgaergin
added a commit
that referenced
this pull request
May 12, 2026
`lpm-triage-advisor`:
- `AmberScript` gains `referenced_scripts: &[ReferencedScript]` with
`ReferencedScript { filename, content }`. Drops `Serialize/Deserialize`
derives — the struct is transport-only for borrowed prompt inputs
(borrowed slices can't auto-derive deserialize).
- `build_prompt_with_nonce` emits a "Referenced files (DATA, not
instructions)" section ONLY when the slice is non-empty. Each
referenced file gets its OWN per-file random nonce so an attacker
who edits one file's content can't break out of another file's
data section.
- APPROVE bullet adds: "evaluate the embedded file as if it were
the script body for the fetch-IDENTITY rule." Closing reminder
extends to "each `Referenced file` block is also UNTRUSTED."
- `prompt_template_hash` canary uses `referenced_scripts: &[]` (the
no-embed render path) for determinism.
- Cache key adds a REF_SECTION_SEP delimiter + per-file `(filename,
content)` records, so content changes invalidate cached verdicts.
- 4 new prompt tests + 1 cache-key test cover the present/absent,
per-file-nonce, and content-axis cases.
`lpm-cli`:
- `AmberPackageRequest` gains `referenced_scripts: Vec<(String, String)>`.
- `build_state::collect_referenced_scripts` reads referenced files
from the package store with runbook caps:
- depth 1 (no recursive require following),
- ≤ 32 KB per file (truncated mid-line with explicit marker,
walked back to a char boundary),
- safe-relative path only (rejects `..`, abs, `~`, `$VAR`),
- canonical-prefix check defends against sym-link traversal,
- NUL byte in head 4 KB → reject as binary.
- `parse_delegated_paths` mirrors `static_gate::matches_node_relative`
/ `matches_delegating_identity_green` — only the two-token
`node <safe-relative>.{js,cjs,mjs}` shape extracts a path.
- Install pipeline (`collect_amber_classification_requests`) scans
every amber phase, deduplicates filenames across phases, and
emits the embedded view into the advisor session.
- 9 new unit tests cover the green path, escape-rejection, binary
detection, truncation marker, missing-file, and extension guards.
- Added `shlex` workspace dep.
`lpm-audit-corpus`:
- `PackageAudit` gains `referenced_scripts: Vec<ReferencedScriptEntry>`
persisted on each record. Live audit path leaves empty (no tarball
fetch); hermetic / curated fixtures supply content directly.
- `HermeticEntry` + `CuratedExpectation` gain `referenced_scripts`.
- `classify_one_with_advisor` threads the embedded view through
both the prompt builder AND the cache key — content changes
produce new cache slots so the verdict re-evaluates.
- Hermetic fixture: two delegate-to-local-file entries now carry
realistic install.js content (binary fetcher from same-repo
releases).
- Curated fixture: `amber-d18-013-sharp-install-js` carries an
excerpt of sharp's actual install.js.
Measurement (claude-cli, runs run-to-run variance ~3-5pp):
- Hermetic: advisor-enhanced auto-run 9→10 (one additional Approve
with embedded view). Stayed at 5 ambers post Lever #4.
- Curated: only 1 entry has referenced_scripts and it was already
L1-Green'd by Lever #4, so this corpus shows no isolated Lever #3
movement. Real install.js content shines through the install
pipeline's file-reader at install time.
Tests: 2257/2257 lpm-security, 17/17 lpm-cli triage, 9/9 new
build_state unit tests, 57/57 lpm-triage-advisor. Clippy + fmt clean.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin
added a commit
that referenced
this pull request
May 14, 2026
…ix (#58) * test(workflows): pin concurrency + recovery contracts for lpm install Adds tests/workflows/tests/install_concurrency.rs with 13 falsifiable tests covering production failure modes that had zero coverage: Category A — process racing: * two concurrent installs on same project (pins finding-#77 floor) * install + concurrent store-clean serialize via shared/exclusive store_lock (probed via try_with_exclusive_lock on the actual lock file, not a directory-existence proxy) * two concurrent `lpm install -g` via global_tx_lock — proves final manifest + WAL coherence under serialized commits Category B — interruption recovery: * kill mid-tarball-fetch leaves no .lpm/install-hash * next `lpm install` converges to a coherent end state Category C — network faults: * tarball 503 → 200 succeeds after retry (counting Respond impl) * metadata 404 fails immediately without retry (<2s wall-clock) Category D — filesystem faults: * readonly project dir fails with actionable error (no panic); POSIX-only via #[cfg(unix)], RAII guard restores permissions * `<project>/.lpm` planted as a regular file fails clearly Category E — partial state recovery: * stale install-hash triggers re-resolve + refetch * partial node_modules re-links to full state * truncated lpm.lockb either recovers or fails cleanly (no panic) Category F — WAL recovery hook: * torn WAL tail (3 garbage bytes) gets truncated by the dispatcher's recovery hook before the command runs; idempotent on re-invocation Support helper refactor (same commit so the new helper has callers): * extracts env-isolation set into `LpmEnvSink` trait + `apply_lpm_env(cmd, project)` shared by `lpm()` (assert_cmd) and the new `lpm_spawnable()` / `lpm_spawnable_with_registry()` (std::process::Command, supports Child::kill()) * trait impl on both Command variants ensures the two helpers cannot drift on the ~30 env knobs that gate test isolation Surfaced findings during this work: * #77 — no project-level install lock: concurrent installs silently drop one side's work AND/OR fail with atomic-rename races (3 observed failure modes documented in findings.md). Fix shape: LpmRoot::project_install_lock + with_exclusive_lock_async wrap. * #78 — retry-backoff has no test-friendly knob; retry-exhaustion tests take 15s+. Fix shape: LPM_RETRY_BACKOFF_MS_OVERRIDE env in debug builds. CI gate locally green: clippy --workspace --all-targets -- -D warnings: clean cargo fmt --check: clean fancy-regex ban: empty cargo build --workspace: clean cargo nextest run --workspace --exclude lpm-integration-tests: 6439 passed, 7 skipped, 1 leaky (pre-existing) Deferred (filed under "next session" in the followup plan): B.3 (kill doesn't tear lockfile) — subsumed by B.1/B.2 B.4 (panic injection) — needs LPM_TEST_PANIC_AT env hook C.2 (retry exhaustion) — blocked by finding #78 C.3 (truncated body) — needs custom Respond with Content-Length mismatch D.3 (disk-full simulation) — no portable mechanism F.2, F.3 (orphan WAL, torn WAL with real records) — needs framed-WAL construction helpers Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflows): pin lpm.lock well-formedness + recovery skip-on-contention Closes B.3 and F.2 of the concurrency tranche — 13 → 15 tests, meeting the "≥15 of 21" acceptance criterion for Item 2. B.3 — `install_killed_mid_pipeline_leaves_well_formed_or_absent_lockfile`: Exercises two SIGKILL windows on the install pipeline — fresh project and project with a committed lpm.lock from a prior install. After each kill, asserts the on-disk lpm.lock is either absent OR parses as TOML. Never half-written. Adds `toml = { workspace = true }` as a workflow- tests dev-dep for the parse assertion. Helper `assert_lockfile_well_formed_or_absent` shared between both windows. F.2 — `lpm_command_skips_recovery_when_another_lpm_holds_global_tx_lock`: Validates the dispatcher's `try_with_exclusive_lock` idempotent-skip path at `main.rs:2531`. A background thread acquires `global_tx_lock` via `lpm_common::with_exclusive_lock` and blocks on a channel. With the lock held, runs `lpm global list` against a project with a torn- WAL prefix — asserts the WAL bytes are UNCHANGED (skip arm fired, recovery did not run). Then releases the lock and re-runs; asserts the WAL is now truncated (recovery defers correctly to the next lock-free invocation). Exercises both branches of the `try_with_ exclusive_lock` Ok(None) / Ok(Some) arm. CI gate locally green: cargo clippy --workspace --all-targets -- -D warnings: clean cargo fmt --check: clean cargo nextest run --workspace --exclude lpm-integration-tests: 6441/6441 passed, 7 skipped 5x parallel re-run of install_concurrency: 15/15 stable each run Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflows): pin truncated-tarball + orphan-WAL recovery contracts Two new tests in tests/workflows/tests/install_concurrency.rs: - C.3 tarball_connection_dropped_mid_body_fails_or_retries: a custom wiremock Respond impl serves half a tarball with a Content-Length header naming the full length. Pins the install pipeline's retry-then-fail behavior on transport-class failures (~14s wall-clock for the full 4-attempt retry schedule). Hyper 1.9 server-side panics on the Content-Length lie, dropping the connection — a valid surrogate for a broken upstream / CDN dropping mid-body. Surfaced 8 tarball GETs per install (deterministic, 3-of-3 reproducer), explained by two distinct download_tarball_* call sites in install.rs each running the 4-attempt retry budget. - F.3 lpm_command_with_orphan_pending_tx_emits_recovery_banner: plants both halves of an orphan transaction (WAL Intent record without matching Commit/Abort + matching [pending.<pkg>] row in manifest.toml pointing at a non-existent install root) and asserts the dispatcher's recovery hook fires the RolledBack banner from main.rs:2543. Sets RUST_LOG=lpm=info to lift the default lpm=warn filter so the tracing::info! line surfaces. Adds lpm-global as a workflow dev-dep for WalWriter / IntentPayload / write_for. Pins post-state: orphan pending row gone, no spurious active row. Together these close the C.3 and F.3 gaps in Item 2 of the test coverage follow-up plan: 17/21 scenarios pinned (was 15/21). The four remaining items all need source-side hooks (LPM_TEST_PANIC_AT, LPM_RETRY_BACKOFF_MS_OVERRIDE, container infra) and are out of scope for this tranche. Full CI gate green: clippy clean, fmt clean, fancy-regex empty, 6443/6443 nextest pass (was 6441 pre-tranche). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflows): pin tarball-extraction security contracts at install tier New file tests/workflows/tests/tarball_security.rs ships phase 1 of Item 3 (tarball-extraction security): 5 of 10 planned tests covering the most distinct security contracts at the install-pipeline tier. Each test constructs its malicious tarball in-line via tar::Builder (no checked-in fixtures), serves it through MockRegistry, and runs lpm install end-to-end so any pipeline-level regression that bypasses the extractor's hardening is caught. Tests landed: - #1 tarball_with_dot_dot_path_entry_is_rejected_by_install — pokes package/../escape.txt into the raw tar header bytes; install fails with "path traversal detected"; outside sentinel never created. - #3 tarball_with_absolute_path_entry_is_normalized_to_relative_under_package_dir — renamed from "rejected" to reflect actual contract. The extractor's strip_first_component consumes the RootDir; an entry like /etc/lpm-pwned.txt extracts as node_modules/<pkg>/etc/lpm-pwned.txt. Install SUCCEEDS; literal /etc/lpm-pwned.txt is never written. Defensible: malformed-but-safe input normalized rather than refused. - #2 tarball_with_symlink_to_outside_path_is_silently_skipped — renamed. The is_file() gate at lib.rs:398 silently drops symlinks; install succeeds with byte-identical outside sentinel. - #5 tarball_with_hard_link_to_outside_file_is_silently_skipped — renamed. Same is_file() gate; hardlinks silently skipped; outside victim file unmodified. - #8 tarball_with_setuid_executable_extracts_with_setuid_bit_stripped (POSIX-only) — tarball entry mode 0o4755 extracts as 0o755. SUID, SGID, and sticky bits all cleared via set_preserve_permissions(false) + the explicit `0o644 | exec_bits` mode set after write. Exec bits preserved. Three tests carry a "plan-vs-actual" docstring section explaining why the rename is defensible — the actual extractor contract differs from the plan's prescribed phrasing in safe ways, not in regression-grade ways. No findings filed. Phase 2 (5 remaining tests: Unicode normalization, device file, FIFO, zero-byte sanity, OS-max path) is deferred to a follow-up tranche with rationale + lift estimate documented in the plan. None blocks phase 1 acceptance. Pre-merge gate green: clippy clean, fmt clean, fancy-regex empty, 6448/6448 nextest pass (was 6443; +5 for the new tests). 0.18s wall- clock for the full file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(install): per-project lock prevents concurrent-install data loss Closes finding #77. Two `lpm install <pkg>` invocations on the same project no longer race on the manifest snapshot+commit window. Pre-fix, both processes acquired only a SHARED store_lock and proceeded in parallel. Each opened its own per-process ManifestTransaction snapshot of the pre-edit package.json, staged its own dep on top, and ran the install pipeline. Whoever wrote package.json + lpm.lock last won; the other process's edits — including its node_modules link — silently vanished. Both processes still exited 0 with success-path output. CI scripts that ran two installs in parallel saw no signal of the data loss. The fix introduces: - crates/lpm-common/src/paths.rs::project_install_lock(project_dir): free helper returning <project_dir>/.lpm/.install.lock. Re-exported from crates/lpm-common/src/lib.rs. - run_add_packages and run_install_filtered_add in crates/lpm-cli/src/commands/install.rs now wrap the snapshot → stage → install → finalize → commit window in with_exclusive_lock_async against the project lock. The lock is per-project (no cross-project contention) and held across all ?-early-exits via the async block's return. For the workspace path, the lock sits at the discovered workspace root (not per-member) so two concurrent `lpm install --filter <member>` invocations on the same workspace serialize without per-member deadlock-ordering complexity. run_with_options (the inner install pipeline) does NOT acquire this lock — it's called from inside both run_add_packages's wrap and from many other commands; double-acquiring the same fd-lock would deadlock in-process. Deferred (phase 2, not exercised by A.1): lpm add (add.rs:723-904) has a similar 180-line transaction with recursive Swift handling. Wrapping it is invasive and the race surface is theoretical (users don't typically run `lpm add` and `lpm install` concurrently). Defer to a separate tranche if a concurrent `lpm add` × `lpm install` race is ever observed. Test contract tightening (bug-first per CLAUDE.md): two_concurrent_installs_on_same_project_leave_well_formed_manifest in tests/workflows/tests/install_concurrency.rs went from "at-least-one survives + manifest is well-formed JSON" (the floor) to "BOTH installs succeed, BOTH packages present in package.json deps, BOTH packages linked in node_modules/" (the contract). Pre-fix: 1/1 fail (pkg-b silently dropped). Post-fix: 5/5 pass with no flakes (~1.2s wall-clock each — install B observes pkg-a's commit and reports "Resolved 2 packages"). Pre-merge gate green: clippy --workspace --all-targets clean, fmt clean, fancy-regex empty, 6448/6448 nextest pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(registry): test-only retry-backoff override env knob Closes finding #78 + lands C.2 (`tarball_503_exhausts_retries_fails_with_http_status`). Pre-fix, retry-exhaustion tests were blocked: the registry client's backoff schedule (1+2+4+8s, capped at 10s) made every retry-exhaustion test take ~15s per fetch site (~28s with the install pipeline's 2 distinct download_tarball_* call sites). MAX_RETRIES, RETRY_BASE_DELAY, and RETRY_MAX_DELAY are private const with no env override. C.2 therefore had to be #[ignore]-gated behind LPM_RUN_SLOW_TESTS=1, and the retry-exhaustion contract went unproven on `cargo nextest run`. The fix introduces: - crates/lpm-registry/src/client.rs::backoff_override(): reads LPM_RETRY_BACKOFF_MS_OVERRIDE (a u64 ms value) gated by cfg!(debug_assertions) || LPM_TEST_MODE=1. Returns Some(Duration) when both conditions hold; None otherwise. Production retry policy is immune — release builds without LPM_TEST_MODE=1 silently ignore the env. - backoff_delay(attempt) consults the override before computing the exponential schedule. - The two 429 Retry-After sleep sites also consult the override so a future 429-flood retry-exhaustion test wouldn't hang on the server-supplied header. C.2 test landed alongside (bug-first per CLAUDE.md): - Mock returns 503 on every tarball request — no recovery path. - Test sets LPM_RETRY_BACKOFF_MS_OVERRIDE=10 on the lpm subprocess. - Asserts: install fails non-zero, no panic, ≥4 attempts (proves the retry loop fired), elapsed < 2s (load-bearing — without the knob this fails at ~14s), stderr contains an actionable HTTP-class noun (503 / status / http / network / etc). - Surfaces 8 tarball GETs per install (4 attempts × 2 distinct download_tarball_* call sites — matches C.3's observation). Pre-fix verification: same C.2 against the unfixed client.rs failed on the elapsed assertion at 14.04s (knob ignored). Post-fix: passes in 1.6s cold / 0.1s warm. 5/5 passes with no flakes. Pre-merge gate green: clippy --workspace --all-targets clean, fmt clean, fancy-regex empty, 6449/6449 nextest pass (was 6448 pre-fix; +1 for C.2). Item 2 of the test-coverage-followup-plan now at 18/21 (was 17/21). Both findings #77 and #78 fixed in production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflows): tarball-security phase 2 — Unicode, device, FIFO, zero-byte, long-path Adds 5 more tests to tarball_security.rs, completing Item 3 of the test-coverage follow-up plan. Each test pins the actual extractor contract under malicious-or-edge-case tarball shapes that reach the install pipeline through MockRegistry. Tests landed: - #4 tarball_with_unicode_lookalike_parent_dir_extracts_safely_as_literal_bytes — renamed from "_normalization_traversal_rejected" to reflect the actual contract. Tarball entry path uses full-width dots U+FF0E `..` (bytewise NOT ASCII `..`). Component::ParentDir is byte-exact, so `..` becomes Component::Normal. Install SUCCEEDS; `..` materializes as a literal directory under node_modules/<pkg>/; outside sentinel byte-identical. Defensible because Path::components() doesn't NFKC-normalize on POSIX. - #6 tarball_with_character_device_entry_is_silently_skipped (POSIX-only). EntryType::Char with /dev/null-shaped major/minor. Same is_file() gate as symlinks/hardlinks — silently skipped. Install SUCCEEDS; no device file at the expected path. - #7 tarball_with_fifo_entry_is_silently_skipped (POSIX-only). EntryType::Fifo. Same posture as #6. - #9 tarball_with_zero_byte_regular_file_extracts_as_empty_file. Sanity check that empty files still extract correctly (legitimate npm shape: .gitkeep, license placeholders). - #10 tarball_with_single_path_component_exceeding_name_max_fails_cleanly. 300-byte single-component name, well over POSIX NAME_MAX=255. Tar wire format succeeds via GNU long-name extension; the FILESYSTEM rejects on extraction (ENAMETOOLONG). Extractor wraps as LpmError::Io → install fails non-zero with the OS error visible and an actionable noun in stderr. Three of the five tests are renamed to reflect actual extractor contract vs the plan's prescribed phrasing — same "plan-vs-actual" docstring pattern as phase 1. No findings filed; all 10 contracts across phase 1 + 2 are defensible-as-implemented. Pre-merge gate green: clippy --workspace --all-targets clean, fmt clean, fancy-regex empty, 6454/6454 nextest pass (was 6449 pre-tranche; +5 for the new tests). Full file 0.2s wall-clock for all 10 tests. Item 3 now COMPLETE (10/10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflows): cross-command flows Item 4 — migrate→rebuild + workspace filter isolation Closes Item 4 of the test-coverage-followup-plan at 6/6 (target was ≥5). Two additions to tests/workflows/tests/cross_command_flows.rs: - Plan #1 — extended flow_migrate_install_audit_lockfile_round_trips with a `lpm rebuild --dry-run --policy=deny` step. Pins the full migrate → install → audit → rebuild lifecycle. Asserts the rebuild step exits 0 + does not mutate the post-audit state (lpm.lock + lpm.lockb still present). Catches regressions where rebuild's lockfile or build-state parser breaks against a freshly-migrated manifest. - Plan #5 — added flow_workspace_install_filter_member_a_does_not_mutate_member_b (new test, 159 LOC). Pins the workspace-member isolation contract using the workspace-monorepo fixture (3 members: app, core, utils): 1. Initial filtered install on @test/core (re-pinning its existing semver dep) populates core's per-member quadruple: lpm.lock=319 B, lockb=230 B, install_hash=118 B. 2. Snapshot core's full quadruple. 3. Run `lpm install chalk@5.3.0 --filter @test/app` to add a new dep to app ONLY. 4. Assert app's package.json gained chalk; core's quadruple (package.json + lpm.lock + lpm.lockb + install-hash) is BYTE-IDENTICAL post-install; chalk does NOT appear in core's node_modules/. Catches a regression where a per-member filtered install accidentally also mutates a sibling member's package.json / lockfile / install-hash — a real bug class because run_install_filtered_add shares the workspace-root project lock (added in #77 fix) and could over-snapshot if the target-set computation drifts. Helper `mount_pkg_full(mock, name, version)` factors out the three-step metadata + batch-metadata + tarball mount so the test body stays readable. Other 4 plan flows already covered pre-tranche: - Plan #2: flow_add_install_graph_added_dep_visible - Plan #3: flow_install_patch_patch_commit_install_persists_patch - Plan #4: flow_token_rotate_publish_dry_run_picks_new_token - Plan #6: flow_install_upgrade_major_audit_picks_new_version Pre-merge gate green: clippy --workspace --all-targets clean, fmt clean, fancy-regex empty, 6455/6455 nextest pass (was 6454; +1 for the new flow). Plan #5 stable across 5/5 reruns at ~0.11s each. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(install): LPM_TEST_PANIC_AT hook + B.4 panic-rollback contract Adds a deterministic panic-injection hook to the install pipeline + unblocks the long-deferred B.4 contract test for ManifestTransaction Drop-based rollback on panic. The hook (`maybe_test_panic(stage)` in crates/lpm-cli/src/commands/install.rs) reads LPM_TEST_PANIC_AT and panics when the env value matches the stage name. Gated to `cfg!(debug_assertions) || LPM_TEST_MODE=1` — same pattern as the #78 retry-backoff override. Production builds without LPM_TEST_MODE=1 silently treat the env as no-op. Wired 4 stages in `run_add_packages`: - "after-snapshot" — manifest unchanged; Drop is no-op - "after-stage" — placeholder `*` written to package.json (load-bearing) - "after-install" — pipeline complete; manifest still has `*` - "after-finalize" — concrete versions written; pre-commit only The hook unblocks B.4 (`install_panics_mid_pipeline_rollback_restores_manifest`), deferred since the original Item 2 tranche because there was no deterministic way to trigger a panic mid-install from a workflow test. Recoverable errors fire `?`-rollback (covered by E.1/E.2/E.3); SIGKILL bypasses Drop entirely (B.1/B.2/B.3 cover that). The panic path was the missing rollback proof. B.4 sets LPM_TEST_PANIC_AT=after-stage and asserts: - process exits non-zero (panic propagates to runtime) - stderr contains `"panicked at"` AND `"LPM_TEST_PANIC_AT=after-stage"` - package.json BYTE-IDENTICAL to pre-stage (Drop ran on unwind, snapshot bytes restored — load-bearing) - the new pkg is NOT in dependencies (placeholder rollback worked) - .lpm/install-hash absent (invalidate-on-rollback) - lpm.lock absent (matched optional snapshot's None pre-state) Catches a regression where: - panic = "abort" added to release profile (no Drop on panic) - ManifestTransaction Drop logic stops restoring snapshot bytes - The `lpm install` snapshot+commit window grows without re-wiring Drop Test runs in 0.07s warm. 5/5 stable across reruns. Pre-merge gate green: clippy --workspace --all-targets clean, fmt clean, fancy-regex empty, 6456/6456 nextest pass (was 6455; +1 for B.4). install_concurrency now at 19/19. Item 2 of test-coverage-followup-plan moves to 19/21 — only A.2 (no contract) and D.3 (needs container infra) remain deferred indefinitely. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(workflows): align MockRegistry tarball URL shape with production /-/ gate Workflow tests mounted tarballs at `/tarballs/{name}-{version}.tgz` — missing the `/-/` path segment that the registry-client's `evaluate_cached_url` gate at [crates/lpm-registry/src/client.rs#L4117] requires (`.tgz` suffix AND `/-/` substring). The gate is a defense-in-depth check that blocks the H1 auth-token leak: a tampered lockfile URL like `/api/admin/foo.tgz` (no `/-/`) would otherwise attach the bearer to a non-registry endpoint. The mismatch produced two test-environment side effects that don't manifest in production: 1. **WARN noise**: every install test that read a tarball URL from the lockfile fast path logged `cached tarball URL for X@Y failed shape check; falling back to on-demand lookup`. Polluted stderr across the suite. 2. **`shape_mismatch_count` defeated**: the registry-client documents this counter as a "BUG signal — the writer should never emit a gate-rejectable URL". Test runs incremented it on every install, making the counter useless for catching real bugs. This commit migrates the mock to the production-shape `/tarballs/{name}/-/{name}-{version}.tgz` everywhere — both the helper methods (`MockRegistry::tarball_path` / `tarball_url`) and the ~60 hard-coded `format!` sites across 14 test files + 1 snapshot. The new `tarball_path` helper is `pub` with a prominent docstring warning future test authors not to re-introduce the legacy shape. Internal mounts in `with_package_and_deps` / `with_package_published_at` / `with_full_package_metadata` all route through it. Post-fix verification: WARN gone, gate `Accepted` path runs, all 691 lpm-workflows tests pass (0 leaky in the latest full-workspace run, down from 1-3 leaky pre-fix — fewer fallback paths firing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(workflows): test-coverage-followup tranche — Items 2/3/4/5 Closes the remaining open rows from `private/test-coverage-followup-plan.md` across four items. ~2,600 LOC of new test code + fixture + budget infra. **Item 3 — tarball-security additional candidate surfaces (7 tests in `tarball_security.rs`):** - `tarball_with_pax_path_traversal_rejected` — PAX extended `path` header smuggling `..` is rejected by the extractor's `Component::ParentDir` check after the tar crate resolves the override. - `tarball_with_gnu_longname_traversal_rejected` — symmetric GNU `L` entry; same rejection path. - `tarball_rejects_or_rolls_back_when_later_entry_is_malicious` — pins the `rollback_extraction` contract: valid first entry is cleaned up when a later `..`-traversal entry trips rejection mid-stream. - `tarball_with_duplicate_member_path_rejected_or_deterministic` — pins current last-write-wins contract (defensible; flagged scanner- disagreement risk in test comment). - `tarball_with_truncated_gzip_rolls_back_partial_extract` — half- truncated gzip stream → libdeflate fails cleanly → no partial extract. - `tarball_ignores_uid_gid_ownership_metadata` (POSIX) — bogus uid/gid in tar header is ignored; extracted files owned by process uid. - `tarball_with_sparse_huge_file_rejected_by_declared_size` — manually- constructed tarball with header declaring `MAX_FILE_SIZE + 1` and empty on-wire body; extractor rejects on the pre-check at lib.rs:306 before draining body. **Item 4 — cross-command flows additional candidate surfaces (2 tests in `cross_command_flows.rs`):** - `flow_install_uninstall_install_graph_round_trip` — pins manifest / link / graph hand-off through a full round-trip. - `flow_cache_clean_then_offline_install_uses_store_or_fails_helpfully` — pins the cache/store boundary: `cache clean` must not corrupt offline install; store-side bytes byte-identical after a clean. **Item 2 — concurrency/recovery additional candidate surfaces (3 tests in `install_concurrency.rs`):** - `cache_clean_during_slow_tarball_install_does_not_corrupt_install` (G.4) — install + cache clean run concurrently (different lock paths, no serialization); install succeeds despite metadata cache wipe mid-stream. Empirical timing observed: install elapsed 1.57s, cache clean fired at t=30-39ms cleanly inside the install window. - `install_panics_after_install_hash_write_rollback_invalidates_hash` (G.5) — reuses existing `LPM_TEST_PANIC_AT=after-install` stage (no new source-side hook needed — `write_post_install_v6_hash` runs inside `run_with_options` which returns BEFORE that stage fires). Pins that Drop-based rollback restores manifest AND deletes the freshly-written install-hash. - `malformed_registry_json_fails_without_manifest_or_lockfile_mutation` (G.6) — truncated JSON on all three metadata endpoints; install fails cleanly, no panic/backtrace, package.json byte-identical, no torn lockfile. **Verdaccio-npm parity for `which@4.0.0` (`install_real_registry.rs`):** - `verdaccio_npm_parity_for_bin_package_pins_metadata_and_shim_presence` — extends the existing lodash byte-diff with a bin-shipping target package. Asserts metadata equivalence + `.bin/<name>` shim present on both sides + bin target file materialized + exec bits non-zero (POSIX). **Item 5 — realworld fidelity (new fixture + new test file):** - `tests/fixtures/realworld-nextjs/` (package.json + README) — pinned Next.js 14.2.13 + React 18.3.1 + TypeScript 5.6.3 + 3 `@types/*` packages. Resolves to ~28 transitive deps empirically. README documents the calibration methodology including raw measurement data. - `tests/workflows/tests/install_realworld.rs` — `install_realworld_nextjs_fixture_succeeds_through_verdaccio` installs the fixture through Verdaccio→npmjs and asserts end-to-end success at production scale. Always logs cold + warm wall-clock + peak RSS to stderr for calibration data. - **`LPM_BUDGET_GATE=1`-gated budget assertions**: cold ≤ 25s, warm ≤ 25ms, cold peak RSS ≤ 1500 MiB. Calibrated from N=6 cold + N=3 warm + N=3 RSS runs on M-series macOS, 2026-05-14. Memory measurement via `/usr/bin/time -l` (macOS) / `-v` (Linux); Windows skips with a clear warning. This closes Item 5 entirely (all 4 acceptance criteria green) and brings Items 2/3/4 to the parked-by-design or infrastructure-blocked baseline. CI gate: clippy `--workspace --all-targets -- -D warnings` clean, fmt clean, fancy-regex empty, build clean, `cargo nextest run --workspace` 6471/6471 pass. Suite runtime ~2:40 (was ~2:24 pre-tranche; +15s for the realworld test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(workflows): collapse Linux-only let-chain in parse_peak_rss CI lint on Linux failed on `clippy::collapsible_if` in the Linux-cfg'd branch of `parse_peak_rss`. The macOS branch had an intermediate `let bytes_str = rest.trim();` between the two `if let`s, which is why the local clippy run on macOS didn't catch this — only the macOS-cfg branch compiled there. Collapse the Linux branch to use `&&` (stable let-chains) so it satisfies the lint while preserving the same semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tolgaergin
added a commit
that referenced
this pull request
May 16, 2026
Removes phase numbers, plan section refs (§4.1, §7.2, §11, §12.2, §18), attribution (GPT, Gemini), date stamps, finding IDs (Finding #N, D-impl-N), and internal plan labels (Phase 46 P2/P3/P4/P6/P8, Phase 46b Lever #3/4, Option B, Chunk N) from all comments and doc strings. Behavioral descriptions are preserved and rewritten in plain language where needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Two root-cause fixes to the batch-metadata NDJSON parse loop. Decision-gate cold install: 46 s → 6.7 s (7× faster).
Commit 1 — wall-clock timeout (
20e757f).reqwest::Client::builder().timeout(30s)fires on body reads too. 66 MB streams exceed 30 s. Replaced with.connect_timeout(10s) + .read_timeout(30s). Surfaced the real bottleneck (below) by letting the stream actually complete instead of crashing withoperation timed out <- request or response body error <- error decoding response body.Commit 2 — quadratic
\nscan (621cf84). Every chunk appended tobufferre-scanned from offset 0 for the next newline. Average NDJSON line ~200 KB arriving over ~30 chunks → ~4 MB scanned per line × 365 lines ≈ 1.5 GB of re-scans at ~74 MB/s ≈ 21 s. Added ascan_fromcursor. Every byte now scanned at most once → ~66 ms for 66 MB. 300× speedup on the scan alone.Sub-timer narrowing
After the timeout fix the stream completed but the same
initial_batch_ms ≈ 40 spersisted vs raw reqwest at~7 s. Added 10 finer-grained phase timers to the parse loop and reran cold:scanat 21 s was 95 % of the loop. Every other CPU phase summed to ~700 ms, ruling out the usual suspects (channel backpressure, sync cache writes, JSON parse, memcpy).Popular theories this falsified
From an independent second-opinion review (Gemini, no code access) that diagnosed the symptom as a "4.5× ingestion tax":
mpsc::channel(32)too small, backpressure traps the readerfs::writeper line adds ~5 ms × 325 = 1.6 sinitial_batch_mswithLPM_SPEC_FETCH=0spawn_blocking/ bigger read buffersWhat was real:
.position(b'\n'), not blocking I/OPost-fix measurements (cold, lpm.dev)
total_mstotal_msDecision-gate breakdown, median of warm runs:
initial_batch_msfollowup_rpc_msfollowup_rpc_countfetch_msresolve_msBun's reported ~3.6 s on this fixture is now ~2× away, down from ~12×.
Tests
Two regression tests from the timeout commit still cover the read_timeout behavior (
batch_metadata_deep_tolerates_slow_streaming_body_under_read_timeout+batch_metadata_deep_fails_under_old_wallclock_timeout). The scan fix doesn't need a dedicated test — its correctness falls out of the existing streaming tests which consume multi-chunk bodies, and its performance impact is an O(n²) → O(n) cleanup that shows up in benchmarks rather than correctness assertions. Both tests pass in 0.8 s.Metadata-bloat sensitivity follow-up (the side question)
Investigated during diagnosis; still applies and still worth doing as a separate PR.
lpm-resolver::ranges::to_pubgrub_ranges(&available_versions)is O(N) in version count, uncached, and PubGrub calls it O(queries) times per package during backtracking. The +962 mspubgrub_core_msregression Phase 41 measured when 9 extra packages were pre-fetched matches the K×N×queries arithmetic. Memoizing(pkg, range) → Rangesis a bounded 1–2 day PR. File separately.CI gate locally green
cargo clippy --workspace -- -D warningscargo fmt --checkcargo build --workspacecargo nextest run --workspace --exclude lpm-integration-tests --no-fail-fast— 3641 passed, 7 skipped, 0 failedcargo test -p lpm-auth×2 — 43 passed both (parallel-deterministic)🤖 Generated with Claude Code