Skip to content

fix(ci): per-PR cargo registry to break intel-runner concurrent-write race (ANDON paiml/infra#77)#1043

Merged
noahgift merged 2 commits intomainfrom
fix/ci-per-pr-cargo-registry-isolation
Apr 24, 2026
Merged

fix(ci): per-PR cargo registry to break intel-runner concurrent-write race (ANDON paiml/infra#77)#1043
noahgift merged 2 commits intomainfrom
fix/ci-per-pr-cargo-registry-isolation

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Stop-the-line fix for the intel-runner shared cargo registry race that is blocking aprender PRs #1031..#1042 (11 PRs, SHIP-TWO-001 algorithmic coverage).

Problem

All 11 stacked PRs (plus unrelated PR #1025 itself) currently fail ci / security or workspace-test with variants of:

error: couldn't read /home/noah/.cargo/registry/src/<crate>/lib.rs: Permission denied (os error 13)
error: could not compile `fnv` (lib) due to 1 previous error

or the rustix-0.38 version (E0432: unresolved import libc / libc_errno from the syscall macro that regenerates from extracted src/).

Root cause (five whys — full write-up in paiml/infra#77)

  1. ci / security fails → cargo install cargo-audit can't read fnv-1.0.7/lib.rs.
  2. EACCES → file is missing or owned by a different UID.
  3. Who else writes? → the 16 intel-clean-room-* runners that ALL bind-mount the SAME host /home/noah/.cargo/registry, plus ci-reaper.sh sweeping src/ on a TTL.
  4. Why shared? → ci.yml:49 was authored for throughput (avoid ~200 MB crate re-download per job). Race class not modeled.
  5. Why not fixed sooner? → target/ hit the exact same race (task chore(deps): Bump axum from 0.7.9 to 0.8.8 #134) and was fixed by per-PR isolation on /mnt/nvme-raid0/targets/aprender-ci/<pr#>. The registry never got the same treatment.

PR #1025's self-heal is a band-aid that only runs inside ci / security and itself races with concurrent jobs. It does not close the class.

Fix

Mirror the existing target-dir pattern (ci.yml:55) for the cargo registry:

       volumes:
-        - /home/noah/.cargo/registry:/usr/local/cargo/registry
+        - /mnt/nvme-raid0/cargo-ci/registry/${{ github.event.pull_request.number || github.ref_name }}:/usr/local/cargo/registry

Each PR now owns its registry lifecycle. No cross-PR writes → no race.

Cost

  • First-run cache miss: ~200 MB crate download per PR. Same profile the fleet already absorbed for the target/ fix.
  • Run 2+ on the same PR: cache hit, no extra cost.

Follow-up (paiml/infra#77)

  • Forjar recipe: pre-create /mnt/nvme-raid0/cargo-ci/ owner=noah:noah, 0755, on all 16 intel runners.
  • Reaper: extend machines/intel/sovereign-ci/systemd/reaper/ci-reaper.sh:308 TTL sweep to include /mnt/nvme-raid0/cargo-ci/registry/*/src.
  • Once infra PR lands, drop the ANDON comment in this ci.yml.

Docker auto-creates the mount-source leaf dir on first run, so this PR is landable standalone; infra PR only improves GC.

Test plan

🤖 Generated with Claude Code

noahgift and others added 2 commits April 24, 2026 10:23
… race (paiml/infra#77)

ANDON 2026-04-24 — aprender 11-PR stack (#1031..#1042) all failing `ci / security`
and `workspace-test` with:

  error: couldn't read /home/noah/.cargo/registry/src/<crate>/lib.rs:
         Permission denied (os error 13)

and the rustix-0.38 equivalent (E0432 unresolved import `libc`/`libc_errno`
originating in the `syscall` macro, which the rustix build.rs regenerates from
src/ files — missing src/ → macro can't find libc crate → cascading errors).

FIVE WHYS
─────────
 1 `ci / security` fails: `cargo install cargo-audit --locked` hits EACCES
   reading `fnv-1.0.7/lib.rs`.
 2 EACCES: the file is missing OR owned by root (docker container creates
   extractions as root on the bind-mounted host registry).
 3 Concurrent writers: 16 self-hosted `intel-clean-room-*` runners bind-mount
   the SAME /home/noah/.cargo/registry — cargo extractions, the ci-reaper
   TTL sweep, and cross-container chown cycles all touch identical paths.
 4 Shared by design: ci.yml:49 was authored for throughput — re-downloading
   crates per job is ~200MB, so the host registry was shared across all
   runners. Race class not modeled.
 5 Precedent already exists: target/ hit the identical race under concurrent
   PRs (task #134) and was fixed by per-PR isolation on
   /mnt/nvme-raid0/targets/aprender-ci/<pr#>. The registry simply never got
   the same treatment.

ROOT CAUSE
──────────
Shared mutable bind mount + concurrent multi-runner write access ≈ guaranteed
race. The existing band-aid (PR #1025 "self-heal cargo registry cache",
cargo-ok + Cargo.toml marker check) only runs inside `ci / security` and
itself races with concurrent jobs that have already passed the cache check.

FIX (this PR)
─────────────
Mirror the target-dir pattern from ci.yml:55 for the cargo registry. Each
PR (or branch) gets its own registry under /mnt/nvme-raid0/cargo-ci/registry/<pr#>.
Docker auto-creates the leaf dir on first mount; the ci-reaper TTL sweep
(ci-reaper.sh:308) needs a companion infra update (paiml/infra#77) to include
the new /mnt path.

 - Removes: /home/noah/.cargo/registry:/usr/local/cargo/registry
 - Adds:    /mnt/nvme-raid0/cargo-ci/registry/${pr#|ref_name}:/usr/local/cargo/registry

Cost: ~200MB per PR on first run (cargo re-downloads crates). Same cost
profile as the target/ isolation fix, which the fleet already absorbed.
Once cargo-ci/registry/<pr#> warms on run 1, run 2+ hit the cache.

FOLLOW-UP
─────────
paiml/infra#77 tracks:
  - forjar recipe to pre-create /mnt/nvme-raid0/cargo-ci/ owner=noah:noah
  - reaper extension: GC /mnt/nvme-raid0/cargo-ci/registry/<pr#>/src with same TTL
  - once infra lands, drop the ANDON comment above

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) April 24, 2026 10:21
@noahgift noahgift merged commit f6b4dff into main Apr 24, 2026
10 checks passed
@noahgift noahgift deleted the fix/ci-per-pr-cargo-registry-isolation branch April 24, 2026 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant