Conversation
… race (paiml/infra#77) ANDON 2026-04-24 — aprender 11-PR stack (#1031..#1042) all failing `ci / security` and `workspace-test` with: error: couldn't read /home/noah/.cargo/registry/src/<crate>/lib.rs: Permission denied (os error 13) and the rustix-0.38 equivalent (E0432 unresolved import `libc`/`libc_errno` originating in the `syscall` macro, which the rustix build.rs regenerates from src/ files — missing src/ → macro can't find libc crate → cascading errors). FIVE WHYS ───────── 1 `ci / security` fails: `cargo install cargo-audit --locked` hits EACCES reading `fnv-1.0.7/lib.rs`. 2 EACCES: the file is missing OR owned by root (docker container creates extractions as root on the bind-mounted host registry). 3 Concurrent writers: 16 self-hosted `intel-clean-room-*` runners bind-mount the SAME /home/noah/.cargo/registry — cargo extractions, the ci-reaper TTL sweep, and cross-container chown cycles all touch identical paths. 4 Shared by design: ci.yml:49 was authored for throughput — re-downloading crates per job is ~200MB, so the host registry was shared across all runners. Race class not modeled. 5 Precedent already exists: target/ hit the identical race under concurrent PRs (task #134) and was fixed by per-PR isolation on /mnt/nvme-raid0/targets/aprender-ci/<pr#>. The registry simply never got the same treatment. ROOT CAUSE ────────── Shared mutable bind mount + concurrent multi-runner write access ≈ guaranteed race. The existing band-aid (PR #1025 "self-heal cargo registry cache", cargo-ok + Cargo.toml marker check) only runs inside `ci / security` and itself races with concurrent jobs that have already passed the cache check. FIX (this PR) ───────────── Mirror the target-dir pattern from ci.yml:55 for the cargo registry. Each PR (or branch) gets its own registry under /mnt/nvme-raid0/cargo-ci/registry/<pr#>. Docker auto-creates the leaf dir on first mount; the ci-reaper TTL sweep (ci-reaper.sh:308) needs a companion infra update (paiml/infra#77) to include the new /mnt path. - Removes: /home/noah/.cargo/registry:/usr/local/cargo/registry - Adds: /mnt/nvme-raid0/cargo-ci/registry/${pr#|ref_name}:/usr/local/cargo/registry Cost: ~200MB per PR on first run (cargo re-downloads crates). Same cost profile as the target/ isolation fix, which the fleet already absorbed. Once cargo-ci/registry/<pr#> warms on run 1, run 2+ hit the cache. FOLLOW-UP ───────── paiml/infra#77 tracks: - forjar recipe to pre-create /mnt/nvme-raid0/cargo-ci/ owner=noah:noah - reaper extension: GC /mnt/nvme-raid0/cargo-ci/registry/<pr#>/src with same TTL - once infra lands, drop the ANDON comment above 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stop-the-line fix for the intel-runner shared cargo registry race that is blocking aprender PRs #1031..#1042 (11 PRs, SHIP-TWO-001 algorithmic coverage).
Problem
All 11 stacked PRs (plus unrelated PR #1025 itself) currently fail
ci / securityorworkspace-testwith variants of:or the rustix-0.38 version (
E0432: unresolved import libc/libc_errnofrom thesyscallmacro that regenerates from extracted src/).Root cause (five whys — full write-up in paiml/infra#77)
ci / securityfails →cargo install cargo-auditcan't readfnv-1.0.7/lib.rs.intel-clean-room-*runners that ALL bind-mount the SAME host/home/noah/.cargo/registry, plusci-reaper.shsweepingsrc/on a TTL.ci.yml:49was authored for throughput (avoid ~200 MB crate re-download per job). Race class not modeled.target/hit the exact same race (task chore(deps): Bump axum from 0.7.9 to 0.8.8 #134) and was fixed by per-PR isolation on/mnt/nvme-raid0/targets/aprender-ci/<pr#>. The registry never got the same treatment.PR #1025's self-heal is a band-aid that only runs inside
ci / securityand itself races with concurrent jobs. It does not close the class.Fix
Mirror the existing target-dir pattern (ci.yml:55) for the cargo registry:
Each PR now owns its registry lifecycle. No cross-PR writes → no race.
Cost
target/fix.Follow-up (paiml/infra#77)
/mnt/nvme-raid0/cargo-ci/owner=noah:noah, 0755, on all 16 intel runners.machines/intel/sovereign-ci/systemd/reaper/ci-reaper.sh:308TTL sweep to include/mnt/nvme-raid0/cargo-ci/registry/*/src.Docker auto-creates the mount-source leaf dir on first run, so this PR is landable standalone; infra PR only improves GC.
Test plan
ci / securityandworkspace-teston the new mount (proves the race is gone on the fresh per-PR path).🤖 Generated with Claude Code