fix(ci): per-runner CARGO_HOME for security job (ANDON paiml/infra#77)#32
Merged
fix(ci): per-runner CARGO_HOME for security job (ANDON paiml/infra#77)#32
Conversation
…nstall race ANDON companion to aprender#1043 + paiml/infra#77. aprender#1043 addresses the workspace-test container's shared registry mount; THIS fix addresses a SEPARATE root cause in the bare-metal `security` job. FIVE WHYS ───────── 1 `ci / security` fails in aprender#1043 (and every other sovereign-ci repo under concurrent load) with: warning: failed to write cache, path: /home/noah/.cargo/registry/index/.../.cache/ca/rg/<crate>, Permission denied (os error 13) error: couldn't read /home/noah/.cargo/registry/src/.../fnv-1.0.7/lib.rs: Permission denied error: could not compile `fnv` 2 EACCES reading lib.rs → the file is missing or owned by a different UID than the current process. 3 Who wrote it? → another concurrent runner on the same intel host. 4 Why same path? → `security` job runs `runs-on: [self-hosted, clean-room]` (bare-metal, NO container). HOME=/home/noah, so $HOME/.cargo/registry resolves to the same physical path across all 16 intel-clean-room-* runners. `cargo install cargo-audit` extracts to src/, writes index/, and leaves root/noah-mixed ownership depending on prior job state. 5 Why bare-metal not containerized? → the existing `test`/`lint`/`coverage` jobs above already use `container:` with per-runner bind-mounts for their targets; `security` was added later (rule 8, 2026-04-12) as a bare-metal job because cargo-audit install only takes ~15s, and containerizing it would have required also solving sibling-checkout PWD (the bug FIVE-WHYS'd at line 770). The race class wasn't modeled at that time — it became observable only after auto-merge-green-PRs policy drove concurrent CI load up in 2026-04. ROOT CAUSE ────────── Shared $HOME/.cargo/registry on a bare-metal multi-runner host. Cargo install is not concurrency-safe under `--root ~/.cargo` when multiple processes extract the same crate tarball: one writes .cargo-ok + src/, another deletes src/ via ci-reaper TTL sweep, a third's cache check trusts the stale state, then fails at compile. FIX ─── Per-runner CARGO_HOME isolation. The `target/` directory hit the identical race class (intel-runner disk-race, task #134) and was fixed by per-PR isolation; the container `test`/`lint`/`coverage` jobs use per-runner CARGO_HOME; `security` simply hadn't gotten the treatment. export CARGO_HOME="/tmp/cargo-home-security-${{ runner.name }}" mkdir -p "$CARGO_HOME" cargo install cargo-audit --locked --root "$CARGO_HOME" || true echo "$CARGO_HOME/bin" >> "$GITHUB_PATH" Each intel-clean-room-<N> runner now writes to /tmp/cargo-home-security-<N>, no cross-runner contention. /tmp is tmpfs on intel (reboot-cleared) so no reaper extension needed — unlike aprender#1043's persistent cargo-ci which requires paiml/infra#78 reaper coverage. COST ──── ~200 MB cargo-audit install per runner per cold boot. Warm-cache reruns are free (same CARGO_HOME path across job reruns until reboot). 16 runners × 200 MB = 3.2 GB total on /tmp at steady state — well within the intel host's tmpfs budget. BLAST RADIUS ──────────── Every sovereign-ci caller (38 repos) gets the fix automatically on next workflow dispatch after this merges to main. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
to paiml/aprender
that referenced
this pull request
Apr 24, 2026
noahgift
added a commit
to paiml/aprender
that referenced
this pull request
Apr 24, 2026
… race (ANDON paiml/infra#77) (#1043) * fix(ci): per-PR cargo registry to break intel-runner concurrent-write race (paiml/infra#77) ANDON 2026-04-24 — aprender 11-PR stack (#1031..#1042) all failing `ci / security` and `workspace-test` with: error: couldn't read /home/noah/.cargo/registry/src/<crate>/lib.rs: Permission denied (os error 13) and the rustix-0.38 equivalent (E0432 unresolved import `libc`/`libc_errno` originating in the `syscall` macro, which the rustix build.rs regenerates from src/ files — missing src/ → macro can't find libc crate → cascading errors). FIVE WHYS ───────── 1 `ci / security` fails: `cargo install cargo-audit --locked` hits EACCES reading `fnv-1.0.7/lib.rs`. 2 EACCES: the file is missing OR owned by root (docker container creates extractions as root on the bind-mounted host registry). 3 Concurrent writers: 16 self-hosted `intel-clean-room-*` runners bind-mount the SAME /home/noah/.cargo/registry — cargo extractions, the ci-reaper TTL sweep, and cross-container chown cycles all touch identical paths. 4 Shared by design: ci.yml:49 was authored for throughput — re-downloading crates per job is ~200MB, so the host registry was shared across all runners. Race class not modeled. 5 Precedent already exists: target/ hit the identical race under concurrent PRs (task #134) and was fixed by per-PR isolation on /mnt/nvme-raid0/targets/aprender-ci/<pr#>. The registry simply never got the same treatment. ROOT CAUSE ────────── Shared mutable bind mount + concurrent multi-runner write access ≈ guaranteed race. The existing band-aid (PR #1025 "self-heal cargo registry cache", cargo-ok + Cargo.toml marker check) only runs inside `ci / security` and itself races with concurrent jobs that have already passed the cache check. FIX (this PR) ───────────── Mirror the target-dir pattern from ci.yml:55 for the cargo registry. Each PR (or branch) gets its own registry under /mnt/nvme-raid0/cargo-ci/registry/<pr#>. Docker auto-creates the leaf dir on first mount; the ci-reaper TTL sweep (ci-reaper.sh:308) needs a companion infra update (paiml/infra#77) to include the new /mnt path. - Removes: /home/noah/.cargo/registry:/usr/local/cargo/registry - Adds: /mnt/nvme-raid0/cargo-ci/registry/${pr#|ref_name}:/usr/local/cargo/registry Cost: ~200MB per PR on first run (cargo re-downloads crates). Same cost profile as the target/ isolation fix, which the fleet already absorbed. Once cargo-ci/registry/<pr#> warms on run 1, run 2+ hit the cache. FOLLOW-UP ───────── paiml/infra#77 tracks: - forjar recipe to pre-create /mnt/nvme-raid0/cargo-ci/ owner=noah:noah - reaper extension: GC /mnt/nvme-raid0/cargo-ci/registry/<pr#>/src with same TTL - once infra lands, drop the ANDON comment above 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci: trigger fresh run to pick up paiml/.github#32 security-job CARGO_HOME fix --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
to paiml/forjar
that referenced
this pull request
Apr 24, 2026
`cargo deny check` and `cargo audit` are distinct tools reading distinct config sources. `cargo deny` reads `deny.toml` [advisories.ignore]. `cargo audit` 0.22 does NOT read config files — only CLI --ignore flags. forjar's audit.yml ran `cargo audit` bare. After RUSTSEC-2026-0097 (rustls-webpki) and RUSTSEC-2026-0104 (rustls-webpki CRL panic) published against rustls-webpki 0.103.12 (both already exempted in deny.toml), `cargo audit` correctly exited non-zero — the exemptions never reached it. CI green on deny, red on audit, despite the same advisory IDs being on the ignore list. Fix mirrors the aprender sovereign-ci.yml pattern: - New `.cargo/audit.toml` with the cargo-audit-native schema `[advisories] ignore = [...]`. Single source of truth for cargo-audit, kept in sync with deny.toml by convention (documented in file header). - audit.yml parses `.cargo/audit.toml` for RUSTSEC IDs at run time and builds `--ignore <id>` CLI flags, matching how paiml/.github#32 solved the same class upstream. Covers both -0097 and -0104 (same rustls-webpki transitive class, no safe upgrade before upstream 0.104).
noahgift
added a commit
to paiml/forjar
that referenced
this pull request
Apr 24, 2026
#119) * fix: propagate sidecar errors + reseal recovery subcommand (Refs #118) Two related defects in state integrity handling. DEFECT 1 — src/core/state/mod.rs:57,108 — silent sidecar error let _ = integrity::write_b3_sidecar(&path); After atomic rename of state.lock.yaml, the sidecar's Result was discarded. Any failure (disk full, permission, signal, reaper race) left lock.yaml (new) + .b3 (stale); next apply hard-failed with "integrity check failed". Toyota Way violation: no signal at moment of corruption. Fix: propagate with `?`; message points user at `forjar reseal`. DEFECT 2 — no recovery for pre-existing drift Users with drift from OLD forjar versions or git checkout had no recovery short of `forjar apply --yes`. Adds `reseal` subcommand that rewrites sidecars from current lock contents without converging infrastructure: forjar reseal --all # reseal every state/*/lock.yaml forjar reseal --file <path> forjar reseal --machine <name> forjar reseal --all --dry-run Safety: each target YAML-parsed before sidecar rewrite — corrupt lock cannot be blessed with a fresh sidecar. FILES - src/core/state/mod.rs — `?` propagation in save_lock + save_global_lock. - src/cli/reseal.rs (NEW, TDG 97.5 A-) — cmd_reseal + 3 small helpers. - src/cli/{mod,dispatch_misc}.rs + src/cli/commands/{mod,state_args}.rs — Commands::Reseal wiring. TEST Smoke-tested against paiml/infra with 13/24 lock files mismatched. Apply correctly rejected; `reseal --all` resealed 23 files 0 failures; next apply passed the integrity gate. Closes #118. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore(deny): add RUSTSEC-2026-0104 exemption for rustls-webpki CRL panic (Refs #118) RUSTSEC-2026-0104 was published 2026-04-23 — reachable panic in rustls-webpki 0.103.12's CRL parsing. Transitive via rustls → rustls-native-certs; upstream fix in rustls-webpki 0.104 but rustls hasn't bumped yet. aprender's `.cargo/audit.toml` already ignores this (observed in aprender CI audit-cmd `--ignore` list 2026-04-24). Syncing forjar's deny.toml to match so forjar CI (`cargo deny check`) doesn't block on the same class across repos. This unblocks the audit gate for #119 (integrity atomic-write fix). Fleet follow-up: paiml/infra clean-room template needs the same exemption — filed separately. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(audit): cargo-audit --ignore sync with deny.toml (Refs #118) `cargo deny check` and `cargo audit` are distinct tools reading distinct config sources. `cargo deny` reads `deny.toml` [advisories.ignore]. `cargo audit` 0.22 does NOT read config files — only CLI --ignore flags. forjar's audit.yml ran `cargo audit` bare. After RUSTSEC-2026-0097 (rustls-webpki) and RUSTSEC-2026-0104 (rustls-webpki CRL panic) published against rustls-webpki 0.103.12 (both already exempted in deny.toml), `cargo audit` correctly exited non-zero — the exemptions never reached it. CI green on deny, red on audit, despite the same advisory IDs being on the ignore list. Fix mirrors the aprender sovereign-ci.yml pattern: - New `.cargo/audit.toml` with the cargo-audit-native schema `[advisories] ignore = [...]`. Single source of truth for cargo-audit, kept in sync with deny.toml by convention (documented in file header). - audit.yml parses `.cargo/audit.toml` for RUSTSEC IDs at run time and builds `--ignore <id>` CLI flags, matching how paiml/.github#32 solved the same class upstream. Covers both -0097 and -0104 (same rustls-webpki transitive class, no safe upgrade before upstream 0.104). --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Companion fix to aprender#1043 + paiml/infra#77/#78. aprender's ci.yml PR fixes the containerized
workspace-testregistry-mount race; THIS fix addresses the SEPARATE bare-metalsecurityjob race that's blocking every sovereign-ci consumer under concurrent load.Problem
Every PR across sovereign-ci repos is hitting
ci / securityfailures:securityjob at sovereign-ci.yml:649 is bare-metal (runs-on: [self-hosted, clean-room], nocontainer:). HOME=/home/noah resolves to the same physical path across all 16 intel-clean-room-* runners. Concurrentcargo install cargo-auditextracts into the same~/.cargo/registry/src/, and the ci-reaper TTL sweep races with it.Root cause (five-whys — full write-up in commit message)
ci / securityfails → cargo install cargo-audit → EACCES onfnv-1.0.7/lib.rs.Precedent:
test/lint/coveragejobs above (lines 87, 241, 376) all run incontainer:with per-runner mounts.securitywas the only hold-out.Fix
Per-runner CARGO_HOME isolation:
Each intel-clean-room- runner writes to
/tmp/cargo-home-security-<N>. No cross-runner contention./tmpis tmpfs (reboot-cleared) — no reaper extension needed (unlike aprender#1043's persistent/mnt/nvme-raid0/cargo-ci/).Cost
/tmp— well within intel's tmpfs budget.Blast radius
Every sovereign-ci caller (38 repos) gets the fix automatically on next workflow dispatch after merge.
Test plan
ci / securityflips from FAILURE → SUCCESS on next run.securityjob.🤖 Generated with Claude Code