Skip to content

fix(ci): per-runner CARGO_HOME for security job (ANDON paiml/infra#77)#32

Merged
noahgift merged 1 commit intomainfrom
fix/security-per-runner-cargo-home
Apr 24, 2026
Merged

fix(ci): per-runner CARGO_HOME for security job (ANDON paiml/infra#77)#32
noahgift merged 1 commit intomainfrom
fix/security-per-runner-cargo-home

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Companion fix to aprender#1043 + paiml/infra#77/#78. aprender's ci.yml PR fixes the containerized workspace-test registry-mount race; THIS fix addresses the SEPARATE bare-metal security job race that's blocking every sovereign-ci consumer under concurrent load.

Problem

Every PR across sovereign-ci repos is hitting ci / security failures:

warning: failed to write cache, path: /home/noah/.cargo/registry/index/.../.cache/ca/rg/<crate>, Permission denied (os error 13)
error: couldn't read /home/noah/.cargo/registry/src/.../fnv-1.0.7/lib.rs: Permission denied (os error 13)
error: could not compile `fnv` (lib) due to 1 previous error
error: failed to compile `cargo-audit v0.22.1`

security job at sovereign-ci.yml:649 is bare-metal (runs-on: [self-hosted, clean-room], no container:). HOME=/home/noah resolves to the same physical path across all 16 intel-clean-room-* runners. Concurrent cargo install cargo-audit extracts into the same ~/.cargo/registry/src/, and the ci-reaper TTL sweep races with it.

Root cause (five-whys — full write-up in commit message)

  1. ci / security fails → cargo install cargo-audit → EACCES on fnv-1.0.7/lib.rs.
  2. EACCES → file missing or UID mismatch.
  3. Why? → another concurrent runner wrote/deleted the same path.
  4. Why same path? → bare-metal security job, no container isolation, shared $HOME/.cargo/registry.
  5. Why bare-metal? → rule 8 (2026-04-12) added security as a quick bare-metal check; race class was not modeled.

Precedent: test/lint/coverage jobs above (lines 87, 241, 376) all run in container: with per-runner mounts. security was the only hold-out.

Fix

Per-runner CARGO_HOME isolation:

- name: Install cargo-audit (per-runner CARGO_HOME)
  run: |
    export CARGO_HOME="/tmp/cargo-home-security-${{ runner.name }}"
    mkdir -p "$CARGO_HOME"
    echo "CARGO_HOME=$CARGO_HOME" >> "$GITHUB_ENV"
    cargo install cargo-audit --locked --root "$CARGO_HOME" || true
    echo "$CARGO_HOME/bin" >> "$GITHUB_PATH"

Each intel-clean-room- runner writes to /tmp/cargo-home-security-<N>. No cross-runner contention. /tmp is tmpfs (reboot-cleared) — no reaper extension needed (unlike aprender#1043's persistent /mnt/nvme-raid0/cargo-ci/).

Cost

  • ~200 MB cargo-audit install per runner per cold boot.
  • Warm-cache reruns: free (same CARGO_HOME until reboot).
  • Steady state: 16 runners × 200 MB = 3.2 GB on /tmp — well within intel's tmpfs budget.

Blast radius

Every sovereign-ci caller (38 repos) gets the fix automatically on next workflow dispatch after merge.

Test plan

  • This PR's own CI passes (uses the new per-runner CARGO_HOME).
  • After merge: aprender#1043's ci / security flips from FAILURE → SUCCESS on next run.
  • 24h soak across multiple repos — no EACCES recurrence on security job.

🤖 Generated with Claude Code

…nstall race

ANDON companion to aprender#1043 + paiml/infra#77. aprender#1043 addresses the
workspace-test container's shared registry mount; THIS fix addresses a
SEPARATE root cause in the bare-metal `security` job.

FIVE WHYS
─────────
 1 `ci / security` fails in aprender#1043 (and every other sovereign-ci
   repo under concurrent load) with:
     warning: failed to write cache, path: /home/noah/.cargo/registry/index/.../.cache/ca/rg/<crate>, Permission denied (os error 13)
     error: couldn't read /home/noah/.cargo/registry/src/.../fnv-1.0.7/lib.rs: Permission denied
     error: could not compile `fnv`
 2 EACCES reading lib.rs → the file is missing or owned by a different
   UID than the current process.
 3 Who wrote it? → another concurrent runner on the same intel host.
 4 Why same path? → `security` job runs `runs-on: [self-hosted, clean-room]`
   (bare-metal, NO container). HOME=/home/noah, so $HOME/.cargo/registry
   resolves to the same physical path across all 16 intel-clean-room-*
   runners. `cargo install cargo-audit` extracts to src/, writes index/,
   and leaves root/noah-mixed ownership depending on prior job state.
 5 Why bare-metal not containerized? → the existing `test`/`lint`/`coverage`
   jobs above already use `container:` with per-runner bind-mounts for
   their targets; `security` was added later (rule 8, 2026-04-12) as a
   bare-metal job because cargo-audit install only takes ~15s, and
   containerizing it would have required also solving sibling-checkout
   PWD (the bug FIVE-WHYS'd at line 770). The race class wasn't modeled
   at that time — it became observable only after auto-merge-green-PRs
   policy drove concurrent CI load up in 2026-04.

ROOT CAUSE
──────────
Shared $HOME/.cargo/registry on a bare-metal multi-runner host. Cargo
install is not concurrency-safe under `--root ~/.cargo` when multiple
processes extract the same crate tarball: one writes .cargo-ok + src/,
another deletes src/ via ci-reaper TTL sweep, a third's cache check
trusts the stale state, then fails at compile.

FIX
───
Per-runner CARGO_HOME isolation. The `target/` directory hit the
identical race class (intel-runner disk-race, task #134) and was fixed
by per-PR isolation; the container `test`/`lint`/`coverage` jobs use
per-runner CARGO_HOME; `security` simply hadn't gotten the treatment.

  export CARGO_HOME="/tmp/cargo-home-security-${{ runner.name }}"
  mkdir -p "$CARGO_HOME"
  cargo install cargo-audit --locked --root "$CARGO_HOME" || true
  echo "$CARGO_HOME/bin" >> "$GITHUB_PATH"

Each intel-clean-room-<N> runner now writes to /tmp/cargo-home-security-<N>,
no cross-runner contention. /tmp is tmpfs on intel (reboot-cleared) so
no reaper extension needed — unlike aprender#1043's persistent cargo-ci
which requires paiml/infra#78 reaper coverage.

COST
────
~200 MB cargo-audit install per runner per cold boot. Warm-cache reruns
are free (same CARGO_HOME path across job reruns until reboot). 16
runners × 200 MB = 3.2 GB total on /tmp at steady state — well within
the intel host's tmpfs budget.

BLAST RADIUS
────────────
Every sovereign-ci caller (38 repos) gets the fix automatically on next
workflow dispatch after this merges to main.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 68961a4 into main Apr 24, 2026
2 checks passed
@noahgift noahgift deleted the fix/security-per-runner-cargo-home branch April 24, 2026 08:40
noahgift added a commit to paiml/aprender that referenced this pull request Apr 24, 2026
noahgift added a commit to paiml/aprender that referenced this pull request Apr 24, 2026
… race (ANDON paiml/infra#77) (#1043)

* fix(ci): per-PR cargo registry to break intel-runner concurrent-write race (paiml/infra#77)

ANDON 2026-04-24 — aprender 11-PR stack (#1031..#1042) all failing `ci / security`
and `workspace-test` with:

  error: couldn't read /home/noah/.cargo/registry/src/<crate>/lib.rs:
         Permission denied (os error 13)

and the rustix-0.38 equivalent (E0432 unresolved import `libc`/`libc_errno`
originating in the `syscall` macro, which the rustix build.rs regenerates from
src/ files — missing src/ → macro can't find libc crate → cascading errors).

FIVE WHYS
─────────
 1 `ci / security` fails: `cargo install cargo-audit --locked` hits EACCES
   reading `fnv-1.0.7/lib.rs`.
 2 EACCES: the file is missing OR owned by root (docker container creates
   extractions as root on the bind-mounted host registry).
 3 Concurrent writers: 16 self-hosted `intel-clean-room-*` runners bind-mount
   the SAME /home/noah/.cargo/registry — cargo extractions, the ci-reaper
   TTL sweep, and cross-container chown cycles all touch identical paths.
 4 Shared by design: ci.yml:49 was authored for throughput — re-downloading
   crates per job is ~200MB, so the host registry was shared across all
   runners. Race class not modeled.
 5 Precedent already exists: target/ hit the identical race under concurrent
   PRs (task #134) and was fixed by per-PR isolation on
   /mnt/nvme-raid0/targets/aprender-ci/<pr#>. The registry simply never got
   the same treatment.

ROOT CAUSE
──────────
Shared mutable bind mount + concurrent multi-runner write access ≈ guaranteed
race. The existing band-aid (PR #1025 "self-heal cargo registry cache",
cargo-ok + Cargo.toml marker check) only runs inside `ci / security` and
itself races with concurrent jobs that have already passed the cache check.

FIX (this PR)
─────────────
Mirror the target-dir pattern from ci.yml:55 for the cargo registry. Each
PR (or branch) gets its own registry under /mnt/nvme-raid0/cargo-ci/registry/<pr#>.
Docker auto-creates the leaf dir on first mount; the ci-reaper TTL sweep
(ci-reaper.sh:308) needs a companion infra update (paiml/infra#77) to include
the new /mnt path.

 - Removes: /home/noah/.cargo/registry:/usr/local/cargo/registry
 - Adds:    /mnt/nvme-raid0/cargo-ci/registry/${pr#|ref_name}:/usr/local/cargo/registry

Cost: ~200MB per PR on first run (cargo re-downloads crates). Same cost
profile as the target/ isolation fix, which the fleet already absorbed.
Once cargo-ci/registry/<pr#> warms on run 1, run 2+ hit the cache.

FOLLOW-UP
─────────
paiml/infra#77 tracks:
  - forjar recipe to pre-create /mnt/nvme-raid0/cargo-ci/ owner=noah:noah
  - reaper extension: GC /mnt/nvme-raid0/cargo-ci/registry/<pr#>/src with same TTL
  - once infra lands, drop the ANDON comment above

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* ci: trigger fresh run to pick up paiml/.github#32 security-job CARGO_HOME fix

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit to paiml/forjar that referenced this pull request Apr 24, 2026
`cargo deny check` and `cargo audit` are distinct tools reading distinct
config sources. `cargo deny` reads `deny.toml` [advisories.ignore].
`cargo audit` 0.22 does NOT read config files — only CLI --ignore flags.

forjar's audit.yml ran `cargo audit` bare. After RUSTSEC-2026-0097
(rustls-webpki) and RUSTSEC-2026-0104 (rustls-webpki CRL panic) published
against rustls-webpki 0.103.12 (both already exempted in deny.toml),
`cargo audit` correctly exited non-zero — the exemptions never reached
it. CI green on deny, red on audit, despite the same advisory IDs being
on the ignore list.

Fix mirrors the aprender sovereign-ci.yml pattern:
- New `.cargo/audit.toml` with the cargo-audit-native schema
  `[advisories] ignore = [...]`. Single source of truth for cargo-audit,
  kept in sync with deny.toml by convention (documented in file header).
- audit.yml parses `.cargo/audit.toml` for RUSTSEC IDs at run time and
  builds `--ignore <id>` CLI flags, matching how paiml/.github#32 solved
  the same class upstream.

Covers both -0097 and -0104 (same rustls-webpki transitive class, no
safe upgrade before upstream 0.104).
noahgift added a commit to paiml/forjar that referenced this pull request Apr 24, 2026
#119)

* fix: propagate sidecar errors + reseal recovery subcommand (Refs #118)

Two related defects in state integrity handling.

DEFECT 1 — src/core/state/mod.rs:57,108 — silent sidecar error

    let _ = integrity::write_b3_sidecar(&path);

After atomic rename of state.lock.yaml, the sidecar's Result was discarded.
Any failure (disk full, permission, signal, reaper race) left lock.yaml
(new) + .b3 (stale); next apply hard-failed with "integrity check failed".
Toyota Way violation: no signal at moment of corruption.

Fix: propagate with `?`; message points user at `forjar reseal`.

DEFECT 2 — no recovery for pre-existing drift

Users with drift from OLD forjar versions or git checkout had no recovery
short of `forjar apply --yes`. Adds `reseal` subcommand that rewrites
sidecars from current lock contents without converging infrastructure:

    forjar reseal --all           # reseal every state/*/lock.yaml
    forjar reseal --file <path>
    forjar reseal --machine <name>
    forjar reseal --all --dry-run

Safety: each target YAML-parsed before sidecar rewrite — corrupt lock
cannot be blessed with a fresh sidecar.

FILES

- src/core/state/mod.rs — `?` propagation in save_lock + save_global_lock.
- src/cli/reseal.rs (NEW, TDG 97.5 A-) — cmd_reseal + 3 small helpers.
- src/cli/{mod,dispatch_misc}.rs + src/cli/commands/{mod,state_args}.rs —
  Commands::Reseal wiring.

TEST

Smoke-tested against paiml/infra with 13/24 lock files mismatched. Apply
correctly rejected; `reseal --all` resealed 23 files 0 failures; next
apply passed the integrity gate.

Closes #118.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore(deny): add RUSTSEC-2026-0104 exemption for rustls-webpki CRL panic (Refs #118)

RUSTSEC-2026-0104 was published 2026-04-23 — reachable panic in
rustls-webpki 0.103.12's CRL parsing. Transitive via rustls →
rustls-native-certs; upstream fix in rustls-webpki 0.104 but rustls
hasn't bumped yet. aprender's `.cargo/audit.toml` already ignores this
(observed in aprender CI audit-cmd `--ignore` list 2026-04-24). Syncing
forjar's deny.toml to match so forjar CI (`cargo deny check`) doesn't
block on the same class across repos.

This unblocks the audit gate for #119 (integrity atomic-write fix).

Fleet follow-up: paiml/infra clean-room template needs the same
exemption — filed separately.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(audit): cargo-audit --ignore sync with deny.toml (Refs #118)

`cargo deny check` and `cargo audit` are distinct tools reading distinct
config sources. `cargo deny` reads `deny.toml` [advisories.ignore].
`cargo audit` 0.22 does NOT read config files — only CLI --ignore flags.

forjar's audit.yml ran `cargo audit` bare. After RUSTSEC-2026-0097
(rustls-webpki) and RUSTSEC-2026-0104 (rustls-webpki CRL panic) published
against rustls-webpki 0.103.12 (both already exempted in deny.toml),
`cargo audit` correctly exited non-zero — the exemptions never reached
it. CI green on deny, red on audit, despite the same advisory IDs being
on the ignore list.

Fix mirrors the aprender sovereign-ci.yml pattern:
- New `.cargo/audit.toml` with the cargo-audit-native schema
  `[advisories] ignore = [...]`. Single source of truth for cargo-audit,
  kept in sync with deny.toml by convention (documented in file header).
- audit.yml parses `.cargo/audit.toml` for RUSTSEC IDs at run time and
  builds `--ignore <id>` CLI flags, matching how paiml/.github#32 solved
  the same class upstream.

Covers both -0097 and -0104 (same rustls-webpki transitive class, no
safe upgrade before upstream 0.104).

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant