fix(ci)!: per-run target-dir — P0 STOP THE LINE on cancel-corrupt-state race by noahgift · Pull Request #1693 · paiml/aprender

noahgift · 2026-05-15T11:37:07Z

Andon

STOP THE LINE. Recurring `couldn't create a temp dir: /workspace/target/debug/deps/...`
flakes hitting 4-5 PRs simultaneously, blocking the entire merge queue.

Root cause (five-whys)

CI fails with `couldn't create a temp dir: /workspace/target/debug/deps/...`
`deps/` was unlinked mid-build, then cargo tried to create a tempfile inside it
A concurrent process on the same host path is racing this run's cargo
`concurrency.cancel-in-progress: true` (`.github/workflows/ci.yml:22`)
cancels the old run when `update-branch` pushes a new commit — but cancellation
is signal-based with a 30s SIGTERM→SIGKILL window. During those 30s the old
cargo is still writing/cleaning files in `/workspace/target/...`
The mount was per-PR (`aprender-ci/${PR_OR_REF}`), so the new run mounted
THE SAME host directory as the dying old run — they shared
`/workspace/target/debug/deps/` on the host

What the prior fix solved vs. what it introduced

The 2026-04-23 fleet fix (paiml/.github#31) moved sovereign-ci from a shared
runner-wide target dir to per-PR isolation, solving CROSS-PR same-runner
collisions. That worked. But the per-PR isolation INTRODUCED this
cancel-corrupt-state collision for SAME-PR sequential runs — every
`update-branch` or new push triggers a new run that mounts the SAME host path.

Fix

Bump the mount path one level deeper:

```
aprender-ci/${PR_OR_REF} → aprender-ci/${PR_OR_REF}/run-${GITHUB_RUN_ID}
```

Now every CI run gets its own isolated target dir; no two cargo invocations
ever share a host directory.

Trade-offs

sccache: unchanged. Lives on its own mount (`/home/noah/data/sccache`)
and continues to dedupe across ALL runs of ALL PRs. This is the heavy
lifter — typical 80%+ hit rate on warm cache.
cargo-incremental: lost per new run. Cost is small because sccache
already covers most of the rebuild surface; cargo-incremental is just
per-crate metadata, not codegen.
Disk: per-run dirs accumulate under `aprender-ci//`. Existing
disk-guard hook deletes old PR dirs after merge (incl. all run subdirs).
No new cleanup logic needed.

Verified

`python3 -c "import yaml; yaml.safe_load(open('.github/workflows/ci.yml'))"` → OK
4 mount lines updated (workspace-test: 3 docker steps + ownership-fix step)
Inline comment block documents the race for future maintainers

Impact

Once this lands, the recurring "No such file or directory (os error 2)"
flakes hitting the queue should stop. Currently blocking PR queue cascade.

Refs Toyota Way andon: this is the second iteration of the per-PR-target
fix. The 2026-04-23 fix was correct in direction (isolate from runner shared
state) but didn't account for the cancel-in-progress race. Both fixes are
needed: per-PR isolates across-PRs; per-RUN isolates within-PR rebuilds.

🤖 Generated with Claude Code

…ate race) ## Root cause (five-whys) 1. CI fails with `couldn't create a temp dir: /workspace/target/debug/deps/...` 2. `deps/` was unlinked mid-build, then cargo tried to create a tempfile inside it 3. A concurrent process on the same host path is racing this run's cargo 4. `concurrency.cancel-in-progress: true` (ci.yml:22) cancels the old run when `update-branch` pushes a new commit — but cancellation is signal-based with a 30s SIGTERM→SIGKILL window. During those 30s the old cargo is still writing/cleaning files in `/workspace/target/...` 5. The mount was per-PR (`aprender-ci/${PR_OR_REF}`), so the new run mounted THE SAME host directory as the dying old run — they shared `/workspace/target/debug/deps/` on the host ## What the prior fix solved vs. what it introduced The 2026-04-23 fleet fix (paiml/.github#31) moved sovereign-ci from a shared runner-wide target dir to per-PR isolation, solving CROSS-PR same-runner collisions. That worked. But the per-PR isolation INTRODUCED this cancel-corrupt-state collision for SAME-PR sequential runs (every `update-branch` or new push triggers a new run that mounts the SAME host path). ## Fix Bump the mount path one level deeper: `aprender-ci/${PR_OR_REF}` → `aprender-ci/${PR_OR_REF}/run-${GITHUB_RUN_ID}` Now every CI run gets its own isolated target dir; no two cargo invocations ever share a host directory. ## Trade-offs - **sccache**: unchanged. Lives on its own mount (`/home/noah/data/sccache`) and continues to dedupe across ALL runs of ALL PRs. This is the heavy lifter — typical 80%+ hit rate on warm cache. - **cargo-incremental**: lost per new run. Cost is small because sccache already covers most of the rebuild surface; cargo-incremental is the per-crate metadata, not the codegen. - **Disk**: per-run dirs accumulate under `aprender-ci/<PR>/`. Existing disk-guard hook deletes old PR dirs after merge (incl. all run subdirs). No new cleanup logic needed. ## Verified - `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/ci.yml'))"` → OK - 4 mount lines updated (workspace-test: 3 docker steps + ownership-fix step) - Inline comment block documents the cancel-corrupt-state race for future maintainers ## Impact Once this lands, the recurring "No such file or directory (os error 2)" flakes hitting the queue should stop. Currently 4+ PRs blocked at the same defect simultaneously. Refs Toyota Way andon: this is the second iteration of the per-PR-target fix; the 2026-04-23 fix was correct in direction (isolate from runner shared state) but didn't account for the cancel-in-progress race. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…0 fix (#1704) The P0 per-run target-dir fix (#1693, 2026-05-15) eliminated cargo-incremental cross-run warmth. Cold compiles now happen on every run; sccache covers codegen (~80% hit rate on warm cache) but cargo's metadata + linking + test binaries still cost ~40-50 min cold. Under runner-pool saturation (7+ concurrent CI runs observed today), the previous 55min timeout was exactly hit by 5 simultaneously-rebasing PRs (run 25919246467 and siblings — all timed out at 55:00.0 with no real test failure). Bumps: workspace-test step: 55min → 75min workspace-test job: 65min → 85min (10min overhead headroom) Trade-off: a genuinely stuck run now eats up to 75min of runner time instead of 55min. Acceptable — we have 16 self-hosted runners (per memory `reference_self_hosted_runner_disk_guard.md`) and the cost of a timeout-false-alarm cascade (5 PRs simultaneously red) is much higher than the cost of one extra 20min of waiting for a truly hung run. Refs Toyota Way: don't ignore the line stoppage — extend the time window to match the new cold-compile reality. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

#1798) Adds a chown step BEFORE the cargo step that runs `docker run --rm` as root and chowns the per-RUN target dir + cargo registry to noah:1000. ## Why Docker's bind-mount creates missing host directories with the daemon's uid (root). Since #1693 switched to per-RUN target dirs (`/mnt/nvme-raid0/targets/aprender-ci/<PR>/run-<RUN_ID>`), every fresh run gets a root-owned target dir. Cargo (running as uid 1000 inside the container) cannot write to it and fails with: error: failed to create directory `/workspace/target/debug`: No such file or directory (os error 2) The existing post-job chown (line 245) was meant to fix this for the NEXT run's git-clean — but per-RUN paths invalidate that since each run gets a brand-new root-owned dir. First-runs always fail. This was observed across 6+ in-flight PRs (#1784, #1791-#1797) on 2026-05-18 — every "infrastructure flake" turned out to be the same ownership bug at different cargo entry points. ## Fix Pre-cargo chown step. Idempotent (`|| true`). Runs the existing sovereign-ci image as root for the chown, then exits — adds maybe 2s to runs. Matches the pattern of the post-job chown step that already exists; just moves it to BEFORE cargo as well. ## Manual one-shot The 6 currently-stuck PRs were unblocked by manually chowning their per-RUN dirs on the runner host: ssh intel sudo chown -R 1000:1000 \ /mnt/nvme-raid0/targets/aprender-ci/{1792,1793,1794,1796,1797,main}/run-* After this PR lands, future runs will fix themselves. Co-authored-by: Noah Gift <claude@noahgift.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 15, 2026 11:37

noahgift merged commit b13645a into main May 15, 2026
11 checks passed

noahgift deleted the fix/ci-per-run-target-dir branch May 15, 2026 12:00

noahgift mentioned this pull request May 15, 2026

fix(ci): bump workspace-test timeout 55→75min — followup to per-run P0 fix #1704

Merged

noahgift mentioned this pull request May 18, 2026

ci: pre-build chown of per-RUN target dir to fix root-owned bind mount #1798

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci)!: per-run target-dir — P0 STOP THE LINE on cancel-corrupt-state race#1693

fix(ci)!: per-run target-dir — P0 STOP THE LINE on cancel-corrupt-state race#1693
noahgift merged 1 commit into
mainfrom
fix/ci-per-run-target-dir

noahgift commented May 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 15, 2026

Andon

Root cause (five-whys)

What the prior fix solved vs. what it introduced

Fix

Trade-offs

Verified

Impact

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant