fix(ci)!: per-run target-dir — P0 STOP THE LINE on cancel-corrupt-state race#1693
Merged
Conversation
…ate race)
## Root cause (five-whys)
1. CI fails with `couldn't create a temp dir: /workspace/target/debug/deps/...`
2. `deps/` was unlinked mid-build, then cargo tried to create a tempfile inside it
3. A concurrent process on the same host path is racing this run's cargo
4. `concurrency.cancel-in-progress: true` (ci.yml:22) cancels the old run when
`update-branch` pushes a new commit — but cancellation is signal-based with
a 30s SIGTERM→SIGKILL window. During those 30s the old cargo is still
writing/cleaning files in `/workspace/target/...`
5. The mount was per-PR (`aprender-ci/${PR_OR_REF}`), so the new run mounted
THE SAME host directory as the dying old run — they shared
`/workspace/target/debug/deps/` on the host
## What the prior fix solved vs. what it introduced
The 2026-04-23 fleet fix (paiml/.github#31) moved sovereign-ci from a shared
runner-wide target dir to per-PR isolation, solving CROSS-PR same-runner
collisions. That worked. But the per-PR isolation INTRODUCED this
cancel-corrupt-state collision for SAME-PR sequential runs (every `update-branch`
or new push triggers a new run that mounts the SAME host path).
## Fix
Bump the mount path one level deeper:
`aprender-ci/${PR_OR_REF}` → `aprender-ci/${PR_OR_REF}/run-${GITHUB_RUN_ID}`
Now every CI run gets its own isolated target dir; no two cargo invocations
ever share a host directory.
## Trade-offs
- **sccache**: unchanged. Lives on its own mount (`/home/noah/data/sccache`)
and continues to dedupe across ALL runs of ALL PRs. This is the heavy
lifter — typical 80%+ hit rate on warm cache.
- **cargo-incremental**: lost per new run. Cost is small because sccache
already covers most of the rebuild surface; cargo-incremental is the
per-crate metadata, not the codegen.
- **Disk**: per-run dirs accumulate under `aprender-ci/<PR>/`. Existing
disk-guard hook deletes old PR dirs after merge (incl. all run subdirs).
No new cleanup logic needed.
## Verified
- `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/ci.yml'))"` → OK
- 4 mount lines updated (workspace-test: 3 docker steps + ownership-fix step)
- Inline comment block documents the cancel-corrupt-state race for future maintainers
## Impact
Once this lands, the recurring "No such file or directory (os error 2)"
flakes hitting the queue should stop. Currently 4+ PRs blocked at the
same defect simultaneously.
Refs Toyota Way andon: this is the second iteration of the per-PR-target
fix; the 2026-04-23 fix was correct in direction (isolate from runner
shared state) but didn't account for the cancel-in-progress race.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 15, 2026
…0 fix (#1704) The P0 per-run target-dir fix (#1693, 2026-05-15) eliminated cargo-incremental cross-run warmth. Cold compiles now happen on every run; sccache covers codegen (~80% hit rate on warm cache) but cargo's metadata + linking + test binaries still cost ~40-50 min cold. Under runner-pool saturation (7+ concurrent CI runs observed today), the previous 55min timeout was exactly hit by 5 simultaneously-rebasing PRs (run 25919246467 and siblings — all timed out at 55:00.0 with no real test failure). Bumps: workspace-test step: 55min → 75min workspace-test job: 65min → 85min (10min overhead headroom) Trade-off: a genuinely stuck run now eats up to 75min of runner time instead of 55min. Acceptable — we have 16 self-hosted runners (per memory `reference_self_hosted_runner_disk_guard.md`) and the cost of a timeout-false-alarm cascade (5 PRs simultaneously red) is much higher than the cost of one extra 20min of waiting for a truly hung run. Refs Toyota Way: don't ignore the line stoppage — extend the time window to match the new cold-compile reality. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
noahgift
added a commit
that referenced
this pull request
May 18, 2026
#1798) Adds a chown step BEFORE the cargo step that runs `docker run --rm` as root and chowns the per-RUN target dir + cargo registry to noah:1000. ## Why Docker's bind-mount creates missing host directories with the daemon's uid (root). Since #1693 switched to per-RUN target dirs (`/mnt/nvme-raid0/targets/aprender-ci/<PR>/run-<RUN_ID>`), every fresh run gets a root-owned target dir. Cargo (running as uid 1000 inside the container) cannot write to it and fails with: error: failed to create directory `/workspace/target/debug`: No such file or directory (os error 2) The existing post-job chown (line 245) was meant to fix this for the NEXT run's git-clean — but per-RUN paths invalidate that since each run gets a brand-new root-owned dir. First-runs always fail. This was observed across 6+ in-flight PRs (#1784, #1791-#1797) on 2026-05-18 — every "infrastructure flake" turned out to be the same ownership bug at different cargo entry points. ## Fix Pre-cargo chown step. Idempotent (`|| true`). Runs the existing sovereign-ci image as root for the chown, then exits — adds maybe 2s to runs. Matches the pattern of the post-job chown step that already exists; just moves it to BEFORE cargo as well. ## Manual one-shot The 6 currently-stuck PRs were unblocked by manually chowning their per-RUN dirs on the runner host: ssh intel sudo chown -R 1000:1000 \ /mnt/nvme-raid0/targets/aprender-ci/{1792,1793,1794,1796,1797,main}/run-* After this PR lands, future runs will fix themselves. Co-authored-by: Noah Gift <claude@noahgift.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Andon
STOP THE LINE. Recurring `couldn't create a temp dir: /workspace/target/debug/deps/...`
flakes hitting 4-5 PRs simultaneously, blocking the entire merge queue.
Root cause (five-whys)
cancels the old run when `update-branch` pushes a new commit — but cancellation
is signal-based with a 30s SIGTERM→SIGKILL window. During those 30s the old
cargo is still writing/cleaning files in `/workspace/target/...`
THE SAME host directory as the dying old run — they shared
`/workspace/target/debug/deps/` on the host
What the prior fix solved vs. what it introduced
The 2026-04-23 fleet fix (paiml/.github#31) moved sovereign-ci from a shared
runner-wide target dir to per-PR isolation, solving CROSS-PR same-runner
collisions. That worked. But the per-PR isolation INTRODUCED this
cancel-corrupt-state collision for SAME-PR sequential runs — every
`update-branch` or new push triggers a new run that mounts the SAME host path.
Fix
Bump the mount path one level deeper:
```
aprender-ci/${PR_OR_REF} → aprender-ci/${PR_OR_REF}/run-${GITHUB_RUN_ID}
```
Now every CI run gets its own isolated target dir; no two cargo invocations
ever share a host directory.
Trade-offs
and continues to dedupe across ALL runs of ALL PRs. This is the heavy
lifter — typical 80%+ hit rate on warm cache.
already covers most of the rebuild surface; cargo-incremental is just
per-crate metadata, not codegen.
disk-guard hook deletes old PR dirs after merge (incl. all run subdirs).
No new cleanup logic needed.
Verified
Impact
Once this lands, the recurring "No such file or directory (os error 2)"
flakes hitting the queue should stop. Currently blocking PR queue cascade.
Refs Toyota Way andon: this is the second iteration of the per-PR-target
fix. The 2026-04-23 fix was correct in direction (isolate from runner shared
state) but didn't account for the cancel-in-progress race. Both fixes are
needed: per-PR isolates across-PRs; per-RUN isolates within-PR rebuilds.
🤖 Generated with Claude Code