Skip to content

fix(ci)!: per-run target-dir — P0 STOP THE LINE on cancel-corrupt-state race#1693

Merged
noahgift merged 1 commit into
mainfrom
fix/ci-per-run-target-dir
May 15, 2026
Merged

fix(ci)!: per-run target-dir — P0 STOP THE LINE on cancel-corrupt-state race#1693
noahgift merged 1 commit into
mainfrom
fix/ci-per-run-target-dir

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Andon

STOP THE LINE. Recurring `couldn't create a temp dir: /workspace/target/debug/deps/...`
flakes hitting 4-5 PRs simultaneously, blocking the entire merge queue.

Root cause (five-whys)

  1. CI fails with `couldn't create a temp dir: /workspace/target/debug/deps/...`
  2. `deps/` was unlinked mid-build, then cargo tried to create a tempfile inside it
  3. A concurrent process on the same host path is racing this run's cargo
  4. `concurrency.cancel-in-progress: true` (`.github/workflows/ci.yml:22`)
    cancels the old run when `update-branch` pushes a new commit — but cancellation
    is signal-based with a 30s SIGTERM→SIGKILL window. During those 30s the old
    cargo is still writing/cleaning files in `/workspace/target/...`
  5. The mount was per-PR (`aprender-ci/${PR_OR_REF}`), so the new run mounted
    THE SAME host directory as the dying old run — they shared
    `/workspace/target/debug/deps/` on the host

What the prior fix solved vs. what it introduced

The 2026-04-23 fleet fix (paiml/.github#31) moved sovereign-ci from a shared
runner-wide target dir to per-PR isolation, solving CROSS-PR same-runner
collisions. That worked. But the per-PR isolation INTRODUCED this
cancel-corrupt-state collision for SAME-PR sequential runs — every
`update-branch` or new push triggers a new run that mounts the SAME host path.

Fix

Bump the mount path one level deeper:

```
aprender-ci/${PR_OR_REF} → aprender-ci/${PR_OR_REF}/run-${GITHUB_RUN_ID}
```

Now every CI run gets its own isolated target dir; no two cargo invocations
ever share a host directory.

Trade-offs

  • sccache: unchanged. Lives on its own mount (`/home/noah/data/sccache`)
    and continues to dedupe across ALL runs of ALL PRs. This is the heavy
    lifter — typical 80%+ hit rate on warm cache.
  • cargo-incremental: lost per new run. Cost is small because sccache
    already covers most of the rebuild surface; cargo-incremental is just
    per-crate metadata, not codegen.
  • Disk: per-run dirs accumulate under `aprender-ci//`. Existing
    disk-guard hook deletes old PR dirs after merge (incl. all run subdirs).
    No new cleanup logic needed.

Verified

  • `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/ci.yml'))"` → OK
  • 4 mount lines updated (workspace-test: 3 docker steps + ownership-fix step)
  • Inline comment block documents the race for future maintainers

Impact

Once this lands, the recurring "No such file or directory (os error 2)"
flakes hitting the queue should stop. Currently blocking PR queue cascade.

Refs Toyota Way andon: this is the second iteration of the per-PR-target
fix. The 2026-04-23 fix was correct in direction (isolate from runner shared
state) but didn't account for the cancel-in-progress race. Both fixes are
needed: per-PR isolates across-PRs; per-RUN isolates within-PR rebuilds.

🤖 Generated with Claude Code

…ate race)

## Root cause (five-whys)

1. CI fails with `couldn't create a temp dir: /workspace/target/debug/deps/...`
2. `deps/` was unlinked mid-build, then cargo tried to create a tempfile inside it
3. A concurrent process on the same host path is racing this run's cargo
4. `concurrency.cancel-in-progress: true` (ci.yml:22) cancels the old run when
   `update-branch` pushes a new commit — but cancellation is signal-based with
   a 30s SIGTERM→SIGKILL window. During those 30s the old cargo is still
   writing/cleaning files in `/workspace/target/...`
5. The mount was per-PR (`aprender-ci/${PR_OR_REF}`), so the new run mounted
   THE SAME host directory as the dying old run — they shared
   `/workspace/target/debug/deps/` on the host

## What the prior fix solved vs. what it introduced

The 2026-04-23 fleet fix (paiml/.github#31) moved sovereign-ci from a shared
runner-wide target dir to per-PR isolation, solving CROSS-PR same-runner
collisions. That worked. But the per-PR isolation INTRODUCED this
cancel-corrupt-state collision for SAME-PR sequential runs (every `update-branch`
or new push triggers a new run that mounts the SAME host path).

## Fix

Bump the mount path one level deeper:
  `aprender-ci/${PR_OR_REF}` → `aprender-ci/${PR_OR_REF}/run-${GITHUB_RUN_ID}`

Now every CI run gets its own isolated target dir; no two cargo invocations
ever share a host directory.

## Trade-offs

- **sccache**: unchanged. Lives on its own mount (`/home/noah/data/sccache`)
  and continues to dedupe across ALL runs of ALL PRs. This is the heavy
  lifter — typical 80%+ hit rate on warm cache.
- **cargo-incremental**: lost per new run. Cost is small because sccache
  already covers most of the rebuild surface; cargo-incremental is the
  per-crate metadata, not the codegen.
- **Disk**: per-run dirs accumulate under `aprender-ci/<PR>/`. Existing
  disk-guard hook deletes old PR dirs after merge (incl. all run subdirs).
  No new cleanup logic needed.

## Verified

- `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/ci.yml'))"` → OK
- 4 mount lines updated (workspace-test: 3 docker steps + ownership-fix step)
- Inline comment block documents the cancel-corrupt-state race for future maintainers

## Impact

Once this lands, the recurring "No such file or directory (os error 2)"
flakes hitting the queue should stop. Currently 4+ PRs blocked at the
same defect simultaneously.

Refs Toyota Way andon: this is the second iteration of the per-PR-target
fix; the 2026-04-23 fix was correct in direction (isolate from runner
shared state) but didn't account for the cancel-in-progress race.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 15, 2026 11:37
@noahgift noahgift merged commit b13645a into main May 15, 2026
11 checks passed
@noahgift noahgift deleted the fix/ci-per-run-target-dir branch May 15, 2026 12:00
noahgift added a commit that referenced this pull request May 15, 2026
…0 fix (#1704)

The P0 per-run target-dir fix (#1693, 2026-05-15) eliminated
cargo-incremental cross-run warmth. Cold compiles now happen on every
run; sccache covers codegen (~80% hit rate on warm cache) but cargo's
metadata + linking + test binaries still cost ~40-50 min cold.

Under runner-pool saturation (7+ concurrent CI runs observed today),
the previous 55min timeout was exactly hit by 5 simultaneously-rebasing
PRs (run 25919246467 and siblings — all timed out at 55:00.0 with no
real test failure).

Bumps:
  workspace-test step: 55min → 75min
  workspace-test job:  65min → 85min  (10min overhead headroom)

Trade-off: a genuinely stuck run now eats up to 75min of runner time
instead of 55min. Acceptable — we have 16 self-hosted runners (per
memory `reference_self_hosted_runner_disk_guard.md`) and the cost of
a timeout-false-alarm cascade (5 PRs simultaneously red) is much higher
than the cost of one extra 20min of waiting for a truly hung run.

Refs Toyota Way: don't ignore the line stoppage — extend the time
window to match the new cold-compile reality.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 18, 2026
#1798)

Adds a chown step BEFORE the cargo step that runs `docker run --rm` as
root and chowns the per-RUN target dir + cargo registry to noah:1000.

## Why

Docker's bind-mount creates missing host directories with the daemon's
uid (root). Since #1693 switched to per-RUN target dirs
(`/mnt/nvme-raid0/targets/aprender-ci/<PR>/run-<RUN_ID>`), every fresh
run gets a root-owned target dir. Cargo (running as uid 1000 inside the
container) cannot write to it and fails with:

    error: failed to create directory `/workspace/target/debug`:
    No such file or directory (os error 2)

The existing post-job chown (line 245) was meant to fix this for the
NEXT run's git-clean — but per-RUN paths invalidate that since each
run gets a brand-new root-owned dir. First-runs always fail.

This was observed across 6+ in-flight PRs (#1784, #1791-#1797) on
2026-05-18 — every "infrastructure flake" turned out to be the same
ownership bug at different cargo entry points.

## Fix

Pre-cargo chown step. Idempotent (`|| true`). Runs the existing
sovereign-ci image as root for the chown, then exits — adds maybe 2s
to runs. Matches the pattern of the post-job chown step that already
exists; just moves it to BEFORE cargo as well.

## Manual one-shot

The 6 currently-stuck PRs were unblocked by manually chowning their
per-RUN dirs on the runner host:

    ssh intel sudo chown -R 1000:1000 \
        /mnt/nvme-raid0/targets/aprender-ci/{1792,1793,1794,1796,1797,main}/run-*

After this PR lands, future runs will fix themselves.

Co-authored-by: Noah Gift <claude@noahgift.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant