Skip to content

ci: pre-build chown of per-RUN target dir to fix root-owned bind mount#1798

Merged
noahgift merged 5 commits into
mainfrom
fix/ci-pre-build-chown-per-run-target-dir
May 18, 2026
Merged

ci: pre-build chown of per-RUN target dir to fix root-owned bind mount#1798
noahgift merged 5 commits into
mainfrom
fix/ci-pre-build-chown-per-run-target-dir

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Adds a chown step BEFORE the cargo step in `workspace-test` that runs the sovereign-ci image as root and chowns the per-RUN target dir (`/mnt/nvme-raid0/targets/aprender-ci//run-<RUN_ID>`) and cargo registry to noah:1000.

Why — five-whys

  1. Why do fresh runs fail with `failed to create /workspace/target/debug`? Cargo runs as uid 1000 inside the container and can't write to root-owned dirs.
  2. Why root-owned? Docker's bind-mount creates missing host directories with the daemon's uid (root).
  3. Why does this fire every run? Per-RUN paths (fix(ci)!: per-run target-dir — P0 STOP THE LINE on cancel-corrupt-state race #1693 in `aprender-ci//run-<RUN_ID>`) are always fresh — Docker creates them root-owned every time.
  4. Why didn't this happen before fix(ci)!: per-run target-dir — P0 STOP THE LINE on cancel-corrupt-state race #1693? Pre-fix(ci)!: per-run target-dir — P0 STOP THE LINE on cancel-corrupt-state race #1693 the per-PR (not per-RUN) path persisted across runs, and the existing post-job chown step at line 245 fixed perms for the next run's git-clean. Per-RUN invalidates that.
  5. Root cause: the chown step runs AFTER cargo, not BEFORE. First-runs always fail; reruns appear to work only when a sibling run's belated chown happened to fix the (now-stale) dir.

Empirical (2026-05-18 lambda-vector ssh intel)

Inspection showed 6 in-flight runs hit the same error simultaneously across unrelated PRs:

```
ROOT-OWNED: /mnt/nvme-raid0/targets/aprender-ci/1792/run-26040118649 (root:root)
ROOT-OWNED: /mnt/nvme-raid0/targets/aprender-ci/1793/run-26040120976 (root:root)
ROOT-OWNED: /mnt/nvme-raid0/targets/aprender-ci/1794/run-26040155236 (root:root)
ROOT-OWNED: /mnt/nvme-raid0/targets/aprender-ci/1796/run-26040126733 (root:root)
ROOT-OWNED: /mnt/nvme-raid0/targets/aprender-ci/1797/run-26040155057 (root:root)
ROOT-OWNED: /mnt/nvme-raid0/targets/aprender-ci/main/run-26040057475 (root:root)
```

All resolved immediately after manual `sudo chown -R 1000:1000` on the runner host. Disk space and inode counts were fine (67% used / 22% inodes); this was pure ownership.

This was misdiagnosed as a transient infra flake across the day's session — every "workspace-test failure" turned out to be the same root cause at different cargo entry points (lint, test, coverage, gate all error in the same way).

Fix

A new step BEFORE `Workspace lib tests`:

```yaml

  • name: Pre-build chown — fix per-RUN root ownership
    run: |
    docker run --rm \
    -v "/mnt/nvme-raid0/targets/aprender-ci/${PR_OR_REF}/run-${GITHUB_RUN_ID}:/workspace/target" \
    -v "/mnt/nvme-raid0/cargo-ci/registry/${PR_OR_REF}:/usr/local/cargo/registry" \
    "$IMAGE" \
    bash -c 'chown -R 1000:1000 /workspace/target /usr/local/cargo/registry 2>/dev/null || true'
    ```

Idempotent (`|| true` covers reruns where the dir is already noah-owned). Adds ~2s per run since chown of an empty fresh dir is fast.

Test plan

  • YAML lint: `python3 -c "import yaml; yaml.safe_load(open('.github/workflows/ci.yml'))"` PASS
  • Manual one-shot `sudo chown` of stuck dirs unblocked 6 in-flight PRs
  • First CI run on this branch verifies the new step works (this PR is self-testing — if the chown step is buggy, this PR's own CI will fail)
  • Subsequent runs on other PRs confirm the fix is stable

Cross-refs

🤖 Generated with Claude Code

Adds a chown step BEFORE the cargo step that runs `docker run --rm` as
root and chowns the per-RUN target dir + cargo registry to noah:1000.

## Why

Docker's bind-mount creates missing host directories with the daemon's
uid (root). Since #1693 switched to per-RUN target dirs
(`/mnt/nvme-raid0/targets/aprender-ci/<PR>/run-<RUN_ID>`), every fresh
run gets a root-owned target dir. Cargo (running as uid 1000 inside the
container) cannot write to it and fails with:

    error: failed to create directory `/workspace/target/debug`:
    No such file or directory (os error 2)

The existing post-job chown (line 245) was meant to fix this for the
NEXT run's git-clean — but per-RUN paths invalidate that since each
run gets a brand-new root-owned dir. First-runs always fail.

This was observed across 6+ in-flight PRs (#1784, #1791-#1797) on
2026-05-18 — every "infrastructure flake" turned out to be the same
ownership bug at different cargo entry points.

## Fix

Pre-cargo chown step. Idempotent (`|| true`). Runs the existing
sovereign-ci image as root for the chown, then exits — adds maybe 2s
to runs. Matches the pattern of the post-job chown step that already
exists; just moves it to BEFORE cargo as well.

## Manual one-shot

The 6 currently-stuck PRs were unblocked by manually chowning their
per-RUN dirs on the runner host:

    ssh intel sudo chown -R 1000:1000 \
        /mnt/nvme-raid0/targets/aprender-ci/{1792,1793,1794,1796,1797,main}/run-*

After this PR lands, future runs will fix themselves.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 18, 2026 14:49
@noahgift noahgift merged commit 1c21f49 into main May 18, 2026
10 checks passed
@noahgift noahgift deleted the fix/ci-pre-build-chown-per-run-target-dir branch May 18, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant