Skip to content

fix(prune): serialize integration checks against branch -D (Windows .git/config race)#2808

Merged
max-sixty merged 1 commit into
mainfrom
prune-windows-race
May 19, 2026
Merged

fix(prune): serialize integration checks against branch -D (Windows .git/config race)#2808
max-sixty merged 1 commit into
mainfrom
prune-windows-race

Conversation

@max-sixty
Copy link
Copy Markdown
Owner

Summary

On Windows, wt step prune intermittently panicked prune should succeed with unable to access '.git/config': Permission denied. prune runs parallel integration_reason branch-integration checks (each spawns git subprocesses that read .git/config) on a background thread while the main loop performs inline removals — and the fast-path removal synchronously runs git branch -D, which rewrites .git/config (it removes the branch's [branch "<name>"] section) via git's lockfile + atomic-rename. On Windows the rename briefly holds .git/config with delete access; a concurrent reader's fopen(...,"r") does not pass FILE_SHARE_DELETE and is not retried, so it fails. POSIX open()/rename() semantics never exhibit this — Linux/macOS CI is green.

Approach (the "B-prime" option)

A prune-local Arc<RwLock<()>>: each integration_reason call is wrapped in a read guard (scoped to drop before the channel send); try_remove takes the write guard over its body. The write guard blocks until every in-flight reader has dropped and blocks new ones until the removal completes, so no integration-check git process is alive while the synchronous git branch -D rewrites .git/config. The parallel checks themselves are not serialized — only a reader-vs-removal overlap is excluded — so prune's streaming/interleaved output (the property that fixed the historical start-silence) is unchanged. The lock is confined to prune.rs; the read path is a single guarded scope with no changes to integration_reason, Cmd, or handlers.rs.

Alternatives considered and rejected: full serialization (reintroduces start-silence); git update-ref -d + deferred config-section purge (a Ctrl-C between ref delete and purge leaves a silent ghost-upstream that can corrupt prune's own integration decisions — a data-safety regression); a global git mutex (invasive, kills the parallelism). Poisoning is recovered via unwrap_or_else(|e| e.into_inner()) rather than .expect() — the lock guards (), so a poisoned lock is meaningless and panicking would only cascade an unrelated panic across every later removal.

Known residual gap (deliberate, documented — not closed here)

B-prime closes the race on the instant-rename fast path, which is the path the observed CI failure took (synchronous git branch -D under the guard). On the cross-filesystem / .gitmodules / Windows-file-lock fallback, git branch -D must be deferred into a detached git worktree remove && git branch -D (the worktree still references the branch until it is removed, so an in-process branch -D would fail) — that deferred write runs after the guard drops and is not covered. The fallback is rare for prune targets (integrated worktrees the user is done with; the lock-prone current worktree is deferred last, after the check thread has drained), so the residual exposure is small but non-zero. The check_lock rationale comment documents this precisely rather than overclaiming an unconditional guarantee.

test_prune_fallback_config_race_canary is the empirical probe for whether that residual gap matters: it forces the fallback for one non-current integrated worktree (pre-blocking its staged path) while several other integrated worktrees keep the parallel integration fan-out running, so the deferred config-rewriting branch -D overlaps live .git/config readers. On POSIX it is a deterministic fallback-path smoke test (a concurrent read and rename never collide). On Windows it is a canary: a flake here during normal CI runs is empirical proof the fallback path needs closing (which would require synchronous/foreground fallback removal — a real perf regression, hence deferred to evidence rather than decided on speculation). The canary asserts only the immediate race signal (prune exit + absence of the .git/config error); removal-completion polling was intentionally dropped after it proved load-sensitive (a 60s wait_for timeout under heavy concurrent load) and carries no canary value.

Testing

pre-commit run --all-files clean; full suite green (3737 passed, 0 skipped); prune suite 36 passed, 0 failed. The fix's correctness is established by macOS reasoning + the green macOS suite; the actual Windows-only behavior can only be confirmed by test (windows) here (the originally-flaking test_prune_locally_merged_when_upstream_diverged going/staying green) and by the canary's flake rate over subsequent Windows runs.

Ref #2801

…git/config race)

On Windows, `wt step prune` intermittently failed with
`unable to access '.git/config': Permission denied`: parallel
`integration_reason` checks read `.git/config` while the inline removal's
synchronous `git branch -D` rewrites it via lockfile+rename. Windows holds
the file with delete access during the rename and a concurrent reader's
`fopen` (no FILE_SHARE_DELETE, no retry) fails; POSIX is unaffected.

Add a prune-local `RwLock<()>`: each `integration_reason` call takes a read
guard, `try_remove` takes the write guard, so no reader git process is in
flight while the synchronous fast-path `branch -D` runs. Parallel checks
themselves are not serialized; streaming/interleaved output is unchanged.

The cross-fs/.gitmodules/Windows-file-lock fallback defers `branch -D` into
a detached command (the worktree still references the branch), which runs
outside the guard — documented in-code as a known residual gap, not closed.
`test_prune_fallback_config_race_canary` exercises that fallback under
parallel fan-out: a deterministic POSIX pass, a Windows canary for whether
the residual gap matters in practice.

Ref #2801

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@max-sixty max-sixty merged commit 6fd2b70 into main May 19, 2026
34 checks passed
@max-sixty max-sixty deleted the prune-windows-race branch May 19, 2026 17:35
max-sixty added a commit that referenced this pull request May 19, 2026
…Context (#2811)

Follow-up to #2808. That PR fixed the Windows `.git/config` race in `wt
step prune` by adding a prune-local `RwLock` parameter to `try_remove`,
pushing it to 8 parameters and a `#[allow(clippy::too_many_arguments)]`
— deliberately deferred as an optional standalone cleanup to keep the
race fix minimal.

Of `try_remove`'s 8 arguments, only `candidate` varies; the other 7 are
identical at all three call sites in `step_prune` (two in the `rx` loop,
one for `deferred_current`). This bundles them into a
`RemovalContext<'a>` struct of borrows, constructed once and passed by
reference, following the existing `CommandContext` /
`RemovedWorktreeOutputContext` / `TemplateContext` pattern. Each call
site collapses from a 9-line argument block to one line, the
loop-invariant relationship is now expressed in the type rather than
re-asserted by hand three times, and the `#[allow]` is removed (clippy
is clean without it).

The B-prime concurrency semantics are intact: the struct carries
`&RwLock<()>` like the other borrows, and `try_remove` still acquires
the `check_lock` write guard as the first statement of its body, held
over the whole body — load-bearing for the Windows fix.

Pure refactor, no behavior change. Rebased onto main after #2806 (the
approval-boundary TOCTOU fix), which renamed `try_remove`'s `run_hooks:
bool` to `hook_plan: &ApprovedHookPlan`; `RemovalContext` carries that
field unchanged.

Follow-up commit: the struct refactor incidentally pushed
`handle_remove_output(...)` past rustfmt's `fn_call_width` (the
`ctx.`-prefixed args), splitting it multi-line and moving the trailing
`)?;` onto its own line — which LLVM-cov flags as a `codecov/patch` miss
despite the line executing (line 205 is covered and 205→210 is
straight-line). Reading the `Copy` fields into locals keeps that call
byte-identical to its covered form on main, so the diff stays a faithful
no-behavior-change refactor.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants