Skip to content

[anneal][v2] (Buggy) atomic toolchain management#3375

Closed
joshlf wants to merge 1 commit into
mainfrom
G3gq762w6ejo7xxhycou2wdqjw7osy6qd
Closed

[anneal][v2] (Buggy) atomic toolchain management#3375
joshlf wants to merge 1 commit into
mainfrom
G3gq762w6ejo7xxhycou2wdqjw7osy6qd

Conversation

@joshlf
Copy link
Copy Markdown
Member

@joshlf joshlf commented May 15, 2026

This commit introduces an initial implementation of atomic toolchain
installation and garbage collection. However,
analysis has identified several critical race conditions and robustness
flaws in this implementation:

  1. Readdir TOCTOU Flaw: fs::read_dir is non-atomic. If a concurrent
    installation creates a temporary symlink and target directory mid-scan,
    GC can miss the symlink (if placed before the scan cursor in hash order)
    while discovering the directory (if placed after the cursor). GC will
    then delete the active directory while it is being populated.
  2. Multiple Canonical Destinations Race: gc only performs post-iteration
    symlink verification on the single dst path passed to install. If
    multiple canonical toolchains (e.g., arm vs x86) share the parent directory,
    a concurrent update to another toolchain can be missed during read_dir
    and its valid directory deleted.
  3. Unguarded Resource Leak: std::os::unix::fs::symlink and fs::create_dir
    are called before the RAII guard is armed. If directory creation fails
    or a panic occurs, orphaned temporary symlinks are left on disk permanently.
  4. Infinite Spin-Loop on Deletion Errors: while dir_path.exists() in GC
    silently ignores all errors from fs::remove_dir_all. If a directory cannot
    be removed due to permission or I/O errors, the worker enters an infinite
    CPU lockup loop.
  5. Readdir Metadata Abortion: Calling entry.file_type()? on filesystems where
    d_type is DT_UNKNOWN triggers lstat. If another thread deleted the entry
    concurrently, this fails with ENOENT and aborts the entire operation.

Proposed Holistic Redesign:
To address these flaws, we propose moving to a quiescent Readers-Writers
synchronization model paired with structured artifact naming:

  • Directory Locking: Use POSIX file locking on parent/.lock. install
    acts as a Reader (LOCK_SH), allowing parallel installations. gc acts
    as a Writer (LOCK_EX | LOCK_NB), ensuring that garbage collection only
    runs when zero installations and zero other GC workers are active.
  • Quiescent Mark-and-Sweep: Under LOCK_EX, directory iteration is completely
    quiescent, eliminating readdir TOCTOU races and metadata deletion errors.
  • Structured Artifact Lifecycle: Arm the RAII guard prior to creating files.
    Prefix temporary files distinctively (e.g., .tmp.*), allowing GC under
    exclusive lock to safely identify and sweep any orphaned artifacts from
    terminated processes.
  • Unconditional Single Deletion: Remove the while exists() spin-loop and
    perform a single robust deletion per unreachable directory.

⬇️ Download this PR

Branch

git fetch origin refs/heads/G3gq762w6ejo7xxhycou2wdqjw7osy6qd && git checkout -b pr-G3gq762w6ejo7xxhycou2wdqjw7osy6qd FETCH_HEAD

Checkout

git fetch origin refs/heads/G3gq762w6ejo7xxhycou2wdqjw7osy6qd && git checkout FETCH_HEAD

Cherry Pick

git fetch origin refs/heads/G3gq762w6ejo7xxhycou2wdqjw7osy6qd && git cherry-pick FETCH_HEAD

Pull

git pull origin refs/heads/G3gq762w6ejo7xxhycou2wdqjw7osy6qd

Stacked PRs enabled by GHerrit.

This commit introduces an initial implementation of atomic toolchain
installation and garbage collection. However, rigorous concurrency
analysis has identified several critical race conditions and robustness
flaws in this implementation:

1. Readdir TOCTOU Flaw: `fs::read_dir` is non-atomic. If a concurrent
   installation creates a temporary symlink and target directory mid-scan,
   GC can miss the symlink (if placed before the scan cursor in hash order)
   while discovering the directory (if placed after the cursor). GC will
   then delete the active directory while it is being populated.
2. Multiple Canonical Destinations Race: `gc` only performs post-iteration
   symlink verification on the single `dst` path passed to `install`. If
   multiple canonical toolchains (e.g., arm vs x86) share the parent directory,
   a concurrent update to another toolchain can be missed during `read_dir`
   and its valid directory deleted.
3. Unguarded Resource Leak: `std::os::unix::fs::symlink` and `fs::create_dir`
   are called before the RAII guard is armed. If directory creation fails
   or a panic occurs, orphaned temporary symlinks are left on disk permanently.
4. Infinite Spin-Loop on Deletion Errors: `while dir_path.exists()` in GC
   silently ignores all errors from `fs::remove_dir_all`. If a directory cannot
   be removed due to permission or I/O errors, the worker enters an infinite
   CPU lockup loop.
5. Readdir Metadata Abortion: Calling `entry.file_type()?` on filesystems where
   d_type is DT_UNKNOWN triggers `lstat`. If another thread deleted the entry
   concurrently, this fails with ENOENT and aborts the entire operation.

Proposed Holistic Redesign:
To address these flaws, we propose moving to a quiescent Readers-Writers
synchronization model paired with structured artifact naming:
- Directory Locking: Use POSIX file locking on `parent/.lock`. `install`
  acts as a Reader (`LOCK_SH`), allowing parallel installations. `gc` acts
  as a Writer (`LOCK_EX | LOCK_NB`), ensuring that garbage collection only
  runs when zero installations and zero other GC workers are active.
- Quiescent Mark-and-Sweep: Under `LOCK_EX`, directory iteration is completely
  quiescent, eliminating readdir TOCTOU races and metadata deletion errors.
- Structured Artifact Lifecycle: Arm the RAII guard prior to creating files.
  Prefix temporary files distinctively (e.g., `.tmp.*`), allowing GC under
  exclusive lock to safely identify and sweep any orphaned artifacts from
  terminated processes.
- Unconditional Single Deletion: Remove the `while exists()` spin-loop and
  perform a single robust deletion per unreachable directory.

gherrit-pr-id: G3gq762w6ejo7xxhycou2wdqjw7osy6qd
@joshlf joshlf closed this May 15, 2026
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.88%. Comparing base (be6f199) to head (1cb1153).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3375   +/-   ##
=======================================
  Coverage   91.88%   91.88%           
=======================================
  Files          20       20           
  Lines        6076     6076           
=======================================
  Hits         5583     5583           
  Misses        493      493           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants