[anneal][v2] (Buggy) atomic toolchain management#3375
Closed
joshlf wants to merge 1 commit into
Closed
Conversation
This commit introduces an initial implementation of atomic toolchain installation and garbage collection. However, rigorous concurrency analysis has identified several critical race conditions and robustness flaws in this implementation: 1. Readdir TOCTOU Flaw: `fs::read_dir` is non-atomic. If a concurrent installation creates a temporary symlink and target directory mid-scan, GC can miss the symlink (if placed before the scan cursor in hash order) while discovering the directory (if placed after the cursor). GC will then delete the active directory while it is being populated. 2. Multiple Canonical Destinations Race: `gc` only performs post-iteration symlink verification on the single `dst` path passed to `install`. If multiple canonical toolchains (e.g., arm vs x86) share the parent directory, a concurrent update to another toolchain can be missed during `read_dir` and its valid directory deleted. 3. Unguarded Resource Leak: `std::os::unix::fs::symlink` and `fs::create_dir` are called before the RAII guard is armed. If directory creation fails or a panic occurs, orphaned temporary symlinks are left on disk permanently. 4. Infinite Spin-Loop on Deletion Errors: `while dir_path.exists()` in GC silently ignores all errors from `fs::remove_dir_all`. If a directory cannot be removed due to permission or I/O errors, the worker enters an infinite CPU lockup loop. 5. Readdir Metadata Abortion: Calling `entry.file_type()?` on filesystems where d_type is DT_UNKNOWN triggers `lstat`. If another thread deleted the entry concurrently, this fails with ENOENT and aborts the entire operation. Proposed Holistic Redesign: To address these flaws, we propose moving to a quiescent Readers-Writers synchronization model paired with structured artifact naming: - Directory Locking: Use POSIX file locking on `parent/.lock`. `install` acts as a Reader (`LOCK_SH`), allowing parallel installations. `gc` acts as a Writer (`LOCK_EX | LOCK_NB`), ensuring that garbage collection only runs when zero installations and zero other GC workers are active. - Quiescent Mark-and-Sweep: Under `LOCK_EX`, directory iteration is completely quiescent, eliminating readdir TOCTOU races and metadata deletion errors. - Structured Artifact Lifecycle: Arm the RAII guard prior to creating files. Prefix temporary files distinctively (e.g., `.tmp.*`), allowing GC under exclusive lock to safely identify and sweep any orphaned artifacts from terminated processes. - Unconditional Single Deletion: Remove the `while exists()` spin-loop and perform a single robust deletion per unreachable directory. gherrit-pr-id: G3gq762w6ejo7xxhycou2wdqjw7osy6qd
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3375 +/- ##
=======================================
Coverage 91.88% 91.88%
=======================================
Files 20 20
Lines 6076 6076
=======================================
Hits 5583 5583
Misses 493 493 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit introduces an initial implementation of atomic toolchain
installation and garbage collection. However,
analysis has identified several critical race conditions and robustness
flaws in this implementation:
fs::read_diris non-atomic. If a concurrentinstallation creates a temporary symlink and target directory mid-scan,
GC can miss the symlink (if placed before the scan cursor in hash order)
while discovering the directory (if placed after the cursor). GC will
then delete the active directory while it is being populated.
gconly performs post-iterationsymlink verification on the single
dstpath passed toinstall. Ifmultiple canonical toolchains (e.g., arm vs x86) share the parent directory,
a concurrent update to another toolchain can be missed during
read_dirand its valid directory deleted.
std::os::unix::fs::symlinkandfs::create_dirare called before the RAII guard is armed. If directory creation fails
or a panic occurs, orphaned temporary symlinks are left on disk permanently.
while dir_path.exists()in GCsilently ignores all errors from
fs::remove_dir_all. If a directory cannotbe removed due to permission or I/O errors, the worker enters an infinite
CPU lockup loop.
entry.file_type()?on filesystems whered_type is DT_UNKNOWN triggers
lstat. If another thread deleted the entryconcurrently, this fails with ENOENT and aborts the entire operation.
Proposed Holistic Redesign:
To address these flaws, we propose moving to a quiescent Readers-Writers
synchronization model paired with structured artifact naming:
parent/.lock.installacts as a Reader (
LOCK_SH), allowing parallel installations.gcactsas a Writer (
LOCK_EX | LOCK_NB), ensuring that garbage collection onlyruns when zero installations and zero other GC workers are active.
LOCK_EX, directory iteration is completelyquiescent, eliminating readdir TOCTOU races and metadata deletion errors.
Prefix temporary files distinctively (e.g.,
.tmp.*), allowing GC underexclusive lock to safely identify and sweep any orphaned artifacts from
terminated processes.
while exists()spin-loop andperform a single robust deletion per unreachable directory.
⬇️ Download this PR
Branch
git fetch origin refs/heads/G3gq762w6ejo7xxhycou2wdqjw7osy6qd && git checkout -b pr-G3gq762w6ejo7xxhycou2wdqjw7osy6qd FETCH_HEADCheckout
git fetch origin refs/heads/G3gq762w6ejo7xxhycou2wdqjw7osy6qd && git checkout FETCH_HEADCherry Pick
git fetch origin refs/heads/G3gq762w6ejo7xxhycou2wdqjw7osy6qd && git cherry-pick FETCH_HEADPull
Stacked PRs enabled by GHerrit.