Merged
Conversation
brianpepin
approved these changes
Feb 19, 2026
Contributor
brianpepin
left a comment
There was a problem hiding this comment.
These changes look good to me; thanks for contributing.
Collaborator
Author
|
Sneaking in another change to relax a time-based test, found while soaking under cdb with page-heap. I've found that a loaded system sometimes needs >50ms epsilon to service delayed tasks. |
jasonsandlin
approved these changes
Feb 19, 2026
6 tasks
Collaborator
Author
|
Don't merge this PR just yet. There was an independent latent bug #938 and the fix obsoletes the need for a change in this PR. |
…etime • Relax VerifyDistributedAsyncCall timing by allowing a per-iteration slack window and wait for Cleanup to be recorded before opcode verification. • Add explicit completion signaling and queue-drain waits in VerifyWaitForCompletion, VerifySimpleAsyncCall, and VerifyDistributedAsyncCall to avoid threadpool timing races. • Hold a per-call reference in FactorialWorkerSimple/FactorialWorkerDistributed to prevent UAF during asynchronous callbacks. • In _VerifyQueueTermination, only wait for counts to settle when termination is non-blocking.
Access violation crash in `AsyncBlockTests::VerifyAsyncBlockReuse` due to concurrent `std::vector::push_back` operations on shared `opCodes` member from overlapping async call lifecycle phases. The test intentionally reuses `XAsyncBlock` and `FactorialCallData` across sequential async calls. When the first call's cleanup (invoked from completion callback) races with the second call's initialization, both threads attempt to push_back into the same `std::vector`, causing heap corruption during vector reallocation. **Crash stack trace:** ``` ucrtbased!_free_dbg (heap corruption during vector realloc) ← std::vector<XAsyncOp>::push_back ← FactorialWorkerSimple (Cleanup opcode from first call) ← AsyncState::~AsyncState ← CompletionCallback [concurrent with] ← FactorialWorkerSimple (Begin/DoWork from second call) ← VerifyAsyncBlockReuse ``` **Detection:** Heisenbug found after 6-hour soak test under Windows CDB with page heap enabled (`gflags /p /enable`). Replace `std::vector<XAsyncOp> opCodes` with fixed-capacity lock-free append buffer: - `std::array<std::atomic<XAsyncOp>, 16>` for storage (capacity exceeds max test depth) - `std::atomic<size_t>` for thread-safe index allocation - `RecordOp(op)`: atomic fetch-add for index, then `store(memory_order_release)` to array slot - `GetOpCodes()`: snapshot current state into vector via `load(memory_order_acquire)` **Why this approach:** - Aligns with library philosophy of avoiding synchronization primitives - No dynamic allocation eliminates reallocation races - Bounded opcode sequences (max ~9 in distributed factorial tests) - Append-only during async lifecycle, read-only during verification - Proper release-acquire semantics ensure visibility on ARM/weakly-ordered architectures - Natural lock-free semantics: each writer gets unique slot via atomic index - `Tests/UnitTests/Tests/AsyncBlockTests.cpp`: - Added `#include <array>` for std::array support - Replaced `std::vector<XAsyncOp> opCodes` with `std::atomic<XAsyncOp>` array buffer - Updated `FactorialWorkerSimple` and `FactorialWorkerDistributed` to use `RecordOp()` - Updated all test verification sites to use `GetOpCodes()` snapshot method - Optimized multi-call sites to snapshot once and reuse - **Specific test**: `VerifyAsyncBlockReuse` passes 10/10 rapid runs - **Full suite**: All 23 AsyncBlockTests passed with no regressions - **Note**: Original heisenbug required 6hr soak to reproduce; single-pass testing verifies compilation and basic functionality, but extended soak testing would be needed to fully validate stability under stress - Test-only change, no production code affected - Eliminates data race without introducing mutex overhead - Maintains test semantics and coverage
be45d64 to
14286a1
Compare
Collaborator
Author
|
Clean and ready for review. |
jasonsandlin
approved these changes
Feb 19, 2026
Merged
5 tasks
jasonsandlin
pushed a commit
that referenced
this pull request
Feb 20, 2026
…sults in AsyncBlockTests (#940) * fix 935: drain queue before verifying opcodes in all verification tests Apply queue drain timing fix to tests that verify async opcodes. These tests were checking opcodes immediately after async completion without ensuring all cleanup work had been recorded to the opcode log, causing intermittent test failures. * Refine cdb test soak script Refactor cdb test script to capture stacks independently, as well as output log, stacks, and dmp for all abnormal exits (including Ctrl+C). * Fix AsyncBlockTests: drain queue before verifying in al tests Apply consistent queue drain pattern to 8 AsyncBlockTests before final queue verification to eliminate timing races where cleanup work completes asynchronously after XAsyncGetStatus() returns. Root Cause: The async framework's Cleanup operation is initiated by the provider but completed asynchronously through the task queue. Tests checking queue state or opcode snapshots immediately after XAsyncGetStatus() could race with the pending Cleanup work, resulting in intermittent failures (heisenbug-like behavior with "8 vs 9 opcodes" or "queue not empty" errors). Solution: All queue verification now preceded by explicit drain loop: - Checks both Completion and Work ports - 10ms sleep granularity, 2000ms timeout - Ensures all async cleanup completes before verification
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (#932, #938)
Unit tests for async operations and task queues exhibit intermittent failures on win32 target (others untested) due to timing assumptions, races, and use-after-free in test fixtures (#932). In addition,
AsyncBlockTests::VerifyAsyncBlockReusecan crash under stress when concurrent async lifecycle phases append to a sharedstd::vector(#938).Key failure modes:
VerifyDuplicateQueueHandletests an error path by passing a freed queue handle, which can be dereferenced in the duplicate-handle failure path.opCodes.push_back()and heap corruption during vector reallocation.Solution
Stabilize async/task-queue unit tests and harden factorial worker lifetime (#932)
VerifySimpleAsyncCall,VerifyWaitForCompletion, andVerifyDistributedAsyncCallto ensure async operations complete before verification.VerifyDistributedAsyncCalltiming check (5 iterations x 100ms with one-interval slack).FactorialWorkerSimpleandFactorialWorkerDistributedto prevent use-after-free during async callbacks._VerifyQueueTermination, only wait for counts to settle for non-blocking termination cases.Fix use-after-free in test VerifyDuplicateQueueHandle (#932)
XTaskQueueDuplicateHandlefailure path.Fix data race in AsyncBlockTests opcode logging (#938)
std::vector<XAsyncOp>with a fixed-capacity lock-free log:std::array<std::atomic<XAsyncOp>, 16>+std::atomic<size_t>.store(memory_order_release)for writes andload(memory_order_acquire)for snapshots to ensure visibility on weakly-ordered architectures (ARM).RecordOp()for lock-free append andGetOpCodes()for snapshots.Impact
Testing
VerifyAsyncBlockReuse: 10/10 passes.