Fix: Data race in AsyncBlockTests with lock-free opcode logging (#938)#939
Closed
Fix: Data race in AsyncBlockTests with lock-free opcode logging (#938)#939
Conversation
Access violation crash in `AsyncBlockTests::VerifyAsyncBlockReuse` due to concurrent `std::vector::push_back` operations on shared `opCodes` member from overlapping async call lifecycle phases. The test intentionally reuses `XAsyncBlock` and `FactorialCallData` across sequential async calls. When the first call's cleanup (invoked from completion callback) races with the second call's initialization, both threads attempt to push_back into the same `std::vector`, causing heap corruption during vector reallocation. **Crash stack trace:** ``` ucrtbased!_free_dbg (heap corruption during vector realloc) ← std::vector<XAsyncOp>::push_back ← FactorialWorkerSimple (Cleanup opcode from first call) ← AsyncState::~AsyncState ← CompletionCallback [concurrent with] ← FactorialWorkerSimple (Begin/DoWork from second call) ← VerifyAsyncBlockReuse ``` **Detection:** Heisenbug found after 6-hour soak test under Windows CDB with page heap enabled (`gflags /p /enable`). Replace `std::vector<XAsyncOp> opCodes` with fixed-capacity lock-free append buffer: - `std::array<std::atomic<XAsyncOp>, 16>` for storage (capacity exceeds max test depth) - `std::atomic<size_t>` for thread-safe index allocation - `RecordOp(op)`: atomic fetch-add for index, then `store(memory_order_release)` to array slot - `GetOpCodes()`: snapshot current state into vector via `load(memory_order_acquire)` **Why this approach:** - Aligns with library philosophy of avoiding synchronization primitives - No dynamic allocation eliminates reallocation races - Bounded opcode sequences (max ~9 in distributed factorial tests) - Append-only during async lifecycle, read-only during verification - Proper release-acquire semantics ensure visibility on ARM/weakly-ordered architectures - Natural lock-free semantics: each writer gets unique slot via atomic index - `Tests/UnitTests/Tests/AsyncBlockTests.cpp`: - Added `#include <array>` for std::array support - Replaced `std::vector<XAsyncOp> opCodes` with `std::atomic<XAsyncOp>` array buffer - Updated `FactorialWorkerSimple` and `FactorialWorkerDistributed` to use `RecordOp()` - Updated all test verification sites to use `GetOpCodes()` snapshot method - Optimized multi-call sites to snapshot once and reuse - **Specific test**: `VerifyAsyncBlockReuse` passes 10/10 rapid runs - **Full suite**: All 23 AsyncBlockTests passed with no regressions - **Note**: Original heisenbug required 6hr soak to reproduce; single-pass testing verifies compilation and basic functionality, but extended soak testing would be needed to fully validate stability under stress - Test-only change, no production code affected - Eliminates data race without introducing mutex overhead - Maintains test semantics and coverage
Collaborator
Author
|
Alternatively, I can add this fix directly to #935. Please advise. |
Collaborator
Author
|
Closing this PR in favor of directly updating #935 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fyi: this will merge conflict with PR #935. I'll fix the conflict once that PR lands in main.
Overview
Fixes intermittent access violation crash in
AsyncBlockTests::VerifyAsyncBlockReusecaused by concurrent
std::vector::push_backoperations during async block reuse.Problem Statement (#938)
The Bug
Under stress testing with page heap,
VerifyAsyncBlockReusecrashes with access violationin
ucrtbased!_free_dbgduring vector reallocation.Root cause: The test intentionally reuses
XAsyncBlockand sharedFactorialCallDataacross sequential async calls. When the first call's cleanup (running in completion callback)
races with the second call's initialization (running on main thread), both threads concurrently
call
push_backon the sharedstd::vector<XAsyncOp> opCodes, corrupting the heap duringreallocation.
Detection
Race Condition
Solution
Approach
Replace
std::vector<XAsyncOp> opCodeswith lock-free fixed-capacity append buffer:Implementation
Thread-safe append with proper memory ordering for ARM:
Snapshot for verification with acquire semantics:
Snapshot optimization for multiple verification calls:
Why This Design
Testing
Validation Results
VerifyAsyncBlockReuse(targeted)AsyncBlockTestssuiteTest Coverage
recommended to fully validate fix under production-like conditions
Build Verification
Impact Assessment
Scope
AsyncBlockTests.cpp)Risk
Checklist
Additional Notes
This fix demonstrates the value of stress testing with page heap enabled. The race
condition is subtle and timing-dependent, only manifesting under specific concurrency
patterns during async block reuse. The solution maintains the test's intent (verify
XAsyncBlock reuse semantics) while eliminating the data race through a design that
fits naturally with the library's lock-free architecture.