Skip to content

Fix: Data race in AsyncBlockTests with lock-free opcode logging (#938)#939

Closed
jhugard wants to merge 1 commit intomainfrom
fix-test-datarace-in-verifyasyncblockreuse
Closed

Fix: Data race in AsyncBlockTests with lock-free opcode logging (#938)#939
jhugard wants to merge 1 commit intomainfrom
fix-test-datarace-in-verifyasyncblockreuse

Conversation

@jhugard
Copy link
Copy Markdown
Collaborator

@jhugard jhugard commented Feb 19, 2026

fyi: this will merge conflict with PR #935. I'll fix the conflict once that PR lands in main.

Overview

Fixes intermittent access violation crash in AsyncBlockTests::VerifyAsyncBlockReuse
caused by concurrent std::vector::push_back operations during async block reuse.

Problem Statement (#938)

The Bug

Under stress testing with page heap, VerifyAsyncBlockReuse crashes with access violation
in ucrtbased!_free_dbg during vector reallocation.

Root cause: The test intentionally reuses XAsyncBlock and shared FactorialCallData
across sequential async calls. When the first call's cleanup (running in completion callback)
races with the second call's initialization (running on main thread), both threads concurrently
call push_back on the shared std::vector<XAsyncOp> opCodes, corrupting the heap during
reallocation.

Detection

  • Method: 6-hour soak test under Windows CDB with page heap enabled
  • Frequency: Heisenbug - intermittent failure under high concurrency
  • Environment: Debug x64 build with Application Verifier

Race Condition

Thread 1 (Completion):          Thread 2 (Test):
└─ Cleanup(first call)          └─ Begin(second call)
   └─ opCodes.push_back() ──────┬── opCodes.push_back()
                                │
                            [RACE: concurrent vector mutation]

Solution

Approach

Replace std::vector<XAsyncOp> opCodes with lock-free fixed-capacity append buffer:

// Before:
std::vector<XAsyncOp> opCodes;

// After:
static constexpr size_t MAX_OPCODES = 16;
std::array<std::atomic<XAsyncOp>, MAX_OPCODES> opCodesArray{};
std::atomic<size_t> opCodesCount{ 0 };

Implementation

  1. Thread-safe append with proper memory ordering for ARM:

    void RecordOp(XAsyncOp op) {
        size_t idx = opCodesCount.fetch_add(1, std::memory_order_relaxed);
        if (idx < MAX_OPCODES) {
            opCodesArray[idx].store(op, std::memory_order_release);
        }
    }
  2. Snapshot for verification with acquire semantics:

    std::vector<XAsyncOp> GetOpCodes() const {
        size_t count = opCodesCount.load(std::memory_order_acquire);
        count = (count < MAX_OPCODES) ? count : MAX_OPCODES;
        std::vector<XAsyncOp> result;
        result.reserve(count);
        for (size_t i = 0; i < count; i++) {
            result.push_back(opCodesArray[i].load(std::memory_order_acquire));
        }
        return result;
    }
  3. Snapshot optimization for multiple verification calls:

    // Snapshot once, reuse for multiple checks:
    auto opCodes = data.Ref->GetOpCodes();
    VerifyHasOp(opCodes, XAsyncOp::Cancel);
    VerifyHasOp(opCodes, XAsyncOp::Cleanup);

Why This Design

  • Lock-free: Aligns with library philosophy of avoiding synchronization primitives
  • No allocation: Fixed capacity eliminates reallocation races
  • Bounded: Max test depth ~9 opcodes, capacity=16 provides margin
  • ARM-safe: Release-acquire semantics ensure visibility on weakly-ordered architectures
  • Natural semantics: Append-only during async lifecycle, read-only verification
  • Minimal change: Test behavior and coverage unchanged

Testing

Validation Results

Test Runs Passed Failed
VerifyAsyncBlockReuse (targeted) 10 10 0
Full AsyncBlockTests suite 1 23 0

Test Coverage

  • ✅ All AsyncBlockTests pass without regression
  • ✅ Rapid successive runs show immediate stability
  • ⚠️ Note: Original bug required 6hr soak to reproduce; extended stress testing
    recommended to fully validate fix under production-like conditions

Build Verification

  • ✅ Debug x64 build successful
  • ✅ No new compiler warnings
  • ✅ Test DLL loads and executes correctly

Impact Assessment

Scope

  • Code affected: Test harness only (AsyncBlockTests.cpp)
  • Production impact: None (test-only change)
  • API changes: None
  • Binary compatibility: Not applicable

Risk

  • Low: Isolated to test infrastructure
  • Regression risk: Minimal - test semantics preserved
  • Performance: Negligible (fixed allocation vs. vector overhead)

Checklist

  • Code compiles without warnings
  • All existing tests pass
  • No production code changes
  • Fix aligns with library design philosophy (lock-free)
  • Issue documented with root cause analysis
  • Commit message includes testing details

Additional Notes

This fix demonstrates the value of stress testing with page heap enabled. The race
condition is subtle and timing-dependent, only manifesting under specific concurrency
patterns during async block reuse. The solution maintains the test's intent (verify
XAsyncBlock reuse semantics) while eliminating the data race through a design that
fits naturally with the library's lock-free architecture.

Access violation crash in `AsyncBlockTests::VerifyAsyncBlockReuse` due to
concurrent `std::vector::push_back` operations on shared `opCodes` member
from overlapping async call lifecycle phases.

The test intentionally reuses `XAsyncBlock` and `FactorialCallData` across
sequential async calls. When the first call's cleanup (invoked from completion
callback) races with the second call's initialization, both threads attempt to
push_back into the same `std::vector`, causing heap corruption during vector
reallocation.

**Crash stack trace:**
```
ucrtbased!_free_dbg (heap corruption during vector realloc)
  ← std::vector<XAsyncOp>::push_back
  ← FactorialWorkerSimple (Cleanup opcode from first call)
  ← AsyncState::~AsyncState
  ← CompletionCallback
  [concurrent with]
  ← FactorialWorkerSimple (Begin/DoWork from second call)
  ← VerifyAsyncBlockReuse
```

**Detection:** Heisenbug found after 6-hour soak test under Windows CDB with
page heap enabled (`gflags /p /enable`).

Replace `std::vector<XAsyncOp> opCodes` with fixed-capacity lock-free
append buffer:

- `std::array<std::atomic<XAsyncOp>, 16>` for storage (capacity exceeds max test depth)
- `std::atomic<size_t>` for thread-safe index allocation
- `RecordOp(op)`: atomic fetch-add for index, then `store(memory_order_release)` to array slot
- `GetOpCodes()`: snapshot current state into vector via `load(memory_order_acquire)`

**Why this approach:**
- Aligns with library philosophy of avoiding synchronization primitives
- No dynamic allocation eliminates reallocation races
- Bounded opcode sequences (max ~9 in distributed factorial tests)
- Append-only during async lifecycle, read-only during verification
- Proper release-acquire semantics ensure visibility on ARM/weakly-ordered architectures
- Natural lock-free semantics: each writer gets unique slot via atomic index

- `Tests/UnitTests/Tests/AsyncBlockTests.cpp`:
  - Added `#include <array>` for std::array support
  - Replaced `std::vector<XAsyncOp> opCodes` with `std::atomic<XAsyncOp>` array buffer
  - Updated `FactorialWorkerSimple` and `FactorialWorkerDistributed` to use `RecordOp()`
  - Updated all test verification sites to use `GetOpCodes()` snapshot method
  - Optimized multi-call sites to snapshot once and reuse

- **Specific test**: `VerifyAsyncBlockReuse` passes 10/10 rapid runs
- **Full suite**: All 23 AsyncBlockTests passed with no regressions
- **Note**: Original heisenbug required 6hr soak to reproduce; single-pass
  testing verifies compilation and basic functionality, but extended soak
  testing would be needed to fully validate stability under stress

- Test-only change, no production code affected
- Eliminates data race without introducing mutex overhead
- Maintains test semantics and coverage
@jhugard
Copy link
Copy Markdown
Collaborator Author

jhugard commented Feb 19, 2026

Alternatively, I can add this fix directly to #935. Please advise.

@jhugard
Copy link
Copy Markdown
Collaborator Author

jhugard commented Feb 19, 2026

Closing this PR in favor of directly updating #935

@jhugard jhugard closed this Feb 19, 2026
@jhugard jhugard deleted the fix-test-datarace-in-verifyasyncblockreuse branch February 19, 2026 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant