Skip to content

Stabilize unit tests (#932, #938)#935

Merged
jasonsandlin merged 4 commits intomainfrom
stabilize-queue-unit-tests
Feb 19, 2026
Merged

Stabilize unit tests (#932, #938)#935
jasonsandlin merged 4 commits intomainfrom
stabilize-queue-unit-tests

Conversation

@jhugard
Copy link
Copy Markdown
Collaborator

@jhugard jhugard commented Feb 18, 2026

Problem (#932, #938)

Unit tests for async operations and task queues exhibit intermittent failures on win32 target (others untested) due to timing assumptions, races, and use-after-free in test fixtures (#932). In addition, AsyncBlockTests::VerifyAsyncBlockReuse can crash under stress when concurrent async lifecycle phases append to a shared std::vector (#938).

Key failure modes:

Solution

Stabilize async/task-queue unit tests and harden factorial worker lifetime (#932)

  • Explicit synchronization: Add wait loops in VerifySimpleAsyncCall, VerifyWaitForCompletion, and VerifyDistributedAsyncCall to ensure async operations complete before verification.
  • Queue drains: Wait for both Work and Completion ports to be empty before verifying final state.
  • Relaxed timing: Allow per-iteration slack window for VerifyDistributedAsyncCall timing check (5 iterations x 100ms with one-interval slack).
  • Lifetime hardening: Hold per-call references in FactorialWorkerSimple and FactorialWorkerDistributed to prevent use-after-free during async callbacks.
  • Conditional waits: In _VerifyQueueTermination, only wait for counts to settle for non-blocking termination cases.

Fix use-after-free in test VerifyDuplicateQueueHandle (#932)

  • Replace freed queue handle with stack-allocated fake handle (64-byte aligned buffer) for error testing.
  • Prevents dereferencing of previously-freed memory in the XTaskQueueDuplicateHandle failure path.

Fix data race in AsyncBlockTests opcode logging (#938)

  • Replace std::vector<XAsyncOp> with a fixed-capacity lock-free log: std::array<std::atomic<XAsyncOp>, 16> + std::atomic<size_t>.
  • Use store(memory_order_release) for writes and load(memory_order_acquire) for snapshots to ensure visibility on weakly-ordered architectures (ARM).
  • Provide RecordOp() for lock-free append and GetOpCodes() for snapshots.
  • Snapshot optimization for repeated verification checks.

Impact

Testing

  • VerifyAsyncBlockReuse: 10/10 passes.
  • AsyncBlockTests suite: 23/23 passes.
  • Note: The original data-race crash required a long soak with page heap; extended stress testing is recommended for maximum confidence.

@jhugard jhugard changed the title Stabilize queue unit tests #932 Stabilize queue unit tests (#932) Feb 18, 2026
Copy link
Copy Markdown
Contributor

@brianpepin brianpepin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good to me; thanks for contributing.

@jhugard
Copy link
Copy Markdown
Collaborator Author

jhugard commented Feb 19, 2026

Sneaking in another change to relax a time-based test, found while soaking under cdb with page-heap. I've found that a loaded system sometimes needs >50ms epsilon to service delayed tasks.

@jhugard jhugard requested a review from brianpepin February 19, 2026 03:55
@jhugard
Copy link
Copy Markdown
Collaborator Author

jhugard commented Feb 19, 2026

Don't merge this PR just yet. There was an independent latent bug #938 and the fix obsoletes the need for a change in this PR.

…etime

•	Relax VerifyDistributedAsyncCall timing by allowing a per-iteration slack window and wait for Cleanup to be recorded before opcode verification.
•	Add explicit completion signaling and queue-drain waits in VerifyWaitForCompletion, VerifySimpleAsyncCall, and VerifyDistributedAsyncCall to avoid threadpool timing races.
•	Hold a per-call reference in FactorialWorkerSimple/FactorialWorkerDistributed to prevent UAF during asynchronous callbacks.
•	In _VerifyQueueTermination, only wait for counts to settle when termination is non-blocking.
Access violation crash in `AsyncBlockTests::VerifyAsyncBlockReuse` due to
concurrent `std::vector::push_back` operations on shared `opCodes` member
from overlapping async call lifecycle phases.

The test intentionally reuses `XAsyncBlock` and `FactorialCallData` across
sequential async calls. When the first call's cleanup (invoked from completion
callback) races with the second call's initialization, both threads attempt to
push_back into the same `std::vector`, causing heap corruption during vector
reallocation.

**Crash stack trace:**
```
ucrtbased!_free_dbg (heap corruption during vector realloc)
  ← std::vector<XAsyncOp>::push_back
  ← FactorialWorkerSimple (Cleanup opcode from first call)
  ← AsyncState::~AsyncState
  ← CompletionCallback
  [concurrent with]
  ← FactorialWorkerSimple (Begin/DoWork from second call)
  ← VerifyAsyncBlockReuse
```

**Detection:** Heisenbug found after 6-hour soak test under Windows CDB with
page heap enabled (`gflags /p /enable`).

Replace `std::vector<XAsyncOp> opCodes` with fixed-capacity lock-free
append buffer:

- `std::array<std::atomic<XAsyncOp>, 16>` for storage (capacity exceeds max test depth)
- `std::atomic<size_t>` for thread-safe index allocation
- `RecordOp(op)`: atomic fetch-add for index, then `store(memory_order_release)` to array slot
- `GetOpCodes()`: snapshot current state into vector via `load(memory_order_acquire)`

**Why this approach:**
- Aligns with library philosophy of avoiding synchronization primitives
- No dynamic allocation eliminates reallocation races
- Bounded opcode sequences (max ~9 in distributed factorial tests)
- Append-only during async lifecycle, read-only during verification
- Proper release-acquire semantics ensure visibility on ARM/weakly-ordered architectures
- Natural lock-free semantics: each writer gets unique slot via atomic index

- `Tests/UnitTests/Tests/AsyncBlockTests.cpp`:
  - Added `#include <array>` for std::array support
  - Replaced `std::vector<XAsyncOp> opCodes` with `std::atomic<XAsyncOp>` array buffer
  - Updated `FactorialWorkerSimple` and `FactorialWorkerDistributed` to use `RecordOp()`
  - Updated all test verification sites to use `GetOpCodes()` snapshot method
  - Optimized multi-call sites to snapshot once and reuse

- **Specific test**: `VerifyAsyncBlockReuse` passes 10/10 rapid runs
- **Full suite**: All 23 AsyncBlockTests passed with no regressions
- **Note**: Original heisenbug required 6hr soak to reproduce; single-pass
  testing verifies compilation and basic functionality, but extended soak
  testing would be needed to fully validate stability under stress

- Test-only change, no production code affected
- Eliminates data race without introducing mutex overhead
- Maintains test semantics and coverage
@jhugard jhugard force-pushed the stabilize-queue-unit-tests branch from be45d64 to 14286a1 Compare February 19, 2026 18:46
@jhugard jhugard changed the title Stabilize queue unit tests (#932) Stabilize unit tests (#932, #938) Feb 19, 2026
@jhugard jhugard requested a review from jasonsandlin February 19, 2026 19:09
@jhugard
Copy link
Copy Markdown
Collaborator Author

jhugard commented Feb 19, 2026

Clean and ready for review.

@jasonsandlin jasonsandlin merged commit b872e39 into main Feb 19, 2026
15 checks passed
@jasonsandlin jasonsandlin deleted the stabilize-queue-unit-tests branch February 19, 2026 19:25
jasonsandlin pushed a commit that referenced this pull request Feb 20, 2026
…sults in AsyncBlockTests (#940)

* fix 935: drain queue before verifying opcodes in all verification tests

Apply queue drain timing fix to tests that verify async opcodes.
These tests were checking opcodes immediately after async completion
without ensuring all cleanup work had been recorded to the opcode log,
causing intermittent test failures.

* Refine cdb test soak script

Refactor cdb test script to capture stacks independently, as well as
output log, stacks, and dmp for all abnormal exits (including Ctrl+C).

* Fix AsyncBlockTests: drain queue before verifying in al tests

Apply consistent queue drain pattern to 8 AsyncBlockTests before final queue
verification to eliminate timing races where cleanup work completes asynchronously
after XAsyncGetStatus() returns.

Root Cause:
  The async framework's Cleanup operation is initiated by the provider but
  completed asynchronously through the task queue. Tests checking queue state
  or opcode snapshots immediately after XAsyncGetStatus() could race with the
  pending Cleanup work, resulting in intermittent failures (heisenbug-like
  behavior with "8 vs 9 opcodes" or "queue not empty" errors).

Solution:
  All queue verification now preceded by explicit drain loop:
    - Checks both Completion and Work ports
    - 10ms sleep granularity, 2000ms timeout
    - Ensures all async cleanup completes before verification
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants