Summary
In aicpu_executor.cpp, the idle-dispatch batch loop can index core_id_map_[-1], causing an out-of-bounds memory access that corrupts core state and stalls the scheduler.
Root Cause
The idle-dispatch code (Phase 4) computes want = valid_cluster_states.count() (number of idle clusters) and uses it as max_count to pop_ready_tasks_batch(). This implicitly assumes 1 task = 1 cluster, but an SPMD task with logical_block_num >> cluster_count (e.g. 256 blocks on 24 clusters) consumes all idle clusters in a single do-while pass.
When got > 1 (multiple tasks in the batch), the first task's do-while exhausts all clusters. The for (bi) loop then advances to the next task and enters the do-while — which is a do-while (body executes unconditionally before the guard), so pop_first() is called on an empty bitmask, returning -1.
The -1 is then used as cluster_offset in:
core_id_map_[-1] → array out-of-bounds read
1ULL << -1 in change_core_state → undefined behavior (negative shift)
This corrupts core_states_, causing subsequent scheduling decisions to be wrong, eventually leading to a scheduler stall.
Trigger Condition
logical_block_num >> cluster_count (e.g. paged-attention with block_num = batch × q_loop = 256, cluster_count = 24)
- Multiple tasks queued simultaneously in the ready queue (bn > 1, or any scenario where
pop_ready_tasks_batch returns got > 1)
- The first task in the batch exhausts all idle clusters, leaving none for subsequent tasks
With bn=1 the bug is masked because only one task is in the ready queue at a time, so got is always 1.
Affected Files
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (idle dispatch do-while, ~L1831)
src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (same pattern, ~L1814)
Proposed Fix
Convert the do-while to a guarded pattern: check valid_cluster_states.has_value() before entering the loop body. When clusters are exhausted mid-batch, re-enqueue the remaining tasks and break out of the for loop. Minimal, zero-overhead on the hot path (single branch check).
Impact
- Severity: High — OOB memory write + UB on hardware
- Scope: Any SPMD workload with
logical_block_num > cluster_count and concurrent task submission
Summary
In
aicpu_executor.cpp, the idle-dispatch batch loop can indexcore_id_map_[-1], causing an out-of-bounds memory access that corrupts core state and stalls the scheduler.Root Cause
The idle-dispatch code (
Phase 4) computeswant = valid_cluster_states.count()(number of idle clusters) and uses it asmax_counttopop_ready_tasks_batch(). This implicitly assumes 1 task = 1 cluster, but an SPMD task withlogical_block_num >> cluster_count(e.g. 256 blocks on 24 clusters) consumes all idle clusters in a single do-while pass.When
got > 1(multiple tasks in the batch), the first task's do-while exhausts all clusters. Thefor (bi)loop then advances to the next task and enters the do-while — which is a do-while (body executes unconditionally before the guard), sopop_first()is called on an empty bitmask, returning -1.The -1 is then used as
cluster_offsetin:core_id_map_[-1]→ array out-of-bounds read1ULL << -1inchange_core_state→ undefined behavior (negative shift)This corrupts
core_states_, causing subsequent scheduling decisions to be wrong, eventually leading to a scheduler stall.Trigger Condition
logical_block_num>>cluster_count(e.g. paged-attention with block_num = batch × q_loop = 256, cluster_count = 24)pop_ready_tasks_batchreturnsgot > 1)With bn=1 the bug is masked because only one task is in the ready queue at a time, so
gotis always 1.Affected Files
src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp(idle dispatch do-while, ~L1831)src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp(same pattern, ~L1814)Proposed Fix
Convert the do-while to a guarded pattern: check
valid_cluster_states.has_value()before entering the loop body. When clusters are exhausted mid-batch, re-enqueue the remaining tasks and break out of the for loop. Minimal, zero-overhead on the hot path (single branch check).Impact
logical_block_num > cluster_countand concurrent task submission