Skip to content

Bug: dispatch batch loop OOB when SPMD task drains all idle clusters #565

@chenshengxin2026

Description

@chenshengxin2026

Summary

In aicpu_executor.cpp, the idle-dispatch batch loop can index core_id_map_[-1], causing an out-of-bounds memory access that corrupts core state and stalls the scheduler.

Root Cause

The idle-dispatch code (Phase 4) computes want = valid_cluster_states.count() (number of idle clusters) and uses it as max_count to pop_ready_tasks_batch(). This implicitly assumes 1 task = 1 cluster, but an SPMD task with logical_block_num >> cluster_count (e.g. 256 blocks on 24 clusters) consumes all idle clusters in a single do-while pass.

When got > 1 (multiple tasks in the batch), the first task's do-while exhausts all clusters. The for (bi) loop then advances to the next task and enters the do-while — which is a do-while (body executes unconditionally before the guard), so pop_first() is called on an empty bitmask, returning -1.

The -1 is then used as cluster_offset in:

  • core_id_map_[-1] → array out-of-bounds read
  • 1ULL << -1 in change_core_state → undefined behavior (negative shift)

This corrupts core_states_, causing subsequent scheduling decisions to be wrong, eventually leading to a scheduler stall.

Trigger Condition

  • logical_block_num >> cluster_count (e.g. paged-attention with block_num = batch × q_loop = 256, cluster_count = 24)
  • Multiple tasks queued simultaneously in the ready queue (bn > 1, or any scenario where pop_ready_tasks_batch returns got > 1)
  • The first task in the batch exhausts all idle clusters, leaving none for subsequent tasks

With bn=1 the bug is masked because only one task is in the ready queue at a time, so got is always 1.

Affected Files

  • src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (idle dispatch do-while, ~L1831)
  • src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp (same pattern, ~L1814)

Proposed Fix

Convert the do-while to a guarded pattern: check valid_cluster_states.has_value() before entering the loop body. When clusters are exhausted mid-batch, re-enqueue the remaining tasks and break out of the for loop. Minimal, zero-overhead on the hot path (single branch check).

Impact

  • Severity: High — OOB memory write + UB on hardware
  • Scope: Any SPMD workload with logical_block_num > cluster_count and concurrent task submission

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions