Bug: dispatch batch loop OOB when SPMD task drains all idle clusters

## Summary

In `aicpu_executor.cpp`, the idle-dispatch batch loop can index `core_id_map_[-1]`, causing an out-of-bounds memory access that corrupts core state and stalls the scheduler.

## Root Cause

The idle-dispatch code (`Phase 4`) computes `want = valid_cluster_states.count()` (number of idle clusters) and uses it as `max_count` to `pop_ready_tasks_batch()`. This implicitly assumes **1 task = 1 cluster**, but an SPMD task with `logical_block_num >> cluster_count` (e.g. 256 blocks on 24 clusters) consumes **all** idle clusters in a single do-while pass.

When `got > 1` (multiple tasks in the batch), the first task's do-while exhausts all clusters. The `for (bi)` loop then advances to the next task and enters the do-while — which is a **do-while** (body executes unconditionally before the guard), so `pop_first()` is called on an empty bitmask, returning **-1**.

The -1 is then used as `cluster_offset` in:
- `core_id_map_[-1]` → array out-of-bounds read
- `1ULL << -1` in `change_core_state` → undefined behavior (negative shift)

This corrupts `core_states_`, causing subsequent scheduling decisions to be wrong, eventually leading to a scheduler stall.

## Trigger Condition

- `logical_block_num` >> `cluster_count` (e.g. paged-attention with block_num = batch × q_loop = 256, cluster_count = 24)
- Multiple tasks queued simultaneously in the ready queue (bn > 1, or any scenario where `pop_ready_tasks_batch` returns `got > 1`)
- The first task in the batch exhausts all idle clusters, leaving none for subsequent tasks

With bn=1 the bug is masked because only one task is in the ready queue at a time, so `got` is always 1.

## Affected Files

- `src/a2a3/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp` (idle dispatch do-while, ~L1831)
- `src/a5/runtime/tensormap_and_ringbuffer/aicpu/aicpu_executor.cpp` (same pattern, ~L1814)

## Proposed Fix

Convert the do-while to a guarded pattern: check `valid_cluster_states.has_value()` **before** entering the loop body. When clusters are exhausted mid-batch, re-enqueue the remaining tasks and break out of the for loop. Minimal, zero-overhead on the hot path (single branch check).

## Impact

- **Severity**: High — OOB memory write + UB on hardware
- **Scope**: Any SPMD workload with `logical_block_num > cluster_count` and concurrent task submission

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: dispatch batch loop OOB when SPMD task drains all idle clusters #565

Summary

Root Cause

Trigger Condition

Affected Files

Proposed Fix

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: dispatch batch loop OOB when SPMD task drains all idle clusters #565

Description

Summary

Root Cause

Trigger Condition

Affected Files

Proposed Fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions