Refactor: enforce mix strict priority in scheduler dispatch#855
Merged
ChaoWao merged 1 commit intoMay 26, 2026
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the scheduler's dispatch logic to implement a MIX-strict-priority policy, introducing a more structured two-phase dispatch process (IDLE and PENDING) within the new dispatch_ready_tasks method. It also adds cross-thread idle gating via has_idle_in_other_threads. Feedback was provided regarding the implementation of has_idle_in_other_threads, specifically noting that performing cross-thread reads without std::atomic constitutes a data race and undefined behavior under the C++ memory model, regardless of hardware-level atomicity guarantees on specific architectures.
1e30ea2 to
e1728ea
Compare
… dispatch
Apply to both a2a3 and a5 runtimes. Phase 4 of resolve_and_dispatch is
reshaped from shape-outer/phase-inner into a new dispatch_ready_tasks
pass with phase-split semantics:
* IDLE-MIX runs first. If mix tasks remain (local_buf + ready_queue),
AIC and AIV yield both their IDLE and PENDING stages for the pass.
* MIX-PENDING is always considered next, gated only on whether any
peer scheduler thread has an idle cluster — so residual mix continues
to drain via pending slots regardless of skip_aic_aiv.
* After MIX-PENDING, AIC/AIV-PENDING runs only when mix is fully
drained and the corresponding shape has no peer idle core.
* Local buffers are flushed between the IDLE and PENDING stages so
PENDING-stage queue checks and peer threads see IDLE-stage results,
and again on every return path via an RAII FlushGuard so
release_fanin output during PENDING does not carry into the next
iteration's IDLE.
The PMU single-issue short-circuit and the sync_start drain protocol
are preserved unchanged. a5 picks up the PMU guard alongside the new
policy (its prior implementation lacked it); there's no automated test
for this — PMU profiling correctness requires hardware PMU counters
and a single-issue baseline to compare against, neither of which the
sim suite provides. The change brings a5 in line with a2a3.
cross-thread peer-tracker reads in has_idle_in_other_threads stay
plain (not atomic) and consume the value as a hint; the comment on
the implementation spells out the aarch64 single-copy-atomicity
argument and the drain-protocol exclusion.
PTO2_SCHED_PROFILING note: local_overflow_count now accumulates each
batch separately as flush_local_bufs is called multiple times per
pass (mid flush + RAII tail flush). Each entry is still counted
exactly once (count is zeroed after push_batch), but the per-pass
total reflects "entries pushed to the global queue this pass" rather
than the pre-refactor "buf residual at pass end". Comparing traces
across commits, expect the post-refactor number to be greater-or-equal.
ChaoWao
approved these changes
May 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reshape Phase 4 of
SchedulerContext::resolve_and_dispatchfrom shape-outer/phase-inner into a newdispatch_ready_taskspass with phase-split semantics and cross-thread idle gating. Applied to both a2a3 and a5 tensormap_and_ringbuffer runtimes.has_idle_in_other_threads(MIX), not on theskip_aic_aivflag — pending slots keep draining mix even when AIC/AIV are blocked.release_faninoutput; again at function end so PENDING-stage release_fanin output does not carry across iterations.sync_startdrain protocol preserved unchanged. a5 picks up the PMU guard alongside the new policy (its prior implementation lacked it).has_idle_in_other_threadsreads peer trackers'core_states_without explicit synchronization; aarch64 8-byte aligned single-copy atomicity covers the load, and the value is consumed as a scheduling hint (stale reads self-correct on the next iteration).Testing
test_scheduler_state(10/10),test_ready_queue(25/25); a5:test_a5_fatal(3/3) — greenspmd_multiblock_mix,spmd_sync_start,spmd_sync_start_stress,spmd_sync_start_edge,spmd_sync_start_aiv,spmd_starvation,mixed_example— 7/7 passed (60.96s)benchmark_bgemm) on both arches — recommended before merge