Skip to content

Missing MTE2→V synchronization when loop body has zero iterations #533

@YunjiQin

Description

@YunjiQin

Summary

--enable-insert-sync fails to insert a synchronization barrier between TLOAD (MTE2 pipe) operations before a loop and TROWEXPANDDIV (V pipe) after the loop, when the loop executes zero iterations at runtime.

Version

ptoas 0.26 (also likely affects v0.25)

Reproduction

Input .pto pattern (simplified from a real qwen3 attention kernel):

// Before loop: 3 TLOADs from GM to UB (MTE2 pipe)
TLOAD(oi_tile, gm_oi)       // MTE2: load to UB[6304]
TLOAD(mi_tile, gm_mi)       // MTE2: load to UB[10400]
TLOAD(li_tile, gm_li)       // MTE2: load to UB[10432]

// Loop: iterates from 1 to ctx_blocks (dynamic scalar)
for (v41 = 1; v41 < ctx_blocks; v41++) {
    // Loop body has proper MTE2<->V synchronization via wait_flag/set_flag
    ...
}

// After loop: V pipe reads from UB
TROWEXPANDDIV(result, oi_tile, li_tile)   // V pipe: reads UB[6304] and UB[10432]

Compile command:

ptoas input.pto -o output.cpp --enable-insert-sync --pto-level=level3

Generated C++ (relevant section):

// Lines 74-75: V pipe grants MTE2 permission to proceed
set_flag(PIPE_V, PIPE_MTE2, EVENT_ID0);
set_flag(PIPE_V, PIPE_MTE2, EVENT_ID1);

// Lines 76-88: MTE2 loads (async, no completion wait)
TLOAD(v29, v32);   // oi → UB[6304]
TLOAD(v33, v36);   // mi → UB[10400]
TLOAD(v37, v40);   // li → UB[10432]

// Line 89: Loop — when ctx_blocks=1, range is [1,1) → ZERO iterations
for (size_t v41 = 1; v41 < ctx_blocks; v41++) {
    // Contains wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID0) at line 96
    // This implicitly syncs the initial TLOADs — but ONLY if the loop executes
    ...
}

// Line 204: Only syncs V pipe, NOT MTE2!
pipe_barrier(PIPE_V);

// Line 205: V pipe reads from UB — TLOADs may not have completed!
TROWEXPANDDIV(v88, v29, v37);

Root Cause

When the loop executes ≥1 iteration, the wait_flag(PIPE_V, PIPE_MTE2, ...) inside the loop body implicitly ensures the initial TLOADs have completed before any V pipe operation reads the UB data.

When the loop executes zero iterations, this synchronization is skipped entirely. The pipe_barrier(PIPE_V) at line 204 only synchronizes the V pipe with itself — it does not wait for MTE2 to complete. The subsequent TROWEXPANDDIV reads stale/uninitialized UB data.

Impact

  • Deterministic wrong results (not intermittent) when the loop trip count is 0
  • Output values are wildly incorrect (e.g., ±398 vs expected ±0.1), with occasional NaN
  • Affects any kernel where TLOADs feed into post-loop V-pipe operations via a potentially-zero-trip loop

Workaround

Manually changing line 204 from pipe_barrier(PIPE_V) to pipe_barrier(PIPE_ALL) in the generated C++ fixes the issue. All tests pass after this change.

Expected Fix

The --enable-insert-sync pass should insert an MTE2→V barrier (e.g., set_flag(PIPE_MTE2, PIPE_V, ...); wait_flag(PIPE_MTE2, PIPE_V, ...) or pipe_barrier(PIPE_ALL)) after the loop, when:

  1. There are pending MTE2 operations (TLOADs) before the loop
  2. The loop trip count is dynamic (may be zero)
  3. V pipe operations after the loop read from UB addresses written by those TLOADs

Context

Discovered while investigating qwen3 decode attention kernel failures in hw-native-sys/pypto#1098. The kernel implements online-softmax with a dynamic number of context blocks. When seq_len ≤ SEQ_TILE (context blocks = 1), the accumulation loop has zero iterations, triggering this bug.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions