Summary
--enable-insert-sync fails to insert a synchronization barrier between TLOAD (MTE2 pipe) operations before a loop and TROWEXPANDDIV (V pipe) after the loop, when the loop executes zero iterations at runtime.
Version
ptoas 0.26 (also likely affects v0.25)
Reproduction
Input .pto pattern (simplified from a real qwen3 attention kernel):
// Before loop: 3 TLOADs from GM to UB (MTE2 pipe)
TLOAD(oi_tile, gm_oi) // MTE2: load to UB[6304]
TLOAD(mi_tile, gm_mi) // MTE2: load to UB[10400]
TLOAD(li_tile, gm_li) // MTE2: load to UB[10432]
// Loop: iterates from 1 to ctx_blocks (dynamic scalar)
for (v41 = 1; v41 < ctx_blocks; v41++) {
// Loop body has proper MTE2<->V synchronization via wait_flag/set_flag
...
}
// After loop: V pipe reads from UB
TROWEXPANDDIV(result, oi_tile, li_tile) // V pipe: reads UB[6304] and UB[10432]
Compile command:
ptoas input.pto -o output.cpp --enable-insert-sync --pto-level=level3
Generated C++ (relevant section):
// Lines 74-75: V pipe grants MTE2 permission to proceed
set_flag(PIPE_V, PIPE_MTE2, EVENT_ID0);
set_flag(PIPE_V, PIPE_MTE2, EVENT_ID1);
// Lines 76-88: MTE2 loads (async, no completion wait)
TLOAD(v29, v32); // oi → UB[6304]
TLOAD(v33, v36); // mi → UB[10400]
TLOAD(v37, v40); // li → UB[10432]
// Line 89: Loop — when ctx_blocks=1, range is [1,1) → ZERO iterations
for (size_t v41 = 1; v41 < ctx_blocks; v41++) {
// Contains wait_flag(PIPE_V, PIPE_MTE2, EVENT_ID0) at line 96
// This implicitly syncs the initial TLOADs — but ONLY if the loop executes
...
}
// Line 204: Only syncs V pipe, NOT MTE2!
pipe_barrier(PIPE_V);
// Line 205: V pipe reads from UB — TLOADs may not have completed!
TROWEXPANDDIV(v88, v29, v37);
Root Cause
When the loop executes ≥1 iteration, the wait_flag(PIPE_V, PIPE_MTE2, ...) inside the loop body implicitly ensures the initial TLOADs have completed before any V pipe operation reads the UB data.
When the loop executes zero iterations, this synchronization is skipped entirely. The pipe_barrier(PIPE_V) at line 204 only synchronizes the V pipe with itself — it does not wait for MTE2 to complete. The subsequent TROWEXPANDDIV reads stale/uninitialized UB data.
Impact
- Deterministic wrong results (not intermittent) when the loop trip count is 0
- Output values are wildly incorrect (e.g., ±398 vs expected ±0.1), with occasional NaN
- Affects any kernel where TLOADs feed into post-loop V-pipe operations via a potentially-zero-trip loop
Workaround
Manually changing line 204 from pipe_barrier(PIPE_V) to pipe_barrier(PIPE_ALL) in the generated C++ fixes the issue. All tests pass after this change.
Expected Fix
The --enable-insert-sync pass should insert an MTE2→V barrier (e.g., set_flag(PIPE_MTE2, PIPE_V, ...); wait_flag(PIPE_MTE2, PIPE_V, ...) or pipe_barrier(PIPE_ALL)) after the loop, when:
- There are pending MTE2 operations (TLOADs) before the loop
- The loop trip count is dynamic (may be zero)
- V pipe operations after the loop read from UB addresses written by those TLOADs
Context
Discovered while investigating qwen3 decode attention kernel failures in hw-native-sys/pypto#1098. The kernel implements online-softmax with a dynamic number of context blocks. When seq_len ≤ SEQ_TILE (context blocks = 1), the accumulation loop has zero iterations, triggering this bug.
Summary
--enable-insert-syncfails to insert a synchronization barrier betweenTLOAD(MTE2 pipe) operations before a loop andTROWEXPANDDIV(V pipe) after the loop, when the loop executes zero iterations at runtime.Version
ptoas 0.26 (also likely affects v0.25)
Reproduction
Input
.ptopattern (simplified from a real qwen3 attention kernel):Compile command:
Generated C++ (relevant section):
Root Cause
When the loop executes ≥1 iteration, the
wait_flag(PIPE_V, PIPE_MTE2, ...)inside the loop body implicitly ensures the initial TLOADs have completed before any V pipe operation reads the UB data.When the loop executes zero iterations, this synchronization is skipped entirely. The
pipe_barrier(PIPE_V)at line 204 only synchronizes the V pipe with itself — it does not wait for MTE2 to complete. The subsequentTROWEXPANDDIVreads stale/uninitialized UB data.Impact
Workaround
Manually changing line 204 from
pipe_barrier(PIPE_V)topipe_barrier(PIPE_ALL)in the generated C++ fixes the issue. All tests pass after this change.Expected Fix
The
--enable-insert-syncpass should insert an MTE2→V barrier (e.g.,set_flag(PIPE_MTE2, PIPE_V, ...); wait_flag(PIPE_MTE2, PIPE_V, ...)orpipe_barrier(PIPE_ALL)) after the loop, when:Context
Discovered while investigating qwen3 decode attention kernel failures in hw-native-sys/pypto#1098. The kernel implements online-softmax with a dynamic number of context blocks. When
seq_len ≤ SEQ_TILE(context blocks = 1), the accumulation loop has zero iterations, triggering this bug.