Skip to content

Add: SPMD paged attention with TPUSH/TPOP MIX kernel pipeline#596

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
chenshengxin2026:add/spmd-paged-attention-tpush
Apr 20, 2026
Merged

Add: SPMD paged attention with TPUSH/TPOP MIX kernel pipeline#596
ChaoWao merged 1 commit intohw-native-sys:mainfrom
chenshengxin2026:add/spmd-paged-attention-tpush

Conversation

@chenshengxin2026
Copy link
Copy Markdown
Contributor

@chenshengxin2026 chenshengxin2026 commented Apr 20, 2026

Summary

  • Add scene test for SPMD paged attention using TPUSH/TPOP inter-core communication between AIC (Cube) and AIV (Vector) cores
  • Implements cooperative FlashAttention-style pipeline: AIC handles QK/PV matmuls, AIV handles online softmax and output accumulation
  • Uses QK-first / SF-first software pipelining to overlap Cube matmuls with Vector softmax across KV block iterations
  • Hardware block_num fixed at 24 with stride-loop distribution over batch * q_loop logical blocks
  • Supports q_tile=16 and q_tile=64 static dispatch, adapting to num_heads at runtime

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a Paged Attention MIX kernel using AIC and AIV cores with TPUSH/TPOP communication, including orchestration logic and integration tests. The review identifies critical risks of 32-bit integer overflow during pointer arithmetic in the QK and PV steps when handling large KV caches. Additionally, it is recommended to replace the inefficient scalar loop used for zeroing the destination buffer with a vector-based approach to improve performance on the AIV core.

@chenshengxin2026 chenshengxin2026 force-pushed the add/spmd-paged-attention-tpush branch 4 times, most recently from ad15be6 to ac14bd8 Compare April 20, 2026 06:48
Add a new scene test for paged attention using TPUSH/TPOP inter-core
communication between AIC (Cube) and AIV (Vector) cores. The kernel
implements a cooperative FlashAttention-style pipeline:
- AIC: QK matmul -> TPUSH(sij) -> TPOP(pij) -> PV matmul -> TPUSH(oi)
- AIV: TPOP(sij) -> online softmax -> TPUSH(pij) -> TPOP(oi) -> update

Uses QK-first / SF-first software pipelining to overlap Cube matmuls
with Vector softmax across iterations. Hardware block_num is fixed at
24 with stride-loop work distribution over batch * q_loop logical
blocks. Supports q_tile=16 and q_tile=64 static dispatch paths.
@chenshengxin2026 chenshengxin2026 force-pushed the add/spmd-paged-attention-tpush branch from ac14bd8 to 09840d3 Compare April 20, 2026 06:52
@ChaoWao ChaoWao merged commit 6c87797 into hw-native-sys:main Apr 20, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants