Skip to content

Refactor: merge paged_attention examples/st into unified test_*.py#556

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
doraemonmj:reuse_fix
Apr 15, 2026
Merged

Refactor: merge paged_attention examples/st into unified test_*.py#556
ChaoWao merged 1 commit intohw-native-sys:mainfrom
doraemonmj:reuse_fix

Conversation

@doraemonmj
Copy link
Copy Markdown
Contributor

@doraemonmj doraemonmj commented Apr 14, 2026

Update: tensormap_and_ringbuffer PA kernels with multi-tile dispatch and profiling

Add runtime dispatch for 16x128 and 64x128 tile configs in AIC/AIV

kernels (qk_matmul, pv_matmul, softmax_prepare, online_update)

Add ENABLE_PROFILING conditional compilation and platform-isolated

cycle count macros in orchestration

Use TRESHAPE for zero-copy UB scalar layout conversion in

online_update, eliminating GM round-trip

Add pipeline overlap (separate MTE2 events) in qk/pv matmul kernels

Add production-scale cases (Case1/2/3) to paged_attention and

batch_paged_attention tests

Add multi-round cases to multi_round_paged_attention test

Fix benchmark_rounds.sh case names for bgemm and batch_paged_attention

tensormap_and_ringbuffer 用例清单

统计范围:

  • examples/a2a3/tensormap_and_ringbuffer/
  • tests/st/a2a3/tensormap_and_ringbuffer/

全量 Case 表

# 位置 用例名 Case 平台 dtype 精度 (R/A) block_dim thread_num 输入数据 输出数据
1 examples/ bgemm 默认 sim+onboard fp32 1e-3/1e-3 3 4 A:[2,4,4,64,64] fp32, B:[2,4,4,64,64] fp32 C:[2,4,4,64,64] fp32
2 examples/ vector_example 默认 sim+onboard fp32 1e-5/1e-5 3 4 a:[16384] fp32, b:[16384] fp32 f:[16384] fp32
3 examples/ mixed_example case1 sim+onboard fp32 1e-3/1e-3 3 4 A:[128x128] fp32, B:[128x128] fp32, D/E/G/H:[16384] fp32 C/F/I/J/K/L/M/N/O: 各 [4x16384] fp32
4 examples/ mixed_example case2 sim+onboard fp32 1e-3/1e-3 3 4 #3 (num_iters=1) #3 (num_iters=1)
5 examples/ paged_attention Case1 sim+onboard bf16 1e-2/1e-2 24 4 query:[1,16,16], kc:[3,16,1,16], vc:[3,16,1,16], bt:[1,16] i32, cl:1 out:[1,16,16] fp32
6 examples/ paged_attention Case2 sim+onboard bf16 1e-2/1e-2 24 4 query:[1,16,16], kc:[8,16,1,16], vc:[8,16,1,16], bt:[1,16] i32, cl:1 out:[1,16,16] fp32
7 examples/ paged_attention CaseVarSeq2 sim+onboard bf16 1e-2/1e-2 24 4 batch=2, context_lens=[33,17] out:[2,16,16] fp32
8 examples/ paged_attention CaseVarSeq4 sim+onboard bf16 1e-2/1e-2 24 4 batch=4, context_lens=[33,64,128,15] out:[4,16,16] fp32
9 examples/ batch_paged_attention CaseSmall1 sim+onboard bf16 1e-3/1e-3 24 4 batch=1, heads=16, kv=1, hd=16, bs=16, cl=33 out fp32
10 examples/ batch_paged_attention CaseSmall2 sim+onboard bf16 1e-3/1e-3 24 4 batch=1, heads=16, kv=1, hd=16, bs=16, cl=31 out fp32
11 examples/ batch_paged_attention CaseSmall3 sim+onboard bf16 1e-3/1e-3 24 4 batch=1, heads=16, kv=1, hd=16, bs=16, cl=128 out fp32
12 examples/ batch_paged_attention CaseVarSeq2 sim+onboard bf16 1e-3/1e-3 24 4 batch=2, context_lens=[33,17] out fp32
13 examples/ batch_paged_attention CaseVarSeq4 sim+onboard bf16 1e-3/1e-3 24 4 batch=4, context_lens=[33,64,128,15] out fp32
14 examples/ multi-round-paged-attention Case1 sim+onboard bf16 1e-2/1e-2 24 4 #5 (10 rounds) #5
15 examples/ multi-round-paged-attention Case2 sim+onboard bf16 1e-2/1e-2 24 4 #6 (10 rounds) #6
16 examples/ multi-round-paged-attention CaseVarSeq2 sim+onboard bf16 1e-2/1e-2 24 4 #7 (10 rounds) #7
17 examples/ multi-round-paged-attention CaseVarSeq4 sim+onboard bf16 1e-2/1e-2 24 4 #8 (10 rounds) #8
18 examples/ paged_attention_ringbuffer ringbuffer_stress onboard bf16 1e-3/1e-3 24 4 batch=32, heads=16, kv=1, hd=128, bs=128, cl=4096 out fp32
19 examples/ scalar_data_test 默认 onboard fp32 - 3 4 a:[16384] fp32, b:[16384] fp32 result:[16384] fp32, check:[10] fp32
20 examples/ spmd_basic Case1 sim+onboard fp32 0/0 24 4 (无) output:[48] fp32
21 examples/ spmd_sync_start Case1 sim+onboard fp32 0/0 24 4 (无) output:[1152] fp32
22 examples/ spmd_sync_start_stress Case1 sim+onboard fp32 0/0 24 4 (无) output:[13440] fp32
23 examples/ spmd_sync_start_edge Case1 sim+onboard fp32 0/0 24 4 (无) output:[2016] fp32
24 examples/ spmd_sync_start_aiv Case1 sim+onboard fp32 0/0 24 4 (无) output:[768] fp32
25 examples/ spmd_multiblock_mix Case1 sim+onboard fp32 0/0 24 4 (无) output:[4512] fp32
26 examples/ spmd_multiblock_aiv Case1 sim+onboard fp32 0/0 24 4 (无) output:[3008] fp32
27 examples/ spmd_starvation Case1 sim+onboard fp32 0/0 24 4 (无) output:[4032] fp32
28 tests/st/ paged_attention Case1 onboard bf16 1e-3/1e-3 24 4 query:[256,16,128], kc:[16384,128,1,128], vc:[16384,128,1,128], bt:[256,256] i32, cl:256 out:[256,16,128] fp32
29 tests/st/ paged_attention Case2 onboard bf16 1e-3/1e-3 24 4 query:[64,64,128], kc:[8192,64,1,128], vc:[8192,64,1,128], bt:[64,512] i32, cl:64 out:[64,64,128] fp32
30 tests/st/ paged_attention Case3 onboard bf16 1e-3/1e-3 24 4 query:[64,64,256], kc:[8192,64,1,256], vc:[8192,64,1,256], bt:[64,512] i32, cl:64 out:[64,64,256] fp32
31 tests/st/ paged_attention_unroll Case1 onboard bf16 1e-3/1e-3 24 4 #28 #28
32 tests/st/ paged_attention_unroll Case2 onboard bf16 1e-3/1e-3 24 4 #29 #29
33 tests/st/ paged_attention_unroll Case3 onboard bf16 1e-3/1e-3 24 4 #30 #30
34 tests/st/ paged_attention_unroll_4dims Case1 onboard bf16 1e-3/1e-3 24 4 query:[256,1,16,128] (4D), 其余同#28 out:[256,1,16,128] fp32
35 tests/st/ paged_attention_unroll_4dims Case2 onboard bf16 1e-3/1e-3 24 4 query:[64,1,64,128] (4D), 其余同#29 out:[64,1,64,128] fp32
36 tests/st/ paged_attention_unroll_4dims Case3 onboard bf16 1e-3/1e-3 24 4 query:[64,1,64,256] (4D), 其余同#30 out:[64,1,64,256] fp32
37 tests/st/ batch_paged_attention Case1 onboard bf16 1e-3/1e-3 24 4 #28 #28
38 tests/st/ batch_paged_attention Case2 onboard bf16 1e-3/1e-3 24 4 #29 #29
39 tests/st/ batch_paged_attention Case3 onboard bf16 1e-3/1e-3 24 4 #30 #30
40 tests/st/ benchmark_bgemm Case0 onboard fp32 1e-3/1e-3 24 4 A:[1000,2,128,128] fp32, B:[1000,2,128,128] fp32, config:[4] i64 C:[1000,128,128] fp32
41 tests/st/ benchmark_bgemm Case1 onboard fp32 1e-3/1e-3 24 4 A:[128,2,128,128] fp32, B:[128,2,128,128] fp32, config:[4] i64 C:[128,128,128] fp32
42 tests/st/ benchmark_bgemm Case2 onboard fp32 1e-3/1e-3 24 4 A:[512,2,128,128] fp32, B:[512,2,128,128] fp32, config:[4] i64 C:[512,128,128] fp32
43 tests/st/ benchmark_bgemm Case3 onboard fp32 1e-3/1e-3 24 4 A:[512,2,128,128] fp32, B:[512,2,128,128] fp32, config:[4] i64 (incore_loop=16) C:[512,128,128] fp32
44 tests/st/ benchmark_bgemm Case4 onboard fp32 1e-3/1e-3 24 4 A:[64,4,128,128] fp32, B:[64,4,128,128] fp32, config:[4] i64 (grid_k=4) C:[64,128,128] fp32
45 tests/st/ alternating_matmul_add 默认 onboard fp32 1e-3/1e-3 24 4 A:[1,1,128,128] fp32, B:[1,1,128,128] fp32, X:[1,1,128,128] fp32, Y:[1,1,128,128] fp32 C:[1,1,128,128] fp32, Z:[1,1,128,128] fp32
46 tests/st/ alternating_matmul_add Case1 onboard fp32 1e-3/1e-3 24 4 A:[500,4,128,128], B:[500,4,128,128], X:[500,4,128,128], Y:[500,4,128,128] C:[500,4,128,128], Z:[500,4,128,128]
47 tests/st/ alternating_matmul_add Case2 onboard fp32 1e-3/1e-3 24 4 A:[512,2,128,128], B:[512,2,128,128], X:[512,5,128,128], Y:[512,5,128,128] C:[512,2,128,128], Z:[512,5,128,128]
48 tests/st/ test_explicit_fatal 默认 sim - - 24 4 (无) (无, 负例测试 rc=-9)
49 tests/st/ test_l3_dependency 默认 sim+onboard fp32 1e-5/1e-5 3 4 a:[16384] fp32, b:[16384] fp32 f:[16384] fp32
50 tests/st/ test_l3_group 默认 sim+onboard fp32 1e-5/1e-5 3 4 a:[16384] fp32, b:[16384] fp32 (x2 chips) f:[16384] fp32 (x2 chips)

统计汇总

examples/a2a3/tensormap_and_ringbuffer/

类别 用例目录数 总 Case 数
计算类 (bgemm, vector_example, mixed_example) 3 4
PA 类 (paged_attention, batch_pa, multi-round-pa, pa_ringbuffer) 4 14
SPMD 类 (basic, sync_start, stress, edge, aiv, multiblock_mix, multiblock_aiv, starvation) 8 8
其他 (scalar_data_test) 1 1
合计 16 27

tests/st/a2a3/tensormap_and_ringbuffer/

类别 用例目录数 总 Case 数
PA 类 (paged_attention, pa_unroll, pa_unroll_4dims, batch_pa) 4 12
计算类 (benchmark_bgemm, alternating_matmul_add) 2 8
L3 测试 (test_l3_dependency, test_l3_group) 2 2
负例测试 (test_explicit_fatal) 1 1
合计 9 23

总计

用例目录数 总 Case 数
examples 16 27
tests/st 9 23
总计 25 50

关键差异备注

  • examples 全部 14 个 example 均在 sim+onboard 上通过 CI (除 paged_attention_ringbuffer 和 scalar_data_test 仅 onboard)
  • tests/st 的 device_test (PA/bgemm) 仅在 onboard 上运行; test_l3_dependency/test_l3_group 在 sim+onboard 上运行
  • test_explicit_fatal 是唯一仅 sim 的用例 (负例测试, 验证 rc=-9)
  • examples 下的 bgemm block_dim=3, tests/st 下的 benchmark_bgemm block_dim=24
  • examples 下的 batch_paged_attention 用 bf16 (与 kernel 的 bfloat16_t 一致), 精度 1e-3/1e-3
  • examples PA 类用例为小规模 (head_dim=16, batch<=4), tests/st PA 类为大规模 (head_dim=128/256, batch=64/256)
  • multi-round-paged-attention 为 PA 的多轮 benchmark 版本 (10 rounds), 仅在 examples 中存在
  • SPMD 类用例 (8 个) 全部仅在 examples 中, tests/st 下无 SPMD 用例
  • paged_attention_unroll_4dims 是唯一使用 4D 输入形状的 PA 用例 (query 形状 [batch,1,heads,hd])
  • 所有用例统一 thread_num=4

迁移后变更

所有用例统一为 @scene_test 类格式 (test_*.py)。
golden.py + kernel_config.py 删除,合并到 CALLABLE + CASES + generate_args() + compute_golden() 中。
小规模 example 用例与大规模 st 用例合并到同一文件,小规模 case 设为 sim+onboard 自动运行,大规模 case 设为 onboard-only + manual。

变更明细

原# 用例名 原 Case 迁移后位置 迁移后 Case 变更字段 迁移前 迁移后 状态
1 bgemm 默认 examples/.../bgemm/test_bgemm.py default (无变更) sim+onboard sim+onboard ✅ 已迁移
2 vector_example 默认 examples/.../vector_example/test_vector_example.py default (无变更) sim+onboard sim+onboard ✅ 已迁移
3 mixed_example case1 tests/st/.../mixed_example/test_mixed_example.py case1 (无变更) sim+onboard sim+onboard ✅ 已迁移
4 mixed_example case2 tests/st/.../mixed_example/test_mixed_example.py case2 (无变更) sim+onboard sim+onboard ✅ 已迁移
5 paged_attention Case1 examples/.../paged_attention/test_paged_attention.py CaseSmall1 name, 精度 sim+onboard, 1e-2/1e-2 sim+onboard, 1e-3/1e-3 ✅ 已合并
6 paged_attention Case2 examples/.../paged_attention/test_paged_attention.py CaseSmall2 name, 精度 sim+onboard, 1e-2/1e-2 sim+onboard, 1e-3/1e-3 ✅ 已合并
7 paged_attention CaseVarSeq2 examples/.../paged_attention/test_paged_attention.py CaseVarSeq2 精度 sim+onboard, 1e-2/1e-2 sim+onboard, 1e-3/1e-3 ✅ 已合并
8 paged_attention CaseVarSeq4 examples/.../paged_attention/test_paged_attention.py CaseVarSeq4 精度 sim+onboard, 1e-2/1e-2 sim+onboard, 1e-3/1e-3 ✅ 已合并
9 batch_paged_attention Case1 examples/.../batch_paged_attention/test_batch_paged_attention.py CaseSmall1 name, 精度, dtype sim+onboard, 1e-2/1e-2, fp16 sim+onboard, 1e-3/1e-3, bf16 ✅ 已合并
10 batch_paged_attention Case2 examples/.../batch_paged_attention/test_batch_paged_attention.py CaseSmall2 name, 精度, dtype sim+onboard, 1e-2/1e-2, fp16 sim+onboard, 1e-3/1e-3, bf16 ✅ 已合并
11 batch_paged_attention Case3 examples/.../batch_paged_attention/test_batch_paged_attention.py CaseSmall3 name, 精度, dtype sim+onboard, 1e-2/1e-2, fp16 sim+onboard, 1e-3/1e-3, bf16 ✅ 已合并
12 batch_paged_attention CaseVarSeq2 examples/.../batch_paged_attention/test_batch_paged_attention.py CaseVarSeq2 精度, dtype sim+onboard, 1e-2/1e-2, fp16 sim+onboard, 1e-3/1e-3, bf16 ✅ 已合并
13 batch_paged_attention CaseVarSeq4 examples/.../batch_paged_attention/test_batch_paged_attention.py CaseVarSeq4 精度, dtype sim+onboard, 1e-2/1e-2, fp16 sim+onboard, 1e-3/1e-3, bf16 ✅ 已合并
14 multi-round-pa Case1 tests/st/.../multi_round_paged_attention/test_multi_round_paged_attention.py Case1 (无变更) sim+onboard sim+onboard ✅ 已合并
15 multi-round-pa Case2 tests/st/.../multi_round_paged_attention/test_multi_round_paged_attention.py Case2 (无变更) sim+onboard sim+onboard ✅ 已合并
16 multi-round-pa CaseVarSeq2 tests/st/.../multi_round_paged_attention/test_multi_round_paged_attention.py CaseVarSeq2 (无变更) sim+onboard sim+onboard ✅ 已合并
17 multi-round-pa CaseVarSeq4 tests/st/.../multi_round_paged_attention/test_multi_round_paged_attention.py CaseVarSeq4 (无变更) sim+onboard sim+onboard ✅ 已合并
18 pa_ringbuffer ringbuffer_stress examples/.../paged_attention_ringbuffer/test_paged_attention_ringbuffer.py ringbuffer_stress (无变更) onboard onboard ✅ 已迁移
19 scalar_data_test 默认 examples/.../scalar_data_test/test_scalar_data.py default (无变更) onboard onboard ✅ 已迁移
20 spmd_basic Case1 tests/st/.../spmd_basic/test_spmd_basic.py Case1 (无变更) sim+onboard sim+onboard ✅ 已迁移
21 spmd_sync_start Case1 tests/st/.../spmd_sync_start/test_spmd_sync_start.py Case1 (无变更) sim+onboard sim+onboard ✅ 已迁移
22 spmd_sync_start_stress Case1 tests/st/.../spmd_sync_start_stress/test_spmd_sync_start_stress.py Case1 (无变更) sim+onboard sim+onboard ✅ 已迁移
23 spmd_sync_start_edge Case1 tests/st/.../spmd_sync_start_edge/test_spmd_sync_start_edge.py Case1 (无变更) sim+onboard sim+onboard ✅ 已迁移
24 spmd_sync_start_aiv Case1 tests/st/.../spmd_sync_start_aiv/test_spmd_sync_start_aiv.py Case1 (无变更) sim+onboard sim+onboard ✅ 已迁移
25 spmd_multiblock_mix Case1 tests/st/.../spmd_multiblock_mix/test_spmd_multiblock_mix.py Case1 (无变更) sim+onboard sim+onboard ✅ 已迁移
26 spmd_multiblock_aiv Case1 tests/st/.../spmd_multiblock_aiv/test_spmd_multiblock_aiv.py Case1 (无变更) sim+onboard sim+onboard ✅ 已迁移
27 spmd_starvation Case1 tests/st/.../spmd_starvation/test_spmd_starvation.py Case1 (无变更) sim+onboard sim+onboard ✅ 已迁移
28 paged_attention (st) Case1 examples/.../paged_attention/test_paged_attention.py Case1 位置 onboard (tests/st/) onboard (examples/, 合并) ✅ 已合并
29 paged_attention (st) Case2 examples/.../paged_attention/test_paged_attention.py Case2 位置 onboard (tests/st/) onboard (examples/, 合并) ✅ 已合并
30 paged_attention (st) Case3 examples/.../paged_attention/test_paged_attention.py Case3 位置 onboard (tests/st/) onboard (examples/, 合并) ✅ 已合并
31 pa_unroll (st) Case1 tests/st/.../paged_attention_unroll/test_paged_attention_unroll.py Case1 (无变更) onboard onboard ✅ 已迁移
32 pa_unroll (st) Case2 tests/st/.../paged_attention_unroll/test_paged_attention_unroll.py Case2 (无变更) onboard onboard ✅ 已迁移
33 pa_unroll (st) Case3 tests/st/.../paged_attention_unroll/test_paged_attention_unroll.py Case3 (无变更) onboard onboard ✅ 已迁移
34 pa_unroll_4dims (st) Case1 tests/st/.../paged_attention_unroll_4dims/test_paged_attention_unroll_4dims.py Case1 (无变更) onboard onboard ✅ 已迁移
35 pa_unroll_4dims (st) Case2 tests/st/.../paged_attention_unroll_4dims/test_paged_attention_unroll_4dims.py Case2 (无变更) onboard onboard ✅ 已迁移
36 pa_unroll_4dims (st) Case3 tests/st/.../paged_attention_unroll_4dims/test_paged_attention_unroll_4dims.py Case3 (无变更) onboard onboard ✅ 已迁移
37 batch_paged_attention (st) Case1 examples/.../batch_paged_attention/test_batch_paged_attention.py Case1 位置 onboard (tests/st/) onboard (examples/, 合并) ✅ 已合并
38 batch_paged_attention (st) Case2 examples/.../batch_paged_attention/test_batch_paged_attention.py Case2 位置 onboard (tests/st/) onboard (examples/, 合并) ✅ 已合并
39 batch_paged_attention (st) Case3 examples/.../batch_paged_attention/test_batch_paged_attention.py Case3 位置 onboard (tests/st/) onboard (examples/, 合并) ✅ 已合并
40 benchmark_bgemm (st) Case0 tests/st/.../benchmark_bgemm/test_benchmark_bgemm.py Case0 平台 onboard sim+onboard ✅ 已迁移
41 benchmark_bgemm (st) Case1 tests/st/.../benchmark_bgemm/test_benchmark_bgemm.py Case1 平台 onboard sim+onboard ✅ 已迁移
42 benchmark_bgemm (st) Case2 tests/st/.../benchmark_bgemm/test_benchmark_bgemm.py Case2 平台 onboard sim+onboard ✅ 已迁移
43 benchmark_bgemm (st) Case3 tests/st/.../benchmark_bgemm/test_benchmark_bgemm.py Case3 平台 onboard sim+onboard ✅ 已迁移
44 benchmark_bgemm (st) Case4 tests/st/.../benchmark_bgemm/test_benchmark_bgemm.py Case4 平台 onboard sim+onboard ✅ 已迁移
45 alternating_matmul_add (st) default tests/st/.../alternating_matmul_add/test_alternating_matmul_add.py default (无变更) onboard onboard ✅ 已迁移
46 alternating_matmul_add (st) Case1 tests/st/.../alternating_matmul_add/test_alternating_matmul_add.py Case1 (无变更) onboard onboard ✅ 已迁移
47 alternating_matmul_add (st) Case2 tests/st/.../alternating_matmul_add/test_alternating_matmul_add.py Case2 (无变更) onboard onboard ✅ 已迁移
48 explicit_fatal (st) 默认 tests/st/.../test_explicit_fatal.py default (无变更) sim sim ✅ 已迁移
49 test_l3_dependency (st) 默认 tests/st/.../test_l3_dependency.py default (无变更) sim+onboard sim+onboard ✅ 已迁移
50 test_l3_group (st) 默认 tests/st/.../test_l3_group.py default (无变更) sim+onboard sim+onboard ✅ 已迁移

迁移后统计

位置 用例目录数 Case 数 sim+onboard (auto) onboard (manual) onboard (auto)
examples/ 6 19 10 7 2
tests/st/ 17 31 19 12 0
合计 23 50 29 19 2

迁移变更小结

  • golden.py + kernel_config.py 删除,统一为 @scene_test
  • paged_attention: 4 example小 + 3 st大 → 合并为 7 case,小用例重命名 CaseSmall1/2
  • batch_paged_attention: 5 example小 + 3 st大 → 合并为 8 case,小用例重命名 CaseSmall1/2/3
  • multi_round_paged_attention: 原 1 case → 补充至 4 case (从 example golden.py)
  • _PA_KERNELS 路径保持指向 examples/.../paged_attention/kernels (与迁移前一致)
  • 50/50 case 全部确认,无遗漏

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors paged attention kernels to support production-scale tile configurations via runtime dispatch and optimizes performance through improved pipeline overlap and zero-copy UB reshapes using TRESHAPE. It also introduces profiling in the orchestration layer and updates test suites. Review feedback highlights a critical issue in the dispatch logic where small-scale configurations are incorrectly handled when the tile size is 16. Furthermore, several newly added comments in the kernel entry points are misleading as they reference out-of-bounds arguments.

@doraemonmj doraemonmj force-pushed the reuse_fix branch 9 times, most recently from ecac927 to 8f5bc46 Compare April 15, 2026 02:22
…and profiling

- Add runtime dispatch for 16x128 and 64x128 tile configs in AIC/AIV
  kernels (qk_matmul, pv_matmul, softmax_prepare, online_update)
- Add ENABLE_PROFILING conditional compilation and platform-isolated
  cycle count macros in orchestration
- Use TRESHAPE for zero-copy UB scalar layout conversion in
  online_update, eliminating GM round-trip
- Add pipeline overlap (separate MTE2 events) in qk/pv matmul kernels
- Add production-scale cases (Case1/2/3) to paged_attention and
  batch_paged_attention tests
- Add multi-round cases to multi_round_paged_attention test
- Fix benchmark_rounds.sh case names for bgemm and batch_paged_attention
@ChaoWao ChaoWao merged commit b1ff237 into hw-native-sys:main Apr 15, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants