Refactor: merge paged_attention examples/st into unified test_*.py by doraemonmj · Pull Request #556 · hw-native-sys/simpler

doraemonmj · 2026-04-14T15:42:27Z

Update: tensormap_and_ringbuffer PA kernels with multi-tile dispatch and profiling

Add runtime dispatch for 16x128 and 64x128 tile configs in AIC/AIV

kernels (qk_matmul, pv_matmul, softmax_prepare, online_update)

Add ENABLE_PROFILING conditional compilation and platform-isolated

cycle count macros in orchestration

Use TRESHAPE for zero-copy UB scalar layout conversion in

online_update, eliminating GM round-trip

Add pipeline overlap (separate MTE2 events) in qk/pv matmul kernels

Add production-scale cases (Case1/2/3) to paged_attention and

batch_paged_attention tests

Add multi-round cases to multi_round_paged_attention test

Fix benchmark_rounds.sh case names for bgemm and batch_paged_attention

tensormap_and_ringbuffer 用例清单

统计范围:

examples/a2a3/tensormap_and_ringbuffer/

tests/st/a2a3/tensormap_and_ringbuffer/

全量 Case 表

#	位置	用例名	Case	平台	dtype	精度 (R/A)	block_dim	thread_num	输入数据	输出数据
1	`examples/`	bgemm	默认	sim+onboard	fp32	1e-3/1e-3	3	4	A:[2,4,4,64,64] fp32, B:[2,4,4,64,64] fp32	C:[2,4,4,64,64] fp32
2	`examples/`	vector_example	默认	sim+onboard	fp32	1e-5/1e-5	3	4	a:[16384] fp32, b:[16384] fp32	f:[16384] fp32
3	`examples/`	mixed_example	case1	sim+onboard	fp32	1e-3/1e-3	3	4	A:[128x128] fp32, B:[128x128] fp32, D/E/G/H:[16384] fp32	C/F/I/J/K/L/M/N/O: 各 [4x16384] fp32
4	`examples/`	mixed_example	case2	sim+onboard	fp32	1e-3/1e-3	3	4	同#3 (num_iters=1)	同#3 (num_iters=1)
5	`examples/`	paged_attention	Case1	sim+onboard	bf16	1e-2/1e-2	24	4	query:[1,16,16], kc:[3,16,1,16], vc:[3,16,1,16], bt:[1,16] i32, cl:1	out:[1,16,16] fp32
6	`examples/`	paged_attention	Case2	sim+onboard	bf16	1e-2/1e-2	24	4	query:[1,16,16], kc:[8,16,1,16], vc:[8,16,1,16], bt:[1,16] i32, cl:1	out:[1,16,16] fp32
7	`examples/`	paged_attention	CaseVarSeq2	sim+onboard	bf16	1e-2/1e-2	24	4	batch=2, context_lens=[33,17]	out:[2,16,16] fp32
8	`examples/`	paged_attention	CaseVarSeq4	sim+onboard	bf16	1e-2/1e-2	24	4	batch=4, context_lens=[33,64,128,15]	out:[4,16,16] fp32
9	`examples/`	batch_paged_attention	CaseSmall1	sim+onboard	bf16	1e-3/1e-3	24	4	batch=1, heads=16, kv=1, hd=16, bs=16, cl=33	out fp32
10	`examples/`	batch_paged_attention	CaseSmall2	sim+onboard	bf16	1e-3/1e-3	24	4	batch=1, heads=16, kv=1, hd=16, bs=16, cl=31	out fp32
11	`examples/`	batch_paged_attention	CaseSmall3	sim+onboard	bf16	1e-3/1e-3	24	4	batch=1, heads=16, kv=1, hd=16, bs=16, cl=128	out fp32
12	`examples/`	batch_paged_attention	CaseVarSeq2	sim+onboard	bf16	1e-3/1e-3	24	4	batch=2, context_lens=[33,17]	out fp32
13	`examples/`	batch_paged_attention	CaseVarSeq4	sim+onboard	bf16	1e-3/1e-3	24	4	batch=4, context_lens=[33,64,128,15]	out fp32
14	`examples/`	multi-round-paged-attention	Case1	sim+onboard	bf16	1e-2/1e-2	24	4	同#5 (10 rounds)	同#5
15	`examples/`	multi-round-paged-attention	Case2	sim+onboard	bf16	1e-2/1e-2	24	4	同#6 (10 rounds)	同#6
16	`examples/`	multi-round-paged-attention	CaseVarSeq2	sim+onboard	bf16	1e-2/1e-2	24	4	同#7 (10 rounds)	同#7
17	`examples/`	multi-round-paged-attention	CaseVarSeq4	sim+onboard	bf16	1e-2/1e-2	24	4	同#8 (10 rounds)	同#8
18	`examples/`	paged_attention_ringbuffer	ringbuffer_stress	onboard	bf16	1e-3/1e-3	24	4	batch=32, heads=16, kv=1, hd=128, bs=128, cl=4096	out fp32
19	`examples/`	scalar_data_test	默认	onboard	fp32	-	3	4	a:[16384] fp32, b:[16384] fp32	result:[16384] fp32, check:[10] fp32
20	`examples/`	spmd_basic	Case1	sim+onboard	fp32	0/0	24	4	(无)	output:[48] fp32
21	`examples/`	spmd_sync_start	Case1	sim+onboard	fp32	0/0	24	4	(无)	output:[1152] fp32
22	`examples/`	spmd_sync_start_stress	Case1	sim+onboard	fp32	0/0	24	4	(无)	output:[13440] fp32
23	`examples/`	spmd_sync_start_edge	Case1	sim+onboard	fp32	0/0	24	4	(无)	output:[2016] fp32
24	`examples/`	spmd_sync_start_aiv	Case1	sim+onboard	fp32	0/0	24	4	(无)	output:[768] fp32
25	`examples/`	spmd_multiblock_mix	Case1	sim+onboard	fp32	0/0	24	4	(无)	output:[4512] fp32
26	`examples/`	spmd_multiblock_aiv	Case1	sim+onboard	fp32	0/0	24	4	(无)	output:[3008] fp32
27	`examples/`	spmd_starvation	Case1	sim+onboard	fp32	0/0	24	4	(无)	output:[4032] fp32
28	`tests/st/`	paged_attention	Case1	onboard	bf16	1e-3/1e-3	24	4	query:[256,16,128], kc:[16384,128,1,128], vc:[16384,128,1,128], bt:[256,256] i32, cl:256	out:[256,16,128] fp32
29	`tests/st/`	paged_attention	Case2	onboard	bf16	1e-3/1e-3	24	4	query:[64,64,128], kc:[8192,64,1,128], vc:[8192,64,1,128], bt:[64,512] i32, cl:64	out:[64,64,128] fp32
30	`tests/st/`	paged_attention	Case3	onboard	bf16	1e-3/1e-3	24	4	query:[64,64,256], kc:[8192,64,1,256], vc:[8192,64,1,256], bt:[64,512] i32, cl:64	out:[64,64,256] fp32
31	`tests/st/`	paged_attention_unroll	Case1	onboard	bf16	1e-3/1e-3	24	4	同#28	同#28
32	`tests/st/`	paged_attention_unroll	Case2	onboard	bf16	1e-3/1e-3	24	4	同#29	同#29
33	`tests/st/`	paged_attention_unroll	Case3	onboard	bf16	1e-3/1e-3	24	4	同#30	同#30
34	`tests/st/`	paged_attention_unroll_4dims	Case1	onboard	bf16	1e-3/1e-3	24	4	query:[256,1,16,128] (4D), 其余同#28	out:[256,1,16,128] fp32
35	`tests/st/`	paged_attention_unroll_4dims	Case2	onboard	bf16	1e-3/1e-3	24	4	query:[64,1,64,128] (4D), 其余同#29	out:[64,1,64,128] fp32
36	`tests/st/`	paged_attention_unroll_4dims	Case3	onboard	bf16	1e-3/1e-3	24	4	query:[64,1,64,256] (4D), 其余同#30	out:[64,1,64,256] fp32
37	`tests/st/`	batch_paged_attention	Case1	onboard	bf16	1e-3/1e-3	24	4	同#28	同#28
38	`tests/st/`	batch_paged_attention	Case2	onboard	bf16	1e-3/1e-3	24	4	同#29	同#29
39	`tests/st/`	batch_paged_attention	Case3	onboard	bf16	1e-3/1e-3	24	4	同#30	同#30
40	`tests/st/`	benchmark_bgemm	Case0	onboard	fp32	1e-3/1e-3	24	4	A:[1000,2,128,128] fp32, B:[1000,2,128,128] fp32, config:[4] i64	C:[1000,128,128] fp32
41	`tests/st/`	benchmark_bgemm	Case1	onboard	fp32	1e-3/1e-3	24	4	A:[128,2,128,128] fp32, B:[128,2,128,128] fp32, config:[4] i64	C:[128,128,128] fp32
42	`tests/st/`	benchmark_bgemm	Case2	onboard	fp32	1e-3/1e-3	24	4	A:[512,2,128,128] fp32, B:[512,2,128,128] fp32, config:[4] i64	C:[512,128,128] fp32
43	`tests/st/`	benchmark_bgemm	Case3	onboard	fp32	1e-3/1e-3	24	4	A:[512,2,128,128] fp32, B:[512,2,128,128] fp32, config:[4] i64 (incore_loop=16)	C:[512,128,128] fp32
44	`tests/st/`	benchmark_bgemm	Case4	onboard	fp32	1e-3/1e-3	24	4	A:[64,4,128,128] fp32, B:[64,4,128,128] fp32, config:[4] i64 (grid_k=4)	C:[64,128,128] fp32
45	`tests/st/`	alternating_matmul_add	默认	onboard	fp32	1e-3/1e-3	24	4	A:[1,1,128,128] fp32, B:[1,1,128,128] fp32, X:[1,1,128,128] fp32, Y:[1,1,128,128] fp32	C:[1,1,128,128] fp32, Z:[1,1,128,128] fp32
46	`tests/st/`	alternating_matmul_add	Case1	onboard	fp32	1e-3/1e-3	24	4	A:[500,4,128,128], B:[500,4,128,128], X:[500,4,128,128], Y:[500,4,128,128]	C:[500,4,128,128], Z:[500,4,128,128]
47	`tests/st/`	alternating_matmul_add	Case2	onboard	fp32	1e-3/1e-3	24	4	A:[512,2,128,128], B:[512,2,128,128], X:[512,5,128,128], Y:[512,5,128,128]	C:[512,2,128,128], Z:[512,5,128,128]
48	`tests/st/`	test_explicit_fatal	默认	sim	-	-	24	4	(无)	(无, 负例测试 rc=-9)
49	`tests/st/`	test_l3_dependency	默认	sim+onboard	fp32	1e-5/1e-5	3	4	a:[16384] fp32, b:[16384] fp32	f:[16384] fp32
50	`tests/st/`	test_l3_group	默认	sim+onboard	fp32	1e-5/1e-5	3	4	a:[16384] fp32, b:[16384] fp32 (x2 chips)	f:[16384] fp32 (x2 chips)

统计汇总

examples/a2a3/tensormap_and_ringbuffer/

类别	用例目录数	总 Case 数
计算类 (bgemm, vector_example, mixed_example)	3	4
PA 类 (paged_attention, batch_pa, multi-round-pa, pa_ringbuffer)	4	14
SPMD 类 (basic, sync_start, stress, edge, aiv, multiblock_mix, multiblock_aiv, starvation)	8	8
其他 (scalar_data_test)	1	1
合计	16	27

tests/st/a2a3/tensormap_and_ringbuffer/

类别	用例目录数	总 Case 数
PA 类 (paged_attention, pa_unroll, pa_unroll_4dims, batch_pa)	4	12
计算类 (benchmark_bgemm, alternating_matmul_add)	2	8
L3 测试 (test_l3_dependency, test_l3_group)	2	2
负例测试 (test_explicit_fatal)	1	1
合计	9	23

总计

	用例目录数	总 Case 数
examples	16	27
tests/st	9	23
总计	25	50

关键差异备注

examples 全部 14 个 example 均在 sim+onboard 上通过 CI (除 paged_attention_ringbuffer 和 scalar_data_test 仅 onboard)
tests/st 的 device_test (PA/bgemm) 仅在 onboard 上运行; test_l3_dependency/test_l3_group 在 sim+onboard 上运行
test_explicit_fatal 是唯一仅 sim 的用例 (负例测试, 验证 rc=-9)
examples 下的 bgemm block_dim=3, tests/st 下的 benchmark_bgemm block_dim=24
examples 下的 batch_paged_attention 用 bf16 (与 kernel 的 bfloat16_t 一致), 精度 1e-3/1e-3
examples PA 类用例为小规模 (head_dim=16, batch<=4), tests/st PA 类为大规模 (head_dim=128/256, batch=64/256)
multi-round-paged-attention 为 PA 的多轮 benchmark 版本 (10 rounds), 仅在 examples 中存在
SPMD 类用例 (8 个) 全部仅在 examples 中, tests/st 下无 SPMD 用例
paged_attention_unroll_4dims 是唯一使用 4D 输入形状的 PA 用例 (query 形状 [batch,1,heads,hd])
所有用例统一 thread_num=4

迁移后变更

所有用例统一为 @scene_test 类格式 (test_*.py)。
原 golden.py + kernel_config.py 删除，合并到 CALLABLE + CASES + generate_args() + compute_golden() 中。
小规模 example 用例与大规模 st 用例合并到同一文件，小规模 case 设为 sim+onboard 自动运行，大规模 case 设为 onboard-only + manual。

变更明细

原#	用例名	原 Case	迁移后位置	迁移后 Case	变更字段	迁移前	迁移后	状态
1	bgemm	默认	`examples/.../bgemm/test_bgemm.py`	default	(无变更)	sim+onboard	sim+onboard	✅ 已迁移
2	vector_example	默认	`examples/.../vector_example/test_vector_example.py`	default	(无变更)	sim+onboard	sim+onboard	✅ 已迁移
3	mixed_example	case1	`tests/st/.../mixed_example/test_mixed_example.py`	case1	(无变更)	sim+onboard	sim+onboard	✅ 已迁移
4	mixed_example	case2	`tests/st/.../mixed_example/test_mixed_example.py`	case2	(无变更)	sim+onboard	sim+onboard	✅ 已迁移
5	paged_attention	Case1	`examples/.../paged_attention/test_paged_attention.py`	CaseSmall1	name, 精度	sim+onboard, 1e-2/1e-2	sim+onboard, 1e-3/1e-3	✅ 已合并
6	paged_attention	Case2	`examples/.../paged_attention/test_paged_attention.py`	CaseSmall2	name, 精度	sim+onboard, 1e-2/1e-2	sim+onboard, 1e-3/1e-3	✅ 已合并
7	paged_attention	CaseVarSeq2	`examples/.../paged_attention/test_paged_attention.py`	CaseVarSeq2	精度	sim+onboard, 1e-2/1e-2	sim+onboard, 1e-3/1e-3	✅ 已合并
8	paged_attention	CaseVarSeq4	`examples/.../paged_attention/test_paged_attention.py`	CaseVarSeq4	精度	sim+onboard, 1e-2/1e-2	sim+onboard, 1e-3/1e-3	✅ 已合并
9	batch_paged_attention	Case1	`examples/.../batch_paged_attention/test_batch_paged_attention.py`	CaseSmall1	name, 精度, dtype	sim+onboard, 1e-2/1e-2, fp16	sim+onboard, 1e-3/1e-3, bf16	✅ 已合并
10	batch_paged_attention	Case2	`examples/.../batch_paged_attention/test_batch_paged_attention.py`	CaseSmall2	name, 精度, dtype	sim+onboard, 1e-2/1e-2, fp16	sim+onboard, 1e-3/1e-3, bf16	✅ 已合并
11	batch_paged_attention	Case3	`examples/.../batch_paged_attention/test_batch_paged_attention.py`	CaseSmall3	name, 精度, dtype	sim+onboard, 1e-2/1e-2, fp16	sim+onboard, 1e-3/1e-3, bf16	✅ 已合并
12	batch_paged_attention	CaseVarSeq2	`examples/.../batch_paged_attention/test_batch_paged_attention.py`	CaseVarSeq2	精度, dtype	sim+onboard, 1e-2/1e-2, fp16	sim+onboard, 1e-3/1e-3, bf16	✅ 已合并
13	batch_paged_attention	CaseVarSeq4	`examples/.../batch_paged_attention/test_batch_paged_attention.py`	CaseVarSeq4	精度, dtype	sim+onboard, 1e-2/1e-2, fp16	sim+onboard, 1e-3/1e-3, bf16	✅ 已合并
14	multi-round-pa	Case1	`tests/st/.../multi_round_paged_attention/test_multi_round_paged_attention.py`	Case1	(无变更)	sim+onboard	sim+onboard	✅ 已合并
15	multi-round-pa	Case2	`tests/st/.../multi_round_paged_attention/test_multi_round_paged_attention.py`	Case2	(无变更)	sim+onboard	sim+onboard	✅ 已合并
16	multi-round-pa	CaseVarSeq2	`tests/st/.../multi_round_paged_attention/test_multi_round_paged_attention.py`	CaseVarSeq2	(无变更)	sim+onboard	sim+onboard	✅ 已合并
17	multi-round-pa	CaseVarSeq4	`tests/st/.../multi_round_paged_attention/test_multi_round_paged_attention.py`	CaseVarSeq4	(无变更)	sim+onboard	sim+onboard	✅ 已合并
18	pa_ringbuffer	ringbuffer_stress	`examples/.../paged_attention_ringbuffer/test_paged_attention_ringbuffer.py`	ringbuffer_stress	(无变更)	onboard	onboard	✅ 已迁移
19	scalar_data_test	默认	`examples/.../scalar_data_test/test_scalar_data.py`	default	(无变更)	onboard	onboard	✅ 已迁移
20	spmd_basic	Case1	`tests/st/.../spmd_basic/test_spmd_basic.py`	Case1	(无变更)	sim+onboard	sim+onboard	✅ 已迁移
21	spmd_sync_start	Case1	`tests/st/.../spmd_sync_start/test_spmd_sync_start.py`	Case1	(无变更)	sim+onboard	sim+onboard	✅ 已迁移
22	spmd_sync_start_stress	Case1	`tests/st/.../spmd_sync_start_stress/test_spmd_sync_start_stress.py`	Case1	(无变更)	sim+onboard	sim+onboard	✅ 已迁移
23	spmd_sync_start_edge	Case1	`tests/st/.../spmd_sync_start_edge/test_spmd_sync_start_edge.py`	Case1	(无变更)	sim+onboard	sim+onboard	✅ 已迁移
24	spmd_sync_start_aiv	Case1	`tests/st/.../spmd_sync_start_aiv/test_spmd_sync_start_aiv.py`	Case1	(无变更)	sim+onboard	sim+onboard	✅ 已迁移
25	spmd_multiblock_mix	Case1	`tests/st/.../spmd_multiblock_mix/test_spmd_multiblock_mix.py`	Case1	(无变更)	sim+onboard	sim+onboard	✅ 已迁移
26	spmd_multiblock_aiv	Case1	`tests/st/.../spmd_multiblock_aiv/test_spmd_multiblock_aiv.py`	Case1	(无变更)	sim+onboard	sim+onboard	✅ 已迁移
27	spmd_starvation	Case1	`tests/st/.../spmd_starvation/test_spmd_starvation.py`	Case1	(无变更)	sim+onboard	sim+onboard	✅ 已迁移
28	paged_attention (st)	Case1	`examples/.../paged_attention/test_paged_attention.py`	Case1	位置	onboard (tests/st/)	onboard (examples/, 合并)	✅ 已合并
29	paged_attention (st)	Case2	`examples/.../paged_attention/test_paged_attention.py`	Case2	位置	onboard (tests/st/)	onboard (examples/, 合并)	✅ 已合并
30	paged_attention (st)	Case3	`examples/.../paged_attention/test_paged_attention.py`	Case3	位置	onboard (tests/st/)	onboard (examples/, 合并)	✅ 已合并
31	pa_unroll (st)	Case1	`tests/st/.../paged_attention_unroll/test_paged_attention_unroll.py`	Case1	(无变更)	onboard	onboard	✅ 已迁移
32	pa_unroll (st)	Case2	`tests/st/.../paged_attention_unroll/test_paged_attention_unroll.py`	Case2	(无变更)	onboard	onboard	✅ 已迁移
33	pa_unroll (st)	Case3	`tests/st/.../paged_attention_unroll/test_paged_attention_unroll.py`	Case3	(无变更)	onboard	onboard	✅ 已迁移
34	pa_unroll_4dims (st)	Case1	`tests/st/.../paged_attention_unroll_4dims/test_paged_attention_unroll_4dims.py`	Case1	(无变更)	onboard	onboard	✅ 已迁移
35	pa_unroll_4dims (st)	Case2	`tests/st/.../paged_attention_unroll_4dims/test_paged_attention_unroll_4dims.py`	Case2	(无变更)	onboard	onboard	✅ 已迁移
36	pa_unroll_4dims (st)	Case3	`tests/st/.../paged_attention_unroll_4dims/test_paged_attention_unroll_4dims.py`	Case3	(无变更)	onboard	onboard	✅ 已迁移
37	batch_paged_attention (st)	Case1	`examples/.../batch_paged_attention/test_batch_paged_attention.py`	Case1	位置	onboard (tests/st/)	onboard (examples/, 合并)	✅ 已合并
38	batch_paged_attention (st)	Case2	`examples/.../batch_paged_attention/test_batch_paged_attention.py`	Case2	位置	onboard (tests/st/)	onboard (examples/, 合并)	✅ 已合并
39	batch_paged_attention (st)	Case3	`examples/.../batch_paged_attention/test_batch_paged_attention.py`	Case3	位置	onboard (tests/st/)	onboard (examples/, 合并)	✅ 已合并
40	benchmark_bgemm (st)	Case0	`tests/st/.../benchmark_bgemm/test_benchmark_bgemm.py`	Case0	平台	onboard	sim+onboard	✅ 已迁移
41	benchmark_bgemm (st)	Case1	`tests/st/.../benchmark_bgemm/test_benchmark_bgemm.py`	Case1	平台	onboard	sim+onboard	✅ 已迁移
42	benchmark_bgemm (st)	Case2	`tests/st/.../benchmark_bgemm/test_benchmark_bgemm.py`	Case2	平台	onboard	sim+onboard	✅ 已迁移
43	benchmark_bgemm (st)	Case3	`tests/st/.../benchmark_bgemm/test_benchmark_bgemm.py`	Case3	平台	onboard	sim+onboard	✅ 已迁移
44	benchmark_bgemm (st)	Case4	`tests/st/.../benchmark_bgemm/test_benchmark_bgemm.py`	Case4	平台	onboard	sim+onboard	✅ 已迁移
45	alternating_matmul_add (st)	default	`tests/st/.../alternating_matmul_add/test_alternating_matmul_add.py`	default	(无变更)	onboard	onboard	✅ 已迁移
46	alternating_matmul_add (st)	Case1	`tests/st/.../alternating_matmul_add/test_alternating_matmul_add.py`	Case1	(无变更)	onboard	onboard	✅ 已迁移
47	alternating_matmul_add (st)	Case2	`tests/st/.../alternating_matmul_add/test_alternating_matmul_add.py`	Case2	(无变更)	onboard	onboard	✅ 已迁移
48	explicit_fatal (st)	默认	`tests/st/.../test_explicit_fatal.py`	default	(无变更)	sim	sim	✅ 已迁移
49	test_l3_dependency (st)	默认	`tests/st/.../test_l3_dependency.py`	default	(无变更)	sim+onboard	sim+onboard	✅ 已迁移
50	test_l3_group (st)	默认	`tests/st/.../test_l3_group.py`	default	(无变更)	sim+onboard	sim+onboard	✅ 已迁移

迁移后统计

位置	用例目录数	Case 数	sim+onboard (auto)	onboard (manual)	onboard (auto)
examples/	6	19	10	7	2
tests/st/	17	31	19	12	0
合计	23	50	29	19	2

迁移变更小结

原 golden.py + kernel_config.py 删除，统一为 @scene_test 类
paged_attention: 4 example小 + 3 st大 → 合并为 7 case，小用例重命名 CaseSmall1/2
batch_paged_attention: 5 example小 + 3 st大 → 合并为 8 case，小用例重命名 CaseSmall1/2/3
multi_round_paged_attention: 原 1 case → 补充至 4 case (从 example golden.py)
_PA_KERNELS 路径保持指向 examples/.../paged_attention/kernels (与迁移前一致)
50/50 case 全部确认，无遗漏

gemini-code-assist

Code Review

This pull request refactors paged attention kernels to support production-scale tile configurations via runtime dispatch and optimizes performance through improved pipeline overlap and zero-copy UB reshapes using TRESHAPE. It also introduces profiling in the orchestration layer and updates test suites. Review feedback highlights a critical issue in the dispatch logic where small-scale configurations are incorrectly handled when the tile size is 16. Furthermore, several newly added comments in the kernel entry points are misleading as they reference out-of-bounds arguments.

examples/a2a3/tensormap_and_ringbuffer/paged_attention/kernels/aic/aic_pv_matmul.cpp

examples/a2a3/tensormap_and_ringbuffer/paged_attention/kernels/aic/aic_qk_matmul.cpp

examples/a2a3/tensormap_and_ringbuffer/paged_attention/kernels/aiv/aiv_online_update.cpp

…and profiling - Add runtime dispatch for 16x128 and 64x128 tile configs in AIC/AIV kernels (qk_matmul, pv_matmul, softmax_prepare, online_update) - Add ENABLE_PROFILING conditional compilation and platform-isolated cycle count macros in orchestration - Use TRESHAPE for zero-copy UB scalar layout conversion in online_update, eliminating GM round-trip - Add pipeline overlap (separate MTE2 events) in qk/pv matmul kernels - Add production-scale cases (Case1/2/3) to paged_attention and batch_paged_attention tests - Add multi-round cases to multi_round_paged_attention test - Fix benchmark_rounds.sh case names for bgemm and batch_paged_attention

gemini-code-assist bot reviewed Apr 14, 2026

View reviewed changes

doraemonmj force-pushed the reuse_fix branch 9 times, most recently from ecac927 to 8f5bc46 Compare April 15, 2026 02:22

doraemonmj force-pushed the reuse_fix branch from 8f5bc46 to 865efc5 Compare April 15, 2026 02:40

ChaoWao approved these changes Apr 15, 2026

View reviewed changes

ChaoWao merged commit b1ff237 into hw-native-sys:main Apr 15, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: merge paged_attention examples/st into unified test_*.py#556

Refactor: merge paged_attention examples/st into unified test_*.py#556
ChaoWao merged 1 commit intohw-native-sys:mainfrom
doraemonmj:reuse_fix

doraemonmj commented Apr 14, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

doraemonmj commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tensormap_and_ringbuffer 用例清单

全量 Case 表

统计汇总

examples/a2a3/tensormap_and_ringbuffer/

tests/st/a2a3/tensormap_and_ringbuffer/

总计

关键差异备注

迁移后变更

变更明细

迁移后统计

迁移变更小结

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

doraemonmj commented Apr 14, 2026 •

edited

Loading