Which component has the problem?
CuTe DSL
Bug Report
Description
I’m seeing a codegen regression in nvidia-cutlass-dsl[cu13]==4.5.0 vs 4.4.2. Lmk if this is not the intended use of SSA
A minimal pattern like this:
r = cute.make_rmem_tensor(((1, 3),), BFloat16)
r[0], r[1], r[2] = a[0], a[1], a[2]
v = r[None].load().to(Float32)
s = Float32(0.0)
for i in cutlass.range(cute.size(r[None]), unroll_full=True):
s += v[i]
out[0] = s
causes CuTe DSL 4.5.0 to materialize v in .local memory before extracting scalars.
A workaround is to index the rmem tensor first:
s = Float32(0.0)
for i in cutlass.range(cute.size(r[None]), unroll_full=True):
s += r[None][i].to(Float32)
out[0] = s
Reproducer
Full standalone reproducer: cutedsl_vector_load_index_minimal.py
Observed
With nvidia-cutlass-dsl[cu13]==4.4.2:
Bad: local=0 B
Good: local=0 B
With nvidia-cutlass-dsl[cu13]==4.5.0:
Bad: local=16 B
Good: local=0 B
4.5.0 bad PTX contains:
.local .align 8 .b8 __local_depot0[16];
.reg .b64 %SP;
.reg .b64 %SPL;
mov.b64 %SPL, __local_depot0;
cvta.local.u64 %SP, %SPL;
st.v2.b32 [%SP+8], {%r3, %r2};
ld.b16 %rs7, [%rd5];
st.v2.b32 [%SP], {%r3, %r2};
ld.b16 %rs8, [%rd9];
Corresponding SASS:
ST.E.64 desc[UR4][R4.64+0x8], R10 ;
LD.E.U16 R6, desc[UR4][R6.64] ;
ST.E.64 desc[UR4][R4.64], R10 ;
LD.E.U16 R8, desc[UR4][R8.64] ;
4.4.2 generates no .local memory for the same source.
Impact
This caused a large performance regression in a larger kernel (mamba 2 backward): ~1.1 ms on 4.4.2 vs ~18.8 ms on 4.5.0 before applying the scalar-index workaround.
Expected
Indexing a vector loaded from rmem should stay in registers / scalar SSA, as in 4.4.2, and should not require .local scratch memory.
Which component has the problem?
CuTe DSL
Bug Report
Description
I’m seeing a codegen regression in
nvidia-cutlass-dsl[cu13]==4.5.0vs4.4.2. Lmk if this is not the intended use of SSAA minimal pattern like this:
causes CuTe DSL 4.5.0 to materialize v in .local memory before extracting scalars.
A workaround is to index the rmem tensor first:
Reproducer
Full standalone reproducer: cutedsl_vector_load_index_minimal.py
Observed
With nvidia-cutlass-dsl[cu13]==4.4.2:
With nvidia-cutlass-dsl[cu13]==4.5.0:
4.5.0 bad PTX contains:
Corresponding SASS:
4.4.2 generates no .local memory for the same source.
Impact
This caused a large performance regression in a larger kernel (mamba 2 backward): ~1.1 ms on 4.4.2 vs ~18.8 ms on 4.5.0 before applying the scalar-index workaround.
Expected
Indexing a vector loaded from rmem should stay in registers / scalar SSA, as in 4.4.2, and should not require .local scratch memory.