Skip to content

[BUG] CuTe DSL 4.5.0 regression: indexing rmem.load() vector spills to .local memory #3210

@tridao

Description

@tridao

Which component has the problem?

CuTe DSL

Bug Report

Description

I’m seeing a codegen regression in nvidia-cutlass-dsl[cu13]==4.5.0 vs 4.4.2. Lmk if this is not the intended use of SSA

A minimal pattern like this:

r = cute.make_rmem_tensor(((1, 3),), BFloat16)
r[0], r[1], r[2] = a[0], a[1], a[2]

v = r[None].load().to(Float32)
s = Float32(0.0)
for i in cutlass.range(cute.size(r[None]), unroll_full=True):
    s += v[i]
out[0] = s

causes CuTe DSL 4.5.0 to materialize v in .local memory before extracting scalars.

A workaround is to index the rmem tensor first:

  s = Float32(0.0)
  for i in cutlass.range(cute.size(r[None]), unroll_full=True):
      s += r[None][i].to(Float32)
  out[0] = s

Reproducer

Full standalone reproducer: cutedsl_vector_load_index_minimal.py

Observed

With nvidia-cutlass-dsl[cu13]==4.4.2:

  Bad:  local=0 B
  Good: local=0 B

With nvidia-cutlass-dsl[cu13]==4.5.0:

  Bad:  local=16 B
  Good: local=0 B

4.5.0 bad PTX contains:

  .local .align 8 .b8  __local_depot0[16];
  .reg .b64  %SP;
  .reg .b64  %SPL;

  mov.b64         %SPL, __local_depot0;
  cvta.local.u64  %SP, %SPL;

  st.v2.b32 [%SP+8], {%r3, %r2};
  ld.b16    %rs7, [%rd5];

  st.v2.b32 [%SP], {%r3, %r2};
  ld.b16    %rs8, [%rd9];

Corresponding SASS:

  ST.E.64  desc[UR4][R4.64+0x8], R10 ;
  LD.E.U16 R6, desc[UR4][R6.64] ;
  ST.E.64  desc[UR4][R4.64], R10 ;
  LD.E.U16 R8, desc[UR4][R8.64] ;

4.4.2 generates no .local memory for the same source.

Impact

This caused a large performance regression in a larger kernel (mamba 2 backward): ~1.1 ms on 4.4.2 vs ~18.8 ms on 4.5.0 before applying the scalar-index workaround.

Expected

Indexing a vector loaded from rmem should stay in registers / scalar SSA, as in 4.4.2, and should not require .local scratch memory.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions