[BUG] CuTe DSL 4.5.0 regression: indexing `rmem.load()` vector spills to `.local` memory

### Which component has the problem?

CuTe DSL

### Bug Report

   ## Description

   I’m seeing a codegen regression in `nvidia-cutlass-dsl[cu13]==4.5.0` vs `4.4.2`. Lmk if this is not the intended use of SSA

   A minimal pattern like this:

   ```python
   r = cute.make_rmem_tensor(((1, 3),), BFloat16)
   r[0], r[1], r[2] = a[0], a[1], a[2]

   v = r[None].load().to(Float32)
   s = Float32(0.0)
   for i in cutlass.range(cute.size(r[None]), unroll_full=True):
       s += v[i]
   out[0] = s
 ```

 causes CuTe DSL 4.5.0 to materialize v in .local memory before extracting scalars.

 A workaround is to index the rmem tensor first:

 ```python
   s = Float32(0.0)
   for i in cutlass.range(cute.size(r[None]), unroll_full=True):
       s += r[None][i].to(Float32)
   out[0] = s
 ```

 Reproducer

 Full standalone reproducer: [cutedsl_vector_load_index_minimal.py](https://gist.github.com/tridao/c50c667e18c9e35b4cb97ae9e90e2745)

## Observed

 With nvidia-cutlass-dsl[cu13]==4.4.2:

 ```text
   Bad:  local=0 B
   Good: local=0 B
 ```

 With nvidia-cutlass-dsl[cu13]==4.5.0:

 ```text
   Bad:  local=16 B
   Good: local=0 B
 ```

 4.5.0 bad PTX contains:

 ```ptx
   .local .align 8 .b8  __local_depot0[16];
   .reg .b64  %SP;
   .reg .b64  %SPL;

   mov.b64         %SPL, __local_depot0;
   cvta.local.u64  %SP, %SPL;

   st.v2.b32 [%SP+8], {%r3, %r2};
   ld.b16    %rs7, [%rd5];

   st.v2.b32 [%SP], {%r3, %r2};
   ld.b16    %rs8, [%rd9];
 ```

 Corresponding SASS:

 ```sass
   ST.E.64  desc[UR4][R4.64+0x8], R10 ;
   LD.E.U16 R6, desc[UR4][R6.64] ;
   ST.E.64  desc[UR4][R4.64], R10 ;
   LD.E.U16 R8, desc[UR4][R8.64] ;
 ```

 4.4.2 generates no .local memory for the same source.

 Impact

 This caused a large performance regression in a larger kernel (mamba 2 backward): ~1.1 ms on 4.4.2 vs ~18.8 ms on 4.5.0 before applying the scalar-index workaround.

 Expected

 Indexing a vector loaded from rmem should stay in registers / scalar SSA, as in 4.4.2, and should not require .local scratch memory.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] CuTe DSL 4.5.0 regression: indexing `rmem.load()` vector spills to `.local` memory #3210

Which component has the problem?

Bug Report

Description

Observed

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] CuTe DSL 4.5.0 regression: indexing rmem.load() vector spills to .local memory #3210

Description

Which component has the problem?

Bug Report

Description

Observed

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[BUG] CuTe DSL 4.5.0 regression: indexing `rmem.load()` vector spills to `.local` memory #3210