Useless vectorization of simple address/index calculations with most -mtune= options other than generic, like -mtune=raptorlake

This code packs 64-bit integers down to 40-bit (5 bytes).  From https://stackoverflow.com/questions/79753012/repacking-64-to-40-bits-in-avx2

```
void repack_memcpy2(uint8_t* __restrict out, uint64_t* in) {
    for (int i=0 ; i < N ; i++) {
        uint64_t tmp = in[i];  // separate qword load seems to help GCC/Clang.
        memcpy(out + 5*i, &tmp, 5);    // more efficient is overlapping 8-byte stores, but this more clearly demos the bad auto-vectorization
    }
}
```

In Clang 15 and later `-O3 -march=x86-64-v3 -mtune=skylake` or `raptorlake`,  or with `-march=znver5 -fno-unroll-loops`, we get code which uses `vpaddq` to calculate `i*5` and extracts those values to GPRs for use in addressing modes, instead of just using the immediate displacement part of the addressing mode and a scalar `add`.  This is obviously bad; I haven't benchmarked it to see how much it hurts real-world performance on my Skylake; I expect it's worse on Ice Lake and recent Zen which can do 2 stores per clock and merge them in the store buffer if they're to the same cache line.

https://godbolt.org/z/PGaPY8qeh

```
# current trunk  -O3 -march=raptorlake  (basically x86-64-v3 features)
# RDI = dst, RSI=src
# ahead of the loop, YMM0 = [0,1,2,3], YMM1 = set1_epi64(20), YMM2 = set1_epi64(8) with broadcast loads
        mov     eax, 4
.LBB3_1:
        vpsllq  ymm3, ymm0, 2
        vpaddq  ymm3, ymm3, ymm0      # ymm3 = ymm0 + 4*ymm0
        vmovq   r8, xmm3
        vpextrq r9, xmm3, 1
        vextracti128    xmm4, ymm3, 1
        vpextrq rcx, xmm4, 1
        mov     rdx, qword ptr [rsi + 8*rax - 8]
        mov     r10, qword ptr [rsi + 8*rax - 16]   # normal scalar loads; indexed addressing mode for no reason, although it only costs code-size not speed for mov loads.
        mov     r11, qword ptr [rsi + 8*rax - 32]
        mov     rbx, qword ptr [rsi + 8*rax - 24]
        mov     dword ptr [rdi + r8], r11d     # using r8 extracted from YMM3 as the offset
        shr     r11, 32
        mov     byte ptr [rdi + r8 + 4], r11b  # and store the 5th byte separately
        mov     dword ptr [rdi + r9], ebx
        shr     rbx, 32
        mov     byte ptr [rdi + r9 + 4], bl
        vmovq   r8, xmm4
        mov     dword ptr [rdi + r8], r10d
        shr     r10, 32
        mov     byte ptr [rdi + r8 + 4], r10b
        mov     dword ptr [rdi + rcx], edx
        shr     rdx, 32
        mov     byte ptr [rdi + rcx + 4], dl

        vpaddq  ymm3, ymm3, ymm1        # ymm3 += set1(20)
...   # repeat most of the above again

        vpaddq  ymm0, ymm0, ymm2   # ymm0 += 8 scalar elements per iteration, so this is vectorizing the ++i
        add     rax, 8
        cmp     rax, 10244
        jne     .LBB3_1
...
```

If we're not going to invent SIMD shuffles to move the actual data around, we should be making scalar asm that simply increments a pointer by 5*8 = 40 and uses offsets like `[rdi + 5]`, `[rdi + 5+4]`, `[rdi + 10]`, etc. in an unrolled loop without any address-computation overhead, and equal code-size per store instruction for offsets < 128 where an 8-bit displacement fits.

Stores with a 1-register addressing-mode can also run their store-address uop on port 7 on Haswell/Skylake, unlike these indexed stores.  Less of a problem on Ice Lake and later where store-address ports are all fully capable and don't compete with loads.

Or if we're feeling ambitious, optimizing it to do 8-byte stores which overlap the previous by 3, except on the final loop iteration, is somewhat faster in @swineone's testing on Raptor Lake, and should be a lot faster on Skylake and earlier (1/clock stores).  That would still satisfy the as-if rule since this code uncoditionally writes all bytes in the same range and reads the same range, and there's no volatile or atomic.  A user can write that manually like this, but it would be nice if they didn't have to.

```
void repack_memcpy_overlap(uint8_t* __restrict out, uint64_t* in) {
    for (int i=0 ; i < N-1 ; i++) {
        uint64_t tmp = in[i];
        memcpy(&out[5*i], &tmp, sizeof(tmp));  // 8-byte memcpy gives qword stores
    }
    memcpy(&out[5 * (N-1)], &in[N-1], 5);  // avoid overshoot on the last
}
```

This actually auto-vectorizes in a really bad way, with AVX-512 scatter stores even with -march=znver5 where they're very slow.  (I was worried about a correctness problem with overlapping scatter, but Intel documents it as ordered from LSE to MSE when elements overlap, only unordered in other cases.  AMD doesn't yet document their AVX-512 instructions so I assume same as Intel.)

And for x86-64-v3 with most tuning options, it seems to want to do a manual scatter from vector regs, doing similar vectorized address calculation to feed stores like `vmovq   qword ptr [rdi + rdx], xmm5` and `vpextrq qword ptr [rdi + rcx], xmm5, 1`.  
This is bad for Zen4/5 since XMM stores are 1/clock, even 64-bit `vmovq mem, xmm`, while GPR stores are 2/clock.  But fortunately `-march=x86-64-v3 -mtune=znver5` just goes scalar for this.  (With an excessive unroll factor of 20 though.  Has anyone checked Clang's tuning settings in a non-small program with realistic I-cache pressure?  Spending a lot more code bytes for a tiny bit more speed in a microbenchmark isn't worth it when it causes I-cache misses elsewhere, unless you know from PGO this loop is very long-running.)

Of course there's a *lot* to be gained here from good vectorization with AVX-512 `vpermb`, or especially `vpermt2b` on znver4 / znver5, like 2x 64-byte loads to get 60 or 64 bytes of 5-byte integers from a 1-uop (on AMD) 2-input shuffle.  This is vastly faster than a vpscatterqq could ever hope to be even on Intel.  
And a little to be gained with AVX2 on Raptor Lake, according to test results on the Stack Overflow link at the top: 8758 cycles for the `repack_memcpy_overlap` strategy for N = 1440, vs. 7989 cycles for an AVX2 shuffle strategy that sets up for 32-byte stores, vs. 8049  for a simple 256-bit `vpshufb` strategy with overlapping 16-byte vmovdqu / vextracti128 stores.  Of course those times are all from compiling with clang18 `-O3 -march=native` on the raptor lake test system, so there's unnecessary overhead in the versions that aren't manually vectorized.  
But this issue is only about vectorizing address indexing calculations in cases like this; let me know what else is a non-duplicate you'd like me to open issues for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Useless vectorization of simple address/index calculations with most -mtune= options other than generic, like -mtune=raptorlake #156568

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Useless vectorization of simple address/index calculations with most -mtune= options other than generic, like -mtune=raptorlake #156568

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions