Skip to content

Useless vectorization of simple address/index calculations with most -mtune= options other than generic, like -mtune=raptorlake #156568

@pcordes

Description

@pcordes

This code packs 64-bit integers down to 40-bit (5 bytes). From https://stackoverflow.com/questions/79753012/repacking-64-to-40-bits-in-avx2

void repack_memcpy2(uint8_t* __restrict out, uint64_t* in) {
    for (int i=0 ; i < N ; i++) {
        uint64_t tmp = in[i];  // separate qword load seems to help GCC/Clang.
        memcpy(out + 5*i, &tmp, 5);    // more efficient is overlapping 8-byte stores, but this more clearly demos the bad auto-vectorization
    }
}

In Clang 15 and later -O3 -march=x86-64-v3 -mtune=skylake or raptorlake, or with -march=znver5 -fno-unroll-loops, we get code which uses vpaddq to calculate i*5 and extracts those values to GPRs for use in addressing modes, instead of just using the immediate displacement part of the addressing mode and a scalar add. This is obviously bad; I haven't benchmarked it to see how much it hurts real-world performance on my Skylake; I expect it's worse on Ice Lake and recent Zen which can do 2 stores per clock and merge them in the store buffer if they're to the same cache line.

https://godbolt.org/z/PGaPY8qeh

# current trunk  -O3 -march=raptorlake  (basically x86-64-v3 features)
# RDI = dst, RSI=src
# ahead of the loop, YMM0 = [0,1,2,3], YMM1 = set1_epi64(20), YMM2 = set1_epi64(8) with broadcast loads
        mov     eax, 4
.LBB3_1:
        vpsllq  ymm3, ymm0, 2
        vpaddq  ymm3, ymm3, ymm0      # ymm3 = ymm0 + 4*ymm0
        vmovq   r8, xmm3
        vpextrq r9, xmm3, 1
        vextracti128    xmm4, ymm3, 1
        vpextrq rcx, xmm4, 1
        mov     rdx, qword ptr [rsi + 8*rax - 8]
        mov     r10, qword ptr [rsi + 8*rax - 16]   # normal scalar loads; indexed addressing mode for no reason, although it only costs code-size not speed for mov loads.
        mov     r11, qword ptr [rsi + 8*rax - 32]
        mov     rbx, qword ptr [rsi + 8*rax - 24]
        mov     dword ptr [rdi + r8], r11d     # using r8 extracted from YMM3 as the offset
        shr     r11, 32
        mov     byte ptr [rdi + r8 + 4], r11b  # and store the 5th byte separately
        mov     dword ptr [rdi + r9], ebx
        shr     rbx, 32
        mov     byte ptr [rdi + r9 + 4], bl
        vmovq   r8, xmm4
        mov     dword ptr [rdi + r8], r10d
        shr     r10, 32
        mov     byte ptr [rdi + r8 + 4], r10b
        mov     dword ptr [rdi + rcx], edx
        shr     rdx, 32
        mov     byte ptr [rdi + rcx + 4], dl

        vpaddq  ymm3, ymm3, ymm1        # ymm3 += set1(20)
...   # repeat most of the above again

        vpaddq  ymm0, ymm0, ymm2   # ymm0 += 8 scalar elements per iteration, so this is vectorizing the ++i
        add     rax, 8
        cmp     rax, 10244
        jne     .LBB3_1
...

If we're not going to invent SIMD shuffles to move the actual data around, we should be making scalar asm that simply increments a pointer by 5*8 = 40 and uses offsets like [rdi + 5], [rdi + 5+4], [rdi + 10], etc. in an unrolled loop without any address-computation overhead, and equal code-size per store instruction for offsets < 128 where an 8-bit displacement fits.

Stores with a 1-register addressing-mode can also run their store-address uop on port 7 on Haswell/Skylake, unlike these indexed stores. Less of a problem on Ice Lake and later where store-address ports are all fully capable and don't compete with loads.

Or if we're feeling ambitious, optimizing it to do 8-byte stores which overlap the previous by 3, except on the final loop iteration, is somewhat faster in @swineone's testing on Raptor Lake, and should be a lot faster on Skylake and earlier (1/clock stores). That would still satisfy the as-if rule since this code uncoditionally writes all bytes in the same range and reads the same range, and there's no volatile or atomic. A user can write that manually like this, but it would be nice if they didn't have to.

void repack_memcpy_overlap(uint8_t* __restrict out, uint64_t* in) {
    for (int i=0 ; i < N-1 ; i++) {
        uint64_t tmp = in[i];
        memcpy(&out[5*i], &tmp, sizeof(tmp));  // 8-byte memcpy gives qword stores
    }
    memcpy(&out[5 * (N-1)], &in[N-1], 5);  // avoid overshoot on the last
}

This actually auto-vectorizes in a really bad way, with AVX-512 scatter stores even with -march=znver5 where they're very slow. (I was worried about a correctness problem with overlapping scatter, but Intel documents it as ordered from LSE to MSE when elements overlap, only unordered in other cases. AMD doesn't yet document their AVX-512 instructions so I assume same as Intel.)

And for x86-64-v3 with most tuning options, it seems to want to do a manual scatter from vector regs, doing similar vectorized address calculation to feed stores like vmovq qword ptr [rdi + rdx], xmm5 and vpextrq qword ptr [rdi + rcx], xmm5, 1.
This is bad for Zen4/5 since XMM stores are 1/clock, even 64-bit vmovq mem, xmm, while GPR stores are 2/clock. But fortunately -march=x86-64-v3 -mtune=znver5 just goes scalar for this. (With an excessive unroll factor of 20 though. Has anyone checked Clang's tuning settings in a non-small program with realistic I-cache pressure? Spending a lot more code bytes for a tiny bit more speed in a microbenchmark isn't worth it when it causes I-cache misses elsewhere, unless you know from PGO this loop is very long-running.)

Of course there's a lot to be gained here from good vectorization with AVX-512 vpermb, or especially vpermt2b on znver4 / znver5, like 2x 64-byte loads to get 60 or 64 bytes of 5-byte integers from a 1-uop (on AMD) 2-input shuffle. This is vastly faster than a vpscatterqq could ever hope to be even on Intel.
And a little to be gained with AVX2 on Raptor Lake, according to test results on the Stack Overflow link at the top: 8758 cycles for the repack_memcpy_overlap strategy for N = 1440, vs. 7989 cycles for an AVX2 shuffle strategy that sets up for 32-byte stores, vs. 8049 for a simple 256-bit vpshufb strategy with overlapping 16-byte vmovdqu / vextracti128 stores. Of course those times are all from compiling with clang18 -O3 -march=native on the raptor lake test system, so there's unnecessary overhead in the versions that aren't manually vectorized.
But this issue is only about vectorizing address indexing calculations in cases like this; let me know what else is a non-duplicate you'd like me to open issues for.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions