-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Description
This code packs 64-bit integers down to 40-bit (5 bytes). From https://stackoverflow.com/questions/79753012/repacking-64-to-40-bits-in-avx2
void repack_memcpy2(uint8_t* __restrict out, uint64_t* in) {
for (int i=0 ; i < N ; i++) {
uint64_t tmp = in[i]; // separate qword load seems to help GCC/Clang.
memcpy(out + 5*i, &tmp, 5); // more efficient is overlapping 8-byte stores, but this more clearly demos the bad auto-vectorization
}
}
In Clang 15 and later -O3 -march=x86-64-v3 -mtune=skylake
or raptorlake
, or with -march=znver5 -fno-unroll-loops
, we get code which uses vpaddq
to calculate i*5
and extracts those values to GPRs for use in addressing modes, instead of just using the immediate displacement part of the addressing mode and a scalar add
. This is obviously bad; I haven't benchmarked it to see how much it hurts real-world performance on my Skylake; I expect it's worse on Ice Lake and recent Zen which can do 2 stores per clock and merge them in the store buffer if they're to the same cache line.
https://godbolt.org/z/PGaPY8qeh
# current trunk -O3 -march=raptorlake (basically x86-64-v3 features)
# RDI = dst, RSI=src
# ahead of the loop, YMM0 = [0,1,2,3], YMM1 = set1_epi64(20), YMM2 = set1_epi64(8) with broadcast loads
mov eax, 4
.LBB3_1:
vpsllq ymm3, ymm0, 2
vpaddq ymm3, ymm3, ymm0 # ymm3 = ymm0 + 4*ymm0
vmovq r8, xmm3
vpextrq r9, xmm3, 1
vextracti128 xmm4, ymm3, 1
vpextrq rcx, xmm4, 1
mov rdx, qword ptr [rsi + 8*rax - 8]
mov r10, qword ptr [rsi + 8*rax - 16] # normal scalar loads; indexed addressing mode for no reason, although it only costs code-size not speed for mov loads.
mov r11, qword ptr [rsi + 8*rax - 32]
mov rbx, qword ptr [rsi + 8*rax - 24]
mov dword ptr [rdi + r8], r11d # using r8 extracted from YMM3 as the offset
shr r11, 32
mov byte ptr [rdi + r8 + 4], r11b # and store the 5th byte separately
mov dword ptr [rdi + r9], ebx
shr rbx, 32
mov byte ptr [rdi + r9 + 4], bl
vmovq r8, xmm4
mov dword ptr [rdi + r8], r10d
shr r10, 32
mov byte ptr [rdi + r8 + 4], r10b
mov dword ptr [rdi + rcx], edx
shr rdx, 32
mov byte ptr [rdi + rcx + 4], dl
vpaddq ymm3, ymm3, ymm1 # ymm3 += set1(20)
... # repeat most of the above again
vpaddq ymm0, ymm0, ymm2 # ymm0 += 8 scalar elements per iteration, so this is vectorizing the ++i
add rax, 8
cmp rax, 10244
jne .LBB3_1
...
If we're not going to invent SIMD shuffles to move the actual data around, we should be making scalar asm that simply increments a pointer by 5*8 = 40 and uses offsets like [rdi + 5]
, [rdi + 5+4]
, [rdi + 10]
, etc. in an unrolled loop without any address-computation overhead, and equal code-size per store instruction for offsets < 128 where an 8-bit displacement fits.
Stores with a 1-register addressing-mode can also run their store-address uop on port 7 on Haswell/Skylake, unlike these indexed stores. Less of a problem on Ice Lake and later where store-address ports are all fully capable and don't compete with loads.
Or if we're feeling ambitious, optimizing it to do 8-byte stores which overlap the previous by 3, except on the final loop iteration, is somewhat faster in @swineone's testing on Raptor Lake, and should be a lot faster on Skylake and earlier (1/clock stores). That would still satisfy the as-if rule since this code uncoditionally writes all bytes in the same range and reads the same range, and there's no volatile or atomic. A user can write that manually like this, but it would be nice if they didn't have to.
void repack_memcpy_overlap(uint8_t* __restrict out, uint64_t* in) {
for (int i=0 ; i < N-1 ; i++) {
uint64_t tmp = in[i];
memcpy(&out[5*i], &tmp, sizeof(tmp)); // 8-byte memcpy gives qword stores
}
memcpy(&out[5 * (N-1)], &in[N-1], 5); // avoid overshoot on the last
}
This actually auto-vectorizes in a really bad way, with AVX-512 scatter stores even with -march=znver5 where they're very slow. (I was worried about a correctness problem with overlapping scatter, but Intel documents it as ordered from LSE to MSE when elements overlap, only unordered in other cases. AMD doesn't yet document their AVX-512 instructions so I assume same as Intel.)
And for x86-64-v3 with most tuning options, it seems to want to do a manual scatter from vector regs, doing similar vectorized address calculation to feed stores like vmovq qword ptr [rdi + rdx], xmm5
and vpextrq qword ptr [rdi + rcx], xmm5, 1
.
This is bad for Zen4/5 since XMM stores are 1/clock, even 64-bit vmovq mem, xmm
, while GPR stores are 2/clock. But fortunately -march=x86-64-v3 -mtune=znver5
just goes scalar for this. (With an excessive unroll factor of 20 though. Has anyone checked Clang's tuning settings in a non-small program with realistic I-cache pressure? Spending a lot more code bytes for a tiny bit more speed in a microbenchmark isn't worth it when it causes I-cache misses elsewhere, unless you know from PGO this loop is very long-running.)
Of course there's a lot to be gained here from good vectorization with AVX-512 vpermb
, or especially vpermt2b
on znver4 / znver5, like 2x 64-byte loads to get 60 or 64 bytes of 5-byte integers from a 1-uop (on AMD) 2-input shuffle. This is vastly faster than a vpscatterqq could ever hope to be even on Intel.
And a little to be gained with AVX2 on Raptor Lake, according to test results on the Stack Overflow link at the top: 8758 cycles for the repack_memcpy_overlap
strategy for N = 1440, vs. 7989 cycles for an AVX2 shuffle strategy that sets up for 32-byte stores, vs. 8049 for a simple 256-bit vpshufb
strategy with overlapping 16-byte vmovdqu / vextracti128 stores. Of course those times are all from compiling with clang18 -O3 -march=native
on the raptor lake test system, so there's unnecessary overhead in the versions that aren't manually vectorized.
But this issue is only about vectorizing address indexing calculations in cases like this; let me know what else is a non-duplicate you'd like me to open issues for.