-
Notifications
You must be signed in to change notification settings - Fork 14.8k
Labels
llvm:instcombineCovers the InstCombine, InstSimplify and AggressiveInstCombine passesCovers the InstCombine, InstSimplify and AggressiveInstCombine passesmiscompilation
Description
Given code like this (extracted out of a larger example with similar flow):
__m128 square(__m128i data) {
__m128i y = _mm_srai_epi32(data, 16);
__m128i x = _mm_or_si128(y, _mm_set1_epi32(3));
__m128 v = _mm_cvtepi32_ps(x);
return _mm_mul_ps(v, v);
}
And targeting SSE2, I would expect a more or less straightforward 1-1 lowering into SSE2 instructions, modulo _mm_set1_epi32
which has a couple different options. Indeed, GCC generates this:
pcmpeqd xmm1, xmm1
psrad xmm0, 16
psrld xmm1, 30
por xmm0, xmm1
cvtdq2ps xmm0, xmm0
mulps xmm0, xmm0
and MSVC generates this, opting to load 3
from memory:
movdqu xmm0, XMMWORD PTR [rcx]
psrad xmm0, 16
orps xmm0, XMMWORD PTR __xmm@00000003000000030000000300000003
cvtdq2ps xmm0, xmm0
mulps xmm0, xmm0
clang, however, generates this, which is basically never a good idea:
psrld xmm0, 16
por xmm0, xmmword ptr [rip + .LCPI0_0]
movdqa xmm1, xmm0
pmulhw xmm1, xmm0
pshuflw xmm1, xmm1, 232
pshufhw xmm1, xmm1, 232
pshufd xmm1, xmm1, 232
pmullw xmm0, xmm0
pshuflw xmm0, xmm0, 232
pshufhw xmm0, xmm0, 232
pshufd xmm0, xmm0, 232
punpcklwd xmm0, xmm1
cvtdq2ps xmm0, xmm0
It looks like it decides that it would be a great idea to multiply the integer instead of multiplying the floating-point value, as it knows the range of the integer is small enough. This results in degraded performance.
Godbolt link for convenience: https://gcc.godbolt.org/z/746nGe1x5
Metadata
Metadata
Assignees
Labels
llvm:instcombineCovers the InstCombine, InstSimplify and AggressiveInstCombine passesCovers the InstCombine, InstSimplify and AggressiveInstCombine passesmiscompilation