[X64] Floating-point multiplication can get "optimized" into integer multiplication even though it's inefficient

Given code like this (extracted out of a larger example with similar flow):

```c++
__m128 square(__m128i data) {
    __m128i y = _mm_srai_epi32(data, 16);
    __m128i x = _mm_or_si128(y, _mm_set1_epi32(3)); 
    __m128 v = _mm_cvtepi32_ps(x);
    return _mm_mul_ps(v, v);
}
```

And targeting SSE2, I would expect a more or less straightforward 1-1 lowering into SSE2 instructions, modulo `_mm_set1_epi32` which has a couple different options. Indeed, GCC generates this:

```asm
        pcmpeqd xmm1, xmm1
        psrad   xmm0, 16
        psrld   xmm1, 30
        por     xmm0, xmm1
        cvtdq2ps        xmm0, xmm0
        mulps   xmm0, xmm0
```

and MSVC generates this, opting to load `3` from memory:

```asm
        movdqu  xmm0, XMMWORD PTR [rcx]
        psrad   xmm0, 16
        orps    xmm0, XMMWORD PTR __xmm@00000003000000030000000300000003
        cvtdq2ps xmm0, xmm0
        mulps   xmm0, xmm0
```

clang, however, generates this, which is basically never a good idea:

```asm
        psrld   xmm0, 16
        por     xmm0, xmmword ptr [rip + .LCPI0_0]
        movdqa  xmm1, xmm0
        pmulhw  xmm1, xmm0
        pshuflw xmm1, xmm1, 232
        pshufhw xmm1, xmm1, 232
        pshufd  xmm1, xmm1, 232
        pmullw  xmm0, xmm0
        pshuflw xmm0, xmm0, 232
        pshufhw xmm0, xmm0, 232
        pshufd  xmm0, xmm0, 232
        punpcklwd       xmm0, xmm1
        cvtdq2ps        xmm0, xmm0
```

It looks like it decides that it would be a great idea to multiply the integer instead of multiplying the floating-point value, as it knows the range of the integer is small enough. This results in degraded performance.

Godbolt link for convenience: https://gcc.godbolt.org/z/746nGe1x5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[X64] Floating-point multiplication can get "optimized" into integer multiplication even though it's inefficient #162749

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[X64] Floating-point multiplication can get "optimized" into integer multiplication even though it's inefficient #162749

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions