Doubling a single complex float produces unnecessarily inefficient code without -ffast-math #31205

lesshaste · 2017-02-03T15:31:09Z


Bugzilla Link	31857
Version	trunk
OS	Linux
CC	@lesshaste,@hfinkel,@joker-eph,@RKSimon,@rotateright

Extended Description

Consider

#include <complex.h>
complex float f(complex float x[]) {
complex float p = 1.0;
for (int i = 0; i < 1; i++)
p += 2*x[i];
return p;
}

This code is simply doubling one complex float and adding 1.

In clang trunk with -O3 -march=core-avx2 you get

f: # @f
vmovss xmm3, dword ptr [rdi + 4] # xmm3 = mem[0],zero,zero,zero
vbroadcastss xmm0, xmm3
vmulps xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vmovss xmm2, dword ptr [rdi] # xmm2 = mem[0],zero,zero,zero
vbroadcastss xmm1, xmm2
vmovss xmm4, dword ptr [rip + .LCPI0_1] # xmm4 = mem[0],zero,zero,zero
vmulps xmm1, xmm1, xmm4
vsubps xmm4, xmm1, xmm0
vaddps xmm1, xmm1, xmm0
vblendps xmm0, xmm4, xmm1, 2 # xmm0 = xmm4[0],xmm1[1],xmm4[2,3]
vucomiss xmm4, xmm4
jnp .LBB0_3
vmovshdup xmm1, xmm1 # xmm1 = xmm1[1,1,3,3]
vucomiss xmm1, xmm1
jp .LBB0_2
.LBB0_3:
vmovss xmm1, dword ptr [rip + .LCPI0_2] # xmm1 = mem[0],zero,zero,zero
vaddps xmm0, xmm0, xmm1
ret
.LBB0_2:
push rax
vmovss xmm0, dword ptr [rip + .LCPI0_1] # xmm0 = mem[0],zero,zero,zero
vxorps xmm1, xmm1, xmm1
call __mulsc3
add rsp, 8
jmp .LBB0_3

Using the Intel Compiler with -O3 -march=core-avx2 -fp-model strict you get:

f:
vmovsd xmm0, QWORD PTR [rdi] #5.12
vmulps xmm2, xmm0, XMMWORD PTR .L_2il0floatpacket.1[rip] #5.12
vmovsd xmm1, QWORD PTR p.152.0.0.1[rip] #3.19
vaddps xmm0, xmm1, xmm2 #5.5
ret

as expected.

The -fp-model strict tells the compiler to strictly adhere to value-safe optimizations when implementing floating-point calculations and enables floating-point exception semantics. It also turns off fuse add multiply which might not be relevant here.

If you turn on -ffast-math in clang trunk you do get much better although still not ideal code:

f: # @f
vmovss xmm0, dword ptr [rdi] # xmm0 = mem[0],zero,zero,zero
vmovss xmm1, dword ptr [rdi + 4] # xmm1 = mem[0],zero,zero,zero
vaddss xmm1, xmm1, xmm1
vmovss xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = mem[0],zero,zero,zero
vfmadd213ss xmm2, xmm0, dword ptr [rip + .LCPI0_1]
vinsertps xmm0, xmm2, xmm1, 16 # xmm0 = xmm2[0],xmm1[0],xmm2[2,3]
ret

lesshaste · 2017-02-03T15:46:36Z

A much better example is this:

#include <complex.h>
complex float f(complex float x) {
return 2*x;
}

clang trunk gives (using the same options)

f: # @f
vmovaps xmm2, xmm0
vaddps xmm1, xmm2, xmm2
vpermilps xmm0, xmm2, 225 # xmm0 = xmm2[1,0,2,3]
vxorps xmm3, xmm3, xmm3
vmulps xmm3, xmm0, xmm3
vsubps xmm0, xmm1, xmm3
vaddps xmm1, xmm1, xmm3
vucomiss xmm0, xmm0
jnp .LBB0_2
vmovshdup xmm3, xmm1 # xmm3 = xmm1[1,1,3,3]
vucomiss xmm3, xmm3
jp .LBB0_3
.LBB0_2:
vblendps xmm0, xmm0, xmm1, 2 # xmm0 = xmm0[0],xmm1[1],xmm0[2,3]
ret
.LBB0_3:
vmovshdup xmm3, xmm2 # xmm3 = xmm2[1,1,3,3]
vmovss xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
vxorps xmm1, xmm1, xmm1
jmp __mulsc3 # TAILCALL

Much better would be:

f:
vmulps xmm0, xmm0, XMMWORD PTR .L_2il0floatpacket.1[rip] #3.12
ret

lesshaste · 2017-02-03T16:18:26Z

A final comment. Adding -ffast-math to clang does give

f: # @f
vaddps xmm0, xmm0, xmm0
ret

as expected.

As far as I can tell, we shouldn't need -ffast-math to get this optimisation.

joker-eph · 2017-02-04T00:23:04Z

Apparently we promote 2 as an int to a complex and perform a cross product in the expression 2*c, however if the source is written with 2.f * c we don't perform any cross product.
The optimizer can't optimize the cross-product as efficiently without fast-math because it has to account for NaN.

Tim found that is may be this refactoring commit that would be the culprit to incorrectly promote 2 as a complex in the multiplication:

llvm-mirror/clang@8289f49c0c0

llvmbot · 2017-02-04T00:40:33Z

That's only half the problem.

Consider:

#include <complex.h>
complex float f(complex float x) {
return (2.f + 0 * I) * x;
}

ICC is fine with this and generates a single mul - I guess it figures it can disregard nans.
GCC falls down just like clang, and calls _mulsc3.

When building the original code (with 2, not 2.f) with -ffinite-math-only we get:

f:
vmovshdup %xmm0, %xmm1
vaddss %xmm1, %xmm1, %xmm2
vxorps %xmm3, %xmm3, %xmm3
vmulss %xmm3, %xmm0, %xmm4
vaddss %xmm4, %xmm2, %xmm2
vaddss %xmm0, %xmm0, %xmm0
vmulss %xmm3, %xmm1, %xmm1
vsubss %xmm1, %xmm0, %xmm0
vinsertps $16, %xmm2, %xmm0, %xmm0
retq

The IR looks like:
define <2 x float> @f(floatcomplex )(<2 x float>) local_unnamed_addr #0 {
%2 = extractelement <2 x float> %0, i32 0
%3 = extractelement <2 x float> %0, i32 1
%4 = fmul nnan ninf float %3, 2.000000e+00
%5 = fmul nnan ninf float %2, 0.000000e+00
%6 = fadd nnan ninf float %4, %5
%7 = fmul nnan ninf float %2, 2.000000e+00
%8 = fmul nnan ninf float %3, 0.000000e+00
%9 = fsub nnan ninf float %7, %8
%10 = insertelement <2 x float> undef, float %9, i32 0
%11 = insertelement <2 x float> %10, float %6, i32 1
ret <2 x float> %11
}

Shouldn't we be able to optimize away "fmul nnan ninf float %2, 0.000000e+00"? Or do we really need fast for this?

joker-eph · 2017-02-04T00:42:55Z

"No NaNs - Allow optimizations to assume the arguments and result are not NaN. Such optimizations are required to retain defined behavior over NaNs, but the value of the result is undefined."

So I'd say clearly yes :)

The fast-math flags are likely not used individually everywhere in the optimizer (i.e. transforms will check if we have "fast").

llvmbot · 2017-02-04T00:44:29Z

What I meant was - is there a reason except nans and infs not to optimize this out? But I guess not, and you're right, we're just not testing the flags individually because we never bothered.

TNorthover · 2017-02-04T00:46:35Z

Just to add chapter & verse from C99...

6.3.1.8 (the usual arithmetic conversions) says:

"Otherwise, if the corresponding real type of either operand is float, the other operand is converted, without change of type domain, to a type whose corresponding real type is float.[51]
[...]
[51] For example, addition of a double _Complex and a float entails just the conversion of the float operand to double (and yields a double _Complex result)."

The "type domain" referred to is precisely the complex/real division. So that's pretty clear that the int shouldn't be converted but still doesn't actually say how you multiply a real and a complex.

For that, the only reference seems to be the (informative rather than normative) Appendix G. G.5.1 says:

"If the operands are not both complex, then the result and floating-point exception behavior of the * operator is defined by the usual mathematical formula:
[...] x(u + iv) = (xu) + i(xv)"

which skips the cross-wise terms entirely.

rotateright · 2017-02-06T00:50:13Z

Shouldn't we be able to optimize away "fmul nnan ninf float %2, 0.000000e+00"?
What I meant was - is there a reason except nans and infs not to optimize
this out? But I guess not, and you're right, we're just not testing the
flags individually because we never bothered.

Yes, there is a reason: -0.0. We didn't say that it's ok to ignore signed zero (nsz), so -0.0 * 0.0 is not the same as 0.0 * 0.0.

In general, I think you're right: we don't test the individual flags as much as we should. But in this case, SimplifyFMulInst already does the right thing:
// fmul nnan nsz X, 0 ==> 0
if (FMF.noNaNs() && FMF.noSignedZeros() && match(Op1, m_AnyZero()))
return Op1;

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

Sirraide mentioned this issue Feb 26, 2024

[Clang] [Sema] Fix bug in _Complex float+int arithmetic #83063

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doubling a single complex float produces unnecessarily inefficient code without -ffast-math #31205

Doubling a single complex float produces unnecessarily inefficient code without -ffast-math #31205

lesshaste mannequin commented Feb 3, 2017

lesshaste mannequin commented Feb 3, 2017

lesshaste mannequin commented Feb 3, 2017

joker-eph commented Feb 4, 2017

llvmbot commented Feb 4, 2017

joker-eph commented Feb 4, 2017

llvmbot commented Feb 4, 2017

TNorthover commented Feb 4, 2017

rotateright commented Feb 6, 2017

Doubling a single complex float produces unnecessarily inefficient code without -ffast-math #31205

Doubling a single complex float produces unnecessarily inefficient code without -ffast-math #31205

Comments

lesshaste mannequin commented Feb 3, 2017

Extended Description

lesshaste mannequin commented Feb 3, 2017

lesshaste mannequin commented Feb 3, 2017

joker-eph commented Feb 4, 2017

llvmbot commented Feb 4, 2017

joker-eph commented Feb 4, 2017

llvmbot commented Feb 4, 2017

TNorthover commented Feb 4, 2017

rotateright commented Feb 6, 2017