Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doubling a single complex float produces unnecessarily inefficient code without -ffast-math #31205

Open
lesshaste mannequin opened this issue Feb 3, 2017 · 8 comments
Labels
bugzilla Issues migrated from bugzilla

Comments

@lesshaste
Copy link
Mannequin

lesshaste mannequin commented Feb 3, 2017

Bugzilla Link 31857
Version trunk
OS Linux
CC @lesshaste,@hfinkel,@joker-eph,@RKSimon,@rotateright

Extended Description

Consider

#include <complex.h>
complex float f(complex float x[]) {
complex float p = 1.0;
for (int i = 0; i < 1; i++)
p += 2*x[i];
return p;
}

This code is simply doubling one complex float and adding 1.

In clang trunk with -O3 -march=core-avx2 you get

f: # @​f
vmovss xmm3, dword ptr [rdi + 4] # xmm3 = mem[0],zero,zero,zero
vbroadcastss xmm0, xmm3
vmulps xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vmovss xmm2, dword ptr [rdi] # xmm2 = mem[0],zero,zero,zero
vbroadcastss xmm1, xmm2
vmovss xmm4, dword ptr [rip + .LCPI0_1] # xmm4 = mem[0],zero,zero,zero
vmulps xmm1, xmm1, xmm4
vsubps xmm4, xmm1, xmm0
vaddps xmm1, xmm1, xmm0
vblendps xmm0, xmm4, xmm1, 2 # xmm0 = xmm4[0],xmm1[1],xmm4[2,3]
vucomiss xmm4, xmm4
jnp .LBB0_3
vmovshdup xmm1, xmm1 # xmm1 = xmm1[1,1,3,3]
vucomiss xmm1, xmm1
jp .LBB0_2
.LBB0_3:
vmovss xmm1, dword ptr [rip + .LCPI0_2] # xmm1 = mem[0],zero,zero,zero
vaddps xmm0, xmm0, xmm1
ret
.LBB0_2:
push rax
vmovss xmm0, dword ptr [rip + .LCPI0_1] # xmm0 = mem[0],zero,zero,zero
vxorps xmm1, xmm1, xmm1
call __mulsc3
add rsp, 8
jmp .LBB0_3

Using the Intel Compiler with -O3 -march=core-avx2 -fp-model strict you get:

f:
vmovsd xmm0, QWORD PTR [rdi] #​5.12
vmulps xmm2, xmm0, XMMWORD PTR .L_2il0floatpacket.1[rip] #​5.12
vmovsd xmm1, QWORD PTR p.152.0.0.1[rip] #​3.19
vaddps xmm0, xmm1, xmm2 #​5.5
ret

as expected.

The -fp-model strict tells the compiler to strictly adhere to value-safe optimizations when implementing floating-point calculations and enables floating-point exception semantics. It also turns off fuse add multiply which might not be relevant here.

If you turn on -ffast-math in clang trunk you do get much better although still not ideal code:

f: # @​f
vmovss xmm0, dword ptr [rdi] # xmm0 = mem[0],zero,zero,zero
vmovss xmm1, dword ptr [rdi + 4] # xmm1 = mem[0],zero,zero,zero
vaddss xmm1, xmm1, xmm1
vmovss xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = mem[0],zero,zero,zero
vfmadd213ss xmm2, xmm0, dword ptr [rip + .LCPI0_1]
vinsertps xmm0, xmm2, xmm1, 16 # xmm0 = xmm2[0],xmm1[0],xmm2[2,3]
ret

@lesshaste
Copy link
Mannequin Author

lesshaste mannequin commented Feb 3, 2017

A much better example is this:

#include <complex.h>
complex float f(complex float x) {
return 2*x;
}

clang trunk gives (using the same options)

f: # @​f
vmovaps xmm2, xmm0
vaddps xmm1, xmm2, xmm2
vpermilps xmm0, xmm2, 225 # xmm0 = xmm2[1,0,2,3]
vxorps xmm3, xmm3, xmm3
vmulps xmm3, xmm0, xmm3
vsubps xmm0, xmm1, xmm3
vaddps xmm1, xmm1, xmm3
vucomiss xmm0, xmm0
jnp .LBB0_2
vmovshdup xmm3, xmm1 # xmm3 = xmm1[1,1,3,3]
vucomiss xmm3, xmm3
jp .LBB0_3
.LBB0_2:
vblendps xmm0, xmm0, xmm1, 2 # xmm0 = xmm0[0],xmm1[1],xmm0[2,3]
ret
.LBB0_3:
vmovshdup xmm3, xmm2 # xmm3 = xmm2[1,1,3,3]
vmovss xmm0, dword ptr [rip + .LCPI0_0] # xmm0 = mem[0],zero,zero,zero
vxorps xmm1, xmm1, xmm1
jmp __mulsc3 # TAILCALL

Much better would be:

f:
vmulps xmm0, xmm0, XMMWORD PTR .L_2il0floatpacket.1[rip] #​3.12
ret

@lesshaste
Copy link
Mannequin Author

lesshaste mannequin commented Feb 3, 2017

A final comment. Adding -ffast-math to clang does give

f: # @​f
vaddps xmm0, xmm0, xmm0
ret

as expected.

As far as I can tell, we shouldn't need -ffast-math to get this optimisation.

@joker-eph
Copy link
Collaborator

Apparently we promote 2 as an int to a complex and perform a cross product in the expression 2*c, however if the source is written with 2.f * c we don't perform any cross product.
The optimizer can't optimize the cross-product as efficiently without fast-math because it has to account for NaN.

Tim found that is may be this refactoring commit that would be the culprit to incorrectly promote 2 as a complex in the multiplication:

llvm-mirror/clang@8289f49c0c0

@llvmbot
Copy link
Collaborator

llvmbot commented Feb 4, 2017

That's only half the problem.

Consider:

#include <complex.h>
complex float f(complex float x) {
return (2.f + 0 * I) * x;
}

ICC is fine with this and generates a single mul - I guess it figures it can disregard nans.
GCC falls down just like clang, and calls _mulsc3.

  1. When building the original code (with 2, not 2.f) with -ffinite-math-only we get:

f:
vmovshdup %xmm0, %xmm1
vaddss %xmm1, %xmm1, %xmm2
vxorps %xmm3, %xmm3, %xmm3
vmulss %xmm3, %xmm0, %xmm4
vaddss %xmm4, %xmm2, %xmm2
vaddss %xmm0, %xmm0, %xmm0
vmulss %xmm3, %xmm1, %xmm1
vsubss %xmm1, %xmm0, %xmm0
vinsertps $16, %xmm2, %xmm0, %xmm0
retq

The IR looks like:
define <2 x float> @​f(floatcomplex )(<2 x float>) local_unnamed_addr #​0 {
%2 = extractelement <2 x float> %0, i32 0
%3 = extractelement <2 x float> %0, i32 1
%4 = fmul nnan ninf float %3, 2.000000e+00
%5 = fmul nnan ninf float %2, 0.000000e+00
%6 = fadd nnan ninf float %4, %5
%7 = fmul nnan ninf float %2, 2.000000e+00
%8 = fmul nnan ninf float %3, 0.000000e+00
%9 = fsub nnan ninf float %7, %8
%10 = insertelement <2 x float> undef, float %9, i32 0
%11 = insertelement <2 x float> %10, float %6, i32 1
ret <2 x float> %11
}

Shouldn't we be able to optimize away "fmul nnan ninf float %2, 0.000000e+00"? Or do we really need fast for this?

@joker-eph
Copy link
Collaborator

"No NaNs - Allow optimizations to assume the arguments and result are not NaN. Such optimizations are required to retain defined behavior over NaNs, but the value of the result is undefined."

So I'd say clearly yes :)

The fast-math flags are likely not used individually everywhere in the optimizer (i.e. transforms will check if we have "fast").

@llvmbot
Copy link
Collaborator

llvmbot commented Feb 4, 2017

What I meant was - is there a reason except nans and infs not to optimize this out? But I guess not, and you're right, we're just not testing the flags individually because we never bothered.

@TNorthover
Copy link
Contributor

Just to add chapter & verse from C99...

6.3.1.8 (the usual arithmetic conversions) says:

"Otherwise, if the corresponding real type of either operand is float, the other operand is converted, without change of type domain, to a type whose corresponding real type is float.[51]
[...]
[51] For example, addition of a double _Complex and a float entails just the conversion of the float operand to double (and yields a double _Complex result)."

The "type domain" referred to is precisely the complex/real division. So that's pretty clear that the int shouldn't be converted but still doesn't actually say how you multiply a real and a complex.

For that, the only reference seems to be the (informative rather than normative) Appendix G. G.5.1 says:

"If the operands are not both complex, then the result and floating-point exception behavior of the * operator is defined by the usual mathematical formula:
[...] x(u + iv) = (xu) + i(xv)"

which skips the cross-wise terms entirely.

@rotateright
Copy link
Contributor

Shouldn't we be able to optimize away "fmul nnan ninf float %2, 0.000000e+00"?
What I meant was - is there a reason except nans and infs not to optimize
this out? But I guess not, and you're right, we're just not testing the
flags individually because we never bothered.

Yes, there is a reason: -0.0. We didn't say that it's ok to ignore signed zero (nsz), so -0.0 * 0.0 is not the same as 0.0 * 0.0.

In general, I think you're right: we don't test the individual flags as much as we should. But in this case, SimplifyFMulInst already does the right thing:
// fmul nnan nsz X, 0 ==> 0
if (FMF.noNaNs() && FMF.noSignedZeros() && match(Op1, m_AnyZero()))
return Op1;

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla
Projects
None yet
Development

No branches or pull requests

4 participants