New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doubling a single complex float produces unnecessarily inefficient code without -ffast-math #31205
Comments
A much better example is this: #include <complex.h> clang trunk gives (using the same options) f: # @f Much better would be: f: |
A final comment. Adding -ffast-math to clang does give f: # @f as expected. As far as I can tell, we shouldn't need -ffast-math to get this optimisation. |
Apparently we promote 2 as an int to a complex and perform a cross product in the expression Tim found that is may be this refactoring commit that would be the culprit to incorrectly promote 2 as a complex in the multiplication: |
That's only half the problem. Consider: #include <complex.h> ICC is fine with this and generates a single mul - I guess it figures it can disregard nans.
f: The IR looks like: Shouldn't we be able to optimize away "fmul nnan ninf float %2, 0.000000e+00"? Or do we really need fast for this? |
"No NaNs - Allow optimizations to assume the arguments and result are not NaN. Such optimizations are required to retain defined behavior over NaNs, but the value of the result is undefined." So I'd say clearly yes :) The fast-math flags are likely not used individually everywhere in the optimizer (i.e. transforms will check if we have "fast"). |
What I meant was - is there a reason except nans and infs not to optimize this out? But I guess not, and you're right, we're just not testing the flags individually because we never bothered. |
Just to add chapter & verse from C99... 6.3.1.8 (the usual arithmetic conversions) says: "Otherwise, if the corresponding real type of either operand is float, the other operand is converted, without change of type domain, to a type whose corresponding real type is float.[51] The "type domain" referred to is precisely the complex/real division. So that's pretty clear that the int shouldn't be converted but still doesn't actually say how you multiply a real and a complex. For that, the only reference seems to be the (informative rather than normative) Appendix G. G.5.1 says: "If the operands are not both complex, then the result and floating-point exception behavior of the * operator is defined by the usual mathematical formula: which skips the cross-wise terms entirely. |
Yes, there is a reason: -0.0. We didn't say that it's ok to ignore signed zero (nsz), so -0.0 * 0.0 is not the same as 0.0 * 0.0. In general, I think you're right: we don't test the individual flags as much as we should. But in this case, SimplifyFMulInst already does the right thing: |
Extended Description
Consider
#include <complex.h>
complex float f(complex float x[]) {
complex float p = 1.0;
for (int i = 0; i < 1; i++)
p += 2*x[i];
return p;
}
This code is simply doubling one complex float and adding 1.
In clang trunk with -O3 -march=core-avx2 you get
f: # @f
vmovss xmm3, dword ptr [rdi + 4] # xmm3 = mem[0],zero,zero,zero
vbroadcastss xmm0, xmm3
vmulps xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vmovss xmm2, dword ptr [rdi] # xmm2 = mem[0],zero,zero,zero
vbroadcastss xmm1, xmm2
vmovss xmm4, dword ptr [rip + .LCPI0_1] # xmm4 = mem[0],zero,zero,zero
vmulps xmm1, xmm1, xmm4
vsubps xmm4, xmm1, xmm0
vaddps xmm1, xmm1, xmm0
vblendps xmm0, xmm4, xmm1, 2 # xmm0 = xmm4[0],xmm1[1],xmm4[2,3]
vucomiss xmm4, xmm4
jnp .LBB0_3
vmovshdup xmm1, xmm1 # xmm1 = xmm1[1,1,3,3]
vucomiss xmm1, xmm1
jp .LBB0_2
.LBB0_3:
vmovss xmm1, dword ptr [rip + .LCPI0_2] # xmm1 = mem[0],zero,zero,zero
vaddps xmm0, xmm0, xmm1
ret
.LBB0_2:
push rax
vmovss xmm0, dword ptr [rip + .LCPI0_1] # xmm0 = mem[0],zero,zero,zero
vxorps xmm1, xmm1, xmm1
call __mulsc3
add rsp, 8
jmp .LBB0_3
Using the Intel Compiler with -O3 -march=core-avx2 -fp-model strict you get:
f:
vmovsd xmm0, QWORD PTR [rdi] #5.12
vmulps xmm2, xmm0, XMMWORD PTR .L_2il0floatpacket.1[rip] #5.12
vmovsd xmm1, QWORD PTR p.152.0.0.1[rip] #3.19
vaddps xmm0, xmm1, xmm2 #5.5
ret
as expected.
The -fp-model strict tells the compiler to strictly adhere to value-safe optimizations when implementing floating-point calculations and enables floating-point exception semantics. It also turns off fuse add multiply which might not be relevant here.
If you turn on -ffast-math in clang trunk you do get much better although still not ideal code:
f: # @f
vmovss xmm0, dword ptr [rdi] # xmm0 = mem[0],zero,zero,zero
vmovss xmm1, dword ptr [rdi + 4] # xmm1 = mem[0],zero,zero,zero
vaddss xmm1, xmm1, xmm1
vmovss xmm2, dword ptr [rip + .LCPI0_0] # xmm2 = mem[0],zero,zero,zero
vfmadd213ss xmm2, xmm0, dword ptr [rip + .LCPI0_1]
vinsertps xmm0, xmm2, xmm1, 16 # xmm0 = xmm2[0],xmm1[0],xmm2[2,3]
ret
The text was updated successfully, but these errors were encountered: