Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing float truncation rounding patterns #35965

Closed
RKSimon opened this issue Mar 6, 2018 · 33 comments
Closed

Missing float truncation rounding patterns #35965

RKSimon opened this issue Mar 6, 2018 · 33 comments
Labels
bugzilla Issues migrated from bugzilla llvm:codegen

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented Mar 6, 2018

Bugzilla Link 36617
Resolution FIXED
Resolved on May 24, 2020 07:01
Version trunk
OS Windows NT
CC @efriedma-quic,@hfinkel,@LebedevRI,@rotateright
Fixed by commit(s) d04db48, 7f4ff78, cceb630

Extended Description

Rounding floats/doubles patterns could be converted to truncation rounding instructions (roundss on x86/sse etc.). AFAICT these don't need to be -ffast-math only.

float rnd(float x) {
return (float)((int)x);
}
__v4sf rnd(__v4sf x) {
return __builtin_convertvector(__builtin_convertvector(x, __v4si), __v4sf);
}

define dso_local float @​_Z3rndf {
%2 = fptosi float %0 to i32
%3 = sitofp i32 %2 to float
ret float %3
}
define dso_local <4 x float> @​_Z3rndDv4_f {
%2 = fptosi <4 x float> %0 to <4 x i32>
%3 = sitofp <4 x i32> %2 to <4 x float>
ret <4 x float> %3
}

_Z3rndf:
cvttss2si %xmm0, %eax
xorps %xmm0, %xmm0
cvtsi2ssl %eax, %xmm0
retq
_Z3rndDv4_f:
cvttps2dq %xmm0, %xmm0
cvtdq2ps %xmm0, %xmm0
retq

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 6, 2018

Truncating to smaller float types should be investigated as well:

float rnd32(double x) {
return (float)((int)x);
}

define float @​_Z5rnd32d(double) {
%2 = fptosi double %0 to i32
%3 = sitofp i32 %2 to float
ret float %3
}

_Z3rndDv4_f:
cvttps2dq %xmm0, %xmm0
cvtdq2ps %xmm0, %xmm0
retq

@rotateright
Copy link
Contributor

Proposal for round-trip casting (to the same size):
https://reviews.llvm.org/D44909

@rotateright
Copy link
Contributor

The examples in the description should be fixed with:
https://reviews.llvm.org/rL328921

@rotateright
Copy link
Contributor

The examples in the description should be fixed with:
https://reviews.llvm.org/rL328921

Reverted because we broke Chrome:
https://reviews.llvm.org/rL329920

@rotateright
Copy link
Contributor

Trying the initial case again with more warnings about the potential danger:
https://reviews.llvm.org/rL328921

@rotateright
Copy link
Contributor

Trying the initial case again with more warnings about the potential danger:
https://reviews.llvm.org/rL328921

Oops - new link:
https://reviews.llvm.org/rL330437

@rotateright
Copy link
Contributor

We'll have to hide this behind a flag and default it off to start:
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20180423/545906.html

@rotateright
Copy link
Contributor

I was wrong about having to default this to 'off' (at least for now), but we do have flags to enable/disable the transform:
https://reviews.llvm.org/D46236
https://reviews.llvm.org/rL331209

@​scanon suggested that we add a platform-independent clang builtin that would clamp using either a fixed method (probably the one used by PPC/ARM because they're more useful than x86?) or using the target's method.

I haven't looked at the details yet, but that sounds good. Ie, it's safe to assume that we're going to get more complaints about breaking code with this optimization.

Also for the record, we are not doing this optimization because we think it's fun to break people's programs using a UB loophole. We added this because it can be a big perf improvement (which is why the operation is supported in HW on multiple platforms).

@rotateright
Copy link
Contributor

We missed an FMF constraint (latest comments here):
https://reviews.llvm.org/D44909

Miscompiling the -0.0 case...

The good news is that I expect less chance of the more common out-of-range pitfall once we limit this with FMF (I'm assuming most people don't compile with loosened FP).

@rotateright
Copy link
Contributor

This exposes a hole in IR FMF. We don't currently parse FMF on the conversion instructions even though they are FPMathOperators (the *itofp ones anyway); we only allow FMF on FP binops, calls, and fcmp.

@rotateright
Copy link
Contributor

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 17, 2020

Current Codegen: https://godbolt.org/z/ZEVtp7

@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 17, 2020

@​spatel This looks mostly fine to me (thanks for 2 years ago!), the only possible improvement is:

define dso_local double @​_Z3rndd(double %0) local_unnamed_addr #​0 {
%2 = fptosi double %0 to i32
%3 = sitofp i32 %2 to float
%4 = fpext float %3 to double
ret double %4
}

define dso_local float @​_Z5rnd32d(double %0) local_unnamed_addr #​0 {
%2 = fptosi double %0 to i32
%3 = sitofp i32 %2 to float
ret float %3
}

_Z3rndd: # @​_Z3rndd
vcvttsd2si %xmm0, %eax
vcvtsi2ss %eax, %xmm1, %xmm0
vcvtss2sd %xmm0, %xmm0, %xmm0
retq

_Z5rnd32d: # @​_Z5rnd32d
vcvttsd2si %xmm0, %eax
vcvtsi2ss %eax, %xmm1, %xmm0
retq

We could avoid the XMM->GPR->XMM transfers by using vcvttpd2dq+vcvtdq2ps(+vcvtss2sd) - what do you think?

@rotateright
Copy link
Contributor

@​spatel This looks mostly fine to me (thanks for 2 years ago!), the only
possible improvement is:

define dso_local double @​_Z3rndd(double %0) local_unnamed_addr #​0 {
%2 = fptosi double %0 to i32
%3 = sitofp i32 %2 to float
%4 = fpext float %3 to double
ret double %4
}

define dso_local float @​_Z5rnd32d(double %0) local_unnamed_addr #​0 {
%2 = fptosi double %0 to i32
%3 = sitofp i32 %2 to float
ret float %3
}

_Z3rndd: # @​_Z3rndd
vcvttsd2si %xmm0, %eax
vcvtsi2ss %eax, %xmm1, %xmm0
vcvtss2sd %xmm0, %xmm0, %xmm0
retq

_Z5rnd32d: # @​_Z5rnd32d
vcvttsd2si %xmm0, %eax
vcvtsi2ss %eax, %xmm1, %xmm0
retq

We could avoid the XMM->GPR->XMM transfers by using
vcvttpd2dq+vcvtdq2ps(+vcvtss2sd) - what do you think?

Yes, a quick check of Agner timings for recent Intel and AMD says avoiding the GPR round-trip is always a win. We did something similar with:
https://reviews.llvm.org/D56864

@rotateright
Copy link
Contributor

Yes, a quick check of Agner timings for recent Intel and AMD says avoiding
the GPR round-trip is always a win. We did something similar with:
https://reviews.llvm.org/D56864

There's a potential complication though: if we operate on garbage data in the 128-bit %xmm reg, we might induce some terrible x86 denorm-handling stall and kill perf. Can we rule out the potential denorm stall on a cvttps2dq or similar cast instructions?

If the stall concern is valid, then the way around that would be to zero out the high elements, so the asm diff looks something like this:

diff --git a/llvm/test/CodeGen/X86/ftrunc.ll b/llvm/test/CodeGen/X86/ftrunc.ll
index 0a1c1e2a851..d4042b8ceca 100644
--- a/llvm/test/CodeGen/X86/ftrunc.ll
+++ b/llvm/test/CodeGen/X86/ftrunc.ll
@@ -223,9 +223,10 @@ define <4 x double> @​trunc_unsigned_v4f64(<4 x double> %x) #​0 {
define float @​trunc_signed_f32(float %x) #​0 {
; SSE2-LABEL: trunc_signed_f32:
; SSE2: # %bb.0:
-; SSE2-NEXT: cvttss2si %xmm0, %eax
-; SSE2-NEXT: xorps %xmm0, %xmm0
-; SSE2-NEXT: cvtsi2ss %eax, %xmm0
+; SSE2-NEXT: xorps %xmm1, %xmm1
+; SSE2-NEXT: movss {{.*#+}} xmm1 = xmm0[0],xmm1[1,2,3]
+; SSE2-NEXT: cvttps2dq %xmm1, %xmm0
+; SSE2-NEXT: cvtdq2ps %xmm0, %xmm0
; SSE2-NEXT: retq

Still worth doing?

@llvmbot
Copy link
Collaborator

llvmbot commented Apr 1, 2020

testloop-subnormal-cvt.asm

Can we rule out the potential denorm stall on a cvttps2dq or similar cast instructions?

I think this is probably safe for tune=generic but it needs testing on Silvermont and/or Goldmont. Agner Fog's microarch PDF says:

Operations that have subnormal numbers as input or output
or generate underflow take approximately 160 clock cycles
unless the flush-to-zero mode and denormals-are-zero
mode are both used.

But this might not include conversions. There aren't 2 inputs that need to be aligned wrt. each other to match up the place values of mantissas. Anything with an exponent part below 2^-1 simply rounds to zero, including subnormals. But if conversion shares HW with other FPU operations, maybe it would be subject to the same subnormal penalties? I'm hopeful but I wouldn't bet on it.

I tested on Core 2 (Conroe) and Skylake and found no penalties for subnormal cvt. Even though Core 2 has ~150x penalties for basically anything else involving subnormals.

cvtps2dq and pd2dq, and their truncating versions, are both fast with a mix of normal and subnormal inputs. (Not sure if cvtpd2dq is a win over scalar round trip; it has a shuffle uop both ways. But I tested anyway just for the record.)

In a loop in a static executable, NASM source:

times 8 cvtps2dq  xmm0, xmm1    ; fine with subnormal
times 8 cvtpd2dq  xmm0, xmm2    ; fine with subnormal
times 8 cvttps2dq  xmm0, xmm1   ; fine with subnormal
times 8 cvttpd2dq  xmm0, xmm2   ; fine with subnormal

section .rodata
f1: dd 0.5, 0x12345, 0.123, 123 ; normal, subnormal, normal, subnormal
d2: dq 5.1234567, 0x800000000000ff12 ; normal, -subnormal

Full source attached, in case anyone wants to check on a Bulldozer-family or Zen. Agner Fog says Bulldozer/Piledriver have penalties for subnormal or underflow results, no mention of inputs. If it runs in

Agner also doesn't mention input penalties for Ryzen.

Using the same test program with different instructions in the loop, I was able to see huge (factor of > 150) slowdowns for compares with subnormal inputs on Core 2, so I'm sure I had the right values in registers and that I would have detected a problem if it existed. (cmpltps, or ucomiss with a memory source)

nasm -felf64 subnormal-cvt.asm && ld subnormal-cvt.o && time ./a.out

If the run time is something under 1 or 2 seconds (e.g. 390 ms on a 4.1 GHz Skylake where cvt is 2/clock) it's fast. It takes 1.3 seconds on my 2.4GHz E6600 Core 2 Conroe/Merom.

If it was slow, it would take ~150 times that. Or possibly (but unlikely) on Ryzen, just a bit longer than you'd expect if there are penalties for this; time with perf stat to see if you get the expected ~1 instruction / cycle on Zen1 or Zen2

@RKSimon
Copy link
Collaborator Author

RKSimon commented Apr 2, 2020

We could avoid the XMM->GPR->XMM transfers by using
vcvttpd2dq+vcvtdq2ps(+vcvtss2sd) - what do you think?

To avoid denormal issues, a pre-splat might do the trick - its still going to be a lot quicker than a GPR transfer.

vcvtdq2ps(vcvttpd2dq(vpermilpd(x,0)))

or

vcvtss2sd(vcvtdq2ps(vcvttpd2dq(vpermilpd(x,0))))

I don't think SimplifyDemandedVectorElts will do anything to remove the splat but we'd have to confirm.

@llvmbot
Copy link
Collaborator

llvmbot commented Apr 3, 2020

We could avoid the XMM->GPR->XMM transfers by using
vcvttpd2dq+vcvtdq2ps(+vcvtss2sd) - what do you think?

To avoid denormal issues, a pre-splat might do the trick - its still going
to be a lot quicker than a GPR transfer.

vcvtdq2ps(vcvttpd2dq(vpermilpd(x,0)))

That defeats the throughput advantage, at least on Intel CPUs, but could still be a latency win for tune=generic if there are any CPUs that do have penalties for subnormal FP->int.

We should only consider this if we find evidence that some CPU we care about for tune=generic really does have a penalty for subnormal inputs to FP->int conversion. I haven't found any evidence of that on Intel, including Core 2 which does have subnormal-input penalties for everything else, including compares and float->double. As I said, I'm hopeful that there aren't any such CPUs, and Agner Fog's guide says AMD CPU subnormal penalties are only ever for output, not input.

It's bad for -mtune=sandybridge or tune=core2 where denormals definitely don't hurt FP->integer conversion.

That extra shuffle uop defeats all of the throughput benefit on Sandybridge-family (On Skylake it's nearly break-even for throughput vs. GPR and back). The port breakdown is p5 + p01+p5 + p01 on SKL. (packed conversion to/from double costs a shuffle uop as well as the conversion, because unlike float the element sizes don't match int32. Scalar is also 2: convert and domain transfer in one direction or the other.)

It does still have a latency advantage on Skylake over ss2si / si2sd to a GPR and back. (And a big latency advantage on Bulldozer-family).

SKL and Core 2 latency and throughput results for this loop body. Note that scalar and back wins on Core 2 for both throughput and latency.

xorps   xmm0, xmm0   ;; break the dep chain, comment to test latency

%rep 4
%if 1
; movaps xmm0, xmm2
unpcklpd xmm0, xmm0 ; 2 p5 + 2p01 on SKL
cvtpd2dq xmm0, xmm0 ; rep4: Core2 tput 0.42s SKL tput: 0.192s @ ~4.15 GHz. 800.5 Mcycles 4 front-end uops / round trip
cvtdq2ps xmm0, xmm0 ; rep4: Core2 lat: 1.34s SKL lat: 0.96s
%else
;; SKL ports: p0+p01, p01+p5
cvtsd2si eax, xmm0 ; rep4: Core2 tput 0.33s SKL tput: 0.194s @ ~4.15 GHz. 807.75 Mcycles 4 front-end uops / round trip
cvtsi2ss xmm0, eax ; rep4: Core2 lat 1.00s SKL lat: 1.15s
; note that this has a false dependency on the destination. Thanks, Intel.
%endif

And BTW, you want unpcklpd or movlhps to broadcast the low 64 bits with more compact machine code and a vpermilpd with a 3-byte VEX and an immediate. (And requiring AVX). Or maybe just movq to copy and zero-extend the low qword: 0.33c throughput on Intel, downside is possible bypass delay. Otherwise, without AVX we have to choose between broadcast in place (putting the shuffle in the critical path for other readers of the FP value), or copy and broadcast which probably costs a movaps plus a shuffle.

I think it's likely that FP->integer won't be slowed by subnormals anywhere, because Core 2 has penalties for comparing subnormals or converting ps -> pd, but still has no penalty for ps -> int.

@llvmbot
Copy link
Collaborator

llvmbot commented Apr 3, 2020

vcvtss2sd(vcvtdq2ps(vcvttpd2dq(vpermilpd(x,0))))

Note that if the source FP value was float, we could skip the int->float step and go int->double directly instead of int->float->double. That would guarantee that the integer value either overflowed or was an integer that float can represent exactly. Thus the int->float step is always exact.

But that's not the case for a double input. If we want to preserve the behaviour of rounding to integer and then to the nearest representable float, we need to keep that step.

Maybe everyone else already noticed that, too, but I thought it was neat.

If we had AVX512 for VRNDSCALESD, that wouldn't help and more than SSE4.1 ROUNDSD; it can only specify a non-negative number of binary fraction bits to keep, not rounding to a higher boundary than 2^0

@llvmbot
Copy link
Collaborator

llvmbot commented Apr 6, 2020

  • You won't end up in this situation without doing something weird without vector intrinsics or a reinterpret cast; normal scalar code will not find itself in this scenario. If you're doing something weird, you are opting in to dealing with the hazards that result.

  • Subnormals in the unused lanes don't actually cause an issue on anything recent.

In light of these two factors, it's nuts to slow down the common path at all to try to smooth over a hypothetical subnormal penalty that can only happen under very contrived circumstances on decade-plus-old CPUs. If there's a sequence that avoids a possible subnormal with zero penalty for the normal case, use it, but we shouldn't accept any slowdown to the normal case for this.

@llvmbot
Copy link
Collaborator

llvmbot commented Apr 6, 2020

Steve's comment reminded me of a previous discussion I had about high garbage being a problem in XMM registers for FP-exception correctness. cvtdq2ps can raise Precision (inexact).

The x86-64 SysV calling convention allows it. While it's rare, compilers will sometimes call scalar functions with XMM0's high elements still holding whatever's left over from some auto-vectorization they did. Or return a scalar float after a horizontal sum that used shuffles that don't result in all elements holding the same value.

Clearly code must not rely on high elements being zero in function args or return values for correctness, and LLVM doesn't. But does correctness include FP exception semantics? For clang/LLVM it already doesn't.

Without -ffast-math, return ((a+b)+(c+d))+((e+f)+(g+h)); for float a..h args vectorizes with unpcklps + addps leaving the high elements = 2nd element of the XMM inputs. https://gcc.godbolt.org/z/oRZuN9. (with -ffast-math, clang just uses 7x addss in a way that keeps some but different ILP.)

That will have performance problems for subnormals on P6-family, and maybe sometimes on Sandybridge-family. But it's not a correctness problem and might still be worth doing if we don't mind raising FP exceptions that the C source wouldn't have.

Not everyone is as cavalier with FP exception semantics, although it's really hard (and bad for performance) to get right if you care about FP exceptions as a visible side-effect (i.e. running an exception handler once per exception). GCC enables -ftrapping-math by default, but it's broken and has apparently never fully worked. (It does allow some optimizations that could change the number and IIRC type of FP exceptions that could be raised, as well as missing some optimizations that are still safe.)

For https://stackoverflow.com/questions/36706721/is-a-sign-or-zero-extension-required-when-adding-a-32bit-offset-to-a-pointer-for/36760539#36760539 I emailed one of the x86-64 System V ABI authors (Michael Matz); he confirmed that XMM scalar function args can have high garbage and added:

There once was a glibc math function (custom AMD implementation) that used
the packed variants and indeed produced FP exception when called from normal
C code (which itself left stuff in the upper bits from unrelated code).
The fix was to not used the packed variant. So, the above clang code seems
buggy without other guards.

(I had asked if that clang3.7 codegen was a sign that the ABI guaranteed something. But no, it's just clang/LLVM being aggressive.)

Apparently cvtdq2ps can raise is a Precision exception (but no others). I think if an element is not already an exact integer-valued float, and/or is out of range for int32_t.


Side note: in the unlikely event we ever want to emit that shuffle for some non-default tuning:
we can still leave it out with -ffast-math without -fPIC - that means we're compiling code for an executable which will presumably be linked with -ffast-math, and run with FTZ and DAZ (denormals are zero), the purpose of which are to avoid these penalties.

  • You won't end up in this situation without doing something weird without
    vector intrinsics or a reinterpret cast; normal scalar code will not find
    itself in this scenario. If you're doing something weird, you are opting in
    to dealing with the hazards that result.

Auto-vectorized code that calls a scalar function could cause this, e.g. after a horizontal sum that ends up with the right value in the low lane and leftover other values in the high lanes.

Although most likely it already caused some subnormal penalties, in which case one more is probably not a big deal for the rare case where that happens. Especially not if we're only concerned with performance, not FP-exception correctness.

@rotateright
Copy link
Contributor

  • You won't end up in this situation without doing something weird without
    vector intrinsics or a reinterpret cast; normal scalar code will not find
    itself in this scenario. If you're doing something weird, you are opting in
    to dealing with the hazards that result.

Auto-vectorized code that calls a scalar function could cause this, e.g.
after a horizontal sum that ends up with the right value in the low lane and
leftover other values in the high lanes.

There's some misunderstanding here. We're talking about changing the codegen for this most basic C code (no weird stuff necessary):

float rnd(float x) {
return (float)((int)x);
}

@llvmbot
Copy link
Collaborator

llvmbot commented Apr 6, 2020

There's some misunderstanding here. We're talking about changing the codegen for this most basic C code (no weird stuff necessary)

Subnormal representations in the high-order lanes have to come from somewhere. They don't just magically appear.

In particular:

  • scalar loads zero the high lanes
  • moves from GPR zero the high lanes

There are a few ways to get a scalar value that preserve the high-order lanes. The most common culprits are the scalar conversions and unary scalar operations coming from memory. In those cases, the correct fix for the resulting performance bug will be to zero before that operation (or convert to separate load + op) to also break the dependency.

Without -ffast-math, return ((a+b)+(c+d))+((e+f)+(g+h)); for float a..h args vectorizes with unpcklps + addps leaving the high elements = 2nd element of the XMM inputs.

I'm totally fine with eating a subnormal stall in a case like this in order to avoid pessimizing "normal code", because if a stall is happening after a reduction like this, you're already eating subnormal stalls in the reduction itself. The ship sailed. You're already not going fast.

w.r.t. modeling floating-point exceptions, a constrained intrinsic used when fenv_access is enabled would of course not be able to take part in this optimization, so that shouldn't be an issue.

@rotateright
Copy link
Contributor

Subnormal representations in the high-order lanes have to come from
somewhere. They don't just magically appear.

Ok, I agree it's far-fetched that garbage will be up there and has not already caused perf problems. And it's moot if there's no denorm penalty on the convert insts on recent HW.

We don't have to worry about exceptions because programs that enabled exceptions should use strict/constrained ops, and we're dealing with the base opcodes here.

As a final safety check, can someone try Peter's attached asm code (see comment 16) on an AMD CPU to see if that can trigger a denorm penalty there?

@LebedevRI
Copy link
Member

$ cat /proc/cpuinfo | grep model | sort | uniq
model : 2
model name : AMD FX(tm)-8350 Eight-Core Processor
$ taskset -c 3 perf stat -etask-clock:u,context-switches:u,cpu-migrations:u,page-faults:u,cycles:u,branches:u,instructions:u -r1 ./subnormal-cvt

Performance counter stats for './subnormal-cvt':

        799.31 msec task-clock:u              #    1.000 CPUs utilized          
             0      context-switches:u        #    0.000 K/sec                  
             0      cpu-migrations:u          #    0.000 K/sec                  
             2      page-faults:u             #    0.003 K/sec                  
    3200715964      cycles:u                  #    4.004 GHz                    
     100000415      branches:u                #  125.109 M/sec                  
    3400000428      instructions:u            #    1.06  insn per cycle         
                                              #    0.00  stalled cycles per insn

   0.799595756 seconds time elapsed

   0.799619000 seconds user
   0.000000000 seconds sys

"uops_issued.any:u,uops_executed.thread:u,idq.dsb_uops:u" counters aren't supported here, so i had to drop them.

@llvmbot
Copy link
Collaborator

llvmbot commented Apr 6, 2020

There's some misunderstanding here. We're talking about changing the codegen for this most basic C code (no weird stuff necessary)

Yes exactly. If that function doesn't inline, its caller could have left high garbage in XMM0.

Normal scalar loads zero the upper elements, which is why I proposed an auto-vectorized caller as a simple example of
when we could in practice see high garbage across function call boundaries.

Subnormal representations in the high-order lanes have to come from
somewhere. They don't just magically appear.

In particular:

  • scalar loads zero the high lanes
  • moves from GPR zero the high lanes

There are a few ways to get a scalar value that preserve the high-order
lanes. The most common culprits are the scalar conversions and unary scalar
operations coming from memory. In those cases, the correct fix for the
resulting performance bug will be to zero before that operation (or
convert to separate load + op) to also break the dependency.

Yup, GCC does dep-breaking PXOR-zeroing to work around Intel's short-sighted design (for PIII)
for scalar conversions that merge into the destination. With AVX we can sometimes do the zeroing once and non-destructively merge into the same register. (It seems braindead that the AVX versions didn't just zero-extend, i.e. implicitly use an internal zeroed reg as the merge target. So Intel missed two more chances to fix their ISA with AVX VEX and AVX512 EVEX.)
IIRC, MSVC does scalar load + packed-conversion, which is good for float.

But yes, these merge ops could have leftover integer vectors. Small integers are the bit patterns for subnormal floats. So that's probably a more likely source of subnormal FP bit patterns than actual subnormal floats from FP ops.

Without -ffast-math, return ((a+b)+(c+d))+((e+f)+(g+h)); for float a..h args vectorizes with unpcklps + addps leaving the high elements = 2nd element of the XMM inputs.

I'm totally fine with eating a subnormal stall in a case like this in order
to avoid pessimizing "normal code", because if a stall is happening after a
reduction like this, you're already eating subnormal stalls in the
reduction itself. The ship sailed. You're already not going fast.

Yes, that was my thought, too.
If real code generates subnormal floats as part of its real work, that's something for the programmer to worry about even if we introduce an occasional extra penalty in the rare cases they're left lying around and we trip over them.

w.r.t. modeling floating-point exceptions, a constrained intrinsic used when
fenv_access is enabled would of course not be able to take part in this
optimization, so that shouldn't be an issue.

Oh, you're right, clang supports -ftrapping-math (but unlike GCC it's not on by default).
That does result in evaluation in source bracket-nesting order without shuffling, just 7x addss.
https://gcc.godbolt.org/z/a44S43

(Incidentally, that's probably optimal for most cases of surrounding code,
especially on Skylake with 2/clock FP add but 1/clock FP shuffle.
Fewer total instructions and shorter critical-path latency even on Haswell with 1/clock FP add.)

@llvmbot
Copy link
Collaborator

llvmbot commented Apr 6, 2020

As a final safety check, can someone try Peter's attached asm code (see
comment 16) on an AMD CPU to see if that can trigger a denorm penalty there?

The only CPUs I'm at all worried about are Atom / Silvermont-family.

Low power Intel have more performance glass jaws than the big-core chips in other areas (e.g. much more picky about too many prefixes). I don't know how much we care about Silvermont for tune=generic, especially for FP code, if it does turn out to be a problem there.

Roman's 1.06 insn per cycle result on Bulldozer-family confirms it's fine. Probably safe to assume Zen is also fine. Agner says AMD only has subnormal penalties on outputs, no mention of inputs, and in this case the output is integer. So that's certainly the result I was expecting, but definitely good to be sure.

@rotateright
Copy link
Contributor

Still need to handle more type variations, but here are two patterns:
https://reviews.llvm.org/D77895
https://reviews.llvm.org/D78362

@rotateright
Copy link
Contributor

One more x86-specialization:
https://reviews.llvm.org/D78362

And a generic transform to remove the last cast:
https://reviews.llvm.org/D79116

@rotateright
Copy link
Contributor

One more x86-specialization:
https://reviews.llvm.org/D78362

Pasted wrong link:
https://reviews.llvm.org/D78758

@RKSimon
Copy link
Collaborator Author

RKSimon commented May 4, 2020

All cases now covered - thank you!

@rotateright
Copy link
Contributor

Not shown in the examples here, but there's another FP cast fold proposed in:
https://reviews.llvm.org/D79187

@rotateright
Copy link
Contributor

Added IR folds for int -> FP -> ext/trunc FP:
https://reviews.llvm.org/rGbfd512160fe0
https://reviews.llvm.org/rGc048a02b5b26

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla llvm:codegen
Projects
None yet
Development

No branches or pull requests

4 participants