New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing float truncation rounding patterns #35965
Comments
Truncating to smaller float types should be investigated as well: float rnd32(double x) { define float @_Z5rnd32d(double) { _Z3rndDv4_f: |
Proposal for round-trip casting (to the same size): |
The examples in the description should be fixed with: |
Reverted because we broke Chrome: |
Trying the initial case again with more warnings about the potential danger: |
Oops - new link: |
We'll have to hide this behind a flag and default it off to start: |
I was wrong about having to default this to 'off' (at least for now), but we do have flags to enable/disable the transform: @scanon suggested that we add a platform-independent clang builtin that would clamp using either a fixed method (probably the one used by PPC/ARM because they're more useful than x86?) or using the target's method. I haven't looked at the details yet, but that sounds good. Ie, it's safe to assume that we're going to get more complaints about breaking code with this optimization. Also for the record, we are not doing this optimization because we think it's fun to break people's programs using a UB loophole. We added this because it can be a big perf improvement (which is why the operation is supported in HW on multiple platforms). |
We missed an FMF constraint (latest comments here): Miscompiling the -0.0 case... The good news is that I expect less chance of the more common out-of-range pitfall once we limit this with FMF (I'm assuming most people don't compile with loosened FP). |
This exposes a hole in IR FMF. We don't currently parse FMF on the conversion instructions even though they are FPMathOperators (the *itofp ones anyway); we only allow FMF on FP binops, calls, and fcmp. |
Current Codegen: https://godbolt.org/z/ZEVtp7 |
@spatel This looks mostly fine to me (thanks for 2 years ago!), the only possible improvement is: define dso_local double @_Z3rndd(double %0) local_unnamed_addr #0 { define dso_local float @_Z5rnd32d(double %0) local_unnamed_addr #0 { _Z3rndd: # @_Z3rndd _Z5rnd32d: # @_Z5rnd32d We could avoid the XMM->GPR->XMM transfers by using vcvttpd2dq+vcvtdq2ps(+vcvtss2sd) - what do you think? |
Yes, a quick check of Agner timings for recent Intel and AMD says avoiding the GPR round-trip is always a win. We did something similar with: |
There's a potential complication though: if we operate on garbage data in the 128-bit %xmm reg, we might induce some terrible x86 denorm-handling stall and kill perf. Can we rule out the potential denorm stall on a cvttps2dq or similar cast instructions? If the stall concern is valid, then the way around that would be to zero out the high elements, so the asm diff looks something like this: diff --git a/llvm/test/CodeGen/X86/ftrunc.ll b/llvm/test/CodeGen/X86/ftrunc.ll Still worth doing? |
I think this is probably safe for tune=generic but it needs testing on Silvermont and/or Goldmont. Agner Fog's microarch PDF says:
But this might not include conversions. There aren't 2 inputs that need to be aligned wrt. each other to match up the place values of mantissas. Anything with an exponent part below 2^-1 simply rounds to zero, including subnormals. But if conversion shares HW with other FPU operations, maybe it would be subject to the same subnormal penalties? I'm hopeful but I wouldn't bet on it. I tested on Core 2 (Conroe) and Skylake and found no penalties for subnormal cvt. Even though Core 2 has ~150x penalties for basically anything else involving subnormals. cvtps2dq and pd2dq, and their truncating versions, are both fast with a mix of normal and subnormal inputs. (Not sure if cvtpd2dq is a win over scalar round trip; it has a shuffle uop both ways. But I tested anyway just for the record.) In a loop in a static executable, NASM source:
section .rodata Full source attached, in case anyone wants to check on a Bulldozer-family or Zen. Agner Fog says Bulldozer/Piledriver have penalties for subnormal or underflow results, no mention of inputs. If it runs in Agner also doesn't mention input penalties for Ryzen. Using the same test program with different instructions in the loop, I was able to see huge (factor of > 150) slowdowns for compares with subnormal inputs on Core 2, so I'm sure I had the right values in registers and that I would have detected a problem if it existed. (cmpltps, or ucomiss with a memory source) nasm -felf64 subnormal-cvt.asm && ld subnormal-cvt.o && time ./a.out If the run time is something under 1 or 2 seconds (e.g. 390 ms on a 4.1 GHz Skylake where cvt is 2/clock) it's fast. It takes 1.3 seconds on my 2.4GHz E6600 Core 2 Conroe/Merom. If it was slow, it would take ~150 times that. Or possibly (but unlikely) on Ryzen, just a bit longer than you'd expect if there are penalties for this; time with perf stat to see if you get the expected ~1 instruction / cycle on Zen1 or Zen2 |
To avoid denormal issues, a pre-splat might do the trick - its still going to be a lot quicker than a GPR transfer. vcvtdq2ps(vcvttpd2dq(vpermilpd(x,0))) or vcvtss2sd(vcvtdq2ps(vcvttpd2dq(vpermilpd(x,0)))) I don't think SimplifyDemandedVectorElts will do anything to remove the splat but we'd have to confirm. |
That defeats the throughput advantage, at least on Intel CPUs, but could still be a latency win for tune=generic if there are any CPUs that do have penalties for subnormal FP->int. We should only consider this if we find evidence that some CPU we care about for tune=generic really does have a penalty for subnormal inputs to FP->int conversion. I haven't found any evidence of that on Intel, including Core 2 which does have subnormal-input penalties for everything else, including compares and float->double. As I said, I'm hopeful that there aren't any such CPUs, and Agner Fog's guide says AMD CPU subnormal penalties are only ever for output, not input. It's bad for -mtune=sandybridge or tune=core2 where denormals definitely don't hurt FP->integer conversion. That extra shuffle uop defeats all of the throughput benefit on Sandybridge-family (On Skylake it's nearly break-even for throughput vs. GPR and back). The port breakdown is p5 + p01+p5 + p01 on SKL. (packed conversion to/from double costs a shuffle uop as well as the conversion, because unlike float the element sizes don't match int32. Scalar is also 2: convert and domain transfer in one direction or the other.) It does still have a latency advantage on Skylake over ss2si / si2sd to a GPR and back. (And a big latency advantage on Bulldozer-family). SKL and Core 2 latency and throughput results for this loop body. Note that scalar and back wins on Core 2 for both throughput and latency.
%rep 4 And BTW, you want unpcklpd or movlhps to broadcast the low 64 bits with more compact machine code and a vpermilpd with a 3-byte VEX and an immediate. (And requiring AVX). Or maybe just movq to copy and zero-extend the low qword: 0.33c throughput on Intel, downside is possible bypass delay. Otherwise, without AVX we have to choose between broadcast in place (putting the shuffle in the critical path for other readers of the FP value), or copy and broadcast which probably costs a movaps plus a shuffle. I think it's likely that FP->integer won't be slowed by subnormals anywhere, because Core 2 has penalties for comparing subnormals or converting ps -> pd, but still has no penalty for ps -> int. |
Note that if the source FP value was float, we could skip the int->float step and go int->double directly instead of int->float->double. That would guarantee that the integer value either overflowed or was an integer that float can represent exactly. Thus the int->float step is always exact. But that's not the case for a double input. If we want to preserve the behaviour of rounding to integer and then to the nearest representable float, we need to keep that step. Maybe everyone else already noticed that, too, but I thought it was neat. If we had AVX512 for VRNDSCALESD, that wouldn't help and more than SSE4.1 ROUNDSD; it can only specify a non-negative number of binary fraction bits to keep, not rounding to a higher boundary than 2^0 |
In light of these two factors, it's nuts to slow down the common path at all to try to smooth over a hypothetical subnormal penalty that can only happen under very contrived circumstances on decade-plus-old CPUs. If there's a sequence that avoids a possible subnormal with zero penalty for the normal case, use it, but we shouldn't accept any slowdown to the normal case for this. |
Steve's comment reminded me of a previous discussion I had about high garbage being a problem in XMM registers for FP-exception correctness. cvtdq2ps can raise Precision (inexact). The x86-64 SysV calling convention allows it. While it's rare, compilers will sometimes call scalar functions with XMM0's high elements still holding whatever's left over from some auto-vectorization they did. Or return a scalar float after a horizontal sum that used shuffles that don't result in all elements holding the same value. Clearly code must not rely on high elements being zero in function args or return values for correctness, and LLVM doesn't. But does correctness include FP exception semantics? For clang/LLVM it already doesn't. Without -ffast-math, return ((a+b)+(c+d))+((e+f)+(g+h)); for float a..h args vectorizes with unpcklps + addps leaving the high elements = 2nd element of the XMM inputs. https://gcc.godbolt.org/z/oRZuN9. (with -ffast-math, clang just uses 7x addss in a way that keeps some but different ILP.) That will have performance problems for subnormals on P6-family, and maybe sometimes on Sandybridge-family. But it's not a correctness problem and might still be worth doing if we don't mind raising FP exceptions that the C source wouldn't have. Not everyone is as cavalier with FP exception semantics, although it's really hard (and bad for performance) to get right if you care about FP exceptions as a visible side-effect (i.e. running an exception handler once per exception). GCC enables For https://stackoverflow.com/questions/36706721/is-a-sign-or-zero-extension-required-when-adding-a-32bit-offset-to-a-pointer-for/36760539#36760539 I emailed one of the x86-64 System V ABI authors (Michael Matz); he confirmed that XMM scalar function args can have high garbage and added:
(I had asked if that clang3.7 codegen was a sign that the ABI guaranteed something. But no, it's just clang/LLVM being aggressive.) Apparently cvtdq2ps can raise is a Precision exception (but no others). I think if an element is not already an exact integer-valued float, and/or is out of range for int32_t. Side note: in the unlikely event we ever want to emit that shuffle for some non-default tuning:
Auto-vectorized code that calls a scalar function could cause this, e.g. after a horizontal sum that ends up with the right value in the low lane and leftover other values in the high lanes. Although most likely it already caused some subnormal penalties, in which case one more is probably not a big deal for the rare case where that happens. Especially not if we're only concerned with performance, not FP-exception correctness. |
There's some misunderstanding here. We're talking about changing the codegen for this most basic C code (no weird stuff necessary): float rnd(float x) { |
Subnormal representations in the high-order lanes have to come from somewhere. They don't just magically appear. In particular:
There are a few ways to get a scalar value that preserve the high-order lanes. The most common culprits are the scalar conversions and unary scalar operations coming from memory. In those cases, the correct fix for the resulting performance bug will be to zero before that operation (or convert to separate load + op) to also break the dependency.
I'm totally fine with eating a subnormal stall in a case like this in order to avoid pessimizing "normal code", because if a stall is happening after a reduction like this, you're already eating subnormal stalls in the reduction itself. The ship sailed. You're already not going fast. w.r.t. modeling floating-point exceptions, a constrained intrinsic used when fenv_access is enabled would of course not be able to take part in this optimization, so that shouldn't be an issue. |
Ok, I agree it's far-fetched that garbage will be up there and has not already caused perf problems. And it's moot if there's no denorm penalty on the convert insts on recent HW. We don't have to worry about exceptions because programs that enabled exceptions should use strict/constrained ops, and we're dealing with the base opcodes here. As a final safety check, can someone try Peter's attached asm code (see comment 16) on an AMD CPU to see if that can trigger a denorm penalty there? |
$ cat /proc/cpuinfo | grep model | sort | uniq Performance counter stats for './subnormal-cvt':
"uops_issued.any:u,uops_executed.thread:u,idq.dsb_uops:u" counters aren't supported here, so i had to drop them. |
Yes exactly. If that function doesn't inline, its caller could have left high garbage in XMM0. Normal scalar loads zero the upper elements, which is why I proposed an auto-vectorized caller as a simple example of
Yup, GCC does dep-breaking PXOR-zeroing to work around Intel's short-sighted design (for PIII) But yes, these merge ops could have leftover integer vectors. Small integers are the bit patterns for subnormal floats. So that's probably a more likely source of subnormal FP bit patterns than actual subnormal floats from FP ops.
Yes, that was my thought, too.
Oh, you're right, clang supports -ftrapping-math (but unlike GCC it's not on by default). (Incidentally, that's probably optimal for most cases of surrounding code, |
The only CPUs I'm at all worried about are Atom / Silvermont-family. Low power Intel have more performance glass jaws than the big-core chips in other areas (e.g. much more picky about too many prefixes). I don't know how much we care about Silvermont for tune=generic, especially for FP code, if it does turn out to be a problem there. Roman's 1.06 insn per cycle result on Bulldozer-family confirms it's fine. Probably safe to assume Zen is also fine. Agner says AMD only has subnormal penalties on outputs, no mention of inputs, and in this case the output is integer. So that's certainly the result I was expecting, but definitely good to be sure. |
Still need to handle more type variations, but here are two patterns: |
One more x86-specialization: And a generic transform to remove the last cast: |
Pasted wrong link: |
All cases now covered - thank you! |
Not shown in the examples here, but there's another FP cast fold proposed in: |
Added IR folds for int -> FP -> ext/trunc FP: |
Extended Description
Rounding floats/doubles patterns could be converted to truncation rounding instructions (roundss on x86/sse etc.). AFAICT these don't need to be -ffast-math only.
float rnd(float x) {
return (float)((int)x);
}
__v4sf rnd(__v4sf x) {
return __builtin_convertvector(__builtin_convertvector(x, __v4si), __v4sf);
}
define dso_local float @_Z3rndf {
%2 = fptosi float %0 to i32
%3 = sitofp i32 %2 to float
ret float %3
}
define dso_local <4 x float> @_Z3rndDv4_f {
%2 = fptosi <4 x float> %0 to <4 x i32>
%3 = sitofp <4 x i32> %2 to <4 x float>
ret <4 x float> %3
}
_Z3rndf:
cvttss2si %xmm0, %eax
xorps %xmm0, %xmm0
cvtsi2ssl %eax, %xmm0
retq
_Z3rndDv4_f:
cvttps2dq %xmm0, %xmm0
cvtdq2ps %xmm0, %xmm0
retq
The text was updated successfully, but these errors were encountered: