-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
math: add guaranteed FMA #25819
Comments
Dup of #8037. I don't think we want an explicit |
I don't think this is duplicate of #8037, which is about making the compiler recognize Therefore please reconsider opening of this issue, I believe it is about something else. |
Yes, I doubt we would want to expose a function by which you can explicitly request FMA. It is something like intrinsics, and we usually do not expose those in the standard library. Anyways, I am re-opening this so that the proposal review committee can take a final call. /cc @ianlancetaylor , @rsc |
@TuomLarsen: what would be the benefit of declaring that manually compared to the compiler figuring it out automatically? |
@alexd765 : Because if your application needs the extra precision, you can't rely on compiler optimizations accidentally introducing extra precision for you. For instance, it won't be portable. @agnivade: Do you have a specific application where the extra precision is needed? It's going to be hard to evaluate this proposal without a better understanding of where it is needed. |
For reference, note that the FMA function is included in IEEE 754-2008 revision to the floating point standards and the fma function was added to C99 math standard library in ISO/IEC 9899:1999 as well as being included in other languages such as Java. Some brief discussion of the use of an explicit FMA was also mentioned in the original FMA issue #17895 |
Also: @Kunde21 |
I think you meant to ping @TuomLarsen |
@randall77 It is useful whenever one needs to compute a*b + c (repeatedly). So mostly numerical algorithms such as dot product, matrix multiplication, polynomial evaluation, Newton's method, convolutions and artificial neural networks, ... (from Wikipedia). But I guess you knew that already so taking the first option as an example: first hit for "dot product fma" query reveals a poster which compares the classical dot product computation with various FMA-flavoured ones. The main takeaway from it is the last line: "CompDot is about 6 times faster than ... DotXBLAS" while providing the same accuracy as if it was calculated with the unevaluated sum of two doubles (which is almost as good as quadruple precision floats). |
What is the proposed implementation of math.FMA on systems that do not have an FMA instruction in hardware? |
To elaborate on the previous comment ("What is the proposed fallback implementation when there's no hardware support?"): If the proposed fallback is If the proposed fallback is "do non-trivial work to somehow produce a result that is higher precision than @TuomLarsen, the Wikipedia article correctly says "A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products." If you care about speed alone, then the current Go compiler optimization of compiling If you care about accuracy more than speed, that's when you'd want math.FMA, to be able to force the bit-precise FMA results, where you'd be willing to run significantly slower than |
The proposal is about precise, always correct (i.e. "strict") a*b+c with only one rounding. If it is faster to use FMA than multiplication and addition, that's only good but it is not the main motivation here. The linked Wikipedia page lists quite a lot of modern processors with support for FMA so calling of the slow fallback should be quite rare. What the simple a*b+c fallback (multiply and add) goes, I'm afraid it would not make a lot of sense as the "F" means fused, i.e. only one rounding. So yes, this proposal cares more accuracy than speed. PS: What the actual fallback goes, there is probably more than one way of doing it, this is what Julia folks have done, for example. |
@TuomLarsen, OK, great, thank you for clarifying that the proposal is about a strict bit-precise FMA. Please help me read the Wikipedia page: what is an example of a motivating, common application that cares so much about accuracy that it would prefer a slow software FMA implementation over a non-fused multiply-add? (The point about "quite a lot of modern processors support FMA" cuts against having a special function. If everything is already doing FMA for I'm sorry if we're talking past each other a bit. |
One benefit to fma is tracking floating point errors, which allows carrying higher precision (>float64, e.g. double-double) internally for certain calculations to ensure an accurate result at the end. Calculating x*y exactly is a building block for extended-precision calculations.
then The following paper from IBM provides some motivating examples.
|
@bmkessler, thanks for that link. Very interesting. |
@rsc In addition to what @bmkessler said: FMA improves the accuracy of a*b+c, usually at least by just a bit. But then there are these Error-free transformations which in addition to normal floating-point operations account for the rounding errors and allow for almost double of the working floating point precision (so in case of double precision, they are almost as precise as if calculated with quadruple precision). A basic such building block is EFT product, seen e.g. in the poster I linked to, which when implemented without FMA requires a "magic" constant and 17 FLOPS. Whereas with FMA it only takes 2 FLOPS, as was shown by @bmkessler. So if one implemented the EFT product as plain multiplication and addition, the algorithms would return an incorrect results as they rely on the handling of very tiny rounding errors. "If everything is already doing FMA..." First, I wrote "quite a lot" precisely because I'm not sure if everybody has the FMA instruction so I presume a proper fallback would be necessary. Second, it is also not desirable to automatically replace all a*b+c with FMA, as it may break some algorithms, see e.g. this notice about FMA (search for "THE FMA PROBLEM"). |
The IBM paper linked by @bmkessler convinced me that this issue is basically the floating-point equivalent of #24813. In short the trick is that given float64 values That is, a guaranteed-precision FMA exposes double-width floating-point multiply, the same way #24813 is about exposing double-width integer multiply (and other operations). Also, given #24813's bits.Mul64, a software implementation of math.FMA (for the systems without FMA hardware) would not be many lines of code. I think we should probably accept this issue. |
@TuomLarsen Note that you can already prevent FMA with |
Proposal accepted -- iant for @golang/proposal-review |
I want to clarify the architectures that require runtime feature detection. Please correct me if I'm wrong.
Assuming that internal/cpu aggregates the necessary feature-detection code, the FMA procedure would check
Failing these checks would defer to a software fallback, like freebsd's or musl's. |
Change https://golang.org/cl/127458 mentions this issue: |
Change https://golang.org/cl/137156 mentions this issue: |
Change https://golang.org/cl/131959 mentions this issue: |
Change https://golang.org/cl/142117 mentions this issue: |
Can this go in 1.12? All of the CLs except one have been reviewed, namely https://golang.org/cl/127458. The implementation has been well tested and benchmarked, as well has being compared with alternative implementations. |
Yes, we should get this in. I've +2'd your final CL. |
Currently, the precision of the float64 multiply-add operation (x * y) + z varies across architectures. While generated code for ppc64, s390x, and arm64 can guarantee that there is no intermediate rounding on those platforms, other architectures like x86, mips, and arm will exhibit different behavior depending on available instruction set. Consequently, applications cannot rely on results being identical across GOARCH-dependent codepaths. This CL introduces a software implementation that performs an IEEE 754 double-precision fused-multiply-add operation. The only supported rounding mode is round-to-nearest ties-to-even. Separate CLs include hardware implementations when available. Otherwise, this software fallback is given as the default implementation. Specifically, - arm64, ppc64, s390x: Uses the FMA instruction provided by all of these ISAs. - mips[64][le]: Falls back to this software implementation. Only release 6 of the ISA includes a strict FMA instruction with MADDF.D (not implementation defined). Because the number of R6 processors in the wild is scarce, the assembly implementation is left as a future optimization. - x86: Guards the use of VFMADD213SD by checking cpu.X86.HasFMA. - arm: Guards the use of VFMA by checking cpu.ARM.HasVFPv4. - software fallback: Uses mostly integer arithmetic except for input that involves Inf, NaN, or zero. Updates #25819. Change-Id: Iadadff2219638bacc9fec78d3ab885393fea4a08 Reviewed-on: https://go-review.googlesource.com/c/go/+/127458 Run-TryBot: Ian Lance Taylor <iant@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
In order to make math.FMA a compiler intrinsic for ISAs like ARM64, PPC64[le], and S390X, a generic 3-argument opcode "Fma" is provided and rewritten as ARM64: (Fma x y z) -> (FMADDD z x y) PPC64: (Fma x y z) -> (FMADD x y z) S390X: (Fma x y z) -> (FMADD z x y) Updates #25819. Change-Id: Ie5bc628311e6feeb28ddf9adaa6e702c8c291efa Reviewed-on: https://go-review.googlesource.com/c/go/+/131959 Run-TryBot: Akhil Indurti <aindurti@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
To permit ssa-level optimization, this change introduces an amd64 intrinsic that generates the VFMADD231SD instruction for the fused-multiply-add operation on systems that support it. System support is detected via cpu.X86.HasFMA. A rewrite rule can then translate the generic ssa intrinsic ("Fma") to VFMADD231SD. The benchmark compares the software implementation (old) with the intrinsic (new). name old time/op new time/op delta Fma-4 27.2ns ± 1% 1.0ns ± 9% -96.48% (p=0.008 n=5+5) Updates #25819. Change-Id: I966655e5f96817a5d06dff5942418a3915b09584 Reviewed-on: https://go-review.googlesource.com/c/go/+/137156 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
This change introduces an arm intrinsic that generates the FMULAD instruction for the fused-multiply-add operation on systems that support it. System support is detected via cpu.ARM.HasVFPv4. A rewrite rule translates the generic intrinsic to FMULAD. Updates #25819. Change-Id: I8459e5dd1cdbdca35f88a78dbeb7d387f1e20efa Reviewed-on: https://go-review.googlesource.com/c/go/+/142117 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
The CLs for hardware implementations arm64, ppc64, s390x, x86, and arm, as well as a software fallback have been merged into master. Note that on MIPS, the software fallback is used for now, since the correct instruction (MADDF.D) is only available on Release 6 processors with double-precision support. I think the MIPS intrinsic should be added at later time given demand and feature-detection support (CLs 200579 and 126657). For those who want to use the software fallback before 1.14 is released, I've extracted it to a separate package: https://github.com/smasher164/fma. Finally, I want to thank everybody who has patiently guided my effort in this process. I’ve learned a lot while working on this, and I am thankful for the opportunity. |
Thank you very much! |
Change https://golang.org/cl/205317 mentions this issue: |
This API was added for #25819, where it was discussed as math.FMA. The commit adding it used math.Fma, presumably for consistency with the rest of the unusual names in package math (Sincos, Acosh, Erfcinv, Float32bits, etc). I believe that using an idiomatic Go name is more important here than consistency with these other names, most of which are historical baggage from C's standard library. Early additions like Float32frombits happened before "uppercase for export" (so they were originally like "float32frombits") and they were not properly reconsidered when we uppercased the symbols to export them. That's a mistake we live with. The names of functions we have added since then, and even a few that were legacy, are more properly Go-cased, such as IsNaN, IsInf, and RoundToEven, rather than Isnan, Isinf, and Roundtoeven. And also constants like MaxFloat32. For new API, we should keep using proper Go-cased symbols instead of minimally-upper-cased-C symbols. So math.FMA, not math.Fma. This API has not yet been released, so this change does not break the compatibility promise. This CL also modifies cmd/compile, since the compiler knows the name of the function. I could have stopped at changing the string constants, but it seemed to make more sense to use a consistent casing everywhere. Change-Id: I0f6f3407f41e99bfa8239467345c33945088896e Reviewed-on: https://go-review.googlesource.com/c/go/+/205317 Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Please consider adding fused multiply–add (FMA) to the standard library.
FMA computes a*b + c but only with one floating-point rounding, instead of two. If a CPU instruction is available it might even be faster than a separate multiplication and addition but the main reason for using it is the increased precision.
The use cases include calculating dot product, evaluating polynomials, matrix multiplication and many more.
I think the largest difficulty would be to provide a correct fallback in case that the CPU does not support it directly.
The text was updated successfully, but these errors were encountered: