Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

math: add guaranteed FMA #25819

Closed
TuomLarsen opened this issue Jun 11, 2018 · 30 comments
Closed

math: add guaranteed FMA #25819

TuomLarsen opened this issue Jun 11, 2018 · 30 comments
Labels
FeatureRequest FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. Proposal-Accepted
Milestone

Comments

@TuomLarsen
Copy link

TuomLarsen commented Jun 11, 2018

Please consider adding fused multiply–add (FMA) to the standard library.

FMA computes a*b + c but only with one floating-point rounding, instead of two. If a CPU instruction is available it might even be faster than a separate multiplication and addition but the main reason for using it is the increased precision.

The use cases include calculating dot product, evaluating polynomials, matrix multiplication and many more.

I think the largest difficulty would be to provide a correct fallback in case that the CPU does not support it directly.

@gopherbot gopherbot added this to the Proposal milestone Jun 11, 2018
@agnivade
Copy link
Contributor

agnivade commented Jun 11, 2018

Dup of #8037. I don't think we want an explicit math.FMA function in the standard library. The compiler should detect statements like a*b + c and replace them with VFAMDD instructions.

@TuomLarsen
Copy link
Author

TuomLarsen commented Jun 11, 2018

I don't think this is duplicate of #8037, which is about making the compiler recognize a*b + c and emitting FMA instruction, at compiler will. It might choose to replace that expression with FMA or not, sometimes it is even desirable to keep it as is. This proposal is about adding a way to explicitly request FMA or its similar fallback, should the CPU not support it.

Therefore please reconsider opening of this issue, I believe it is about something else.

@agnivade
Copy link
Contributor

agnivade commented Jun 11, 2018

Yes, I doubt we would want to expose a function by which you can explicitly request FMA. It is something like intrinsics, and we usually do not expose those in the standard library.

Anyways, I am re-opening this so that the proposal review committee can take a final call.

/cc @ianlancetaylor , @rsc

@agnivade agnivade reopened this Jun 11, 2018
@agnivade agnivade changed the title Proposal: add math.FMA proposal: add math.FMA Jun 11, 2018
@alexd765
Copy link
Contributor

alexd765 commented Jun 11, 2018

@TuomLarsen: what would be the benefit of declaring that manually compared to the compiler figuring it out automatically?

@randall77
Copy link
Contributor

randall77 commented Jun 11, 2018

@alexd765 : Because if your application needs the extra precision, you can't rely on compiler optimizations accidentally introducing extra precision for you. For instance, it won't be portable.

@agnivade: Do you have a specific application where the extra precision is needed? It's going to be hard to evaluate this proposal without a better understanding of where it is needed.

@bmkessler
Copy link
Contributor

bmkessler commented Jun 11, 2018

For reference, note that the FMA function is included in IEEE 754-2008 revision to the floating point standards and the fma function was added to C99 math standard library in ISO/IEC 9899:1999 as well as being included in other languages such as Java.

Some brief discussion of the use of an explicit FMA was also mentioned in the original FMA issue #17895

@josharian
Copy link
Contributor

josharian commented Jun 11, 2018

cc @btracey @kortschak

@btracey
Copy link
Contributor

btracey commented Jun 11, 2018

Also: @Kunde21

@agnivade
Copy link
Contributor

agnivade commented Jun 11, 2018

@agnivade: Do you have a specific application where the extra precision is needed? It's going to be hard to evaluate this proposal without a better understanding of where it is needed.

I think you meant to ping @TuomLarsen

@TuomLarsen
Copy link
Author

TuomLarsen commented Jun 11, 2018

@randall77 It is useful whenever one needs to compute a*b + c (repeatedly). So mostly numerical algorithms such as dot product, matrix multiplication, polynomial evaluation, Newton's method, convolutions and artificial neural networks, ... (from Wikipedia).

But I guess you knew that already so taking the first option as an example: first hit for "dot product fma" query reveals a poster which compares the classical dot product computation with various FMA-flavoured ones. The main takeaway from it is the last line: "CompDot is about 6 times faster than ... DotXBLAS" while providing the same accuracy as if it was calculated with the unevaluated sum of two doubles (which is almost as good as quadruple precision floats).

@rsc
Copy link
Contributor

rsc commented Jun 11, 2018

What is the proposed implementation of math.FMA on systems that do not have an FMA instruction in hardware?

@rsc
Copy link
Contributor

rsc commented Jun 11, 2018

To elaborate on the previous comment ("What is the proposed fallback implementation when there's no hardware support?"):

If the proposed fallback is return a*b+c, then there is no difference between math.FMA(a,b,c) and a*b+c, so we don't need a separate function, in which case this is a dup of #8037 as @agnivade said. That is, Go's current a*b+c seems to match java.lang.Math.fma, so no new function needed for that.

If the proposed fallback is "do non-trivial work to somehow produce a result that is higher precision than float64(a*b)+c ("FMA disabled") would be", then OK, this is an issue to leave open. That "must be equal in precision and result to IEEE-758 single-rounding FMA" definition would merit its own function math.FMA and would correspond to java.lang.StrictMath.fma. I'm going to assume the proposal is for this "always bit-perfect FMA" meaning, because otherwise we can just close it.

@TuomLarsen, the Wikipedia article correctly says "A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products."

If you care about speed alone, then the current Go compiler optimization of compiling a*b+c as an FMA (which Java never does) provides that speed. So if you just wanted a fast dot product, then you'd want to write it with a*b+c expressions, and the compiler would use the fastest implementation available - an FMA where it exists, and otherwise a separate multiply and add. You would not want to call the proposed math.FMA, because it would be very slow on systems without an FMA, where the bit-precise answer would have to be computed in significantly more than 2 floating-point operations.

If you care about accuracy more than speed, that's when you'd want math.FMA, to be able to force the bit-precise FMA results, where you'd be willing to run significantly slower than a*b+c on non-FMA hardware in order to get those bit-precise results. @randall77's question is asking "what is the specific situation where you foresee needing to make that decision?"

@TuomLarsen
Copy link
Author

TuomLarsen commented Jun 11, 2018

The proposal is about precise, always correct (i.e. "strict") a*b+c with only one rounding.

If it is faster to use FMA than multiplication and addition, that's only good but it is not the main motivation here. The linked Wikipedia page lists quite a lot of modern processors with support for FMA so calling of the slow fallback should be quite rare. What the simple a*b+c fallback (multiply and add) goes, I'm afraid it would not make a lot of sense as the "F" means fused, i.e. only one rounding.

So yes, this proposal cares more accuracy than speed.

PS: What the actual fallback goes, there is probably more than one way of doing it, this is what Julia folks have done, for example.

@rsc
Copy link
Contributor

rsc commented Jun 11, 2018

@TuomLarsen, OK, great, thank you for clarifying that the proposal is about a strict bit-precise FMA.

Please help me read the Wikipedia page: what is an example of a motivating, common application that cares so much about accuracy that it would prefer a slow software FMA implementation over a non-fused multiply-add?

(The point about "quite a lot of modern processors support FMA" cuts against having a special function. If everything is already doing FMA for a*b+c then why bother adding a special math.FMA? There must be some compelling use where you absolutely have to have the bit-precise FMA.)

I'm sorry if we're talking past each other a bit.

@bmkessler
Copy link
Contributor

bmkessler commented Jun 11, 2018

One benefit to fma is tracking floating point errors, which allows carrying higher precision (>float64, e.g. double-double) internally for certain calculations to ensure an accurate result at the end. Calculating x*y exactly is a building block for extended-precision calculations.

zhi := x*y
zlo := fma(x, y, -zhi)  // zlo = x*y-zhi

then x*y == zhi + zlo exactly.

The following paper from IBM provides some motivating examples.

The Fused Multiply-Add Instruction Leads to Algorithms for Extended-Precision Floating Point: Applications to Java and High-Performance Computing

Finally, we demonstrate the accuracy of the algorithms on example problems (matrix multiplication, 2 x 2 determinant, complex multiplication, and triangle area calculation), for which existing computer arithmetic gives completely inaccurate results in certain instances.

@rsc
Copy link
Contributor

rsc commented Jun 12, 2018

@bmkessler, thanks for that link. Very interesting.

@TuomLarsen
Copy link
Author

TuomLarsen commented Jun 12, 2018

@rsc In addition to what @bmkessler said: FMA improves the accuracy of a*b+c, usually at least by just a bit. But then there are these Error-free transformations which in addition to normal floating-point operations account for the rounding errors and allow for almost double of the working floating point precision (so in case of double precision, they are almost as precise as if calculated with quadruple precision). A basic such building block is EFT product, seen e.g. in the poster I linked to, which when implemented without FMA requires a "magic" constant and 17 FLOPS. Whereas with FMA it only takes 2 FLOPS, as was shown by @bmkessler. So if one implemented the EFT product as plain multiplication and addition, the algorithms would return an incorrect results as they rely on the handling of very tiny rounding errors.

"If everything is already doing FMA..." First, I wrote "quite a lot" precisely because I'm not sure if everybody has the FMA instruction so I presume a proper fallback would be necessary. Second, it is also not desirable to automatically replace all a*b+c with FMA, as it may break some algorithms, see e.g. this notice about FMA (search for "THE FMA PROBLEM").

@rsc
Copy link
Contributor

rsc commented Jun 18, 2018

The IBM paper linked by @bmkessler convinced me that this issue is basically the floating-point equivalent of #24813. In short the trick is that given float64 values x, y, the high word of x*y is given by float64(x*y) and the low word is given by math.FMA(x, y, float64(x*y)) (aka x * y - float64(x*y) on FMA-enabled hardware). Those two words together are an exact representation of the product, for the same reason that high and low uint64s can be an exact presentation of a uint64*uint64 product.

That is, a guaranteed-precision FMA exposes double-width floating-point multiply, the same way #24813 is about exposing double-width integer multiply (and other operations). Also, given #24813's bits.Mul64, a software implementation of math.FMA (for the systems without FMA hardware) would not be many lines of code.

I think we should probably accept this issue.

@rsc rsc changed the title proposal: add math.FMA proposal: math: add guaranteed FMA Jun 18, 2018
@btracey
Copy link
Contributor

btracey commented Jun 18, 2018

@TuomLarsen Note that you can already prevent FMA with float64(a*b) + c

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Jun 18, 2018

Proposal accepted -- iant for @golang/proposal-review

@ianlancetaylor ianlancetaylor added help wanted Proposal-Accepted NeedsFix The path to resolution is known, but the work has not been done. labels Jun 18, 2018
@ianlancetaylor ianlancetaylor changed the title proposal: math: add guaranteed FMA math: add guaranteed FMA Jun 18, 2018
@smasher164
Copy link
Member

smasher164 commented Jul 29, 2018

I want to clarify the architectures that require runtime feature detection. Please correct me if I'm wrong.

  • x86: Yes, depends on the FMA3 instruction set.
  • arm: Yes, depends on VFPv4.
  • arm64: No, arm64 == armv8 and above in Go.
  • mips[64]: Yes, the proper instruction is introduced in Release 6. The CPU must also support double-precision.
  • ppc64: No. POWER8 assembly can safely assume that FMA exists.
  • s390x: No. MADBR was introduced with the 390.

Assuming that internal/cpu aggregates the necessary feature-detection code, the FMA procedure would check

  • arm: cpu.ARM.HasVFPv4
  • mips, mipsle, mips64, mips64le: cpu.MIPS.HasR6 && cpu.MIPS.HasF64
  • 386, amd64, amd64p32: cpu.x86.HasFMA

Failing these checks would defer to a software fallback, like freebsd's or musl's.
The remaining architectures would use an assembly implementation.

@gopherbot
Copy link

gopherbot commented Aug 2, 2018

Change https://golang.org/cl/127458 mentions this issue: math: add guaranteed-precision FMA intrinsic

@gopherbot
Copy link

gopherbot commented Oct 20, 2018

Change https://golang.org/cl/137156 mentions this issue: cmd/compile: add fma intrinsic for amd64

@gopherbot
Copy link

gopherbot commented Oct 20, 2018

Change https://golang.org/cl/131959 mentions this issue: cmd/compile: introduce generic ssa intrinsic for fused-multiply-add

@gopherbot
Copy link

gopherbot commented Oct 20, 2018

Change https://golang.org/cl/142117 mentions this issue: cmd/compile: add fma intrinsic for arm

@smasher164
Copy link
Member

smasher164 commented Nov 2, 2018

Can this go in 1.12? All of the CLs except one have been reviewed, namely https://golang.org/cl/127458. The implementation has been well tested and benchmarked, as well has being compared with alternative implementations.

@randall77
Copy link
Contributor

randall77 commented Nov 5, 2018

Yes, we should get this in. I've +2'd your final CL.

@smasher164 smasher164 modified the milestones: Unplanned, Go1.14 Sep 8, 2019
@smasher164 smasher164 self-assigned this Sep 8, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@smasher164 smasher164 modified the milestones: Backlog, Go1.14 Oct 11, 2019
gopherbot pushed a commit that referenced this issue Oct 21, 2019
Currently, the precision of the float64 multiply-add operation
(x * y) + z varies across architectures. While generated code for
ppc64, s390x, and arm64 can guarantee that there is no intermediate
rounding on those platforms, other architectures like x86, mips, and
arm will exhibit different behavior depending on available instruction
set. Consequently, applications cannot rely on results being identical
across GOARCH-dependent codepaths.

This CL introduces a software implementation that performs an IEEE 754
double-precision fused-multiply-add operation. The only supported
rounding mode is round-to-nearest ties-to-even. Separate CLs include
hardware implementations when available. Otherwise, this software
fallback is given as the default implementation.

Specifically,
    - arm64, ppc64, s390x: Uses the FMA instruction provided by all
      of these ISAs.
    - mips[64][le]: Falls back to this software implementation. Only
      release 6 of the ISA includes a strict FMA instruction with
      MADDF.D (not implementation defined). Because the number of R6
      processors in the wild is scarce, the assembly implementation
      is left as a future optimization.
    - x86: Guards the use of VFMADD213SD by checking cpu.X86.HasFMA.
    - arm: Guards the use of VFMA by checking cpu.ARM.HasVFPv4.
    - software fallback: Uses mostly integer arithmetic except
      for input that involves Inf, NaN, or zero.

Updates #25819.

Change-Id: Iadadff2219638bacc9fec78d3ab885393fea4a08
Reviewed-on: https://go-review.googlesource.com/c/go/+/127458
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
gopherbot pushed a commit that referenced this issue Oct 21, 2019
In order to make math.FMA a compiler intrinsic for ISAs like ARM64,
PPC64[le], and S390X, a generic 3-argument opcode "Fma" is provided and
rewritten as

    ARM64: (Fma x y z) -> (FMADDD z x y)
    PPC64: (Fma x y z) -> (FMADD x y z)
    S390X: (Fma x y z) -> (FMADD z x y)

Updates #25819.

Change-Id: Ie5bc628311e6feeb28ddf9adaa6e702c8c291efa
Reviewed-on: https://go-review.googlesource.com/c/go/+/131959
Run-TryBot: Akhil Indurti <aindurti@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
gopherbot pushed a commit that referenced this issue Oct 21, 2019
To permit ssa-level optimization, this change introduces an amd64 intrinsic
that generates the VFMADD231SD instruction for the fused-multiply-add
operation on systems that support it. System support is detected via
cpu.X86.HasFMA. A rewrite rule can then translate the generic ssa intrinsic
("Fma") to VFMADD231SD.

The benchmark compares the software implementation (old) with the intrinsic
(new).

name   old time/op  new time/op  delta
Fma-4  27.2ns ± 1%   1.0ns ± 9%  -96.48%  (p=0.008 n=5+5)

Updates #25819.

Change-Id: I966655e5f96817a5d06dff5942418a3915b09584
Reviewed-on: https://go-review.googlesource.com/c/go/+/137156
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
gopherbot pushed a commit that referenced this issue Oct 21, 2019
This change introduces an arm intrinsic that generates the FMULAD
instruction for the fused-multiply-add operation on systems that
support it. System support is detected via cpu.ARM.HasVFPv4. A rewrite
rule translates the generic intrinsic to FMULAD.

Updates #25819.

Change-Id: I8459e5dd1cdbdca35f88a78dbeb7d387f1e20efa
Reviewed-on: https://go-review.googlesource.com/c/go/+/142117
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@smasher164
Copy link
Member

smasher164 commented Oct 21, 2019

The CLs for hardware implementations arm64, ppc64, s390x, x86, and arm, as well as a software fallback have been merged into master. Note that on MIPS, the software fallback is used for now, since the correct instruction (MADDF.D) is only available on Release 6 processors with double-precision support. I think the MIPS intrinsic should be added at later time given demand and feature-detection support (CLs 200579 and 126657).

For those who want to use the software fallback before 1.14 is released, I've extracted it to a separate package: https://github.com/smasher164/fma.

Finally, I want to thank everybody who has patiently guided my effort in this process. I’ve learned a lot while working on this, and I am thankful for the opportunity.

@TuomLarsen
Copy link
Author

TuomLarsen commented Oct 22, 2019

Thank you very much!

@gopherbot
Copy link

gopherbot commented Nov 5, 2019

Change https://golang.org/cl/205317 mentions this issue: math, cmd/compile: rename Fma to FMA

gopherbot pushed a commit that referenced this issue Nov 7, 2019
This API was added for #25819, where it was discussed as math.FMA.
The commit adding it used math.Fma, presumably for consistency
with the rest of the unusual names in package math
(Sincos, Acosh, Erfcinv, Float32bits, etc).

I believe that using an idiomatic Go name is more important here
than consistency with these other names, most of which are historical
baggage from C's standard library.

Early additions like Float32frombits happened before "uppercase for export"
(so they were originally like "float32frombits") and they were not properly
reconsidered when we uppercased the symbols to export them.
That's a mistake we live with.

The names of functions we have added since then, and even a few
that were legacy, are more properly Go-cased, such as IsNaN, IsInf,
and RoundToEven, rather than Isnan, Isinf, and Roundtoeven.
And also constants like MaxFloat32.

For new API, we should keep using proper Go-cased symbols
instead of minimally-upper-cased-C symbols.

So math.FMA, not math.Fma.

This API has not yet been released, so this change does not break
the compatibility promise.

This CL also modifies cmd/compile, since the compiler knows
the name of the function. I could have stopped at changing the
string constants, but it seemed to make more sense to use a
consistent casing everywhere.

Change-Id: I0f6f3407f41e99bfa8239467345c33945088896e
Reviewed-on: https://go-review.googlesource.com/c/go/+/205317
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@golang golang locked and limited conversation to collaborators Nov 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FeatureRequest FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done. Proposal-Accepted
Projects
None yet
Development

No branches or pull requests