Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

math: add guaranteed FMA #25819

Open
TuomLarsen opened this issue Jun 11, 2018 · 27 comments

Comments

@TuomLarsen
Copy link

commented Jun 11, 2018

Please consider adding fused multiply–add (FMA) to the standard library.

FMA computes a*b + c but only with one floating-point rounding, instead of two. If a CPU instruction is available it might even be faster than a separate multiplication and addition but the main reason for using it is the increased precision.

The use cases include calculating dot product, evaluating polynomials, matrix multiplication and many more.

I think the largest difficulty would be to provide a correct fallback in case that the CPU does not support it directly.

@gopherbot gopherbot added this to the Proposal milestone Jun 11, 2018

@gopherbot gopherbot added the Proposal label Jun 11, 2018

@agnivade

This comment has been minimized.

Copy link
Member

commented Jun 11, 2018

Dup of #8037. I don't think we want an explicit math.FMA function in the standard library. The compiler should detect statements like a*b + c and replace them with VFAMDD instructions.

@agnivade agnivade closed this Jun 11, 2018

@TuomLarsen

This comment has been minimized.

Copy link
Author

commented Jun 11, 2018

I don't think this is duplicate of #8037, which is about making the compiler recognize a*b + c and emitting FMA instruction, at compiler will. It might choose to replace that expression with FMA or not, sometimes it is even desirable to keep it as is. This proposal is about adding a way to explicitly request FMA or its similar fallback, should the CPU not support it.

Therefore please reconsider opening of this issue, I believe it is about something else.

@agnivade

This comment has been minimized.

Copy link
Member

commented Jun 11, 2018

Yes, I doubt we would want to expose a function by which you can explicitly request FMA. It is something like intrinsics, and we usually do not expose those in the standard library.

Anyways, I am re-opening this so that the proposal review committee can take a final call.

/cc @ianlancetaylor , @rsc

@agnivade agnivade reopened this Jun 11, 2018

@agnivade agnivade changed the title Proposal: add math.FMA proposal: add math.FMA Jun 11, 2018

@alexd765

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2018

@TuomLarsen: what would be the benefit of declaring that manually compared to the compiler figuring it out automatically?

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2018

@alexd765 : Because if your application needs the extra precision, you can't rely on compiler optimizations accidentally introducing extra precision for you. For instance, it won't be portable.

@agnivade: Do you have a specific application where the extra precision is needed? It's going to be hard to evaluate this proposal without a better understanding of where it is needed.

@bmkessler

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2018

For reference, note that the FMA function is included in IEEE 754-2008 revision to the floating point standards and the fma function was added to C99 math standard library in ISO/IEC 9899:1999 as well as being included in other languages such as Java.

Some brief discussion of the use of an explicit FMA was also mentioned in the original FMA issue #17895

@josharian

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2018

@btracey

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2018

Also: @Kunde21

@agnivade

This comment has been minimized.

Copy link
Member

commented Jun 11, 2018

@agnivade: Do you have a specific application where the extra precision is needed? It's going to be hard to evaluate this proposal without a better understanding of where it is needed.

I think you meant to ping @TuomLarsen

@TuomLarsen

This comment has been minimized.

Copy link
Author

commented Jun 11, 2018

@randall77 It is useful whenever one needs to compute a*b + c (repeatedly). So mostly numerical algorithms such as dot product, matrix multiplication, polynomial evaluation, Newton's method, convolutions and artificial neural networks, ... (from Wikipedia).

But I guess you knew that already so taking the first option as an example: first hit for "dot product fma" query reveals a poster which compares the classical dot product computation with various FMA-flavoured ones. The main takeaway from it is the last line: "CompDot is about 6 times faster than ... DotXBLAS" while providing the same accuracy as if it was calculated with the unevaluated sum of two doubles (which is almost as good as quadruple precision floats).

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2018

What is the proposed implementation of math.FMA on systems that do not have an FMA instruction in hardware?

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2018

To elaborate on the previous comment ("What is the proposed fallback implementation when there's no hardware support?"):

If the proposed fallback is return a*b+c, then there is no difference between math.FMA(a,b,c) and a*b+c, so we don't need a separate function, in which case this is a dup of #8037 as @agnivade said. That is, Go's current a*b+c seems to match java.lang.Math.fma, so no new function needed for that.

If the proposed fallback is "do non-trivial work to somehow produce a result that is higher precision than float64(a*b)+c ("FMA disabled") would be", then OK, this is an issue to leave open. That "must be equal in precision and result to IEEE-758 single-rounding FMA" definition would merit its own function math.FMA and would correspond to java.lang.StrictMath.fma. I'm going to assume the proposal is for this "always bit-perfect FMA" meaning, because otherwise we can just close it.

@TuomLarsen, the Wikipedia article correctly says "A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products."

If you care about speed alone, then the current Go compiler optimization of compiling a*b+c as an FMA (which Java never does) provides that speed. So if you just wanted a fast dot product, then you'd want to write it with a*b+c expressions, and the compiler would use the fastest implementation available - an FMA where it exists, and otherwise a separate multiply and add. You would not want to call the proposed math.FMA, because it would be very slow on systems without an FMA, where the bit-precise answer would have to be computed in significantly more than 2 floating-point operations.

If you care about accuracy more than speed, that's when you'd want math.FMA, to be able to force the bit-precise FMA results, where you'd be willing to run significantly slower than a*b+c on non-FMA hardware in order to get those bit-precise results. @randall77's question is asking "what is the specific situation where you foresee needing to make that decision?"

@TuomLarsen

This comment has been minimized.

Copy link
Author

commented Jun 11, 2018

The proposal is about precise, always correct (i.e. "strict") a*b+c with only one rounding.

If it is faster to use FMA than multiplication and addition, that's only good but it is not the main motivation here. The linked Wikipedia page lists quite a lot of modern processors with support for FMA so calling of the slow fallback should be quite rare. What the simple a*b+c fallback (multiply and add) goes, I'm afraid it would not make a lot of sense as the "F" means fused, i.e. only one rounding.

So yes, this proposal cares more accuracy than speed.

PS: What the actual fallback goes, there is probably more than one way of doing it, this is what Julia folks have done, for example.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2018

@TuomLarsen, OK, great, thank you for clarifying that the proposal is about a strict bit-precise FMA.

Please help me read the Wikipedia page: what is an example of a motivating, common application that cares so much about accuracy that it would prefer a slow software FMA implementation over a non-fused multiply-add?

(The point about "quite a lot of modern processors support FMA" cuts against having a special function. If everything is already doing FMA for a*b+c then why bother adding a special math.FMA? There must be some compelling use where you absolutely have to have the bit-precise FMA.)

I'm sorry if we're talking past each other a bit.

@bmkessler

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2018

One benefit to fma is tracking floating point errors, which allows carrying higher precision (>float64, e.g. double-double) internally for certain calculations to ensure an accurate result at the end. Calculating x*y exactly is a building block for extended-precision calculations.

zhi := x*y
zlo := fma(x, y, -zhi)  // zlo = x*y-zhi

then x*y == zhi + zlo exactly.

The following paper from IBM provides some motivating examples.

The Fused Multiply-Add Instruction Leads to Algorithms for Extended-Precision Floating Point: Applications to Java and High-Performance Computing

Finally, we demonstrate the accuracy of the algorithms on example problems (matrix multiplication, 2 x 2 determinant, complex multiplication, and triangle area calculation), for which existing computer arithmetic gives completely inaccurate results in certain instances.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2018

@bmkessler, thanks for that link. Very interesting.

@TuomLarsen

This comment has been minimized.

Copy link
Author

commented Jun 12, 2018

@rsc In addition to what @bmkessler said: FMA improves the accuracy of a*b+c, usually at least by just a bit. But then there are these Error-free transformations which in addition to normal floating-point operations account for the rounding errors and allow for almost double of the working floating point precision (so in case of double precision, they are almost as precise as if calculated with quadruple precision). A basic such building block is EFT product, seen e.g. in the poster I linked to, which when implemented without FMA requires a "magic" constant and 17 FLOPS. Whereas with FMA it only takes 2 FLOPS, as was shown by @bmkessler. So if one implemented the EFT product as plain multiplication and addition, the algorithms would return an incorrect results as they rely on the handling of very tiny rounding errors.

"If everything is already doing FMA..." First, I wrote "quite a lot" precisely because I'm not sure if everybody has the FMA instruction so I presume a proper fallback would be necessary. Second, it is also not desirable to automatically replace all a*b+c with FMA, as it may break some algorithms, see e.g. this notice about FMA (search for "THE FMA PROBLEM").

@rsc

This comment has been minimized.

Copy link
Contributor

commented Jun 18, 2018

The IBM paper linked by @bmkessler convinced me that this issue is basically the floating-point equivalent of #24813. In short the trick is that given float64 values x, y, the high word of x*y is given by float64(x*y) and the low word is given by math.FMA(x, y, float64(x*y)) (aka x * y - float64(x*y) on FMA-enabled hardware). Those two words together are an exact representation of the product, for the same reason that high and low uint64s can be an exact presentation of a uint64*uint64 product.

That is, a guaranteed-precision FMA exposes double-width floating-point multiply, the same way #24813 is about exposing double-width integer multiply (and other operations). Also, given #24813's bits.Mul64, a software implementation of math.FMA (for the systems without FMA hardware) would not be many lines of code.

I think we should probably accept this issue.

@rsc rsc changed the title proposal: add math.FMA proposal: math: add guaranteed FMA Jun 18, 2018

@btracey

This comment has been minimized.

Copy link
Contributor

commented Jun 18, 2018

@TuomLarsen Note that you can already prevent FMA with float64(a*b) + c

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 18, 2018

Proposal accepted -- iant for @golang/proposal-review

@gopherbot gopherbot added the Proposal label Jun 18, 2018

@ianlancetaylor ianlancetaylor modified the milestones: Proposal, Unplanned Jun 18, 2018

@ianlancetaylor ianlancetaylor changed the title proposal: math: add guaranteed FMA math: add guaranteed FMA Jun 18, 2018

@smasher164

This comment has been minimized.

Copy link
Member

commented Jul 29, 2018

I want to clarify the architectures that require runtime feature detection. Please correct me if I'm wrong.

  • x86: Yes, depends on the FMA3 instruction set.
  • arm: Yes, depends on VFPv4.
  • arm64: No, arm64 == armv8 and above in Go.
  • mips[64]: Yes, the proper instruction is introduced in Release 6. The CPU must also support double-precision.
  • ppc64: No. POWER8 assembly can safely assume that FMA exists.
  • s390x: No. MADBR was introduced with the 390.

Assuming that internal/cpu aggregates the necessary feature-detection code, the FMA procedure would check

  • arm: cpu.ARM.HasVFPv4
  • mips, mipsle, mips64, mips64le: cpu.MIPS.HasR6 && cpu.MIPS.HasF64
  • 386, amd64, amd64p32: cpu.x86.HasFMA

Failing these checks would defer to a software fallback, like freebsd's or musl's.
The remaining architectures would use an assembly implementation.

@gopherbot

This comment has been minimized.

Copy link

commented Aug 2, 2018

Change https://golang.org/cl/127458 mentions this issue: math: add guaranteed-precision FMA intrinsic

@gopherbot

This comment has been minimized.

Copy link

commented Oct 20, 2018

Change https://golang.org/cl/137156 mentions this issue: cmd/compile: add fma intrinsic for amd64

@gopherbot

This comment has been minimized.

Copy link

commented Oct 20, 2018

Change https://golang.org/cl/131959 mentions this issue: cmd/compile: introduce generic ssa intrinsic for fused-multiply-add

@gopherbot

This comment has been minimized.

Copy link

commented Oct 20, 2018

Change https://golang.org/cl/142117 mentions this issue: cmd/compile: add fma intrinsic for arm

@smasher164

This comment has been minimized.

Copy link
Member

commented Nov 2, 2018

Can this go in 1.12? All of the CLs except one have been reviewed, namely https://golang.org/cl/127458. The implementation has been well tested and benchmarked, as well has being compared with alternative implementations.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Nov 5, 2018

Yes, we should get this in. I've +2'd your final CL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.