Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spec: allow the use of fused multiply-add floating point instructions #17895

Closed
mundaym opened this issue Nov 11, 2016 · 36 comments
Closed

spec: allow the use of fused multiply-add floating point instructions #17895

mundaym opened this issue Nov 11, 2016 · 36 comments

Comments

@mundaym
Copy link
Member

mundaym commented Nov 11, 2016

Fused multiply-add (FMA) floating point instructions typically provide improved accuracy and performance when compared to independent floating point multiply and add instructions. However they may change the result of such an operation because they omit the rounding operation that would normally take place between a multiply instruction and an add instruction.

This proposal seeks to clarify the guarantees that Go provides as to when rounding to float32 or float64 must be performed so that FMA operations can be safely extracted by SSA rules. I assume that complex{64,128} casts will be lowered to float{32,64} casts for the purposes of this proposal.

The consensus from previous discussions on the subject is that explicit casts should force rounding, as is already specified for constants:

❌ a := float64(x * y) + z  (1)
❌ z += float64(x * y)      (2)

There is also consensus that parentheses should not force rounding. So in the following cases the intermediate rounding stage can be omitted and a FMA used:

✅ a := x * y + z           (3)
✅ a := (x * y) + z         (4)
✅ z += x * y               (5)
✅ z += (x * y)             (6)

It is also proposed that assignments to local variables should not force rounding to take place:

✅ t := x * y; t += z       (7)

I also propose that an assignment to a memory location should force rounding (I lean towards forcing rounding whenever an intermediate result is visible to the program):

❌ *a = x * y; t := *a + z  (8)

(SSA rules could optimize example 8 because they will replace the load from a with a reuse of the result of x * y.)

I think the only real complexity in the implementation is how we plumb the casts from the compiler to the SSA backend so that optimization rules can be blocked as appropriate. I’m not sure if there is a pre-existing mechanism we can use.

See these links for previous discussion of this proposal on golang-dev:
https://groups.google.com/d/topic/golang-dev/qvOqcmAkKnA/discussion
https://groups.google.com/d/topic/golang-dev/cVnE1K08Aks/discussion

@rsc rsc modified the milestone: Proposal Nov 14, 2016
@rsc
Copy link
Contributor

rsc commented Nov 14, 2016

Note: The important part of this outcome is the precedent or more general rule it establishes for avoiding compiler optimization on float32/float64 arithmetic more generally.

@tombergan
Copy link
Contributor

tombergan commented Nov 16, 2016

What is a "local variable", what is a "memory location", and what does it mean for an intermediate result to be "visible to the program"? These may seem like simple questions, but in practice the distinction is blurry and depends on choices made by the compiler. Examples:

var global float64
var globalPtr *float64

// z is changed after it escapes. Is rounding forced?
// Note that z is a "local variable", but also a "memory location", since it escapes.
func case1(x, y float64) {
  var z float64
  globalPtr = &z
  z = x + y
}

// z is changed before it escapes. Is rounding forced before or after Println?
func case2(x, y float64) {
  var z float64
  z = x + y
  fmt.Println(z)
  globalPtr = &z
}

// The first store into `global` may not be visible to other goroutines
// since there is no memory barrier between the two stores. The compiler
// may merge these stores into one. Is rounding forced on (x + y), or
// just (x + y + 2)?
func case3(x, y float64) {
  global = x + y
  global += 2
}

// A smart compiler might realize that z doesn't escape, even though its
// address is taken implicitly by the closure. Is rounding forced?
func case4(x, y float64) float64 {
  var z float64
  var wg sync.WaitGroup
  wg.Add(1)
  go func() {
    defer wg.Done()
    z = x + y
  }()
  wg.Wait()
  return z
}

I think I agree with rsc's suggestion from the second linked thread: "I would lean toward allowing FMA aggressively except for an explicit conversion."

@mundaym
Copy link
Member Author

mundaym commented Nov 16, 2016

What is a "local variable", what is a "memory location", and what does it mean for an intermediate result to be "visible to the program"? These may seem like simple questions, but in practice the distinction is blurry and depends on choices made by the compiler.

Thanks. You're right of course, I was being imprecise. These properties aren't necessarily clear in code. They are somewhat clearer in the backend, but that might change as the compiler gets cleverer.

I think I agree with rsc's suggestion from the second linked thread: "I would lean toward allowing FMA aggressively except for an explicit conversion."

I think I also agree that would be a good rule but I'm starting to wonder if a strict interpretation of the spec already implies that assignment should force rounding. The spec defines float32 and float64 as:

float32     the set of all IEEE-754 32-bit floating-point numbers
float64     the set of all IEEE-754 64-bit floating-point numbers

Since the intermediate result of a fused multiply-add may not be a valid IEEE-754 32- or 64-bit floating-point number, this definition would seem to suggest that an assignment should prevent the optimization. However gri and rsc's comments on the thread would seem to imply that they don't agree (and they obviously know much better than me).

Assignments to variables currently block full precision constant propagation AFAICT, so forcing rounding at assignments would be in line with that behavior. Could the output of the following program change in future?

package main

import "fmt"

const x = 1.0000000000000001

func main() {
    x0 := x*10
    fmt.Println(x0) // prints 10.000000000000002
    x1 := x
    x1 *= 10
    fmt.Println(x1) // prints 10, but could be 10.000000000000002
}

(https://play.golang.org/p/6OMMzRq0pr)

@rsc
Copy link
Contributor

rsc commented Nov 28, 2016

It sounds like we agree that a float64 conversion should be an explicit signal that a rounded float64 should be materialized, so that for example float64(x*y)+z cannot use an FMA, but x*y+z and (x*y)+z can.

The question raised in @mundaym's latest comment is whether we're sure about case (7) above: if the code does t := x*y; t += z, should we allow t to never be materialized as a float64, so that for example FMA can be used in that case? The argument in favor of allowing optimization here is that generated code or code transformations might introduce temporaries, and we probably don't want that to have optimization effects. The argument against is that there's an explicit variable of type float64. On balance it seems that having a very explicit signal, as in the float64 conversion, would be best, so I would lean toward keeping case (7) allowed to use FMA.

We don't know too much about what other languages do here.

We know Fortran uses parentheses, although that only helps for FMA because * has higher precedence than + so the parens are optional. We'd rather not overload parens this way.

We don't think the C or C++ languages give easy control over this (possibly writing to a volatile and reading it back?).

What about Java? How do they provide access to FMA?

/cc @MichaelTJones for thoughts or wisdom about any of this.

@bradfitz
Copy link
Contributor

bradfitz commented Nov 28, 2016

Looks like Java is making it explicit with library additions: https://www.mail-archive.com/core-libs-dev@openjdk.java.net/msg39320.html

Commit: http://cr.openjdk.java.net/~darcy/4851642.0/

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Nov 28, 2016

When using GCC the -ffloat-store option can be used with C/C++ to force rounding when a floating-point value is assigned to a variable.

@MichaelTJones
Copy link
Contributor

MichaelTJones commented Nov 28, 2016

@griesemer
Copy link
Contributor

griesemer commented Nov 28, 2016

@MichaelTJones There's also the opposite explicit form: Have a mechanism (intrinsified function call) to specify when to use FMA (and never, otherwise). Go already allows control over when not to use FMA (if available) by forcing an explicit conversion; e.g. float64(x*y) + z (no FMA) vs x*y + z (possibly FMA).

@rsc
Copy link
Contributor

rsc commented Dec 5, 2016

@mundaym, I think everyone agrees about cases 1, 2, 3, 4, 5, 6.

We are less sure about 7 and 8, which may be the same case for a sufficiently clever compiler.
The argument for "no" on 7 and 8 is that the assignment implies a fixed storage format.
The argument for "yes" on 7 is that otherwise a source-to-source lowering of a program would inhibit certain floating-point optimizations, and also more generally source-to-source translations will have to consider whether they are changing program semantics by combining or splitting expressions. Another argument for "yes" on 7 is that it results in just one way to disable the optimization, instead of two.

I propose that we tentatively assume "yes" on 7 and 8, with the understanding that we can back down from that if a strong real-world example arrives showing that we've made a mistake.

@mundaym
Copy link
Member Author

mundaym commented Dec 5, 2016

I propose that we tentatively assume "yes" on 7 and 8, with the understanding that we can back down from that if a strong real-world example arrives showing that we've made a mistake.

Thanks, that sounds good to me. I'll prototype it for ppc64{,le} and s390x.

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Dec 5, 2016

@rsc Can you clarify what "yes" and "no" mean in your comment?

@rsc
Copy link
Contributor

rsc commented Dec 12, 2016

"Yes" means check-mark above (FMA optimization allowed here), and "no" means X above (FMA optimization not allowed here).

@rsc
Copy link
Contributor

rsc commented Dec 12, 2016

On hold until prototype arrives. We still need to figure out wording for the spec.

@gopherbot
Copy link

gopherbot commented Feb 14, 2017

CL https://golang.org/cl/36963 mentions this issue.

@mundaym
Copy link
Member Author

mundaym commented Feb 16, 2017

Complex multiplication may be implemented using fused multiply-add instructions. There is no obvious way to prevent them being emitted since multiplication is a single operation. I think this is fine, but I thought I'd note it here.

@rsc
Copy link
Contributor

rsc commented Feb 16, 2017

Thanks. I agree that's probably fine, certainly until it comes up in practice.

gopherbot pushed a commit that referenced this issue Feb 28, 2017
Explcitly block fused multiply-add pattern matching when a cast is used
after the multiplication, for example:

    - (a * b) + c        // can emit fused multiply-add
    - float64(a * b) + c // cannot emit fused multiply-add

float{32,64} and complex{64,128} casts of matching types are now kept
as OCONV operations rather than being replaced with OCONVNOP operations
because they now imply a rounding operation (and therefore aren't a
no-op anymore).

Operations (for example, multiplication) on complex types may utilize
fused multiply-add and -subtract instructions internally. There is no
way to disable this behavior at the moment.

Improves the performance of the floating point implementation of
poly1305:

name         old speed     new speed     delta
64           246MB/s ± 0%  275MB/s ± 0%  +11.48%   (p=0.000 n=10+8)
1K           312MB/s ± 0%  357MB/s ± 0%  +14.41%  (p=0.000 n=10+10)
64Unaligned  246MB/s ± 0%  274MB/s ± 0%  +11.43%  (p=0.000 n=10+10)
1KUnaligned  312MB/s ± 0%  357MB/s ± 0%  +14.39%   (p=0.000 n=10+8)

Updates #17895.

Change-Id: Ia771d275bb9150d1a598f8cc773444663de5ce16
Reviewed-on: https://go-review.googlesource.com/36963
Run-TryBot: Michael Munday <munday@ca.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@gopherbot
Copy link

gopherbot commented Mar 13, 2017

CL https://golang.org/cl/38095 mentions this issue.

@laboger
Copy link
Contributor

laboger commented Mar 13, 2017

Improvements on ppc64le for CL 38095:
poly1305:
benchmark old ns/op new ns/op delta
Benchmark64-16 172 151 -12.21%
Benchmark1K-16 1828 1523 -16.68%
Benchmark64Unaligned-16 172 151 -12.21%
Benchmark1KUnaligned-16 1827 1523 -16.64%

math:
BenchmarkAcos-16 43.9 39.9 -9.11%
BenchmarkAcosh-16 57.0 45.8 -19.65%
BenchmarkAsin-16 35.8 33.0 -7.82%
BenchmarkAsinh-16 68.6 60.8 -11.37%
BenchmarkAtan-16 19.8 16.2 -18.18%
BenchmarkAtanh-16 65.5 57.5 -12.21%
BenchmarkAtan2-16 45.4 34.2 -24.67%
BenchmarkGamma-16 37.6 26.0 -30.85%
BenchmarkLgamma-16 40.0 28.2 -29.50%
BenchmarkLog1p-16 35.1 29.1 -17.09%
BenchmarkSin-16 22.7 18.4 -18.94%
BenchmarkSincos-16 31.7 23.7 -25.24%
BenchmarkSinh-16 146 131 -10.27%
BenchmarkY0-16 130 107 -17.69%
BenchmarkY1-16 127 107 -15.75%
BenchmarkYn-16 278 235 -15.47%

gopherbot pushed a commit that referenced this issue Mar 20, 2017
A follow on to CL 36963 adding support for ppc64x.

Performance changes (as posted on the issue):

poly1305:
benchmark               old ns/op new ns/op delta
Benchmark64-16          172       151       -12.21%
Benchmark1K-16          1828      1523      -16.68%
Benchmark64Unaligned-16 172       151       -12.21%
Benchmark1KUnaligned-16 1827      1523      -16.64%

math:
BenchmarkAcos-16        43.9      39.9      -9.11%
BenchmarkAcosh-16       57.0      45.8      -19.65%
BenchmarkAsin-16        35.8      33.0      -7.82%
BenchmarkAsinh-16       68.6      60.8      -11.37%
BenchmarkAtan-16        19.8      16.2      -18.18%
BenchmarkAtanh-16       65.5      57.5      -12.21%
BenchmarkAtan2-16       45.4      34.2      -24.67%
BenchmarkGamma-16       37.6      26.0      -30.85%
BenchmarkLgamma-16      40.0      28.2      -29.50%
BenchmarkLog1p-16       35.1      29.1      -17.09%
BenchmarkSin-16         22.7      18.4      -18.94%
BenchmarkSincos-16      31.7      23.7      -25.24%
BenchmarkSinh-16        146       131       -10.27%
BenchmarkY0-16          130       107       -17.69%
BenchmarkY1-16          127       107       -15.75%
BenchmarkYn-16          278       235       -15.47%

Updates #17895.

Change-Id: I1c16199715d20c9c4bd97c4a950bcfa69eb688c1
Reviewed-on: https://go-review.googlesource.com/38095
Reviewed-by: Carlos Eduardo Seo <cseo@linux.vnet.ibm.com>
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
@rsc rsc removed the Proposal-Hold label Apr 10, 2017
@rsc
Copy link
Contributor

rsc commented Apr 10, 2017

Thanks for the implementation. We experimented with the ppc64 compiler and confirmed that this behaves the way we expected from #17895 (comment).

@griesemer will send a spec CL.

@rsc rsc modified the milestones: Go1.9, Proposal Apr 10, 2017
@gopherbot
Copy link

gopherbot commented Apr 11, 2017

CL https://golang.org/cl/40391 mentions this issue.

@mdempsky
Copy link
Member

mdempsky commented Apr 11, 2017

I looked through the two linked discussions, and I didn't spot any discussion about supporting FMA via the standard library. For example, we could add

// FMA returns a*b+c.
func FMA(a, b, c float64) float64

to package math, and optimize it via compiler intrinsics like Sqrt.

Considering this is the approach C, C++, and Java have already taken, I think we should at least discuss it.

@mdempsky
Copy link
Member

mdempsky commented Apr 11, 2017

I'm not a fan of the current language spec proposal because:

  1. It means assignment to a float64 variable and explicit conversion to float64 now have different semantics; intuitively, I'd expect them to always mean the same thing.
  2. It means redundant conversions now can have side effects on a program's correctness. Tools like mdempsky/unconvert can no longer assume floating-point conversions are redundant.

@btracey
Copy link
Contributor

btracey commented Apr 11, 2017

From what I understand, both of your points are already true in the language. From the spec:

When converting an integer or floating-point number to a floating-point type,
or a complex number to another complex type, the result value is rounded to
the precision specified by the destination type. For instance, the value of a variable
x of type float32 may be stored using additional precision beyond that of an IEEE-754
32-bit number, but float32(x) represents the result of rounding x's value to 32-bit
precision. Similarly, x + 0.1 may use more than 32 bits of precision, but float32(x + 0.1) does not.

This means a statement like y = x + 0.1 + 0.6 can have a different result from than y = float32(x+0.1) + 0.6, since the second one forces intermediate rounding.

@mdempsky
Copy link
Member

mdempsky commented Apr 11, 2017

@btracey Hm, I think you're right. I withdraw my objections.

@MichaelTJones
Copy link
Contributor

MichaelTJones commented Apr 11, 2017

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Apr 11, 2017

Although it's true that C/C++ provides a fma function, it is also true that optimizing C/C++ compilers generate fused multiply-add instructions when reasonable. When using GCC you can control this using the (processor-specific) -mfma and -mno-fma options. If you use the (processor-independent) -fexcess-precision=standard option, then fused multiply-add may be used except when there is a cast to a specific type (like this proposal) or an assignment to a variable of specific type. The (processor-independent) -ffloat-store option is similar, but permits fused multiply-add across a cast but not an assignment to a variable. These options are in turn affected by other options like -std= and -ffast-math.

gopherbot pushed a commit that referenced this issue Apr 17, 2017
Added a paragraph and examples explaining when an implementation
may use fused floating-point operations (such as FMA) and how to
prevent operation fusion.

For #17895.

Change-Id: I64c9559fc1097e597525caca420cfa7032d67014
Reviewed-on: https://go-review.googlesource.com/40391
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Rob Pike <r@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>
@TuomLarsen
Copy link

TuomLarsen commented May 24, 2017

Would there be a way to "force" the use of FMA, instead of just "allow"? I mean, would there be a way to make sure that a*b+c rounds just once? Even if that meant to use software emulation on platforms with no FMA instructions. IMHO, the benefit of using FMA is precision, not just speed.

@rsc
Copy link
Contributor

rsc commented May 24, 2017

@TuomLarsen No, we're not providing a way to do that here. Do other languages do that?

@TuomLarsen
Copy link

TuomLarsen commented May 24, 2017

@rsc I apologise, I now realise it might be out of scope for this proposal: what I meant was basically support for simple math.FMA function, I imagined that math.FMA would be kind of symmetrical to emitting FMA instruction (mandatory vs. optional). In any case, are there plans for such a function? See e.g. http://en.cppreference.com/w/c/numeric/math/fma

@josharian
Copy link
Contributor

josharian commented May 24, 2017

@TuomLarsen please file a new issue to propose/discuss. Thanks!

@rsc
Copy link
Contributor

rsc commented Jun 20, 2017

With ppc64 doing this and the spec written, I think this is done.

@rsc rsc changed the title proposal: allow the use of fused multiply-add floating point instructions spec: allow the use of fused multiply-add floating point instructions Jun 20, 2017
@rsc rsc closed this as completed Jun 20, 2017
gopherbot pushed a commit that referenced this issue Jun 26, 2017
Fixes #20795
Updates #17895
Updates #20587

Change-Id: Iea375f3a6ffe3f51e3ffdae1fb3fd628b6b3316c
Reviewed-on: https://go-review.googlesource.com/46717
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@agnivade
Copy link
Contributor

agnivade commented Aug 8, 2017

Hi everyone,

I was pretty excited to try out the perf benefits of the FMA operation in 1.9. However, it seems that I am still getting the same performance. According to @laboger's comment there seem to be improvements of math functions on ppc64. I was under the impression that I will be able to reap similar benefits on amd64 too ?

However, I still see the same performance.

Here is the code -

package stdtest

import (
	"math"
	"testing"
)

func BenchmarkAtan2(b *testing.B) {
	for n := 0; n < b.N; n++ {
		_ = math.Atan2(480.0, 123.0) * 180 / math.Pi
	}
}

Under 1.8.1

go test  -bench=. -benchmem .
BenchmarkAtan2-4   	100000000	        22.9 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	stdtest	2.318s

Under 1.9.rc2

go1.9rc2 test  -bench=. -benchmem .
goos: linux
goarch: amd64
pkg: stdtest
BenchmarkAtan2-4   	50000000	        22.8 ns/op	       0 B/op	       0 allocs/op
PASS

I have verified that my processor supports fma (cat /proc/cpuinfo | grep fma).

Is this expected ? Or am I doing something wrong ?

@bradfitz
Copy link
Contributor

bradfitz commented Aug 8, 2017

@agnivade, if you run git grep FMADD src/cmd/compile in your $GOROOT, you'll see there's no FMADD support for amd64 yet. Only s390x and ppc64.

@agnivade
Copy link
Contributor

agnivade commented Aug 8, 2017

Thanks @bradfitz ! Yes, I was expecting something like that. The tone of this announcement made it seem like its there for all architectures. However the commits seemed to show only for s390x and ppc64. Hence I was a little confused.

Is adding FMADD to amd64 anywhere on the upcoming roadmap ? Thanks. (I have some expensive math operations on a hot path which will benefit tremendously from it 😉 )

@bradfitz
Copy link
Contributor

bradfitz commented Aug 8, 2017

@agnivade, closed bugs isn't where we conventionally have discussions. But I found #8037 for tracking x86 FMA. You can watch that bug.

@agnivade
Copy link
Contributor

agnivade commented Aug 8, 2017

Thanks a lot ! Appreciate the help :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests