Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: it is not possible to prevent FMA with complex values #36971

Open
kortschak opened this issue Feb 1, 2020 · 15 comments
Open

cmd/compile: it is not possible to prevent FMA with complex values #36971

kortschak opened this issue Feb 1, 2020 · 15 comments

Comments

@kortschak
Copy link
Contributor

@kortschak kortschak commented Feb 1, 2020

What version of Go are you using (go version)?

$ go version
go version go1.13.6 linux/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE="on"
GOARCH="arm64"
GOBIN="/home/user/bin"
GOCACHE="/home/user/.cache/go-build"
GOENV="/home/user/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/user"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/user/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/user/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="0"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build515689865=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Examine the assembly generated for the code at https://play.golang.org/p/JuTC-BPAIJN with GOARCH=arm64

What did you expect to see?

No FMA/FMS instructions emitted.

What did you see instead?

	0x00ac 00172 (/home/user/c.go:19)	PCDATA	ZR, $5
	0x00ac 00172 (/home/user/c.go:19)	FMOVD	8(R8), F4
	0x00b0 00176 (/home/user/c.go:19)	FMSUBD	F1, F3, F4, F3
	0x00b4 00180 (/home/user/c.go:19)	SCVTFD	R2, F5
	0x00b8 00184 (/home/user/c.go:19)	FADDD	F5, F3, F3
	0x00bc 00188 (/home/user/c.go:19)	FMOVD	F3, (R0)(R6)
	0x00c0 00192 (/home/user/c.go:19)	FMULD	F4, F0, F0
	0x00c4 00196 (/home/user/c.go:19)	FMADDD	F1, F0, F2, F0
	0x00c8 00200 (/home/user/c.go:19)	FMOVD	ZR, F1
	0x00cc 00204 (/home/user/c.go:19)	FADDD	F0, F1, F0

No amount of wrapping the operands in complex128 prevents this AFAICS.

@agnivade

This comment has been minimized.

Copy link
Contributor

@agnivade agnivade commented Feb 2, 2020

Does setting "GODEBUG=cpu.fma=off" help ?

@kortschak

This comment has been minimized.

Copy link
Contributor Author

@kortschak kortschak commented Feb 2, 2020

The output remains identical with that env var setting.

@agnivade

This comment has been minimized.

Copy link
Contributor

@agnivade agnivade commented Feb 2, 2020

I think that's a runtime setting. Maybe @martisch can help.

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Feb 2, 2020

I think you're right that there's no way to turn this off for arm64.
@agnivade, the GODEBUG setting should only be for x86 (and x86 only uses fma instructions for math.FMA).

There's nothing in the spec that specifies how complex arithmetic is calculated, so I don't think we're breaking anything.

I'm curious as to what is breaking as a result of this. It's not like you can control the rounding in a complex multiply regardless (there are 3 roundings in a non-fma complex multiply, and just 2 roundings in a fma-enabled complex multiply).

@smasher164

This comment has been minimized.

Copy link
Member

@smasher164 smasher164 commented Feb 2, 2020

@agnivade cpu.fma is for runtime feature detection that is necessary for math.FMA. In this issue, the explicit cast described in #17895 to prevent the optimizer from automatically folding x*y+z into fma(x,yz), does not work for the complex128 expression complex128(ev[k]*w[k]) + complex(float64(i), 0).

Edit: accidentally restated @randall77's comment

@kortschak

This comment has been minimized.

Copy link
Contributor Author

@kortschak kortschak commented Feb 2, 2020

I'm in the process of attempting to get gonum's lapack implementation working on arm64. So far, the failures have been due to FMA/FMS. I have gone through the code to attempt to remove the autogeneration of FMA and FMS to get to a point where I can debug the failures. I'm instead getting to the point where I am not going to bother trying to support arm64.

@agnivade

This comment has been minimized.

Copy link
Contributor

@agnivade agnivade commented Feb 2, 2020

I see. Thanks for the clarification Keith and Akhil.

@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented Feb 2, 2020

I don't think we're breaking anything.

We might be breaking golden tests that attempt to get the same output (perhaps within some tolerance) across different architectures.

I have gone through the code to attempt to remove the autogeneration of FMA and FMS to get to a point where I can debug the failures.

It'd be pretty straightforward to add a compiler command line flag to disable FMA. And to add a flag that logs where FMA is inserted. This would be in keeping with many other compiler flags. Would that help?

(I still think it is also worth having a conversation about whether using explicit rounding should prevent FMA on a case-by-case basis with complex values.)

@kortschak

This comment has been minimized.

Copy link
Contributor Author

@kortschak kortschak commented Feb 2, 2020

We might be breaking golden tests that attempt to get the same output (perhaps within some tolerance) across different architectures.

I suspect that it's worse than that. We already work around this kind of thing by providing sets of acceptable golden values and arch-dependent golden values, but we are working on problems where the intrinsic behaviour of routines as a whole appear to be broken by automatic FMA/FMS emission (I can't guarantee that this is true, because I can't prevent the compiler from emitting all FMA/FMS instructions - hence this issue). The case that brought this up is Eigen decomposition, which is an inherently numerically unstable operation in some cases, I think that the differential precision by fused v non-fused operations is causing amplification of the instability and so the complete (i.e. not just a matter of increasing tolerance) failure on arm64.

@kortschak

This comment has been minimized.

Copy link
Contributor Author

@kortschak kortschak commented Feb 2, 2020

With help from @josharian I've temporarily made these changes and I think my hypothesis is incorrect.

I do still think though that users should be able to prevent fused operations with complex values.

@josharian

This comment has been minimized.

Copy link
Contributor

@josharian josharian commented Feb 2, 2020

This case was noted in passing at #17895 (comment) and the following comment.

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Feb 3, 2020

Try computing x*y (as complex128s) using:

func cmul(x, y complex128) complex128 {
	return complex(float64(real(x)*real(y))-float64(imag(x)*imag(y)), float64(real(x)*imag(y))+float64(imag(x)*real(y)))
}

That should inhibit fma computation of floating-point multiply.

@kortschak

This comment has been minimized.

Copy link
Contributor Author

@kortschak kortschak commented Feb 3, 2020

Yes, I'm aware of that. At that point I should just use struct{ re, im float64 }.

@martisch

This comment has been minimized.

Copy link
Member

@martisch martisch commented Feb 4, 2020

I think that's a runtime setting. Maybe @martisch can help.

As @randall77 pointed out FMA setting doesnt exist for arm64 as we assume its always present:

options = []option{

There is no condition for arm64 in the compiler currently to not generate FMA or make it conditional on the GODEBUG setting at runtime:

sys.ARM64, sys.PPC64, sys.S390X)

@kortschak

This comment has been minimized.

Copy link
Contributor Author

@kortschak kortschak commented Feb 22, 2020

The original issues that prompted this have been resolved, in part by changing the compiler to not emit fused operation instructions. The original issues covered the range of differences in precision causing changes in output precision, differences in precision changing gross behaviour and an actual real bug in our code (~50 issues were rolled up in the original problem).

Because of the complexity of the issues we had (and the number of sites where fused operations can be constructed), it would have been very helpful to be able to tell the compiler to not emit fused operation instructions in order to exclude that as a cause. I'll note that discordance between precision at different sites has historically been a cause of catastrophic failures (I hope no-one is using Go for missile guidance).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.