Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: inconsistent signaling NaN behavior on mips64le #37455

Closed
randall77 opened this issue Feb 25, 2020 · 17 comments
Closed

cmd/compile: inconsistent signaling NaN behavior on mips64le #37455

randall77 opened this issue Feb 25, 2020 · 17 comments
Milestone

Comments

@randall77
Copy link
Contributor

@randall77 randall77 commented Feb 25, 2020

Compile and run the following C program:

#include <stdio.h>
#include <stdint.h>

double add0(double x) {
  return x+0;
}

int main(int argc, char *argv[]) {
  double snan64;
  *(uint64_t*)&snan64 = 0x7ff0000000000001;
  printf("snan64: %llx\n", *(uint64_t*)&snan64);
  double x = add0(snan64);
  printf("  add0: %llx\n", *(uint64_t*)&x);
}

On the linux-mips64le-mengzhuo builders, it prints

snan64: 7ff0000000000001
  add0: 7ff8000000000001

On the linux-mips64le-rtrk builders, it prints

snan64: 7ff0000000000001
  add0: 7ff0000000000001

Note that on the mengzhuo builder, the quiet bit gets set, whereas on the rtrk builder it does not.

This seems like a strange inconsistency. The assembly code for add0 is the same on both platforms.

It seems strange that the rtrk builder doesn't convert the NaN from signaling to quiet. add0 quiets the NaN for all the other Go platforms we support (amd64, powerpc, etc.). There are also strange inconsistencies with how it handles float32 <-> float64 conversions of signaling NaNs, which is how I first noticed this (#36399).

Is this expected behavor? Is this allowed by the spec? Might it be a bug in the rtrk builder? Is there some floating-point-signal-interrupt-control-word thingy we might need to set?

@mengzhuo @milanknezevic

@randall77 randall77 added this to the Go1.15 milestone Feb 25, 2020
@randall77
Copy link
Contributor Author

@randall77 randall77 commented Feb 25, 2020

Note that this behavior also appears on the other rtrk builders (32/64 bit, and big/little endian).

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Feb 25, 2020

I think it should generate a quiet NaN. The MIPS manual said

An SNaN is never produced as a result value.

Also

SNaN operands cause the Invalid Operation exception for arithmetic operations.
...
The result, when the exception condition occurs without a precise trap, is a quiet NaN.

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 26, 2020

According to http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1011.htm:

Binary operations that have a normal operand and a signaling NaN operand shall trigger the signaling NaN and should produce that signaling NaN made quiet as a result.

but:

Whether an operation that merely returns the value of a numeric operand, changing at most its sign, triggers signaling NaNs is unspecified. Such operations include conversions that do not change precision, the unary + and - operators, and the fabs and copysign functions.

So the behavior on the rtrk builder seems to be a bug, depending on whether you consider +0 to be “an operation that merely returns the value of a numeric operand” or one that has “a normal operand and a signaling NaN operand”.

@bcmills
Copy link
Member

@bcmills bcmills commented Feb 26, 2020

The assembly code for add0 is the same on both platforms.

Could this be due to a difference in the default kernel or libc signal behavior for floating-point exceptions? (What happens if you compile the program with -fsignaling-nans, and/or register an explicit SIGFPE handler?)

@bmkessler
Copy link
Contributor

@bmkessler bmkessler commented Feb 27, 2020

Is it possible the mips rtrk builders may be using a soft-float implementation? https://golang.org/src/runtime/softfloat64.go converts all NaN to signalling NaN as that is the only NaN bit pattern returned for all ops.

const (
	mantbits64 uint = 52
	expbits64  uint = 11
	bias64          = -1<<(expbits64-1) + 1

	nan64 uint64 = (1<<expbits64-1)<<mantbits64 + 1
	inf64 uint64 = (1<<expbits64 - 1) << mantbits64
	neg64 uint64 = 1 << (expbits64 + mantbits64)

	mantbits32 uint = 23
	expbits32  uint = 8
	bias32          = -1<<(expbits32-1) + 1

	nan32 uint32 = (1<<expbits32-1)<<mantbits32 + 1
	inf32 uint32 = (1<<expbits32 - 1) << mantbits32
	neg32 uint32 = 1 << (expbits32 + mantbits32)
)

It seems like nan64 and nan32 should be changed to a quiet NaN, i.e.

	nan64 uint64 = (1<<expbits64-1)<<mantbits64 + 1<<(mantbits64-1) + 1
	nan32 uint32 = (1<<expbits32-1)<<mantbits32 + 1<<(mantbits32-1) + 1

Or just write out the explicit bit pattern for clarity like in the math package https://golang.org/src/math/bits.go?s=671:689#L21

That of course doesn't explain the C results unless they also have the same issue in their soft-float implementation.

@randall77
Copy link
Contributor Author

@randall77 randall77 commented Feb 27, 2020

You're right that the soft float implementation should probably be fixed. But as you noted, that isn't the rtrk problem.

@randall77
Copy link
Contributor Author

@randall77 randall77 commented Feb 27, 2020

I think the softfloat in runtime/softfloat64.go is only used by the arm/GOARM=5 port.
Coincidentally, the arm5 builders are down, so we didn't actually see those break when the fix for #36399 was submitted.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Feb 27, 2020

The softfloat code is used on GOARM=5, or GOMIPS=softfloat, or compiler's -d softfloat flag. Yeah, the arm5 builder is down, and I don't think the MIPS builders are running in that mode.

@randall77
Copy link
Contributor Author

@randall77 randall77 commented Feb 28, 2020

So the behavior on the rtrk builder seems to be a bug, depending on whether you consider +0 to be “an operation that merely returns the value of a numeric operand” or one that has “a normal operand and a signaling NaN operand”.

+0 is a red herring. The same bug occurs with +7.
(+0 isn't a no-op. It has to convert -0 to +0.)

Who is the owner of the rtrk builders? Can one of those people take a look? (Maybe it is @milanknezevic ?)

@gopherbot
Copy link

@gopherbot gopherbot commented Feb 28, 2020

Change https://golang.org/cl/221433 mentions this issue: runtime: use quiet NaNs in softfloat implementation

@draganmladjenovic
Copy link

@draganmladjenovic draganmladjenovic commented Mar 1, 2020

@randall77
I believe that your example uses what under legacy [1] Mips NaN encoding is considered a QNaN. So the behavior of rtrk builder seems correct. The mengzhuo builder hardware might be implementing NaN2008 behavior. In think you can check out that [2] with the following snippet:

#include <stdio.h>

int main (void)
{
        printf("NaN mode: %s\n",
                   (__builtin_mips_get_fcsr() & (1 << 18)) ? "2008" : "legacy");
        return 0;
}

If I recall correctly Linux kernels prior 4.5 are unaware of NaN2008 encoding and will happily run
legacy binaries, but it will not put the FPU under legacy mode.

[1] https://sourceware.org/binutils/docs/as/MIPS-NaN-Encodings.html
[2] https://github.com/bminor/glibc/blob/master/sysdeps/mips/fpu_control.h

@randall77
Copy link
Contributor Author

@randall77 randall77 commented Mar 1, 2020

Interesting. So we just need to set the EF_MIPS_NAN2008 bit in the elf header?

I can't seem to make that work. The gcc install can't seem to handle -mnam=2008:

$ gomote run user-khr-linux-mips64-rtrk-0 /usr/bin/gcc -mnan=2008 -o foo go/src/issue37455.c
In file included from /usr/include/features.h:448,
                 from /usr/include/mips-linux-gnu/bits/libc-header-start.h:33,
                 from /usr/include/stdio.h:27,
                 from go/src/issue37455.c:1:
/usr/include/mips-linux-gnu/gnu/stubs.h:17:11: fatal error: gnu/stubs-o32_hard_2008.h: No such file or directory
 # include <gnu/stubs-o32_hard_2008.h>
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

And when I change the Go toolchain to set that bit in executables with this patch:

diff --git a/src/cmd/link/internal/ld/elf.go b/src/cmd/link/internal/ld/elf.go
index b7221f04b3..bc10300910 100644
--- a/src/cmd/link/internal/ld/elf.go
+++ b/src/cmd/link/internal/ld/elf.go
@@ -540,6 +540,13 @@ func Elfinit(ctxt *Link) {
                ehdr.phentsize = ELF32PHDRSIZE /* Must be ELF32PHDRSIZE */
                ehdr.shentsize = ELF32SHDRSIZE /* Must be ELF32SHDRSIZE */
        }
+       switch ctxt.Arch.Family {
+       case sys.MIPS, sys.MIPS64:
+               // Set the NaN2008 bit. This ensures that the quiet/signaling nan
+               // bit has the same behavior on MIPS as other architectures.
+               // See issue 37455.
+               ehdr.flags |= 0x400
+       }
 }

I get the following error:

$ gomote run user-khr-linux-mips64-rtrk-0 go/src/make.bash
Building Go cmd/dist using /usr/local/go-bootstrap. (devel +acbed0372e Mon Oct 28 11:22:20 2019 +0000 linux/mips64)
Building Go toolchain1 using /usr/local/go-bootstrap.
Building Go bootstrap cmd/go (go_bootstrap) using Go toolchain1.
Building Go toolchain2 using go_bootstrap and Go toolchain1.
go tool dist: FAILED: /tmp/gobuilder-mips64/go/pkg/tool/linux_mips64/go_bootstrap install -gcflags=all= -ldflags=all= -i cmd/asm cmd/cgo cmd/compile cmd/link: fork/exec /tmp/gobuilder-mips64/go/pkg/tool/linux_mips64/go_bootstrap: exec format error
Error running run: exit status 2
@randall77
Copy link
Contributor Author

@randall77 randall77 commented Mar 1, 2020

I also tried this patch to set the control register bit on startup. It compiled and ran, but somehow it didn't fix the underlying problem.

diff --git a/src/runtime/asm_mips64x.s b/src/runtime/asm_mips64x.s
index 7330f40e85..03e869ffd7 100644
--- a/src/runtime/asm_mips64x.s
+++ b/src/runtime/asm_mips64x.s
@@ -18,6 +18,13 @@ TEXT runtime·rt0_go(SB),NOSPLIT,$0
        MOVW    R4, 8(R29) // argc
        MOVV    R5, 16(R29) // argv
 
+       // Set the NaN2008 bit. This ensures that the quiet/signaling nan
+       // bit has the same behavior on MIPS as other architectures.
+       // See issue 37455.
+       MOVV    FCR31, R1
+       OR      $(1<<18), R1
+       MOVV    R1, FCR31
+

(I also tried all the other FCRs. Why are there 32 control registers?)

@draganmladjenovic
Copy link

@draganmladjenovic draganmladjenovic commented Mar 1, 2020

@randall77 I believe that other FCSR registers are RW or RO views of FCSR31. You are probably hitting [1]. You technically have to assume that "legacy" behavior is the correct one and not trying to fix it.

You might be able to force nan2008 back to legacy. FPU hardware may or may not support this. Maybe the kernel on mengzhuo builder is older than 4.5 or is running in "lax mode" (can't remember kernel command line argument used for that edit: [2]) that doesn't enforce nan2008/legacy consistency, so you get nan2008 fpu behavior while running "legacy" binary.

As for gcc -mnan2008 requires a separate sysroot and rebuild of whole distribution.

[1] https://github.com/torvalds/linux/blob/2f4c53349961c8ca480193e47da4d44fdb8335a8/arch/mips/kernel/elf.c#L161
[2] https://github.com/torvalds/linux/blob/c60c04021353c55b133519804734415f647f08bd/Documentation/admin-guide/kernel-parameters.txt#L1578

@FlyGoat
Copy link

@FlyGoat FlyGoat commented Mar 2, 2020

FYI:
Mengzhuo's Loongson-3A4000 is running with kernel cmdline ieee754=relaxed, which means kernel won't reject the case if hardware only implemented NAN2008 mode while binary is compiled for legacy NAN case.

Most new MIPS system (e.g. Loongson-3A4000, Ingeinc X1830) implemented NAN2008 mode only while older hardware (e.g. Loongson-3A3000, Ingeinc JZ4780, MediaTek MT7620) are legacy only. In theory, some systems should have configurable NAN, but I know none of them. Due to interlink issue caused by NAN2008 binary, most of the users are using NAN2008 hardware with legacy distro and force ieee754=relaxed.

In my point of view, we'd better keep Golang's FCSR NAN coherent with system-wide setting and treat mengzhuo's builder as an exception. Otherwise, it will become an endless mess.

@mengzhuo
Copy link
Contributor

@mengzhuo mengzhuo commented Mar 2, 2020

@draganmladjenovic

This Loongson box runs on Linux 4.19.
The kernel cmdline

rd_start=0xffffffff89ec0000 rd_size=0x1720044  ieee754=relaxed e1000e.InterruptThrottleRate=4,4,4,4
gopherbot pushed a commit that referenced this issue Mar 2, 2020
Update #37455

Change-Id: Ieac0823aa398d73187c009037be15ba34c84f3d9
Reviewed-on: https://go-review.googlesource.com/c/go/+/221433
Run-TryBot: Emmanuel Odeke <emm.odeke@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@randall77
Copy link
Contributor Author

@randall77 randall77 commented Mar 3, 2020

Oh well. At least we now understand what is going on.
Maybe someday mips will get on board with IEEE 754-2008. We can revisit if/when that happens.

@randall77 randall77 closed this Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
9 participants
You can’t perform that action at this time.