Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: testing/quick misbehaves on Nexus 9 linux/arm64 #19809

Closed
rsc opened this issue Mar 31, 2017 · 3 comments

Comments

Projects
None yet
3 participants
@rsc
Copy link
Contributor

commented Mar 31, 2017

testing/quick's int64 chooser is written to return values in the range [-2⁶²,2⁶²).
That's a mistake; fixing that is #19808.

But the code should run as written, and yet on @dr2chase's Nexus 9
running a linux/arm64 toolchain built from a android/arm64 toolchain
cross-compiled from elsewhere, empirically it generates values outside
that range. (That helped find #19807.)

Using a linux/arm64 toolchain built from a linux/arm64 toolchain
cross-compiled from elsewhere on an Odroid works correctly.

Before fixing testing/quick to generate the full range, we should figure
out why the current code generates values outside the narrower range
in this configuration.

math/rand's (*Rand).Int63 says:

return r.src.Int63() 

and (*rngSource).Int63 says:

return int64(rng.Uint64() & _MASK)

where:

const (
	_MAX  = 1 << 63
	_MASK = _MAX - 1
)

and then testing/quick's randInt64 function does:

return rand.Int63() - 1<<62

So either the & _MASK or the - 1<<62 is not doing its job.

It could be that the Nexus 9 ARM64 hardware is buggy.
It could be that the android/arm64 toolchain being used for
bootstrap on the Nexus 9 is buggy, causing the natively compiled
linux/arm64 toolchain to be buggy, causing the testing/quick
code to generate unexpected values. We don't know.

@rsc rsc added this to the Go1.9 milestone Mar 31, 2017

@gopherbot

This comment has been minimized.

Copy link

commented Mar 31, 2017

CL https://golang.org/cl/39152 mentions this issue.

@rsc

This comment has been minimized.

Copy link
Contributor Author

commented Mar 31, 2017

The following C program misbehaves on the problematic system.

#include <stdio.h>

typedef unsigned long long uvlong;

uvlong f() { return ~0ull; }
uvlong g() { return (f() << 1) >> 1; }

int main() {
	for (int i = 0;; i++) {
		if ((long long)g() < 0) {
			printf("%d\n", i);
		}
	}
	return 0;
}

After 20k-50k iterations the loop starts printing i on every iteration.

Not Go's fault. We'll see about reporting this elsewhere.

@rsc rsc closed this Mar 31, 2017

gopherbot pushed a commit that referenced this issue Apr 3, 2017

testing/quick: generate all possible int64, uint64 values
When generating a random int8, uint8, int16, uint16, int32, uint32,
quick.Value chooses among all possible values.

But when generating a random int64 or uint64, it only chooses
values in the range [-2⁶², 2⁶²) (even for uint64).
It should, like for all the other integers, use the full range.

If it had, this would have caught #19807 earlier.
Instead it let us discover the presence of #19809.

While we are here, also make the default source of
randomness not completely deterministic.

Fixes #19808.

Change-Id: I070f852531c92b3670bd76523326c9132bfc9416
Reviewed-on: https://go-review.googlesource.com/39152
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rob Pike <r@golang.org>
@gopherbot

This comment has been minimized.

Copy link

commented Apr 3, 2017

CL https://golang.org/cl/39310 mentions this issue.

gopherbot pushed a commit that referenced this issue Apr 3, 2017

cmd/compile: rewrite upper-bit-clear idiom to use shift-rotate
Old buggy hardware incorrectly executes the shift-left-K
then shift-right-K idiom for clearing K leftmost bits.
Use a right rotate instead of shift to avoid triggering the
bug.

Fixes #19809.

Change-Id: I6dc646b183c29e9d01aef944729f34388dcc687d
Reviewed-on: https://go-review.googlesource.com/39310
Run-TryBot: David Chase <drchase@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>

lparth added a commit to lparth/go that referenced this issue Apr 13, 2017

testing/quick: generate all possible int64, uint64 values
When generating a random int8, uint8, int16, uint16, int32, uint32,
quick.Value chooses among all possible values.

But when generating a random int64 or uint64, it only chooses
values in the range [-2⁶², 2⁶²) (even for uint64).
It should, like for all the other integers, use the full range.

If it had, this would have caught golang#19807 earlier.
Instead it let us discover the presence of golang#19809.

While we are here, also make the default source of
randomness not completely deterministic.

Fixes golang#19808.

Change-Id: I070f852531c92b3670bd76523326c9132bfc9416
Reviewed-on: https://go-review.googlesource.com/39152
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rob Pike <r@golang.org>

lparth added a commit to lparth/go that referenced this issue Apr 13, 2017

cmd/compile: rewrite upper-bit-clear idiom to use shift-rotate
Old buggy hardware incorrectly executes the shift-left-K
then shift-right-K idiom for clearing K leftmost bits.
Use a right rotate instead of shift to avoid triggering the
bug.

Fixes golang#19809.

Change-Id: I6dc646b183c29e9d01aef944729f34388dcc687d
Reviewed-on: https://go-review.googlesource.com/39310
Run-TryBot: David Chase <drchase@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Russ Cox <rsc@golang.org>

lewurm added a commit to lewurm/mono that referenced this issue May 16, 2017

[android] add hardware specific workaround for Nexus9 in armv7 mode
we got a couple of bug reports, all with the same failure:

https://bugzilla.xamarin.com/show_bug.cgi?id=44907
https://bugzilla.xamarin.com/show_bug.cgi?id=46482
https://bugzilla.xamarin.com/show_bug.cgi?id=51791

The bugs are private, therefore here what I wrote:

``
Thank you Matthias, this was very helpful.

The crash happens on this assignment:
https://github.com/mono/mono/blob/de1865dad5c0350f391fedcaa08f02f610530d3f/mono/mini/mini-generic-sharing.c#L418
Here the according disassembly:
https://gist.github.com/lewurm/b1094749027c9e5ea19fdc4fac7905a7

The crash happens in the last loop iteration (`r9=i`, `r7=slot`).  Looking
at the machine code it just _cannot_ happen, which is confirmed by the C
code as well: `*oti` successfully happens in the if check, but after
returning from `alloc_oti()`, `oti` doesn't contain a valid address anymore.
I suspect some weird hardware issue that fails to restore all registers
properly from the stack.
```
[...]

```
Thanks again Matthias.  Unfortunately, I'm out of ideas, and I can't
blame anything but the hardware. The situation we see is too weird. We
segfault at offset `0xdda60: str sl, [r4]`, but really we should
already segfault at offset `0xdda2c: ldrge   sl, [r4, mono#8]!"`.  So I
suspect two things why this could happen:

(1) The instruction at `dda2c` fails to do the post-increment correctly for *whatever* reason.

(2) Something along the execution path corrupts the stackslot, where
`r4` is saved, in such a way that it *exactly* masks it with `0xf`.
Everything else on the stack looks fine, so this is sort of very
unlikely to be honest.

This only happens on a very specific device: The Nexus 9 is the only
device that was ever shipped with the Tegra K1 T132.  I suspect an issue
in the binary translation layer of `armv7` instruction set to the internal
micro-ops of the CPU.  I tried to stress test the instruction in
question (see https://github.com/lewurm/ldrinsntest), however I was not
able to trigger a crash.  So either, I'm missing some context in order
to trigger the bug or I'm on a completely wrong track.

That said, even if we could proof that it is indeed a hardware issue,
the workaround is also non-trivial (it would then either need a fix in
gcc or require a microcode update by the chip vendor).
```

And then we saw:
golang/go#19809 (comment)

Another hint that this device is buggy.

lewurm added a commit to lewurm/mono that referenced this issue May 16, 2017

[android] add hardware specific workaround for Nexus9 in armv7 mode
we got a couple of bug reports, all with the same failure:

https://bugzilla.xamarin.com/show_bug.cgi?id=44907
https://bugzilla.xamarin.com/show_bug.cgi?id=46482
https://bugzilla.xamarin.com/show_bug.cgi?id=51791

The bugs are private, therefore here what I wrote:

``
Thank you Matthias, this was very helpful.

The crash happens on this assignment:
https://github.com/mono/mono/blob/de1865dad5c0350f391fedcaa08f02f610530d3f/mono/mini/mini-generic-sharing.c#L418
Here the according disassembly:
https://gist.github.com/lewurm/b1094749027c9e5ea19fdc4fac7905a7

The crash happens in the last loop iteration (`r9=i`, `r7=slot`).  Looking
at the machine code it just _cannot_ happen, which is confirmed by the C
code as well: `*oti` successfully happens in the if check, but after
returning from `alloc_oti()`, `oti` doesn't contain a valid address anymore.
I suspect some weird hardware issue that fails to restore all registers
properly from the stack.
```
[...]

```
Thanks again Matthias.  Unfortunately, I'm out of ideas, and I can't
blame anything but the hardware. The situation we see is too weird. We
segfault at offset `0xdda60: str sl, [r4]`, but really we should
already segfault at offset `0xdda2c: ldrge   sl, [r4, mono#8]!"`.  So I
suspect two things why this could happen:

(1) The instruction at `dda2c` fails to do the post-increment correctly
    for *whatever* reason.

(2) Something along the execution path corrupts the stackslot, where
    `r4` is saved, in such a way that it *exactly* masks it with `0xf`.
    Everything else on the stack looks fine, so this is sort of very
    unlikely to be honest.

This only happens on a very specific device: The Nexus 9 is the only
device that was ever shipped with the Tegra K1 T132.  I suspect an issue
in the binary translation layer of `armv7` instruction set to the internal
micro-ops of the CPU.  I tried to stress test the instruction in
question (see https://github.com/lewurm/ldrinsntest), however I was not
able to trigger a crash.  So either, I'm missing some context in order
to trigger the bug or I'm on a completely wrong track.

That said, even if we could proof that it is indeed a hardware issue,
the workaround is also non-trivial (it would then either need a fix in
gcc or require a microcode update by the chip vendor).
```

And then we saw:
golang/go#19809 (comment)

Another hint that this device is buggy.

lewurm added a commit to lewurm/mono that referenced this issue May 16, 2017

[android] add hardware specific workaround for Nexus9 in armv7 mode
we got a couple of bug reports, all with the same failure:

https://bugzilla.xamarin.com/show_bug.cgi?id=44907
https://bugzilla.xamarin.com/show_bug.cgi?id=46482
https://bugzilla.xamarin.com/show_bug.cgi?id=51791

The bugs are private, therefore here what I wrote:

```
Thank you Matthias, this was very helpful.

>>> 09-30 12:24:50.347: A/DEBUG(6211): pid: 6191, tid: 6209, name: Thread-4  >>> com.distech.x50.ui.droid <<<
>>> 09-30 12:24:50.347: A/DEBUG(6211): signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x8
>>> 09-30 12:24:50.347: A/DEBUG(6211):     r0 d6b58728  r1 00004001  r2 00000000  r3 d6b58738
>>> 09-30 12:24:50.347: A/DEBUG(6211):     r4 00000008  r5 ea97427c  r6 00001f3c  r7 00000015
>>> 09-30 12:24:50.347: A/DEBUG(6211):     r8 00001f38  r9 00000015  sl d6b58728  fp d67fed90
>>> 09-30 12:24:50.347: A/DEBUG(6211):     ip ea9742c0  sp d67fed60  lr ea7a4504  pc ea716a58  cpsr 800e0010

The crash happens on this assignment:
https://github.com/mono/mono/blob/de1865dad5c0350f391fedcaa08f02f610530d3f/mono/mini/mini-generic-sharing.c#L418
Here the according disassembly:
https://gist.github.com/lewurm/b1094749027c9e5ea19fdc4fac7905a7

The crash happens in the last loop iteration (`r9=i`, `r7=slot`).  Looking
at the machine code it just _cannot_ happen, which is confirmed by the C
code as well: `*oti` successfully happens in the if check, but after
returning from `alloc_oti()`, `oti` doesn't contain a valid address anymore.
I suspect some weird hardware issue that fails to restore all registers
properly from the stack.
```
[...]

```
Thanks again Matthias.  Unfortunately, I'm out of ideas, and I can't
blame anything but the hardware. The situation we see is too weird. We
segfault at offset `0xdda60: str sl, [r4]`, but really we should
already segfault at offset `0xdda2c: ldrge   sl, [r4, mono#8]!"`.  So I
suspect two things why this could happen:

(1) The instruction at `dda2c` fails to do the post-increment correctly
    for *whatever* reason.

(2) Something along the execution path corrupts the stackslot, where
    `r4` is saved, in such a way that it *exactly* masks it with `0xf`.
    Everything else on the stack looks fine, so this is sort of very
    unlikely to be honest.

This only happens on a very specific device: The Nexus 9 is the only
device that was ever shipped with the Tegra K1 T132.  I suspect an issue
in the binary translation layer of `armv7` instruction set to the internal
micro-ops of the CPU.  I tried to stress test the instruction in
question (see https://github.com/lewurm/ldrinsntest), however I was not
able to trigger a crash.  So either, I'm missing some context in order
to trigger the bug or I'm on a completely wrong track.

That said, even if we could proof that it is indeed a hardware issue,
the workaround is also non-trivial (it would then either need a fix in
gcc or require a microcode update by the chip vendor).
```

And then we saw:
golang/go#19809 (comment)

Another hint that this device is buggy.

lewurm added a commit to mono/mono that referenced this issue May 29, 2017

[android] add hardware specific workaround for Nexus9 in armv7 mode
we got a couple of bug reports, all with the same failure:

https://bugzilla.xamarin.com/show_bug.cgi?id=44907
https://bugzilla.xamarin.com/show_bug.cgi?id=46482
https://bugzilla.xamarin.com/show_bug.cgi?id=51791

The bugs are private, therefore here what I wrote:

```
Thank you Matthias, this was very helpful.

>>> 09-30 12:24:50.347: A/DEBUG(6211): pid: 6191, tid: 6209, name: Thread-4  >>> com.distech.x50.ui.droid <<<
>>> 09-30 12:24:50.347: A/DEBUG(6211): signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x8
>>> 09-30 12:24:50.347: A/DEBUG(6211):     r0 d6b58728  r1 00004001  r2 00000000  r3 d6b58738
>>> 09-30 12:24:50.347: A/DEBUG(6211):     r4 00000008  r5 ea97427c  r6 00001f3c  r7 00000015
>>> 09-30 12:24:50.347: A/DEBUG(6211):     r8 00001f38  r9 00000015  sl d6b58728  fp d67fed90
>>> 09-30 12:24:50.347: A/DEBUG(6211):     ip ea9742c0  sp d67fed60  lr ea7a4504  pc ea716a58  cpsr 800e0010

The crash happens on this assignment:
https://github.com/mono/mono/blob/de1865dad5c0350f391fedcaa08f02f610530d3f/mono/mini/mini-generic-sharing.c#L418
Here the according disassembly:
https://gist.github.com/lewurm/b1094749027c9e5ea19fdc4fac7905a7

The crash happens in the last loop iteration (`r9=i`, `r7=slot`).  Looking
at the machine code it just _cannot_ happen, which is confirmed by the C
code as well: `*oti` successfully happens in the if check, but after
returning from `alloc_oti()`, `oti` doesn't contain a valid address anymore.
I suspect some weird hardware issue that fails to restore all registers
properly from the stack.
```
[...]

```
Thanks again Matthias.  Unfortunately, I'm out of ideas, and I can't
blame anything but the hardware. The situation we see is too weird. We
segfault at offset `0xdda60: str sl, [r4]`, but really we should
already segfault at offset `0xdda2c: ldrge   sl, [r4, #8]!"`.  So I
suspect two things why this could happen:

(1) The instruction at `dda2c` fails to do the post-increment correctly
    for *whatever* reason.

(2) Something along the execution path corrupts the stackslot, where
    `r4` is saved, in such a way that it *exactly* masks it with `0xf`.
    Everything else on the stack looks fine, so this is sort of very
    unlikely to be honest.

This only happens on a very specific device: The Nexus 9 is the only
device that was ever shipped with the Tegra K1 T132.  I suspect an issue
in the binary translation layer of `armv7` instruction set to the internal
micro-ops of the CPU.  I tried to stress test the instruction in
question (see https://github.com/lewurm/ldrinsntest), however I was not
able to trigger a crash.  So either, I'm missing some context in order
to trigger the bug or I'm on a completely wrong track.

That said, even if we could proof that it is indeed a hardware issue,
the workaround is also non-trivial (it would then either need a fix in
gcc or require a microcode update by the chip vendor).
```

And then we saw:
golang/go#19809 (comment)

Another hint that this device is buggy.

jonpryor added a commit to jonpryor/mono that referenced this issue Jun 1, 2017

[android] add hardware specific workaround for Nexus9 in armv7 mode
we got a couple of bug reports, all with the same failure:

https://bugzilla.xamarin.com/show_bug.cgi?id=44907
https://bugzilla.xamarin.com/show_bug.cgi?id=46482
https://bugzilla.xamarin.com/show_bug.cgi?id=51791

The bugs are private, therefore here what I wrote:

```
Thank you Matthias, this was very helpful.

>>> 09-30 12:24:50.347: A/DEBUG(6211): pid: 6191, tid: 6209, name: Thread-4  >>> com.distech.x50.ui.droid <<<
>>> 09-30 12:24:50.347: A/DEBUG(6211): signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x8
>>> 09-30 12:24:50.347: A/DEBUG(6211):     r0 d6b58728  r1 00004001  r2 00000000  r3 d6b58738
>>> 09-30 12:24:50.347: A/DEBUG(6211):     r4 00000008  r5 ea97427c  r6 00001f3c  r7 00000015
>>> 09-30 12:24:50.347: A/DEBUG(6211):     r8 00001f38  r9 00000015  sl d6b58728  fp d67fed90
>>> 09-30 12:24:50.347: A/DEBUG(6211):     ip ea9742c0  sp d67fed60  lr ea7a4504  pc ea716a58  cpsr 800e0010

The crash happens on this assignment:
https://github.com/mono/mono/blob/de1865dad5c0350f391fedcaa08f02f610530d3f/mono/mini/mini-generic-sharing.c#L418
Here the according disassembly:
https://gist.github.com/lewurm/b1094749027c9e5ea19fdc4fac7905a7

The crash happens in the last loop iteration (`r9=i`, `r7=slot`).  Looking
at the machine code it just _cannot_ happen, which is confirmed by the C
code as well: `*oti` successfully happens in the if check, but after
returning from `alloc_oti()`, `oti` doesn't contain a valid address anymore.
I suspect some weird hardware issue that fails to restore all registers
properly from the stack.
```
[...]

```
Thanks again Matthias.  Unfortunately, I'm out of ideas, and I can't
blame anything but the hardware. The situation we see is too weird. We
segfault at offset `0xdda60: str sl, [r4]`, but really we should
already segfault at offset `0xdda2c: ldrge   sl, [r4, mono#8]!"`.  So I
suspect two things why this could happen:

(1) The instruction at `dda2c` fails to do the post-increment correctly
    for *whatever* reason.

(2) Something along the execution path corrupts the stackslot, where
    `r4` is saved, in such a way that it *exactly* masks it with `0xf`.
    Everything else on the stack looks fine, so this is sort of very
    unlikely to be honest.

This only happens on a very specific device: The Nexus 9 is the only
device that was ever shipped with the Tegra K1 T132.  I suspect an issue
in the binary translation layer of `armv7` instruction set to the internal
micro-ops of the CPU.  I tried to stress test the instruction in
question (see https://github.com/lewurm/ldrinsntest), however I was not
able to trigger a crash.  So either, I'm missing some context in order
to trigger the bug or I'm on a completely wrong track.

That said, even if we could proof that it is indeed a hardware issue,
the workaround is also non-trivial (it would then either need a fix in
gcc or require a microcode update by the chip vendor).
```

And then we saw:
golang/go#19809 (comment)

Another hint that this device is buggy.

@golang golang locked and limited conversation to collaborators Apr 22, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.