Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: test failures introduced on ppc64le with recent change to improve fastrand #22047

Open
laboger opened this issue Sep 26, 2017 · 18 comments
Milestone

Comments

@laboger
Copy link
Contributor

@laboger laboger commented Sep 26, 2017

What version of Go are you using (go version)?

tip

Does this issue reproduce with the latest release?

intermittently, only seen on the golang build page so far

What operating system and processor architecture are you using (go env)?

ppc64le

What did you do?

When the fastrand change was merged e7e4a4f runtime failures started to happen intermittently for ppc64le on the build page. Another change was merged after that resulting in consistent failures on ppc64le that hid this one for a while, but that one has now been fixed.

What did you expect to see?

ok for the ppc64le build on the build page

What did you see instead?

fail for the ppc64le build on the build page intermittently

Here is an example of when it fails https://build.golang.org/log/8552765394dde8879c15b4e35b2f995606b8ed1b.

Looks like the tests are all parked waiting to run and it times out.

If I look at the change where this started happening, the only thing that looks like it could affect this is the code that was moved in mcommoninit to be under control of a lock where it wasn't before, but I can't say that is wrong, just different.

I have not been able to reproduce this anywhere else.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Sep 26, 2017

Unfortunately I can't tell which runtime test is failing. The stack trace shows two goroutines whose stack is not reported because when the timeout occurs they are running on a different thread. It would be interesting to know if the failing test is TestSelectFairness, which was introduced in the same CL as the fastrand change.

@laboger

This comment has been minimized.

Copy link
Contributor Author

@laboger laboger commented Sep 27, 2017

I'm trying to get information from others who are building from upstream to see if anyone else has hit this problem. So far we've only seen it on the build page. I can't seem to find the information about the build machine, such as what distro is it and what would be the default GOMAXPROCS on that system? With that information I could see if it reproduces with the exact same environment.

The test TestSelectFairness could be skipped on ppc64le at least temporarily to see if that prevents the failure from happening and then we would be pretty sure that test introduced the problem.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Oct 2, 2017

Change https://golang.org/cl/67631 mentions this issue: runtime: skip test that intermittently hangs on ppc64le

@funny-falcon

This comment has been minimized.

Copy link
Contributor

@funny-falcon funny-falcon commented Oct 3, 2017

Could it be due some field reordering in type m struct {} ?
https://go-review.googlesource.com/c/go/+/64193/1/src/runtime/runtime2.go

@funny-falcon

This comment has been minimized.

Copy link
Contributor

@funny-falcon funny-falcon commented Oct 3, 2017

And I found place, where fastrand assumed to be non-zero:

s.ticket = fastrand()

if s.ticket != 0 || cansemacquire(addr) {

@laboger

This comment has been minimized.

Copy link
Contributor Author

@laboger laboger commented Oct 3, 2017

Yes, I agree it could be something else. If we skip this test and it continues to hang then we'll know it's not the test.

gopherbot pushed a commit that referenced this issue Oct 3, 2017
A new testcase TestSelectFairness was recently added, and
since then the ppc64le build tests have intermittently failed.

This adds a change to skip this test on ppc64le using
SkipFlaky to help determine if the problem is with the
test or something else with that commit.

Updates #22047

Change-Id: Idfef72ed791c5bd45c42ff180947fea3df280ea7
Reviewed-on: https://go-review.googlesource.com/67631
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@laboger

This comment has been minimized.

Copy link
Contributor Author

@laboger laboger commented Oct 3, 2017

Test has failed after skipping this test.

@funny-falcon

This comment has been minimized.

Copy link
Contributor

@funny-falcon funny-falcon commented Oct 3, 2017

What happens with such patch:

diff --git a/src/runtime/sema.go b/src/runtime/sema.go
index 8715e07d7a..816e894ae5 100644
--- a/src/runtime/sema.go
+++ b/src/runtime/sema.go
@@ -275,7 +275,7 @@ func (root *semaRoot) queue(addr *uint32, s *sudog, lifo bool) {
        // on the ticket: s.ticket <= both s.prev.ticket and s.next.ticket.
        // https://en.wikipedia.org/wiki/Treap
        // http://faculty.washington.edu/aragon/pubs/rst89.pdf
-       s.ticket = fastrand()
+       s.ticket = fastrand() | 1
        s.parent = last
        *pt = s

???

@ianlancetaylor ianlancetaylor added this to the Go1.10 milestone Oct 3, 2017
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Oct 3, 2017

There may well be a problem in sema.go but why would it only show up on ppc64?

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Oct 4, 2017

Change https://golang.org/cl/68050 mentions this issue: runtime: fix using fastrand in sema.go

gopherbot pushed a commit that referenced this issue Oct 4, 2017
Before CL 62530 fastrand always returned non-zero value, and one
condition in sema.go depends on this behavior.

fastrand is used to generate random weight for treap of sudog, and
it is checked against zero to verify sudog were inserted into treap or
wait queue.

Since its precision is not very important for correctness, lets just
always set its lowest bit in this place.

Updates #22047
Updates #21806

Change-Id: Iba0b56d81054e6ef9c49ffd293fc5d92a6a31e9b
Reviewed-on: https://go-review.googlesource.com/68050
Reviewed-by: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@laboger

This comment has been minimized.

Copy link
Contributor Author

@laboger laboger commented Oct 5, 2017

ppc64le build tests continue to hang intermittently.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Oct 5, 2017

Change https://golang.org/cl/68710 mentions this issue: runtime: reorder fields in type m

@rasky

This comment has been minimized.

Copy link
Member

@rasky rasky commented Oct 5, 2017

Fastrand is still guaranteed to always return non-zero values, so surely the patch that went in can’t fix anything.

I still wouldn’t revert it because I don’t think code should rely on this property, it’s confusing and fragile.

@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Oct 5, 2017

I don't think fastrand is guaranteed to return nonzero any more.
Its internal [2]uint32 state can't be all zero, but we return state[0]+state[1], which could end up zero.

@funny-falcon

This comment has been minimized.

Copy link
Contributor

@funny-falcon funny-falcon commented Oct 6, 2017

Yes, xorshift64 with 32bit result returns zero easily.

@laboger

This comment has been minimized.

Copy link
Contributor Author

@laboger laboger commented Oct 30, 2017

This test is no longer failing.

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Nov 8, 2017

I wish we understood exactly what fixed this, but closing anyway since it's no longer failing.

@aclements aclements closed this Nov 8, 2017
@golang golang locked and limited conversation to collaborators Nov 8, 2018
@randall77

This comment has been minimized.

Copy link
Contributor

@randall77 randall77 commented Oct 12, 2019

The test is no longer failing because it's now skipped on ppc64le.
Whatever is underlying is probably still happening (although I haven't checked).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.