New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: fixedbug/issue10958.go failures #18589

Open
ALTree opened this Issue Jan 10, 2017 · 23 comments

Comments

Projects
None yet
9 participants
@ALTree
Member

ALTree commented Jan 10, 2017

The fixedbug/issue10958.go test, introduced in 7f1ff65 ( cmd/compile: insert scheduling checks on loop backedges), is flaky.

What version of Go are you using (go version)?

go version devel +7f1ff65 Mon Jan 9 21:01:29 2017 +0000 linux/amd64

What operating system and processor architecture are you using (go env)?

GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="XXX/gocode"
GORACE=""
GOROOT="XXX/go1.7/go"
GOTOOLDIR="XXX/go1.7/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build124360425=/tmp/go-build"
CXX="g++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"

What did you do?

Ran all.bash 3 times on a CentOS 6.5 server.

What did you see?

It failed reliably with:

##### ../test
# go run run.go -- fixedbugs/issue10958.go
signal: killed

It happened on the linux/mips64 builder too:

https://build.golang.org/log/3f4be6b0c792fd179683d0046a2e6f178f8928d2

@dr2chase dr2chase self-assigned this Jan 10, 2017

@gopherbot

This comment has been minimized.

gopherbot commented Jan 10, 2017

CL https://golang.org/cl/35051 mentions this issue.

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Jan 10, 2017

Can you try upping the timeout in that test? See https://go-review.googlesource.com/35051 for an example:

--- a/test/fixedbugs/issue10958.go
+++ b/test/fixedbugs/issue10958.go
@@ -1,5 +1,5 @@
 // +build !nacl
-// buildrun -t 2  -gcflags=-d=ssa/insert_resched_checks/on,ssa/check/on
+// buildrun -t 10  -gcflags=-d=ssa/insert_resched_checks/on,ssa/check/on
 
 // Copyright 2016 The Go Authors. All rights reserved.
 // Use of this source code is governed by a BSD-style

I figure I get one try at not-flaky, if it fails, we yank the test.

@ALTree

This comment has been minimized.

Member

ALTree commented Jan 10, 2017

Do I have to launch the test via all.bash or running go run issue10958.go is enough?

Because if I go run it I hangs forever (there's a data race in the code, is that normal?).

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Jan 10, 2017

In test directory go run run.go -- fixedbugs/issue10958.go.
Standalone, go run -gcflags -d=ssa/insert_resched_checks/on issue10958.go
I should add a test comment explaining that it is normally expected to hang.

The data race is more or less required -- the threads have to enter their actually-but-not-obviously infinite loops, and those loops cannot contain calls. The test is whether the compiler is properly inserting rescheduling checks into those loops, and if they have the intended effect.

@ALTree

This comment has been minimized.

Member

ALTree commented Jan 10, 2017

  1. I patched the file bumping to 10s and ran all.bash and it still failed (same error)

  2. $ go run run.go -- fixedbugs/issue10958.go in test gives:

# go run run.go -- fixedbugs/issue10958.go
exit status 2
# command-line-arguments
compile: Did not find a phase matching insert_resched_checks in -d=ssa/... debug option
  1. go run -gcflags -d=ssa/insert_resched_checks/on issue10958.go in test/fixedbugs gives
# command-line-arguments
compile: Did not find a phase matching insert_resched_checks in -d=ssa/... debug option
@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Jan 10, 2017

That message is from an "old" compiler. There's a new phase in the one paired with that test, and the phase is present whether GOEXPERIMENT is set or not.

@ALTree

This comment has been minimized.

Member

ALTree commented Jan 10, 2017

Yeah sorry I set a bad GOROOT while trying.

go run run.go -- fixedbugs/issue10958.go never fails (not even with 2s).

Still seeing failures with all.bash and 10s.

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Jan 10, 2017

Too much concurrency in all.bash, perhaps? Thanks for checking this.

@ALTree

This comment has been minimized.

Member

ALTree commented Jan 10, 2017

Ok. I'll let you decide what to do with this issue. all.bash has always been quite slow on that machine (not sure why) so maybe it's not worth yanking the test... or maybe we could wait to see how frequent the flake is.

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Jan 10, 2017

It's worth disabling. It's only for an experiment, and flaky tests are vile.

@aclements

This comment has been minimized.

Member

aclements commented Jan 10, 2017

A few thoughts:

You should be able to reproduce it with just go run run.go from the test directory, which runs everything in there with full concurrency, but doesn't require the full all.bash. If that succeeds, then something is weird about the all.bash environment.

If you run with GOTRACEBACK=crash and send it a SIGQUIT before it times out (maybe increase the timeout to something really high), the traceback may be useful.

@gopherbot gopherbot closed this in d9a0579 Jan 10, 2017

@ALTree

This comment has been minimized.

Member

ALTree commented Jan 10, 2017

Thanks for the suggestion.

go run run.go fails (after a while), as expected:

$ gotip run run.go

# go run run.go -- fixedbugs/issue10958.go
signal: killed

FAIL	fixedbugs/issue10958.go	2.005s

In the meantime, another CL was merged, the builders tested it and two failed on this test:

freebsd/arm: https://build.golang.org/log/02cb9ac66d2d9a19feb6fb479917e709a50850d5
linux/mips: https://build.golang.org/log/9f3e8129e95710b42ac5528f523bfbcaf171a5f1

so yeah, it's really flaky.

@ALTree

This comment has been minimized.

Member

ALTree commented Jan 10, 2017

With a really high timeout (I tried 600) it does not fail, so I guess the machine is just choking a little on that test. It has been disabled now so we're fine.

@rsc rsc reopened this Jan 10, 2017

@rsc

This comment has been minimized.

Contributor

rsc commented Jan 10, 2017

@ALTree can you say more about the machine? Is it a VM? Hardware? Software? How many cores? What is GOMAXPROCS? Are there other things running on the machine?

It's possible the OS descheduled a thread for 10 seconds but that seems like a long time.

@ALTree

This comment has been minimized.

Member

ALTree commented Jan 10, 2017

Just to confirm that the timeout was correctly set to 10, I tried again (with all.bash):

##### ../test
# go run run.go -- fixedbugs/issue10958.go
signal: killed

FAIL	fixedbugs/issue10958.go	10.025s

(note 10.025s at the end).

No VM, it's real hardware. The machine is an old NeXtScale cluster (we'll dismiss it during 2017 I believe). OS is CentOS 6.5, uname -a says

Linux nodeXXX 2.6.32-431.el6.x86_64 #1 SMP x86_64 GNU/Linux

There are many nodes, and each node has 2 x Intel Xeon E5-2670v2 2.50 GHz and 128 GB ram, but the resources that I'm requesting to the jobs manager are select=1:ncpus=8:mem=32gb (8 cores, 32 gb ram, all on the same node).

Maybe what's happening is that all.bash sees the hardware of the whole node, and it spawns many threads (20? 40?), which are then pinned to the 8 cores I requested, chocking them a little.

What is GOMAXPROCS?

Not set.

Are there other things running on the machine?

Possibly. Since I've only requested about half a node, if another user asks for just a few cores the job manager could assign the job to the same node. The 8 cores I requested are mine though (and they're guaranteed).

@bradfitz bradfitz added the Testing label Jan 13, 2017

@bradfitz bradfitz added this to the Go1.8Maybe milestone Jan 13, 2017

@rsc

This comment has been minimized.

Contributor

rsc commented Jan 17, 2017

I think all we need to do here is increase the timeout, for Go 1.9.

@rsc rsc modified the milestones: Go1.9Early, Go1.8Maybe Jan 17, 2017

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Feb 22, 2017

I modified the test (and the rescheduling code also) in this CL, it may work better now: https://go-review.googlesource.com/c/36206/

@alexbrainman

This comment has been minimized.

Member

alexbrainman commented Apr 7, 2017

@bradfitz

This comment has been minimized.

Member

bradfitz commented Apr 13, 2017

The last four openbsd-386 builds all failed with this too.

I'm going to disable the test.

@gopherbot

This comment has been minimized.

gopherbot commented Apr 13, 2017

CL https://golang.org/cl/40651 mentions this issue.

gopherbot pushed a commit that referenced this issue Apr 13, 2017

test: disable flaky test/fixedbugs/issue10958.go
Updates #18589

Change-Id: I2c3bbc8257c68295051bd2e63e1e11794d0609c3
Reviewed-on: https://go-review.googlesource.com/40651
Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: David Chase <drchase@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@gopherbot

This comment has been minimized.

gopherbot commented May 2, 2017

CL https://golang.org/cl/42431 mentions this issue.

gopherbot pushed a commit that referenced this issue May 2, 2017

cmd/dist: spin up fewer test dir runners
5 shards, each of which spins up NumCPU processes,
each of which is running at GOMAXPROCS=NumCPU,
is too much for one machine. It makes my laptop unusable.

It might also be in part responsible for test flakes
that require a moderately responsive system,
like #18589 (backedge scheduling) and #19276 (locklinear).

It's possible that Go should be a better neighbor in general;
that's #17969. In the meantime, fix this corner of the world.

Builders snapshot the world and run shards on different
machines, so keeping sharding high for them is good.

This is a partial reversion of CL 18199.

Fixes #20141.

Change-Id: I123cf9436f4f4da3550372896265c38117b78071
Reviewed-on: https://go-review.googlesource.com/42431
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>

josharian added a commit to josharian/go that referenced this issue May 2, 2017

cmd/dist: spin up fewer test dir runners
5 shards, each of which spins up NumCPU processes,
each of which is running at GOMAXPROCS=NumCPU,
is too much for one machine. It makes my laptop unusable.

It might also be in part responsible for test flakes
that require a moderately responsive system,
like golang#18589 (backedge scheduling) and golang#19276 (locklinear).

It's possible that Go should be a better neighbor in general;
that's golang#17969. In the meantime, fix this corner of the world.

Builders snapshot the world and run shards on different
machines, so keeping sharding high for them is good.

This is a partial reversion of CL 18199.

Fixes golang#20141.

Change-Id: I123cf9436f4f4da3550372896265c38117b78071

@bradfitz bradfitz modified the milestones: Go1.9, Go1.9Early May 3, 2017

@bradfitz bradfitz modified the milestones: Go1.10, Go1.9 Jun 28, 2017

@mdempsky mdempsky modified the milestones: Go1.10, Go1.11 Nov 30, 2017

@dr2chase dr2chase modified the milestones: Go1.11, Go1.12 Jun 26, 2018

@dr2chase

This comment has been minimized.

Contributor

dr2chase commented Jun 26, 2018

Moved to 1.12, the test will likely change or be obsoleted when we add signal-based preemption.

@randall77

This comment has been minimized.

Contributor

randall77 commented Nov 26, 2018

Punting to 1.13, we aren't going to do anything for this cycle.

@randall77 randall77 modified the milestones: Go1.12, Go1.13 Nov 26, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment