Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400 #21559

Closed
vielmetti opened this issue Aug 22, 2017 · 31 comments

Comments

Projects
None yet
6 participants
@vielmetti
Copy link

commented Aug 22, 2017

What version of Go are you using (go version)?

go version go1.9rc2 linux/arm64
bootstrap compiler is go version go1.6.2 linux/arm64

Machine is

$ uname -a
Linux lab-centriq 4.7.0-2-generic #5~pdaw1.1+bandera.6-Ubuntu SMP Wed Feb 8 23:10:54 UTC 2017 aarch64 aarch64 aarch64 GNU/Linux

Does this issue reproduce with the latest release?

This issue does not reproduce on the Cavium ThunderX hardware, which passes all tests without flaws.

What operating system and processor architecture are you using (go env)?

./go env
GOARCH="arm64"
GOBIN=""
GOEXE=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/emv/go"
GORACE=""
GOROOT="/home/emv/src/go.googlesource.com/go"
GOTOOLDIR="/home/emv/src/go.googlesource.com/go/pkg/tool/linux_arm64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build997434454=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

git checkout go1.9rc2
cd src
./all.bash

If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.

What did you expect to see?

All tests pass

What did you see instead?

runtime: program exceeds 50-thread limit
fatal error: thread exhaustion
FAIL    os      0.509s 
Failed: exit status 1

Complete output at https://gist.github.com/017dbb7fa20d184d66a8d9ea4bf61225

@vielmetti

This comment has been minimized.

Copy link
Author

commented Aug 22, 2017

From reading the test code, I think the error is showing up in

// Test that reading from a pipe doesn't use up a thread.
func TestPipeThreads(t *testing.T) {

@vielmetti vielmetti changed the title "runtime: program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 "runtime: program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 Aug 22, 2017

@vielmetti vielmetti changed the title "runtime: program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 "runtime: program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400 Aug 22, 2017

@vielmetti

This comment has been minimized.

Copy link
Author

commented Aug 22, 2017

Note that all tests pass on the Cavium ThunderX, but this one test fails to pass on the Centriq system. Exploring further to see if I can get different results.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Aug 24, 2017

Is this easily repeatable? If so, please run go test -c os to get an os.test binary, then run ./os.test --testing.short to verify that the problem still occurs, and attach the output of strace -f ./os.test --testing.short. Thanks.

@vielmetti

This comment has been minimized.

Copy link
Author

commented Aug 24, 2017

Odd. It's reproducible from the ./all.bash shell script, but when I make and run the the os.test binary, it passes.

@vielmetti

This comment has been minimized.

Copy link
Author

commented Aug 24, 2017

When I download the binaries for go 1.9 rc2 they pass the os.test test, but when I use those binaries to bootstrap the go 1.9 rc2 sources, the all.bash script gets this failure - only on the Centriq system.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Aug 25, 2017

Is there any difference in the output of ulimit -u on the succeeding and failing systems?

@vielmetti

This comment has been minimized.

Copy link
Author

commented Aug 25, 2017

ulimit -u on the ThunderX: 515386

ulimit -u on Centriq: 385905

Both should be plenty high for any reason.

@ianlancetaylor ianlancetaylor changed the title "runtime: program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400 runtime: "program exceeds 50-thread limit" in bootstrap build for Go 1.9 RC2 on arm64 on Centriq 2400 Aug 25, 2017

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Aug 25, 2017

I haven't been able to think of any reason why this test would behave differently when invoked from all.bash. I can't recall seeing any other reports of this problem. The test is fairly simple, and is basically testing that the pipes are handled by the internal poller. If we could capture it standalone, then running the test under strace -f would likely show the problem. But running all.bash under strace -f will produce too much irrelevant information.

@vielmetti

This comment has been minimized.

Copy link
Author

commented Aug 25, 2017

I'm also puzzled. The only think I can possibly think of is that this machine is running a unique kernel, and that we're planning to update the kernel to something closer to mainline. I'll put this on hold for now, and rerun it the next time there's a new kernel.

@valyala

This comment has been minimized.

Copy link
Contributor

commented Aug 27, 2017

This may be related to #21621 and #21550 during compilation of a package with many files. As I uderstand, currently each file may be concurrently opened and read in a separate OS thread if the corresponding file I/O syscalls (cgo calls) are slow.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Aug 28, 2017

@valyala In this case the calls should not be slow, because when using a pipe all I/O should be done using non-blocking calls.

@vielmetti

This comment has been minimized.

Copy link
Author

commented Aug 28, 2017

OK, adding to the puzzle:

git checkout go1.9
cd src
./all.bash

just worked perfectly on the same machine, no errors; building master also gave me no errors. But building go1.9rc2 did again error out on os just the same way as it did originally.

Curious!

@orivej

This comment has been minimized.

Copy link

commented Sep 13, 2017

I can easily reproduce this on a 56-threaded Intel Xeon E5-2660 v4 Linux. (Out of 100 consecutive runs of os.test --test.short built from go master, 9 panicked with this error.) Unfortunately, the problem vanishes under strace.

@orivej

This comment has been minimized.

Copy link

commented Sep 13, 2017

Here is a test panic from that machine, if it may be of any use: os.txt.

@orivej

This comment has been minimized.

Copy link

commented Sep 13, 2017

In fact this reproduces only when CPU is under some load. Stressing CPU makes it reproduce practically every time, whereas without any load the test may always succeed. I used the following script to bisect go history, c05b06a is the first bad commit.

#!/bin/bash
set -e
cd src
./make.bash
cd os
../../bin/go test -a -c os
stress-ng -t 60 -c 60 &
trap 'killall stress-ng || true' EXIT
for i in `seq 100`; do
    ./os.test --test.short
done
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2017

Thanks for the bisect. That is the commit that introduced the failing test in the first place.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2017

Can you generate a failure with GOTRACEBACK=system set in the environment and show us the stack dump? Thanks.

@orivej

This comment has been minimized.

Copy link

commented Sep 13, 2017

Here is one: os.txt (go is at the current master, 9593b74).

@gopherbot

This comment has been minimized.

Copy link

commented Sep 13, 2017

Change https://golang.org/cl/63650 mentions this issue: os: avoid crashing with a thundering herd in TestPipeThreads

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2017

Sorry for missing that you attached that before.

The stack trace suggests that the problem is not with the test itself, it is with the cleanup when the test is done. It looks like we can get a thundering herd as we stop all the goroutines.

Can you see if https://golang.org/cl/63650 fixes the problem for you? Thanks.

@orivej

This comment has been minimized.

Copy link

commented Sep 13, 2017

When I repeat the test under load (and it fails every time), the number of goroutines in the dump typically is either 8-10, or 107-108. Above was an example with 10 goroutines, and here is one with 108:
os.txt

I'll test your patch now.

@orivej

This comment has been minimized.

Copy link

commented Sep 13, 2017

The patch fixes the issue. Thanks!

@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.9.1 Sep 13, 2017

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2017

Thanks for testing it.

Marking issue as 1.9.1 to consider bringing the test fix into 1.9.1 to avoid spurious build failure reports.

@gopherbot gopherbot closed this in 29415eb Sep 13, 2017

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2017

Reopening to consider a backport to 1.9.1.

@orivej orivej referenced this issue Sep 17, 2017

Merged

go: fix tests and impurity #29489

6 of 8 tasks complete

@rsc rsc modified the milestones: Go1.9.1, Go1.9.2 Oct 4, 2017

@rsc

This comment has been minimized.

Copy link
Contributor

commented Oct 13, 2017

I'm confused. The test says "Test that reading from a pipe doesn't use up a thread." and it sounds like it was using up a thread on the breaking system. The rewrite of the test may just be making the test not as precise about finding problems, essentially papering over a real problem.

Can you explain why the old test was wrong? I just don't see it. The old channel+waitgroup looks functionally equivalent to the new channel+channel, and I don't see why putting Close into the goroutine (instead of a later cleanup pass) would change anything, unless Close is happening to cause extra serialization that defeats the test.

If you don't mind just disabling the test for Go 1.9.2 because you think it's a bad test for some reason, that's fine. But it seems like a worthwhile test and maybe pointing out a real problem.

@rsc rsc added the NeedsDecision label Oct 13, 2017

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Oct 13, 2017

The test is intended to test that reading from a pipe when there is no data to read doesn't use up a thread. When data arrives, a thread is required to actually do the read. With the original test, all the goroutines would queue up to read, and all is well. Then we would write to all of them at once, and we would blow past the thread limit.

But, yeah, my fix is not well written. I think it may work because of synchronization on the cdone channel. cdone ought to be an unbuffered channel.

For 1.9 we should probably just disable the test.

@rsc

This comment has been minimized.

Copy link
Contributor

commented Oct 13, 2017

OK. Disabling test OK for Go 1.9.2.
TODO: Send a CL on the release branch, but do not submit it. (I want to try automating the actual release branch commits.)

@rsc rsc removed the Testing label Oct 13, 2017

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Oct 13, 2017

Sent https://golang.org/cl/70771 for 1.9 branch.

@gopherbot

This comment has been minimized.

Copy link

commented Oct 13, 2017

Change https://golang.org/cl/70771 mentions this issue: [release-branch.go1.9] os: skip TestPipeThreads as flaky for 1.9

@rsc

This comment has been minimized.

Copy link
Contributor

commented Oct 14, 2017

CL 70771 OK for Go 1.9.2.

gopherbot pushed a commit that referenced this issue Oct 25, 2017

[release-branch.go1.9] os: skip TestPipeThreads as flaky for 1.9
Updates #21559

Change-Id: I90fa8b4ef97c4251440270491ac4c833d76ee872
Reviewed-on: https://go-review.googlesource.com/70771
Run-TryBot: Russ Cox <rsc@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@rsc

This comment has been minimized.

Copy link
Contributor

commented Oct 26, 2017

go1.9.2 has been packaged and includes:

The release is posted at golang.org/dl.

— golang.org/x/build/cmd/releasebot, Oct 26 21:09:08 UTC

@rsc rsc closed this Oct 26, 2017

finlay added a commit to dragonfly-science/nixpkgs that referenced this issue Apr 29, 2018

@golang golang locked and limited conversation to collaborators Oct 26, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.