Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: large initial values for GOMAXPROCS can cause runtime test failures #47246

Open
laboger opened this issue Jul 16, 2021 · 9 comments
Open

Comments

@laboger
Copy link
Contributor

@laboger laboger commented Jul 16, 2021

What version of Go are you using (go version)?

$ go version
tip

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
ppc64le

What did you do?

./all.bash
Run on systems with > 400 processors without a GOMAXPROCS setting.
On systems with fewer, set GOMAXPROCS=400.

What did you expect to see?

Correct test results.

What did you see instead?

When running on a system with > 400 processors, we are seeing some failures in the runtime package tests during ./all.bash that don't happen an explicit setting of GOMAXPROCS with a setting like 300. Some of these failures can be reproduced on systems with fewer processors if the initial GOMAXPROCS value is set to a number of about 400.

On a power9:

~/golang/plain/go/src/runtime$ GOMAXPROCS=396 go test -test.run=TestGcSys
--- FAIL: TestGcSys (0.31s)
    gc_test.go:30: expected "OK\n", but got "using too much memory: 17104896 bytes\n"
FAIL
exit status 1
FAIL	runtime	0.322s

~/golang/plain/go/src/runtime$ GOMAXPROCS=395 go test -test.run=TestGcSys
PASS
ok  	runtime	0.313s

On another system with 448 processors, the default GOMAXPROCS fails consistently.
go test -test.run=TestGcSys
--- FAIL: TestGcSys (0.58s)
    gc_test.go:30: expected "OK\n", but got "using too much memory: 17301504 bytes\n"
FAIL
exit status 1
FAIL	runtime	0.589s

Setting to a lower value allows it to pass.
GOMAXPROCS=256 go test -test.run=TestGcSys -test.count=4
PASS
ok  	runtime	0.721s

It seems that for ppc64le there should be a maximum default initial GOMAXPROCS value so that the runtime tests don't fail with the default setting.

We have seen other failures intermittently in the runtime package based on the GOMAXPROCS setting. Still trying to narrow down the conditions that cause those to fail.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Jul 16, 2021

TestGcSys has been flaky, see #37331 . Is there any other failure other than this one?

I doubt this is PPC64-specific.

@laboger
Copy link
Contributor Author

@laboger laboger commented Jul 16, 2021

This fails consistently on a system with > 400 processors. That means by default ./all.bash fails on that system.
You are correct, I can make it fail consistently on an x86 if I set GOMAXPROCS=400 so it is not PPC64 specific.

@laboger laboger changed the title runtime: large initial values for GOMAXPROCS can cause runtime test failures on ppc64le runtime: large initial values for GOMAXPROCS can cause runtime test failures Jul 16, 2021
@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Jul 16, 2021

cc @aclements @mknyszek

@laboger do you see any other test failing? TestGcSys is a known flaky test.

@cherrymui cherrymui added this to the Backlog milestone Jul 16, 2021
@laboger
Copy link
Contributor Author

@laboger laboger commented Jul 16, 2021

I'm still looking.... Some are taking much longer with more GOMAXPROCS, so I'm trying to determine if they are hanging.
(An earlier run had failures/panic traces but I didn't capture that output.)

@laboger
Copy link
Contributor Author

@laboger laboger commented Jul 16, 2021

Here is a test that timed out so I wasn't sure if it was a hang, turns out it just takes a lot longer as the GOMAXPROCS increase. This system has 448. Not sure if that is expected behavior for the test.

[boger@rain6p1 runtime]$ GOMAXPROCS=64 go test -test.run=TestSemaHandoff
PASS
ok  	runtime	10.074s
[boger@rain6p1 runtime]$ GOMAXPROCS=128 go test -test.run=TestSemaHandoff
PASS
ok  	runtime	79.771s
[boger@rain6p1 runtime]$ GOMAXPROCS=245 go test -test.run=TestSemaHandoff
PASS
ok  	runtime	292.193s
[boger@rain6p1 runtime]$ go test -test.run=TestSemaHandoff -test.timeout=30m
PASS
ok  	runtime	977.224s

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Jul 16, 2021

@laboger Oof. I do think TestSemaHandoff is going to scale pretty poorly with that many processors, if I'm remembering that test correctly. Maybe it should have a cap on GOMAXPROCS.

I know for certain that the runtime is going to have some scalability issues at that many processors. The scheduler and allocator don't take advantage of NUMA at all, for instance.

As Cherry points out, TestGcSys is unfortunately just known to be flaky, and theoretically it's independent of GOMAXPROCS (it sets its own GOMAXPROCS to 1). It could just be that the specific scheduling that triggers the GC issues causing the flakiness is just more common on your system. I'm not sure.

@laboger
Copy link
Contributor Author

@laboger laboger commented Jul 16, 2021

I understand, I'm just suggesting that the default for GOMAXPROCS could have an upper limit instead of always basing it on the number of processors on the system, since there known issues and performance hits above a certain point.

@aclements
Copy link
Member

@aclements aclements commented Jul 27, 2021

Talking with the team, the consensus was that we should just limit GOMAXPROCS in the tests that don't scale well. @prattmic has done benchmarks that show Go itself can scale pretty well to 512 and even 1024 GOMAXPROCS. This is obviously pretty workload-dependent, so I don't think we want to artificially limit GOMAXPROCS. But some of the tests just don't scale well, and we can keep those from timing out.

@prattmic
Copy link
Member

@prattmic prattmic commented Jul 27, 2021

I understand, I'm just suggesting that the default for GOMAXPROCS could have an upper limit instead of always basing it on the number of processors on the system, since there known issues and performance hits above a certain point.

I definitely don't think we should cap explicitly specified values of GOMAXPROCS, however perhaps it could be OK to cap the default value of GOMAXPROCS if there are serious performance issues for real-world applications (though hopefully we could work towards fixing those). I don't think this test is a realistic workload for that purpose, but I'd be curious to see other common applications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants