Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: frequent timeouts in {build,runBuilt,run}TestProg #44422

Closed
bcmills opened this issue Feb 19, 2021 · 28 comments
Closed

runtime: frequent timeouts in {build,runBuilt,run}TestProg #44422

bcmills opened this issue Feb 19, 2021 · 28 comments
Labels
NeedsInvestigation release-blocker
Milestone

Comments

@bcmills bcmills added Builders NeedsInvestigation arch-ppc64x labels Feb 19, 2021
@bcmills bcmills added this to the Backlog milestone Feb 19, 2021
@laboger
Copy link
Contributor

@laboger laboger commented Feb 23, 2021

I've been looking at this problem on various systems. If I run 'go tool dist test go_test:runtime' by itself the time varies and sometimes on a power8 it times out, but never on a power9 because it is faster there. But one mysterious result is that if I set GOMAXPROCS to a smaller value, like 2, 4, or 8 then the time is small enough on the power8 so that it doesn't time out. I thought these VMs only had 2 processors each, which I thought was equivalent to running GOMAXPROCS=2. I can only get it to timeout if the number of processors is > 150.

@ceseo
Copy link
Contributor

@ceseo ceseo commented Feb 23, 2021

I don't know if this is related, but a while back I had a bunch of timeouts in my personal OSU power9 system. I asked them to increase it from 2 to 4 cores and 4 to 8GB RAM and now I have zero timeouts there. Maybe these builders are timing out because they have too little CPU time allocated to them in the hypervisor.

@bradfitz can correct me if I'm wrong, but IIRC, the power8 builders are 2 core, 4GB RAM only.

@laboger
Copy link
Contributor

@laboger laboger commented Feb 24, 2021

@laboger
Copy link
Contributor

@laboger laboger commented Mar 11, 2021

It's not clear to me whether the test is hung or just slow. Perhaps we should set a GO_TEST_TIMEOUT_SCALE on this builder and see if that helps?

@dmitshur I think we need to increase the GO_TEST_TIMEOUT_SCALE for all the linux ppc64* builders. I can make the change but don't know how those are tested.

@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented Mar 11, 2021

@laboger Sounds good. GO_TEST_TIMEOUT_SCALE is a part of builder configuration, it doesn't need tests. See CL 158537 for a past example.

@gopherbot
Copy link

@gopherbot gopherbot commented Mar 11, 2021

Change https://golang.org/cl/300870 mentions this issue: build: increase GO_TEST_TIMEOUT_SCALE for ppc64 builders

gopherbot pushed a commit to golang/build that referenced this issue Mar 16, 2021
This sets the GO_TEST_TIMEOUT_SCALE=2 for ppc64 builders. The
runtime tests have been timing out intermittently on ppc64le
power8; it has happened on ppc64 and ppc64le power9 but
very rarely.

Updates golang/go#44422

Change-Id: I663f3f211a368a59e38fbff9ce43c925c6c7a209
Reviewed-on: https://go-review.googlesource.com/c/build/+/300870
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Trust: Lynn Boger <laboger@linux.vnet.ibm.com>
@bcmills bcmills changed the title runtime: timeouts in buildTestProg on linux-ppc64le-buildlet runtime: timeouts in buildTestProg Apr 28, 2021
@bcmills bcmills removed arch-ppc64x Builders labels Apr 28, 2021
@bcmills
Copy link
Member Author

@bcmills bcmills commented May 7, 2021

This may be the same underlying issue as #45885.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 8, 2021

greplogs --dashboard -md -l -e '(?ms)test timed out.*runtime_test\.(?:run|runBuil[dt]|build)TestProg.*FAIL\s+runtime' --since=2021-05-01

2021-11-08T16:15:01-759eaa2/solaris-amd64-oraclerel
2021-11-07T04:56:11-85493d5/solaris-amd64-oraclerel
2021-11-05T22:00:37-f00b43f/solaris-amd64-oraclerel
2021-11-05T21:28:34-90462df/solaris-amd64-oraclerel
2021-10-29T19:51:20-994049a/netbsd-arm64-bsiegert
2021-10-28T03:35:34-b8f928b/netbsd-arm64-bsiegert
2021-10-28T03:35:04-5a3a9d8/netbsd-arm64-bsiegert
2021-10-25T20:39:17-c26a32a/openbsd-386-68
2021-10-16T16:27:40-640a49b/solaris-amd64-oraclerel
2021-10-06T22:42:28-4002616/netbsd-arm64-bsiegert
2021-10-02T13:52:59-07a7c6f/solaris-amd64-oraclerel
2021-10-01T23:51:16-c129af9/netbsd-arm64-bsiegert
2021-10-01T17:40:49-243d65c/solaris-amd64-oraclerel
2021-10-01T17:18:34-dab16c1/netbsd-arm64-bsiegert
2021-09-30T18:10:18-205640e/netbsd-arm64-bsiegert
2021-09-28T23:59:17-435718e/netbsd-arm64-bsiegert
2021-09-24T17:23:19-d5d3f80/netbsd-arm64-bsiegert
2021-09-24T14:52:47-217507e/netbsd-arm64-bsiegert
2021-09-17T22:19:21-3fa35b5/linux-ppc64le-power9osu
2021-09-13T20:17:24-a0c409c/solaris-amd64-oraclerel
2021-09-12T01:06:53-0d8a4bf/linux-riscv64-jsing
2021-09-10T17:11:39-5a4b9f9/solaris-amd64-oraclerel
2021-09-10T16:51:52-5a94a90/openbsd-386-64
2021-09-10T14:03:35-96ab854/solaris-amd64-oraclerel
2021-09-10T02:44:36-2091bd3/netbsd-arm64-bsiegert
2021-09-09T04:47:41-c84f3a4/linux-riscv64-jsing
2021-09-06T19:10:15-a193843/netbsd-arm64-bsiegert
2021-09-06T03:31:43-43b0517/netbsd-arm64-bsiegert
2021-09-01T06:18:05-5a687ee/netbsd-arm64-bsiegert
2021-08-28T06:22:11-044550a/solaris-amd64-oraclerel
2021-08-26T20:39:50-03db2c2/netbsd-arm64-bsiegert
2021-08-25T15:30:19-099b819/solaris-amd64-oraclerel
2021-08-21T11:22:07-bcd146d/solaris-amd64-oraclerel
2021-08-17T04:37:32-a304273/illumos-amd64
2021-08-16T22:27:47-213e157/illumos-amd64
2021-08-16T16:58:24-a0adf91/illumos-amd64
2021-08-16T12:42:58-94002f6/illumos-amd64
2021-08-16T10:46:11-3d679c6/netbsd-arm-bsiegert
2021-08-16T08:37:46-fcdc3c0/illumos-amd64
2021-08-14T15:14:57-456759b/illumos-amd64
2021-08-13T20:51:21-89a4f99/illumos-amd64
2021-08-13T20:49:48-bad1fc1/illumos-amd64
2021-08-13T20:45:17-98f3d7f/illumos-amd64
2021-08-13T00:12:29-1fffedd/illumos-amd64
2021-07-26T14:24:27-1868f82/illumos-amd64
2021-07-26T14:24:27-1868f82/solaris-amd64-oraclerel
2021-07-25T17:16:20-849b791/illumos-amd64
2021-07-25T17:16:20-849b791/solaris-amd64-oraclerel
2021-07-23T18:10:55-0914646/aix-ppc64
2021-07-23T18:10:55-0914646/illumos-amd64
2021-07-23T18:10:55-0914646/solaris-amd64-oraclerel
2021-07-22T20:47:59-052da57/freebsd-arm64-dmgk
2021-07-22T19:26:40-798ec73/solaris-amd64-oraclerel
2021-07-20T19:54:36-d568e6e/illumos-amd64
2021-07-09T18:00:16-ab4085c/illumos-amd64
2021-07-08T09:09:53-2ca44fe/illumos-amd64
2021-07-07T22:27:48-11f5df2/illumos-amd64
2021-06-29T21:54:24-f9d5095/illumos-amd64
2021-06-22T16:59:10-0ebd5a8/illumos-amd64
2021-06-15T18:42:11-219fe9d/illumos-amd64
2021-06-04T14:31:24-c6b6211/illumos-amd64
2021-06-01T21:14:02-567ee86/illumos-amd64
2021-05-30T02:37:38-1607c28/illumos-amd64
2021-05-26T16:11:00-bfd7798/illumos-amd64
2021-05-26T05:45:56-d050238/illumos-amd64
2021-05-14T16:42:01-3d324f1/illumos-amd64
2021-05-12T17:32:39-f93b951/illumos-amd64
2021-05-12T15:23:09-0388670/illumos-amd64
2021-05-12T15:04:42-af0f8c1/illumos-amd64
2021-05-09T17:07:01-bedf2c4/illumos-amd64
2021-05-08T14:59:49-68327e1/illumos-amd64
2021-05-08T05:57:31-4c8f48e/illumos-amd64
2021-05-07T02:17:32-d2b0311/illumos-amd64
2021-05-06T15:33:43-54e20b5/illumos-amd64
2021-05-05T01:28:34-4df662f/illumos-amd64
2021-05-04T15:03:06-86e42c2/linux-ppc64le-buildlet
2021-05-03T01:35:44-2c9f5a1/illumos-amd64

@bcmills bcmills changed the title runtime: timeouts in buildTestProg runtime: frequent timeouts in {build,runBuild,run}TestProg Nov 8, 2021
@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 8, 2021

This is affecting at least five different builders, with many failures per month.

I think that makes it a release-blocker via #11811. All of the builders need to be passing tests reliably — if that means skipping more tests on non-longtest builders, we may need to prioritize which tests to run, or perhaps move some of the slower ones out to some package that won't also be run by users as part of go test all in their own modules. (Compare #48770.)

@bcmills bcmills removed this from the Backlog milestone Nov 8, 2021
@bcmills bcmills added this to the Go1.18 milestone Nov 8, 2021
@bcmills bcmills added okay-after-beta1 Testing labels Nov 8, 2021
@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 16, 2021

Looking at the solaris-amd64-oraclerel timeouts, many of the process wait times are quite long (1–2 minutes).

That makes me wonder whether these timeouts are actually deadlocks — perhaps related to the ones observed in #48789?

@bcmills bcmills removed the Testing label Nov 16, 2021
@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 16, 2021

#49614 also shows an apparent deadlock with processes blocked in runtime.syscall_wait4. (It may be a duplicate of this issue.)

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 17, 2021

It's interesting to note that in all the recent Solaris failures the runtime test times out while building a test with build flags, either from TestCgoExternalThreadSignal (building testprogcgo with -tags=threadprof) or TestCheckPtr (building testprog with -gcflags=all=-d=checkptr=1). The buildTestProg function serializes all builds using a mutex. It does a separate build for each binary (testprog, testprogcgo, ....) and the additional arguments being passed to go build, typically none but sometimes, as in these cases, present. Building testprogcgo naturally requires invoking the C compiler a few times, and also the C linker. So overall these extra builds could definitely slow down the test. Still hard to see how it takes as long as it does, though.

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 17, 2021

Change https://golang.org/cl/364654 mentions this issue: runtime: don't serialize all builds in test

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 17, 2021

in all the recent Solaris failures the runtime test times out while building a test with build flags

That is a strong pattern but not universal. The stack traces for the failures listed in #49614 seem to be at a point where the test is running the compiled program (and has been for over a minute), not waiting on a build process.

gopherbot pushed a commit that referenced this issue Nov 17, 2021
Permit a test whose program is already built to run immediately,
rather than waiting for another test to complete its build.

For #44422

Change-Id: I2d1b35d055ee4c4251f4caef3b52dccc82b71a1b
Reviewed-on: https://go-review.googlesource.com/c/go/+/364654
Trust: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 17, 2021

After CL 364654 the time required to run the runtime tests on a Solaris system I have access to drops from 70 seconds to 51 seconds. CL 364755, if submitted, will drop it farther. I don't know if timeouts are the problem here, but they may be part of it. But there may also be a different problem.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 17, 2021

Seems like a good start, at least. We can watch the builders for a while and see whether it is empirically fixed.

@bcmills bcmills added the WaitingForInfo label Nov 17, 2021
@bcmills bcmills changed the title runtime: frequent timeouts in {build,runBuild,run}TestProg runtime: frequent timeouts in {build,runBuilt,run}TestProg Nov 19, 2021
@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 19, 2021

The last timeout in these functions was before the aforementioned changes, so I'm provisionally marking this as fixed.
(The reduction in these timeouts did draw my attention to a different runtime timeout, which I have filed as #49680.)

greplogs --dashboard -md -l -e '(?ms)test timed out.*runtime_test\.(?:run|runBuilt|build)TestProg.*FAIL\s+runtime' --since=2021-11-17

2021-11-17T18:14:41-0981724/openbsd-386-68

@bcmills bcmills closed this as completed Nov 19, 2021
@bcmills bcmills removed the WaitingForInfo label Nov 19, 2021
@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 23, 2021

Nope, still timing out occasionally:

greplogs --dashboard -md -l -e '(?ms)test timed out.*runtime_test\.(?:run|runBuilt|build)TestProg.*FAIL\s+runtime' --since=2021-11-20

2021-11-22T20:34:40-f13fcd9/solaris-amd64-oraclerel

@bcmills bcmills reopened this Nov 23, 2021
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 23, 2021

In the 2021-11-22 timeout the only tests still running are two waiting for go build -tags=threadprof testprogcgo to complete.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 23, 2021

On a Solaris system I have access to, clearing the cache and running go test -test.short -test.v -test.count=1 runtime takes 38.5 seconds. The last two tests to complete are TestCgoExternalThreadSIGPROF and TestCgoExternalThreadSignal which both take about 13 seconds. One test is building the test program, and the other is waiting for that build to complete (they both build the same test program). Actually running the test program takes a fraction of a second.

This is consistent with the builder failure just being slow and running out of time, especially since cmd/dist will be running other tests in parallel.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Dec 6, 2021

greplogs --dashboard -md -l -e '(?ms)test timed out.*runtime_test\.(?:run|runBuilt|build)TestProg.*FAIL\s+runtime' --since=2021-11-23

2021-12-02T22:06:27-06dbf61/solaris-amd64-oraclerel
2021-12-02T20:45:47-3c6295d/openbsd-386-68
2021-11-29T19:24:34-68da368/solaris-amd64-oraclerel
2021-11-24T15:57:53-14f2b2a/openbsd-386-68
2021-11-24T15:55:56-465b402/openbsd-386-68
2021-11-24T15:55:56-465b402/solaris-amd64-oraclerel

All of these timeouts also appear to be actively compiling in TestCgoExternalThread.*.

Perhaps some of these slower tests should be skipped in short mode?

@cherrymui cherrymui removed the okay-after-beta1 label Dec 14, 2021
@aclements
Copy link
Member

@aclements aclements commented Jan 5, 2022

Perhaps some of these slower tests should be skipped in short mode?

Since we don't have a long builder for these, this seems like it would lose important test coverage.

or perhaps move some of the slower ones out to some package that won't also be run by users as part of go test all in their own modules.

This seems feasible. As you pointed out, this would give it its own timeout. It would require some nontrivial refactoring to pull out the testprog infrastructure, but we could do that.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 7, 2022

greplogs --dashboard -md -l -e '(?ms)test timed out.*runtime_test\.(?:run|runBuilt|build)TestProg.*FAIL\s+runtime' --since=2021-12-07

2022-01-06T23:39:43-042548b/solaris-amd64-oraclerel (building TestFakeTime)
2022-01-06T15:00:16-da7891f/netbsd-386-9_0-n1 (running TestCgoExecSignalMask)
2022-01-03T23:45:06-a893d0f/openbsd-386-68 (building TestCgoExternalThreadSIGPROF)
2021-12-22T23:31:40-d2ce939/solaris-amd64-oraclerel (building TestCgoExternalThreadSignal)
2021-12-16T21:02:37-ea26ce7/netbsd-arm64-bsiegert (building TestFakeTime)
2021-12-13T19:40:29-7bdbc73/openbsd-386-68 (building TestCgoExternalThreadSIGPROF)
2021-12-12T14:36:17-6b89773/solaris-amd64-oraclerel (building TestCgoExternalThreadSIGPROF)

@aclements
Copy link
Member

@aclements aclements commented Jan 18, 2022

A much simpler thing we could do is to increase the timeout for the runtime test. We already do this for the cmd/go tests [1] and I don't think that's any more artificial than factoring the cgo prog tests out into their own test.

[1] Or, at least, we try to. The code is weird.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 18, 2022

If increasing the timeout fixes the builder flakiness, I think it's a reasonable approach — Go users who run go test all should in general expect to either set a very generous timeout or scale it to their particular running environment.

That said, if the problem is just the timeout length, should we raise the scale factors in x/build/dashboard/builders.go instead of just for the specific test?

(It would be nice if we could eliminate the cmd/go special case in that code too, but that seems like a separate issue...)

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jan 19, 2022

The timeouts are from building a variant of threadprogcgo with -tags=threadprof. We don't need that variant. I'll send a CL to remove it. That will remove the extra build time and should keep the tests from timing out. I hope.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 19, 2022

Change https://golang.org/cl/379294 mentions this issue: runtime: remove -tags=threadprof in tests

jproberts pushed a commit to jproberts/go that referenced this issue Jun 21, 2022
Use an enviroment variable rather than a build tag to control starting
a busy loop thread when testprogcgo starts. This lets us skip another
build that invokes the C compiler and linker, which should avoid
timeouts running the runtime tests.

Fixes golang#44422

Change-Id: I516668d71a373da311d844990236566ff63e6d72
Reviewed-on: https://go-review.googlesource.com/c/go/+/379294
Trust: Ian Lance Taylor <iant@golang.org>
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation release-blocker
Projects
None yet
Development

No branches or pull requests

8 participants