Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testing: provide a way to work around "race: limit on 8128 simultaneously alive goroutines is exceeded" error #47056

Open
rogpeppe opened this issue Jul 5, 2021 · 12 comments

Comments

@rogpeppe
Copy link
Contributor

@rogpeppe rogpeppe commented Jul 5, 2021

$ go version
go version devel go1.17-dc00dc6c6b Thu Jun 10 12:41:37 2021 +0000 linux/amd64

Issue #38184 seems to be a reasonably hard limit in the race detector library, but it's not always easy to work around it.
In tests, particularly, this issue more commonly arises in higher level tests that run more slowly and hence use testing.T.Parallel to speed themselves up. But this means that if a large instance is used to run tests (for example in continuous integration), we might see more simultaneously alive goroutines because the value of GOMAXPROCS is greater so more tests will be run in parallel, leading to this issue arising.

There are a few current possibilities for working around this at present, although none of them feel great:

  • avoid calling testing.T.Parallel when the race detector is enabled, but this disables parallelism completely, which feels like overkill.
  • run some packages with a different value for the -test.parallel flag, but this means that the top level CI script needs to be changed (you'll no longer be able to run go test -race ./...) which has maintainability issues.
  • update the value of -test.parallel within a given package by using flag.Lookup and setting the flag directly, but this is a nasty hack.

Ideally, we'd be able to configure the maximum number of active goroutines allowed, but failing that, another possibility might be to make the default value of testing.parallel come from runtime.GOMAXPROCS after TestMain has run rather than before, allowing a test package to adjust the amount of test parallelism itself. Another idea (courtesy of @mvdan) is that the testing package could avoid starting new parallel tests when the number of goroutines grows large.

@mvdan
Copy link
Member

@mvdan mvdan commented Jul 5, 2021

Another idea (courtesy of @mvdan) is that the testing package could avoid starting new parallel tests when the number of goroutines grows large.

To expand a little: with the race detector enforcing a limit of 8k (active?) goroutines, presumably the testing package could refuse to start new parallel tests or sub-tests if we are currently over 50% of that limit of (active?) goroutines.

That would still leave the "limit exceeded" error as a possibility, but it would presumably make them far less likely, and should only make hugely parallel tests slightly slower. And, best of all, it should allow go test -race ./... to usually do the right thing without extra knobs.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Jul 7, 2021

@mvdan That kind of heuristic could be effective in the general case, but could end up more confusing in cases where tests themselves spawn many goroutines. I'm not really sure what to do here besides trying to make the race detector more robust (I think we haven't kept up with TSAN...? Don't quote me on that.).

@mknyszek mknyszek added this to the Backlog milestone Jul 7, 2021
@randall77
Copy link
Contributor

@randall77 randall77 commented Jul 7, 2021

I believe this is a limitation of the TSAN library itself. It has a max thread limit. We do pull TSAN somewhat periodically, the last pull was 2020-10-20.
The max thread limit is somewhat more painful for Go because goroutines are lightweight, and thus people make more of them. As opposed to C++ threads, which people tend to make fewer of.

All that being said, there's no way to fix this in the Go project directly. Someone would have to redesign TSAN somehow to handle more threads. It's probably a big lift, but @dvyukov might know more about why that limit exists and how we might increase it.

@randall77
Copy link
Contributor

@randall77 randall77 commented Jul 7, 2021

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Jul 9, 2021

I just happen to have a redesigned tsan runtime that supports infinite number of goroutines (also twice faster and consumes less memory). However, there is still a long road to upstreaming it.
If you feel adventurous enough, you may try:
https://go-review.googlesource.com/c/go/+/333529
or build yourself from:
dvyukov/llvm-project#4
(only linux/amd64 is supported for now)

@mvdan
Copy link
Member

@mvdan mvdan commented Jul 9, 2021

Getting rid of the limit entirely would be a far better solution, of course :) My "limit parallelism" suggestion was more of a temporary workaround until that happens.

@mvdan
Copy link
Member

@mvdan mvdan commented Sep 1, 2021

However, there is still a long road to upstreaming it.

I just used your experimental CL to help debug a very tricky memory corruption bug at work, where the software spawns tens of thousands of goroutines :)

Really, really looking forward to this new tsan runtime being upstreamed, so Go can ship with it. Is there any way I can help make that happen?

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Sep 17, 2021

Just got back from a long vacation.
Thanks, that's good to hear (and also that it did not crash :)).
Re helping... hard to say, depends on what types of work you are ready to do. Almost all work is in C++/llvm and related to rebasing, splitting, productionizing and reviewing this large change:
https://github.com/dvyukov/llvm-project/pull/3/files

@mvdan
Copy link
Member

@mvdan mvdan commented Sep 17, 2021

That CL's HEAD did end up failing after a few days, for what it's worth :)

ThreadSanitizer: CHECK failed: sanitizer_common.h:493 "((i)) < ((size_))" (0x100a9fc, 0x79e5ac) (tid=3701513)

@dvyukov
Copy link
Member

@dvyukov dvyukov commented Sep 17, 2021

Nothing immediately comes to mind. I would need to a reproducer to debug.

@mvdan
Copy link
Member

@mvdan mvdan commented Sep 17, 2021

The code is open source, but a large daemon that ran for days, so not a reproducer. Not worth pursuing for now, I think - wanted to bring it up in case it was helpful on its own.

@seebs
Copy link
Contributor

@seebs seebs commented Oct 18, 2021

I seem to be hitting a thing which isn't this, but might be vaguely related, where sometimes go test -race will die in very strange ways, with tracebacks which always show a corrupted stack that goes back to an errgroup.Go or something similar, which I guess is not ultimately all that surprising. But there was a specific test which was hitting it all the time, so I took the errgroup usage out of that test, and suddenly that went fine but a later test did the same thing.

Nothing even close to a viable reproducer yet, unfortunately. Originally saw this with 1.16.9, with 1.17.2 I get similar things but the tracebacks can have 3000 lines of fatal error: runtime*unlock: lock count (only with the dot, not the asterisk).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants