Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: newosproc doesn't handle clone returning EAGAIN #49438

Open
asuffield opened this issue Nov 8, 2021 · 1 comment
Open

runtime: newosproc doesn't handle clone returning EAGAIN #49438

asuffield opened this issue Nov 8, 2021 · 1 comment

Comments

@asuffield
Copy link

@asuffield asuffield commented Nov 8, 2021

What version of Go are you using (go version)?

$ go version
go version go1.16.7 linux/amd64

Does this issue reproduce with the latest release?

I haven't tried, but inspection of the code says it will.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"

(With apologies for pruning)

What did you do?

I don't have a reproduction case for this one - it's very sensitive to something I haven't pinned down yet - but I have uncovered the nature of the bug via inspection.

Running go programs on sufficiently loaded systems sometimes crashes with "runtime: failed to create new OS thread (have 2 already; errno=11)". The really interesting thing here is errno=11, which is EAGAIN. If you read the Linux manpage for fork/clone it will refer to system limits; I have verified that is not the case in my scenario. At this point I said to myself (more than once): fork/clone aren't restartable syscalls, surely they can't actually return EAGAIN. Then I started doubting myself and went looking.

It turns out that Linux can and does return EAGAIN in some circumstances which are entirely undocumented in the manpages. The key code path ends up here:

https://elixir.bootlin.com/linux/v5.15.1/source/kernel/fork.c#L1523

And starts out over here:

https://elixir.bootlin.com/linux/v5.15.1/source/fs/exec.c#L1581

Which eventually led me back to this thread:

https://lore.kernel.org/lkml/20090329005343.GA12139@redhat.com/

It appears Linux has been willing to return EAGAIN to fork/clone for over a decade now, which means this code needs to handle that case somehow:

go/src/runtime/os_linux.go

Lines 167 to 173 in a97c527

if ret < 0 {
print("runtime: failed to create new OS thread (have ", mcount(), " already; errno=", -ret, ")\n")
if ret == -_EAGAIN {
println("runtime: may need to increase max user processes (ulimit -u)")
}
throw("newosproc")
}

It is super unfortunate that the rlimit scenario also returns EAGAIN, but I don't see any solution other than retrying a few times before panic - but maybe there's something I haven't fully understood here, I'll admit I haven't pieced together exactly what's happening. The only thing I'm fully confident of is: there is some way in which go processes can crash with an EAGAIN returned from clone() which isn't caused by rlimits.

@ianlancetaylor ianlancetaylor changed the title newosproc doesn't handle clone() returning EAGAIN runtime: newosproc doesn't handle clone returning EAGAIN Nov 8, 2021
@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Nov 8, 2021

Note: for the cgo case we use a loop with an increasing delay to handle pthread_create returning EAGAIN; see #18146 and https://go.googlesource.com/go/+/refs/heads/master/src/runtime/cgo/gcc_libinit.c#91. We could certainly do the same thing for the non-cgo case, which is what you are describing. It would be nice to have a test case.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants