Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: failed to create new OS thread #19163

Closed
cherrymui opened this issue Feb 17, 2017 · 11 comments

Comments

Projects
None yet
6 participants
@cherrymui
Copy link
Contributor

commented Feb 17, 2017

There are test failures on the builder dashboard at various places on various machines with

runtime: failed to create new OS thread (have 9 already; errno=11)
runtime: may need to increase max user processes (ulimit -u)
fatal error: newosproc

runtime stack:
runtime.throw(0x818701e, 0x9)
	/tmp/workdir/go/src/runtime/panic.go:596 +0x7c
runtime.newosproc(0x18538c80, 0x186fe000)
	/tmp/workdir/go/src/runtime/os_linux.go:163 +0x15f
runtime.newm(0x0, 0x18518000)
	/tmp/workdir/go/src/runtime/proc.go:1614 +0xf9
runtime.startm(0x18518000, 0x0)
	/tmp/workdir/go/src/runtime/proc.go:1684 +0x141
runtime.handoffp(0x18518000)
	/tmp/workdir/go/src/runtime/proc.go:1711 +0x49
runtime.retake(0xe65e8051, 0x1a5e8c, 0x0)
	/tmp/workdir/go/src/runtime/proc.go:3860 +0x10e
runtime.sysmon()
	/tmp/workdir/go/src/runtime/proc.go:3787 +0x272
runtime.mstart1()
	/tmp/workdir/go/src/runtime/proc.go:1166 +0xec
runtime.mstart()
	/tmp/workdir/go/src/runtime/proc.go:1136 +0x4d

Seems the earliest is https://build.golang.org/log/1a62fd950384d62c1782922f55fbaf194691d126 on my commit 98061fa. But I don't know how that CL could be related. Should I revert that CL?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Feb 17, 2017

This kind of error usually indicates that the system is overloaded, conceivably by some other test running in parallel. However, in this case it can't be running parallel with the test that I know can cause these kinds of problems, which is misc/cgo/test/issue18146.go. I don't know what is going on here.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Feb 17, 2017

@bradfitz

This comment has been minimized.

Copy link
Member

commented Feb 17, 2017

I see no evidence that this is the fault of the builders.

The Kubernetes configuration hasn't changed (same node count and size, same pod limits). No pod leaks I can see. No new master or node versions. No logged errors.

Unless one particularly bad CL was running on a trybot and consuming threads like crazy and Kubernetes' isolation between containers isn't good. But I'm not sure we keep enough logs (or enough association between build failure logs and GKE logs) to prove that.

/cc @rsc (who mentioned this to me on chat)

@bradfitz

This comment has been minimized.

Copy link
Member

commented Feb 17, 2017

I see that Kubernetes doesn't seem to support setting rlimit (kubernetes/kubernetes#3595) limits. So maybe we did just have one bad build somewhere impacting other builds.

Looks like I can modify the builders to set their own limits early in the build to prevent bad builds from impacting other pods.

@bradfitz

This comment has been minimized.

Copy link
Member

commented Feb 19, 2017

I kicked off a Go 1.8 trybot run and it also failed on the GKE builders, so I think our GKE nodes are just screwed up somehow.

I don't see any leaked pods, though, and I haven't changed anything about the builders that should affect the GKE builders recently.

I tried to ssh into the GKE nodes via the GCP web UI and the ssh failed to connect.

I tried to kubectl proxy to see their web UI (using the GCP web UI instructions) and I got auth errors. I tried again updating my gcloud components, but same results.

So, I have zero visibility into what is happening on the 5 nodes of the GKE cluster, other than listing their pods and such and seeing that they look fine.

Maybe some system pod or other daemon went crazy and leaked a bunch threads?

In any case, I have to reboot them anyway, so I'm just updating from GKE 1.4.6 to GKE 1.5.2 (using the GCP web UI option), since bug reports against the Kubernetes/GKE teams will probably be more well-received if I'm using the latest version.

We'll see if this does anything.

@bradfitz

This comment has been minimized.

Copy link
Member

commented Feb 19, 2017

GKE master is updated to 1.5.2.

The 5 nodes are half done updating from 1.4.6 to 1.5.2. (2 done, 1 rebooting, 2 still 1.4.6)

@bradfitz

This comment has been minimized.

Copy link
Member

commented Feb 19, 2017

The master and all five of the n1-standard-32 nodes are now on 1.5.2.

Wait and see now, I guess.

@bradfitz

This comment has been minimized.

Copy link
Member

commented Mar 10, 2017

Another of this bug, but on Windows:

https://storage.googleapis.com/go-build-log/c984be4c/windows-amd64-gce_664bd878.log

Note that Windows machines are new VMs (with no prior state) per build, so they should not be overloaded or stale or have stray processes running.

@bradfitz bradfitz added this to the Go1.9 milestone Mar 10, 2017

@bradfitz

This comment has been minimized.

@alexbrainman

This comment has been minimized.

Copy link
Member

commented Apr 2, 2017

More:
https://storage.googleapis.com/go-build-log/a9fae47f/windows-386-gce_e4a6dc6c.log

I wonder if this is actually #18253 instead?
In fact #18253 should be fixed now (by CL 34616) ...

Alex

@aclements

This comment has been minimized.

Copy link
Member

commented Jun 7, 2017

There haven't been any linux/* failures since Brad kicked the builders and the Windows failures are almost certailny #18253 based on the errno, so I believe this is fixed.

@aclements aclements closed this Jun 7, 2017

@golang golang locked and limited conversation to collaborators Jun 7, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.