Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: coordinator restarts #22042

Closed
adams-sarah opened this issue Sep 26, 2017 · 7 comments

Comments

Projects
None yet
5 participants
@adams-sarah
Copy link
Contributor

commented Sep 26, 2017

Coordinator sometimes restarts unexpectedly, causing build failures and gomote sessions to be terminated.

Trends:

  • I have always observed this under load (many builders building at once such that we are at our max threshold of GCP CPUs, 500)
  • When it does restart, it will happen in bursts (many restarts over a duration of hours).
  • There is a correlation between coordinator restarts and Go releases (usage of golang.org/x/build/cmd/release). Though, that could be because the release tool generates load.
  • There will be no indication in the logs that coordinator restarted due to any program error - there are no panics, or any other indication of any error on the coordinator side. Just start-up logs. Eg:
2017/10/25 18:59:36 Creating Kubernetes pod "buildlet-linux-kubestd-rn68e80e3" for host-linux-kubestd
2017/10/25 18:59:36 Registering reverse buildlet "go-builder-2" (10.240.0.20:46624) for host type host-linux-arm5spacemonkey
2017/10/25 18:59:36 Buildlet go-builder-2/10.240.0.20:46624: {Version:15} for [host-linux-arm5spacemonkey]
# restart here
2017/10/25 18:59:38 coordinator version "d55be8da83e622767c00a19031b29cb701fcc744" starting

The rest of this issue is dedicated to documenting specific instances of when I observed these restarts.

1. 08/11/2017

Maintner went down, many builders queued. (see #21383).
When we got maintner back up, all queued builders tried to run, and many (most?) failed.
Seemed to me that the failure had to be caused by all the builders sharing some resource, and hitting some limit.
@aclements helped me debug a bit; we concluded that we were probably hitting a disk write IOPS limit.
Cannot remember if we observed coordinator restarts.

2. 09/22/2017

The second instance of this that I've seen is @rsc golang.org/x/build/cmd/release (binary named build.release below) failures.

$ ./build.release boring18
...
2017/09/22 16:45:28 linux-amd64: Start.
2017/09/22 16:45:28 linux-amd64: Creating buildlet.
2017/09/22 16:47:13 linux-amd64: Error: Post https://farmer.golang.org:443/buildlet/create: EOF
$ ./build.release boring18
2017/09/22 16:49:32 linux-amd64: Start.
2017/09/22 16:49:32 linux-amd64: Creating buildlet.
2017/09/22 16:49:47 linux-amd64: Error: Post https://farmer.golang.org:443/buildlet/create: dial tcp 107.178.219.46:443: getsockopt: connection timed out
ls: cannot access go1.8.3b4.linux-amd64.tar.gz: No such file or directory
$ ./build.release boring18
2017/09/22 16:52:48 linux-amd64: Start.
2017/09/22 16:52:48 linux-amd64: Creating buildlet.
2017/09/22 16:55:05 linux-amd64: Pushing source to buildlet.
2017/09/22 16:55:22 linux-amd64: Writing VERSION file.
2017/09/22 16:55:22 linux-amd64: Cleaning goroot (pre-build).
2017/09/22 16:55:22 linux-amd64: Building.
2017/09/22 16:59:24 linux-amd64: Error: error copying response: unexpected EOF
$ ./build.release boring18
2017/09/22 17:02:11 linux-amd64: Start.
2017/09/22 17:02:11 linux-amd64: Creating buildlet.
2017/09/22 17:04:24 linux-amd64: Pushing source to buildlet.
2017/09/22 17:04:40 linux-amd64: Writing VERSION file.
2017/09/22 17:04:40 linux-amd64: Cleaning goroot (pre-build).
2017/09/22 17:04:40 linux-amd64: Building.
2017/09/22 17:11:32 linux-amd64: Error: error copying response: unexpected EOF

The last 3 failures were due to coordinator restarting in the middle of handling the requests.
There is no indication that coordinator restarted due to any program error - there are no panics, or any other indication of any error on the coordinator side.

Buildlet logs were also interspersed with non-program builder failures (just exit status 1) at the same time, eg:

13:45:31.000
2017/09/22 20:45:31 {windows-amd64-2008 f2ab41b02d4aa71fe472c56b21f7004071bd42ab } failed: tests failed: dist test failed: go_test:os: exit status 1

13:45:51.000
2017/09/22 20:45:51 [build windows-386-2008 f2ab41b02d4aa71fe472c56b21f7004071bd42ab]: Execution error running go_test:path/filepath on http://10.240.0.71 GCE VM: buildlet-windows-amd64-2008-rnbeb0f3b: buildlet: Client closed (numFails = 1)

13:46:22.000
2017/09/22 20:46:22 failed to extract snapshot for helper 10.0.3.52:80: Post http://10.0.3.52/writetgz?dir=go: EOF

13:46:33.000
2017/09/22 20:46:33 {linux-amd64-alpine 93c869ec1678e2cbe3baf4c2e48f7c3482012ae2 } failed: tests failed: dist test failed: go_test:runtime: exit status 1

13:46:37.000
2017/09/22 20:46:37 [build linux-amd64-alpine 39d24b6b6d3e25b4e3a5ad6c7b79891fc3020319]: Execution error running runtime:cpu124 on http://10.0.5.111 Kube Pod: buildlet-linux-x86-alpine-rn34e9c04: buildlet: Client closed (numFails = 1)

13:46:47.000
2017/09/22 20:46:47 {plan9-amd64-9front 42cb4dcdb59a0be3ad52bf6cf33489301d039e08 } failed: tests failed: dist test failed: go_test:net: exit status: 'go 29782: 1'

@rsc tried again a few days later to release and everything worked fine; no changes to the builder pipeline (that I am aware of).

3. 10/25/2017

timestamps (eastern)
15:06, 14:59, 13:56

shadams$ kubectl get po
NAME                                                  READY     STATUS    RESTARTS   AGE
coordinator-deployment-1785484075-5gcw0               1/1       Running   3          1h

cc @bradfitz @andybons @broady

@gopherbot gopherbot added this to the Unreleased milestone Sep 26, 2017

@gopherbot gopherbot added the Builders label Sep 26, 2017

@broady

This comment has been minimized.

Copy link
Member

commented Sep 26, 2017

More long-term, we should come up with some plan to prevent these failures (backpressure) and also provide some priority queue (e.g., give priority to release builders).

For cmd/release, I did have a (mental) todo to add support for retries, but it happens infrequently enough that I just retry manually.

@bradfitz

This comment has been minimized.

Copy link
Member

commented Sep 26, 2017

More long-term, we should come up with some plan to prevent these failures (backpressure) and also provide some priority queue (e.g., give priority to release builders).

That is #19178.

@adams-sarah adams-sarah changed the title x/build: failures under load x/build: coordinator restarts under load Oct 25, 2017

@adams-sarah adams-sarah changed the title x/build: coordinator restarts under load x/build: coordinator restarts Oct 25, 2017

@adams-sarah

This comment has been minimized.

Copy link
Contributor Author

commented Nov 2, 2017

Thanks to @kelseyhightower for helping me debug this.
It seems coordinator has been restarting b/c running out of memory.

Coordinator had a resource limit of 2Gi memory in deployment config.
We removed the limit in the deployment config in-place (via kubectl edit deployment coordinator-deployment).
I will submit a CL to remove the limit in the deployment config source (once I find it).

@bradfitz

This comment has been minimized.

Copy link
Member

commented Nov 2, 2017

Sarah, can you or Kelsey note here how this was debugged, for future reference?

The limit is at https://github.com/golang/build/blob/master/cmd/coordinator/deployment-prod.yaml#L26

@adams-sarah

This comment has been minimized.

Copy link
Contributor Author

commented Nov 2, 2017

Yes, totally. Good idea.

kubectl get po coordinator-deployment-$HASH -o yaml

then under the status section, the reason for last restart was OOM.

I will submit the change to the prod.yaml and ref this issue thanks Brad!

@gopherbot

This comment has been minimized.

Copy link

commented Nov 2, 2017

Change https://golang.org/cl/75532 mentions this issue: cmd/coordinator: remove memory limits from deployment config

@kevinburke

This comment has been minimized.

Copy link
Contributor

commented Nov 2, 2017

can we also monitor the amount of memory being used by each builder? GCP has a good metrics aggregator, right?

@golang golang locked and limited conversation to collaborators Nov 2, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.