Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Coordinator sometimes restarts unexpectedly, causing build failures and gomote sessions to be terminated.
The rest of this issue is dedicated to documenting specific instances of when I observed these restarts.
Maintner went down, many builders queued. (see #21383).
The second instance of this that I've seen is @rsc
The last 3 failures were due to coordinator restarting in the middle of handling the requests.
Buildlet logs were also interspersed with non-program builder failures (just
@rsc tried again a few days later to release and everything worked fine; no changes to the builder pipeline (that I am aware of).
More long-term, we should come up with some plan to prevent these failures (backpressure) and also provide some priority queue (e.g., give priority to release builders).
For cmd/release, I did have a (mental) todo to add support for retries, but it happens infrequently enough that I just retry manually.
Thanks to @kelseyhightower for helping me debug this.
Coordinator had a resource limit of 2Gi memory in deployment config.
Sarah, can you or Kelsey note here how this was debugged, for future reference?