x/build: coordinator restarts #22042
Coordinator sometimes restarts unexpectedly, causing build failures and gomote sessions to be terminated.
The rest of this issue is dedicated to documenting specific instances of when I observed these restarts.
Maintner went down, many builders queued. (see #21383).
The second instance of this that I've seen is @rsc
The last 3 failures were due to coordinator restarting in the middle of handling the requests.
Buildlet logs were also interspersed with non-program builder failures (just
@rsc tried again a few days later to release and everything worked fine; no changes to the builder pipeline (that I am aware of).
The text was updated successfully, but these errors were encountered:
More long-term, we should come up with some plan to prevent these failures (backpressure) and also provide some priority queue (e.g., give priority to release builders).
For cmd/release, I did have a (mental) todo to add support for retries, but it happens infrequently enough that I just retry manually.
Thanks to @kelseyhightower for helping me debug this.
Coordinator had a resource limit of 2Gi memory in deployment config.
Sarah, can you or Kelsey note here how this was debugged, for future reference?