Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/cmd/coordinator: decide what to do about the global VM/pod timeout #52929

Open
dmitshur opened this issue May 16, 2022 · 2 comments
Open
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@dmitshur
Copy link
Contributor

The comment of CleanUpOldVMs (and CleanUpOldPodsLoop) includes:

This is the safety mechanism to delete VMs which stray from the
normal deleting process. VMs are created to run a single build and
should be shut down by a controlling process. Due to various types
of failures, they might get stranded. To prevent them from getting
stranded and wasting resources forever, we instead set the
"delete-at" metadata attribute on them when created to some time
that's well beyond their expected lifetime.

This mechanism requires maintaining a timeout for builds, one that's always "well beyond their expected lifetime". If that stops being true, also depending on the state of #42699, resources may be wasted due multiple retries (as happened in #49666 and #52591 in 2021-2022).

Since coordinator knows about all the builds it started, and already deletes builds that it doesn't know about (e.g., because they're left over from a previous instance of coordinator), I don't think a timer is actually needed for that. However, it might still be useful to handle stalls or other unexpected reasons why a build keeps going beyond a "reasonable" timeframe. So maybe we'll always need to maintain such a timeout.

One of the things we can do in either case is add better metrics/monitoring, so we find out when normal builds start to get dangerously close to the limit before it starts to cause problems.

CL 406216 increased the global timeout for builds from 45 mins to 2 hours to accommodate longtest builders, and this is the tracking issue to figure out what we want to do in this space long term. (Possibly simply bump it up from 2 hours if some builds need even longer in the future.)

CC @golang/release.

@dmitshur dmitshur added Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels May 16, 2022
@dmitshur dmitshur added this to the Unreleased milestone May 16, 2022
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/406216 mentions this issue: cmd/coordinator: consolidate and increase global VM deletion timeout

gopherbot pushed a commit to golang/build that referenced this issue May 16, 2022
We had a lot of flexibility over timeouts, making their maintenance
harder. Consolidate it to a single timeout in the pool package, and
modify it from 45 minutes to 2 hours.

There's room for improvement in how we maintain this timeout,
but I'm leaving that for future work (with a tracking issue).

Fixes golang/go#52591.
Updates golang/go#52929.
Updates golang/go#49666.
Updates golang/go#42699.

Change-Id: I2ad92648d89a714397bd8b0e1ec490fc9f6d6790
Reviewed-on: https://go-review.googlesource.com/c/build/+/406216
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Reviewed-by: Carlos Amedee <carlos@golang.org>
Reviewed-by: Heschi Kreinick <heschi@google.com>
@bcmills
Copy link
Contributor

bcmills commented May 16, 2022

FWIW, with the current builder triage process I'm using there is a natural limit on builder time, which is the interval between a CL being submitted and its triage being performed.

I've been using the day boundary as the triage cutoff, and I think the timestamps that fetchlogs uses are in UTC. I'm in UTC-4 and I don't start triage until at least 9AM local time, so that gives a natural limit of ~13h (less scheduling latency) before the tests for a CL committed just before midnight would start to overrun the triage window.

@dmitshur dmitshur changed the title x/build/cmd/coordinator: decide what to do about a global build time limit x/build/cmd/coordinator: decide what to do about the global VM/pod timeout May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants