x/build/cmd/coordinator: decide what to do about the global VM/pod timeout #52929
Labels
Builders
x/build issues (builders, bots, dashboards)
NeedsInvestigation
Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone
The comment of
CleanUpOldVMs
(andCleanUpOldPodsLoop
) includes:This mechanism requires maintaining a timeout for builds, one that's always "well beyond their expected lifetime". If that stops being true, also depending on the state of #42699, resources may be wasted due multiple retries (as happened in #49666 and #52591 in 2021-2022).
Since coordinator knows about all the builds it started, and already deletes builds that it doesn't know about (e.g., because they're left over from a previous instance of coordinator), I don't think a timer is actually needed for that. However, it might still be useful to handle stalls or other unexpected reasons why a build keeps going beyond a "reasonable" timeframe. So maybe we'll always need to maintain such a timeout.
One of the things we can do in either case is add better metrics/monitoring, so we find out when normal builds start to get dangerously close to the limit before it starts to cause problems.
CL 406216 increased the global timeout for builds from 45 mins to 2 hours to accommodate longtest builders, and this is the tracking issue to figure out what we want to do in this space long term. (Possibly simply bump it up from 2 hours if some builds need even longer in the future.)
CC @golang/release.
The text was updated successfully, but these errors were encountered: