-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DisruptionBudget object to define the max disruption that can be caused to pods #12611
Comments
Copying @bgrant0607 's responses to my questions:
Last sentence is an excellent point. I think that policy makes a lot of sense, so maybe we wouldn't want Deployment to ever consult the disruption budget. Anyway, not something we need to decide now. |
cc/ @thockin (you mentioned something related to this in a meeting today) |
cc @mml |
Re: not counting self-inflicted wounds against the budget: Yes, but we do still want to count unplanned failures, such as a machine panic or crash. We need to be able to distinguish these. "Budget" is a good metaphor for one of the things we usually care about around SLO enforcement: evictions/time. We can model these with a token bucket. "Budget" might be a bad metaphor for binary conditions. One example is shard strength. Do we have a good handle on the kinds of SLOs users want to see, or cluster admins want to offer? |
Not really. Deployment already has maxUnavailable, and I assume we want that. Presumably we want evictions/time as well, as you mentioned. I'm not sure we need anything else, at least for now. |
Occurred to me while working on #17393 that we should define and enforce an upper bound on min time between evictions, not just so we can reasonably make progress, but also so we don't end up in a degenerate case where the drain controller is trying to establish/maintain 2 years worth of disruption history. There might be other cases where we need boundaries to make implementation feasible. |
The issue of control objects with selectors that overlap comes up again and It came up again on network policies - how do I apply policies to sets of Obviously, this doesn't work for RCs and similar.
|
@mml Forgiveness is similar; it also needs a rate. @thockin Required label keys could also be used to ensure non-overlap: #15390 However, whether pod specifies policy or vice versa depends on which scenarios we want to facilitate. Related policy discussion: #17097 |
/cc me & @ihmccreery |
We really need to collect user feedback about how much flexibility is needed here. If a cluster-wide policy is adequate, it can probably be configured via flags, and it will be easier to implement than the fully general idea of a DisruptionBudget object with selectors. |
If these are global (or per-namespace) policies as you are advocating, then you presumably need to express them as a fraction of collection size, since a ReplicaSet or Job or whatever with 100 pods can probably tolerate more simultaneous pods down and higher eviction rate than one with 2 pods. Even as a fraction of collection size I'm not sure you can set a good global number, though. At the very least, I suspect some collections will want to opt out of the policy. Also, we need to track the current state in order to enforce any kind of policy (e.g. how many pods have been disrupted from this collection during the last N minutes). Assuming we want that to be persisted, we need an object anyway, so it seems reasonable to add to the controllers a pointer to a DisruptionBudget object, that contains both the Spec (disruption rate limit and max simultaneously down pods) and the Status (recent disruptions). |
Automatic merge from submit-queue Followup fixes for disruption controller. Part of #12611. - Record an event when a pod does not have exactly 1 controller. - Add TODO comment suggesting we simplify the two cases: integer and percentage.
This is for kubernetes#12611.
@davidopp can you take a look and bump to v1.5 if not needed for 1.4 |
This is just waiting on docs. |
@davidopp what is exactly missing here to document? Devel documents how to implement other DB types (mentioning we handle Has anyone started working on that? |
@mml What remains to be done here? |
I think this is done. |
Forked from #12236.
@davidopp suggested introducing a DisruptionBudget object in #12236 (comment):
In Deployment, we are currently using a maxUnavailable field. We can eventually switch to using a DisruptionBudget.
cc @bgrant0607
The text was updated successfully, but these errors were encountered: