Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DisruptionBudget object to define the max disruption that can be caused to pods #12611

Closed
nikhiljindal opened this issue Aug 12, 2015 · 24 comments
Assignees
Labels
area/api Indicates an issue on api area. area/node-lifecycle Issues or PRs related to Node lifecycle kind/design Categorizes issue or PR as related to design. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@nikhiljindal
Copy link
Contributor

Forked from #12236.
@davidopp suggested introducing a DisruptionBudget object in #12236 (comment):

will it be possible to construct enforceable SLOs if each controller that can cause disruptions has
a separate disruption budget spec? (An example of another such controller is rescheduler, but I'm sure there will be others, e.g. maybe one that takes machines offline for kernel upgrades.)

It seems that we may want a new object that lets you define arbitrary collections of Pods (using label
selectors) and specifies a correspond disruption budget (maxUnavailable like you have here, max rate of churn, etc.). The Spec for this object would be the budget, and the Status would be the
number currently down or current churn rate or whatever. This object could be owned by a service that controllers ask permission before they initiate a disruption, or the logic for checking if it is OK to disrupt and updating the status could be implemented in a library that is used by all the controllers that
cause disruptions.

How will DisruptionBudget in Deployment work? What if a DisruptionBudget (or whatever) object already exists that covers
(some of) these Pods? Does it override the one specified in Deployment? (i.e. you only
create DisruptionBudget if it doesn't exist?) Is it an error to specify a disruption budget in Deployment if there is a DisruptionBudget object for (some of) the Pods covered by the Deployment object?

In Deployment, we are currently using a maxUnavailable field. We can eventually switch to using a DisruptionBudget.

cc @bgrant0607

@nikhiljindal nikhiljindal added area/api Indicates an issue on api area. team/control-plane kind/design Categorizes issue or PR as related to design. labels Aug 12, 2015
@bgrant0607 bgrant0607 added team/api priority/backlog Higher priority than priority/awaiting-more-evidence. labels Aug 12, 2015
@davidopp
Copy link
Member

Copying @bgrant0607 's responses to my questions:

General issue of how to deal with conflicting overlapping sets, similar to multiple controllers. Added to my list of API conventions to document.

Generation of the DisruptionBudget should be optional, though, in case someone did want to create it separately deliberately.

FWIW, internally, self-inflicted disruptions (updates, crash loops, etc.) are not counted against the infrastructure disruption budget.

Last sentence is an excellent point. I think that policy makes a lot of sense, so maybe we wouldn't want Deployment to ever consult the disruption budget. Anyway, not something we need to decide now.

@davidopp
Copy link
Member

davidopp commented Sep 2, 2015

cc/ @thockin (you mentioned something related to this in a meeting today)

@bgrant0607
Copy link
Member

cc @mml

@mml
Copy link
Contributor

mml commented Nov 4, 2015

Re: not counting self-inflicted wounds against the budget: Yes, but we do still want to count unplanned failures, such as a machine panic or crash. We need to be able to distinguish these.

"Budget" is a good metaphor for one of the things we usually care about around SLO enforcement: evictions/time. We can model these with a token bucket.

"Budget" might be a bad metaphor for binary conditions. One example is shard strength.

Do we have a good handle on the kinds of SLOs users want to see, or cluster admins want to offer?

@davidopp
Copy link
Member

davidopp commented Nov 7, 2015

Do we have a good handle on the kinds of SLOs users want to see, or cluster admins want to offer?

Not really. Deployment already has maxUnavailable, and I assume we want that. Presumably we want evictions/time as well, as you mentioned. I'm not sure we need anything else, at least for now.

@bgrant0607
Copy link
Member

Forgiveness is related: #1574

So is service-level readiness, which was discussed in the Component Registration API context #13216.

@mml
Copy link
Contributor

mml commented Nov 21, 2015

Occurred to me while working on #17393 that we should define and enforce an upper bound on min time between evictions, not just so we can reasonably make progress, but also so we don't end up in a degenerate case where the drain controller is trying to establish/maintain 2 years worth of disruption history.

There might be other cases where we need boundaries to make implementation feasible.

@thockin
Copy link
Member

thockin commented Nov 21, 2015

The issue of control objects with selectors that overlap comes up again and
again, and I am not really satisfied with (my understanding of) the answer.

It came up again on network policies - how do I apply policies to sets of
pods and know that the policies are not in conflict? It seems easier to
have pods declare which policy they belong to rather than policies declare
which pods they apply to.

Obviously, this doesn't work for RCs and similar.
On Aug 12, 2015 6:53 PM, "David Oppenheimer" notifications@github.com
wrote:

Copying @bgrant0607 https://github.com/bgrant0607 's responses to my
questions:

General issue of how to deal with conflicting overlapping sets, similar to
multiple controllers. Added to my list of API conventions to document.

Generation of the DisruptionBudget should be optional, though, in case
someone did want to create it separately deliberately.

FWIW, internally, self-inflicted disruptions (updates, crash loops, etc.)
are not counted against the infrastructure disruption budget.

Last sentence is an excellent point. I think that policy makes a lot of
sense, so maybe we wouldn't want Deployment to ever consult the disruption
budget. Anyway, not something we need to decide now.


Reply to this email directly or view it on GitHub
#12611 (comment)
.

@bgrant0607
Copy link
Member

@mml Forgiveness is similar; it also needs a rate.

@thockin
For controllers the tie-breaking proposal is controllerRef: #2210 (comment), #14961

Required label keys could also be used to ensure non-overlap: #15390

However, whether pod specifies policy or vice versa depends on which scenarios we want to facilitate. Related policy discussion: #17097

@roberthbailey
Copy link
Contributor

/cc me & @ihmccreery

@davidopp davidopp added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/backlog Higher priority than priority/awaiting-more-evidence. labels Feb 24, 2016
@davidopp davidopp added this to the next-candidate milestone Feb 24, 2016
@mml
Copy link
Contributor

mml commented Feb 24, 2016

We really need to collect user feedback about how much flexibility is needed here. If a cluster-wide policy is adequate, it can probably be configured via flags, and it will be easier to implement than the fully general idea of a DisruptionBudget object with selectors.

@davidopp
Copy link
Member

If these are global (or per-namespace) policies as you are advocating, then you presumably need to express them as a fraction of collection size, since a ReplicaSet or Job or whatever with 100 pods can probably tolerate more simultaneous pods down and higher eviction rate than one with 2 pods. Even as a fraction of collection size I'm not sure you can set a good global number, though. At the very least, I suspect some collections will want to opt out of the policy.

Also, we need to track the current state in order to enforce any kind of policy (e.g. how many pods have been disrupted from this collection during the last N minutes). Assuming we want that to be persisted, we need an object anyway, so it seems reasonable to add to the controllers a pointer to a DisruptionBudget object, that contains both the Spec (disruption rate limit and max simultaneously down pods) and the Status (recent disruptions).

mml added a commit to mml/kubernetes that referenced this issue Aug 2, 2016
mml added a commit to mml/kubernetes that referenced this issue Aug 16, 2016
mml added a commit to mml/kubernetes that referenced this issue Aug 16, 2016
k8s-github-robot pushed a commit that referenced this issue Aug 17, 2016
Automatic merge from submit-queue

Implement DisruptionController.

Part of #12611

This currently also includes a pending commit from #25895
k8s-github-robot pushed a commit that referenced this issue Aug 21, 2016
Automatic merge from submit-queue

Followup fixes for disruption controller.

Part of #12611.
- Record an event when a pod does not have exactly 1 controller.
- Add TODO comment suggesting we simplify the two cases: integer and percentage.
mml added a commit to mml/kubernetes that referenced this issue Aug 22, 2016
@goltermann
Copy link
Contributor

@davidopp can you take a look and bump to v1.5 if not needed for 1.4

@davidopp
Copy link
Member

davidopp commented Sep 6, 2016

This is just waiting on docs.

@pwittrock pwittrock modified the milestones: v1.4-nonblocking, v1.4 Sep 8, 2016
@ingvagabund
Copy link
Contributor

@davidopp what is exactly missing here to document? Devel documents how to implement other DB types (mentioning we handle MinAvailable for the moment) with explanation what DB is (referring rescheduling docs + additional notes) and how it works based on the current API & controller implementation? Plus with comments from related issues/PRs about current limitations, how is PodDisruptionAllowed set and how it is (about to be) consumed by /eviction subresource?

Has anyone started working on that?

@bgrant0607
Copy link
Member

@mml What remains to be done here?

@bgrant0607
Copy link
Member

ref #35318, #38336

@bgrant0607 bgrant0607 added the area/node-lifecycle Issues or PRs related to Node lifecycle label Mar 23, 2017
@mml
Copy link
Contributor

mml commented May 31, 2017

I think this is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue on api area. area/node-lifecycle Issues or PRs related to Node lifecycle kind/design Categorizes issue or PR as related to design. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

10 participants