New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a KEP for graduating PodDisruptionBudget to stable #904
Conversation
## Summary | ||
|
||
[Pod Disruption Budget (PDB)](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) | ||
is a Kubernete API that limits the number of pods of a collection that are down simultaneously from voluntary disruptions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: Kubernetes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
#### Mutable PDBs | ||
|
||
A mutable PDB object allows its `MinAvailable` and `MaxUnavailable` fields to be | ||
modified by clients. Components that use PDB must watch such modifications and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about selector fields? is there a good reason to limit which spec fields can be modified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the selector as well. I am not closely familiar with the internal logic of Disruption controller to say with confidence whether updating the selector (and/or other fields) could cause any issues there, but I don't see a concern after a quick look at the code.
the rolling update spec and currently does not take PDB into account. We need to | ||
change the implementation and use | ||
`min(PDB.MinAvailable, RollingUpdate.MinAvailable)` and | ||
`max(PDB.MaxUnavailable, RollingUpdate.MaxUnavailable)` instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want deployments looking at PDB objects or using the evict subresource? Just looking at the PDB does not take into account PDB status for selectors that match across deployments, right?
Respecting PDB could easily lead to a situation in which a deployment could not make progress. Given that, is respecting PDB still desired? If so, describe how that state will be detected, communicated, and/or resolved (automatically or manually).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I had in mind was to only look at MinAvailable
and MaxUnavailable
values and treat those similar to MinAvailable
and MaxUnavailable
of a rolling update, without making further changes and without using "evict" subresource, but given your points and reading some of the old comments, I think we should drop this requirement.
#### Needed Tests | ||
|
||
If this is considered a non-optional feature, there should be a conformance test | ||
for it (this needs to be an e2e test tagged as conformance). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good candidate for conformance testing. With an ack from @kubernetes/cncf-conformance-wg I'd make this a more definitive statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed wording to make the test a requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @liggitt for your quick review!
## Summary | ||
|
||
[Pod Disruption Budget (PDB)](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) | ||
is a Kubernete API that limits the number of pods of a collection that are down simultaneously from voluntary disruptions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
#### Mutable PDBs | ||
|
||
A mutable PDB object allows its `MinAvailable` and `MaxUnavailable` fields to be | ||
modified by clients. Components that use PDB must watch such modifications and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the selector as well. I am not closely familiar with the internal logic of Disruption controller to say with confidence whether updating the selector (and/or other fields) could cause any issues there, but I don't see a concern after a quick look at the code.
the rolling update spec and currently does not take PDB into account. We need to | ||
change the implementation and use | ||
`min(PDB.MinAvailable, RollingUpdate.MinAvailable)` and | ||
`max(PDB.MaxUnavailable, RollingUpdate.MaxUnavailable)` instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I had in mind was to only look at MinAvailable
and MaxUnavailable
values and treat those similar to MinAvailable
and MaxUnavailable
of a rolling update, without making further changes and without using "evict" subresource, but given your points and reading some of the old comments, I think we should drop this requirement.
|
||
## Summary | ||
|
||
[Pod Disruption Budget (PDB)](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bsalamat @liggitt - I've been thinking a lot about this topic when working on ClusterAPI.
Because the object is very loosely coupled with Pods, and only via label selector, is there any reason why we couldn't have it be more generic and remove the the 'Pod' from the resource kind? The link via label denotes the resource that it is tied to.
As a concrete example, we would like to tie disruption budgets to other resources for cluster management (machines).
/cc @ncdc @detiber
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the other scenarios, is delete the standard definition of disruption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with disruption budgets to machines. There's two orthogonal dimensions and use cases - managing machine upgrades because of organizational or infrastructure policy (and having to create fake pods to represent that with PDB today), and managing impact to applications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would just using label selector cause confusion here? Would there be a case where we would want to match both Machines and Pods with the same selector? Or would it be on the consumer to differentiate the type to associate with the label selector?
As to the specific use case around Machines, would it make sense to have the disruption budget exist for the Machines or the underlying Nodes instead? If what we are striving for is to ensure a certain level of capacity is available during an upgrade, then it seems to me that we are more concerned with the Nodes rather than the Machines themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have protected a pool of machines by running a pod per node with anti-affinity rules setup, and then you can prevent disruption to more than N number of nodes using that pod as a guard. its a bit of a hack, but it works. in general, i would prefer we have pod disruption budget, and machine disruption budgets as separate objects, the use cases may vary slightly in practice. similar to having machine sets and replica sets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to having different disruption budget for pods and nodes for a few reasons:
- PDB is enforced through
evict
subresource of Pods which is obviously not applicable to nodes. - If we used the same API for both objects, we would need to add a field to the API to specify the type of object that it applies to (node vs pod).
- Implementation of Node Disruption Budget shares little with that of PDB. So, we won't save much on the coding side by combining the two APIs either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer we have pod disruption budget, and machine disruption budgets as separate objects, the use cases may vary slightly in practice. similar to having machine sets and replica sets
I'd tend to agree. The concepts are similar, but the application, audience, and scope are different.
PDB's are namespaced, and write access can be granted to users with write access in a namespace.
Machine or Node disruption budgets would be cluster-scoped, and should only be writeable by the cluster admin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Machine or Node disruption budgets would be cluster-scoped, and should only be writeable by the cluster admin.
That's not true, they would be namespace scoped in a mgmt cluster.
FWIW I'm ok if we decide to not rationalize now, but I do think this is an area in the api where there is overlapping concepts and having (N) of the same type is also not good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I'm ok if we decide to not rationalize now,
If we think we may change the API at some point, discussing it now is probably better than after the API is GA'ed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have work-arounds on our side.
i have no major issue about graduating pdb in its current form. limitations that we have encountered recently could be handled via a higher level controller writing to a pdb so making them mutable would help a lot. an example scenario was sizing a pdb relative to the number of control plane machines in a cluster which is easier to do if we support mutability. this is useful if you use daemonsets for some types of workloads, but still want to minimize potential disruption to the number of control plane machines. |
### Goals | ||
|
||
* Plan to promote PDB API to stable version. | ||
* Propose changes to the deployment controller to take PDB into account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is "taking it into account" described?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed that section as it turns out to create complexities which are not easy to address. I'll remove this goal.
d2515ef
to
e77d930
Compare
- [This test](https://github.com/kubernetes/kubernetes/blob/ac56bd502ab96696682c66ebdff94b6e52471aa3/test/integration/scheduler/preemption_test.go#L731) | ||
tests effects of PDB on preemption (PDB is honored in a best effort way) | ||
* Eviction integration tests | ||
- [These tests](https://github.com/kubernetes/kubernetes/blob/master/test/integration/evictions/evictions_test.go) test eviction logic and its interactions with PDB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the test
- [ ] Ensure that components do not have any logic that relies on immutability | ||
of PDBs. For example, if component builds a cache of pods that match various | ||
PDBs, they must add logic to invalid the cache on updates. | ||
- Action Item: sweep for in-tree uses to make sure we’re informer driven and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should include all repos in kubernetes org, not just k/k. I know Cluster Autoscaler has logic to deal with PDBs (I don't think it will be impacted by this change, but it's worth double checking), I suspect VPA may also use it. Not sure about other projects under kubernetes org.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
e77d930
to
aa2d4d4
Compare
I'm looking into adding support for the scale subresource with PDBs, but I'm not sure if this would be a requirement for GA. But I hope this would be ready for the next release of Kubernetes together with mutable PDBs. And then both changes could soak for one release and we can push PDBs to GA in 1.16. I am interested in taking on the task of getting this to GA. |
So we've discussed this as something we want to do. It passes the criteria for approval. |
Kubernetes control plane evicts some of the existing pods, but keeps at least 10 | ||
around. The PDB is updated and states that at least 20 replicas must be kept | ||
alive. It may appear to an observer that the evictions happened before the PDB | ||
update were incorrect, if they don’t notice the PDB update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have this problem before PDBs are mutable.
Immutable PDBs can be deleted and re-created with new spec (20 replicas in this example).
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, janetkuo, kow3ns The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is a placeholder PR for the docs for the PDB to GA KEP kubernetes/enhancements#904
This is a placeholder PR for the docs for the PDB to GA KEP kubernetes/enhancements#904
See kubernetes/kubernetes#95083 for a potential problem with the debatable treatment of the |
/sig apps
/assign @kow3ns @liggitt