-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3017: Promote unhealthyPodEvictionPolicy for PDBs to Beta #3777
KEP-3017: Promote unhealthyPodEvictionPolicy for PDBs to Beta #3777
Conversation
atiratree
commented
Jan 24, 2023
- One-line PR description: Promote unhealthyPodEvictionPolicy for PDBs to Beta
- Issue link: PodHealthyPolicy for PodDisruptionBudget #3017
- Other comments:
7a21d77
to
2cde840
Compare
/assign @ravisantoshgudimetla |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm
- As `IfHealthyBudget` is generally the less desirable option, it would be good to inform users that they are using it. | ||
We can achieve this by defaulting `spec.unhealthyPodEvictionPolicy` to a `IfHealthyBudget` value. | ||
This might prompt users to re-evaluate the value and set it to `AlwaysAllow` which is preferable for most of the applications. | ||
The default value should preserve original behavior and handle null values. This should be tested and documented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the fence whether to default the field or not... I'm not sure how much it would actually inform/drive user awareness. @soltysh, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if defaulting will help with the adoption or rise awareness, @wojtek-t also pointed out that here that it might cause issues for existing CI/CD flows where users have predefined PDBs with nil values for those fields which would cause them being pushed to the server. I'm inclined to leave it as is, for now.
Do we have a precedence before for changing defaults in similar situations? If we decide to do that, I'd probably go the path of adding a warning for nil values and setting the default in the next release, to give users one release to adopt, maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's plenty of precedence for defaulting of newly introduced fields, conditional on their feature being enabled, which is what is being considered here:
git grep 'Gate.Enabled' -- 'pkg/apis/*defaults.go'
pkg/apis/apps/v1/defaults.go: if utilfeature.DefaultFeatureGate.Enabled(features.MaxUnavailableStatefulSet) {
pkg/apis/apps/v1/defaults.go: if utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetAutoDeletePVC) {
pkg/apis/apps/v1beta1/defaults.go: if utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetAutoDeletePVC) {
pkg/apis/apps/v1beta1/defaults.go: if utilfeature.DefaultFeatureGate.Enabled(features.MaxUnavailableStatefulSet) {
pkg/apis/apps/v1beta2/defaults.go: if utilfeature.DefaultFeatureGate.Enabled(features.MaxUnavailableStatefulSet) {
pkg/apis/apps/v1beta2/defaults.go: if utilfeature.DefaultFeatureGate.Enabled(features.StatefulSetAutoDeletePVC) {
pkg/apis/storage/v1/defaults.go: if obj.Spec.SELinuxMount == nil && utilfeature.DefaultFeatureGate.Enabled(features.SELinuxMountReadWriteOncePod) {
pkg/apis/storage/v1beta1/defaults.go: if obj.Spec.SELinuxMount == nil && utilfeature.DefaultFeatureGate.Enabled(features.SELinuxMountReadWriteOncePod) {
that said, there's also evidence that some clients don't handle defaulted fields correctly / gracefully in general (including the kubelet choking on new defaulted fields inside podspec - kubernetes/kubernetes#63814, or clients doing diffs being unhappy with defaulted fields - argoproj/argo-cd#11143), so unless there's a strong reason to default, I'd probably leave the field nil
by default and document which explicit option that corresponds to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so unless there's a strong reason to default, I'd probably leave the field nil by default and document which explicit option that corresponds to
+1 (unless there are other reasons not mentioned here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for nil, in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, updated the KEP to keep the field nil by default
2cde840
to
458d468
Compare
@@ -530,6 +532,7 @@ For each of them, fill in the following information by copying the below templat | |||
- 2021-10-24: Proposed KEP for adding the new behavior in alpha status in 1.24. | |||
- 2022-11-11: Initial alpha implementation merged into 1.26 | |||
- 2022-12-07: KEP rewritten to match the implementation (PodHealthyPolicy was renamed to UnhealthyPodEvictionPolicy) | |||
- 2023-01-24: Update for beta promotion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not change the default behavior. Users will have to specify a policy on the PDB for behavior to be affected.
What happens if a policy is specified and then the feature is disabled?
@@ -530,6 +532,7 @@ For each of them, fill in the following information by copying the below templat | |||
- 2021-10-24: Proposed KEP for adding the new behavior in alpha status in 1.24. | |||
- 2022-11-11: Initial alpha implementation merged into 1.26 | |||
- 2022-12-07: KEP rewritten to match the implementation (PodHealthyPolicy was renamed to UnhealthyPodEvictionPolicy) | |||
- 2023-01-24: Update for beta promotion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, https://github.com/kubernetes/enhancements/blob/458d4689ad80b05ad938e97c5c29f354acf3df66/keps/sig-apps/3017-pod-healthy-policy-for-pdb/README.md#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service is not filled out. Neither is https://github.com/kubernetes/enhancements/blob/458d4689ad80b05ad938e97c5c29f354acf3df66/keps/sig-apps/3017-pod-healthy-policy-for-pdb/README.md#how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance.
@@ -277,6 +277,7 @@ https://storage.googleapis.com/k8s-triage/index.html | |||
- <test>: <link to test coverage> | |||
--> | |||
|
|||
[TestEvictionWithUnhealthyPodEvictionPolicy](https://github.com/kubernetes/kubernetes/blob/c8010537913422cc221cdd784936ff99817f621c/test/integration/evictions/evictions_test.go#L417): https://storage.googleapis.com/k8s-triage/index.html?test=UnhealthyPodEvictionPolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more comments (on top of what @logicalhan already pointed below too):
L364: "No, but they will be added for alpha"
Alpha has happened - can you update with the test that were added?
L370: "It does not change the default behavior. Users will have to specify a policy
on the PDB for behavior to be affected."
That's not really answer to the question. The question is not what changes, but how can a rollout fail? (e.g. apiserver can start crashing, etc.)
L375: Please specify the exact metric that reports that
L378: was this test run? If so, please update. If not, please ensure this will happen prior promoting feature gate to beta.
L400: the whole monitoring section is not filled-in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this test run? If so, please update. If not, please ensure this will happen prior promoting feature gate to beta.
I will try to run this test soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following up here #3777 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atiratree try addressing the comments from Han and Wojtek asap
458d468
to
43b197b
Compare
Thanks for reviews everyone. I have tried to resolve all the feedback. Can you please take a look again @logicalhan @wojtek-t @soltysh @liggitt |
|
||
#### GA | ||
|
||
- Every bug report is fixed. | ||
- Introduce E2E tests for this field and confirm their stability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it enough to have the E2E tests before GA or should it be sooner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, having a test run at beta, when we enable the feature gate as the default is usually sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
from sig-apps pov
|
||
#### GA | ||
|
||
- Every bug report is fixed. | ||
- Introduce E2E tests for this field and confirm their stability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, having a test run at beta, when we enable the feature gate as the default is usually sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just two more minor comments - other than that LGTM from PRR perspective.
@@ -359,18 +361,25 @@ The eviction API will again start using the `unhealthyPodEvictionPolicy` if prov | |||
|
|||
###### Are there any tests for feature enablement/disablement? | |||
|
|||
No, but they will be added for alpha. | |||
- [TestPodDisruptionBudgetStrategy](https://github.com/kubernetes/kubernetes/blob/06914bdaf51fc1b91501c332bd69d439cd370581/pkg/registry/policy/poddisruptionbudget/strategy_test.go#L46) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are all great and needed tests, but they are more of a feature tests.
Enablement/disablement tests are tests that verify what happens if we change the FG in the middle of the test.
Wlll you be able to add a test like this one described in the template comment:
https://github.com/kubernetes/enhancements/blame/master/keps/NNNN-kep-template/README.md#L505-L509
[and add a TODO for it here?]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there are 2 tests of these that do that https://github.com/kubernetes/kubernetes/blob/06914bdaf51fc1b91501c332bd69d439cd370581/pkg/registry/policy/poddisruptionbudget/strategy_test.go#L96-L114. Nevertheless I have added a TODO to add more comprehensive tests for disablement/enablement,
43b197b
to
e63c0c6
Compare
8. Verify that eviction of pods for Deployment A uses the `UnhealthyPodEvictionPolicy` UnhealthyPodEvictionPolicy and eviction of pods for | ||
StatefulSet B uses the default behavior. | ||
8. Verify that eviction of pods for Deployment A and StatefulSet B use the default behavior. | ||
Verify that the `AlwaysAllow` UnhealthyPodEvictionPolicy can be set again to a PDB of Deployment A and test the eviction behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done the manual upgrade->downgrade->upgrade
test, but the outcome is not the one as originally described by @mortent.
That is, when we downgrade to 1.25 the unhealthyPodEvictionPolicy
field is preserved in etcd, until an update happens to the PDB. Eg by dirsuption controller when updating status, which is triggered for example by eviction that we do in step 5. The field is then removed from etcd and will not be present when we upgrade to 1.26.
Only hacky way to preserve it is to ensure there is no update to the PDB.
I am not sure if this is expected behaviour for new fields. I have found something relevant mentioned here https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md
It must be possible to rollback to a previous version of API server that does not include your change and have no impact on API objects which do not use your change. API objects that use your change will be impacted in case of a rollback.
so it seems we should expect impact on API objects in case of a rollback, so the current behaviour would be correct?
I have updated the KEP to reflect that, but I am not sure if I am not missing something. Can @liggitt or anybody else confirm that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - this was rollback to the version that doesn't recognize this field, so if something was touching this object, the fact that this field is dropped isn't unexpected.
I think it would be useful to make this test again but switching from 1.27 -> 1.26 -> 1.27, in which case you would be downgrading to a version that already supports that and see that (an alternative is to just disable FG within 1.26 version).
If you could add a TODO for doing that, it would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for confirming.
Good idea, I have added a TODO to test the 1.27 upgrade path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atiratree - I just have one super minor comment.
I'm approving it so that you don't have to wait for me, but please apply before merging.
/approve PRR
/hold
No, but they will be added for alpha. | ||
- [TestPodDisruptionBudgetStrategy](https://github.com/kubernetes/kubernetes/blob/06914bdaf51fc1b91501c332bd69d439cd370581/pkg/registry/policy/poddisruptionbudget/strategy_test.go#L96-L114) | ||
|
||
TODO: add more comprehensive enablement/disablement tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing me at the tests above - I missed that somehow when looking into all those tests initially.
This is already great - if you have cycles adding more is of course appreciated, but the ones that you mention above actually make me already happy :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, in that case I have removed the TODO for more tests :)
8. Verify that eviction of pods for Deployment A uses the `UnhealthyPodEvictionPolicy` UnhealthyPodEvictionPolicy and eviction of pods for | ||
StatefulSet B uses the default behavior. | ||
8. Verify that eviction of pods for Deployment A and StatefulSet B use the default behavior. | ||
Verify that the `AlwaysAllow` UnhealthyPodEvictionPolicy can be set again to a PDB of Deployment A and test the eviction behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - this was rollback to the version that doesn't recognize this field, so if something was touching this object, the fact that this field is dropped isn't unexpected.
I think it would be useful to make this test again but switching from 1.27 -> 1.26 -> 1.27, in which case you would be downgrading to a version that already supports that and see that (an alternative is to just disable FG within 1.26 version).
If you could add a TODO for doing that, it would be great.
e63c0c6
to
f2b84af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the comments from Wojtek got addressed
/hold cancel
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: atiratree, soltysh, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
For posterity - this LGTM - thanks! |