Skip to content

Conversation

@sumukha-radhakrishna
Copy link

@sumukha-radhakrishna sumukha-radhakrishna commented Apr 14, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

Currently, we pause pod evictions when a zone experiences partial disruption. This is to prevent unnecessary evictions due to temporary or localized issues.

However, during a full zonal disruption, our current logic continues evicting pods, which can be counterproductive since the entire zone is unavailable.

All workloads in a zone gets evicted during FullZonalDisruption. I understand the idea to evict pods in a zone where we are not able to allocate/provision more capacity since its down.

I am not opposed to the idea of having a flag to turn this on/off, looking for opinion the on the idea.

Which issue(s) this PR fixes:

Fixes #131314

Special notes for your reviewer:

Does this PR introduce a user-facing change?

No

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. labels Apr 14, 2025
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.33 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.33.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Mon Apr 14 13:34:58 UTC 2025.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 14, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 14, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @rsumukha. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 14, 2025
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 14, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Apr 14, 2025
@dims
Copy link
Member

dims commented Apr 15, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 15, 2025
@sumukha-radhakrishna
Copy link
Author

sumukha-radhakrishna commented Apr 17, 2025

I am not opposed to having a flag to turn this on/off, looking for opinion the on the idea.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 21, 2025
@k8s-ci-robot k8s-ci-robot added area/conformance Issues or PRs related to kubernetes conformance tests area/release-eng Issues or PRs related to the Release Engineering subproject sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/release Categorizes an issue or PR as relevant to SIG Release. labels Apr 21, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/contains-merge-commits Indicates a PR which contains merge commits. labels Apr 21, 2025
@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sumukha-radhakrishna
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rphillips
Copy link
Member

rphillips commented Apr 22, 2025

However, during a full zonal disruption, our current logic continues evicting pods, which can be counterproductive since the entire zone is unavailable.

If there is a full zonal disruption, then the zone is not guaranteed to come back. Thus the pods are left scheduled to those pods within the dead zone. I'm inclined to say the behavior is correct today by evicting the pods within the dead zone, so they have an opportunity to run in a different zone as a self-healing mechanism.

Thoughts?

@sumukha-radhakrishna
Copy link
Author

I'm inclined to say the behavior is correct today by evicting the pods within the dead zone, so they have an opportunity to run in a different zone as a self-healing mechanism.

You're right, evicting pods during zonal disruption can trigger rescheduling elsewhere, which is a good self-healing strategy.
But it needs to be done carefully, and configurably, because in some clusters, it can degrade availability if capacity isn't there. This is why I made it configurable via cli flag.

@siyuanfoundation
Copy link
Contributor

/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Apr 22, 2025
@SergeyKanzhelev SergeyKanzhelev moved this from Triage to Archive-it in SIG Node CI/Test Board Apr 23, 2025
@dims
Copy link
Member

dims commented May 13, 2025

@cartermckinnon @kmala please help review!

@lmktfy
Copy link
Member

lmktfy commented Jun 22, 2025

If I had a magic coding wand (and, specifically, one significantly better than Claude):

  • On detecting full zonal disruption, [k-c-m would] capture of snapshot of all nodes in that zone. Using that snapshot, taint any node in that zone that isn't obviously healthy, using a newly minted, official Kubernetes taint. This taint is to indicate that the node is in an unhealthy zone. This first controller can, at least initially, be off-by-default.
  • We also add a second, on-by-default controller to drop that taint once the zone appears to partially or fully recover (and of course, if that specific taint is never set, this controller won't appear to be doing anything). However, you can opt out of the second controller and remove that taint using your own implementation.

Lots of extensibility options and the possibility to prototype it out-of-tree as well. Plus, the mechanism uses existing Kubernetes concepts so it will be very easy to teach.

@dims dims added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 1, 2025
@github-project-automation github-project-automation bot moved this from Needs Triage to Closed in SIG Apps Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/apiserver area/code-generation area/conformance Issues or PRs related to kubernetes conformance tests area/release-eng Issues or PRs related to the Release Engineering subproject area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Archived in project
Archived in project

Development

Successfully merging this pull request may close these issues.

Should pause pod evictions when zone is in Full Zonal Disruption

7 participants