Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-3902: Decouple TaintManager from NodeLifeCycleController #3901

Merged
merged 1 commit into from Jun 12, 2023

Conversation

yuanchen8911
Copy link
Member

@yuanchen8911 yuanchen8911 commented Mar 7, 2023

  • One-line PR description: Code reorganization: decouple the TaintManager from the NodeLifecycleController to enable more flexible extension and enhancement in node taint-based eviction.

The proposal were presented to sig-apps and sig-node. An implementation is WIP.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Mar 7, 2023
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 7, 2023
@yuanchen8911 yuanchen8911 changed the title Add KEP on a new taint-manager KEP: Decouple taint-manager from NodeLifeCycleController Mar 7, 2023
@yuanchen8911 yuanchen8911 changed the title KEP: Decouple taint-manager from NodeLifeCycleController KEP-3902: Decouple taint-manager from NodeLifeCycleController Mar 7, 2023
@yuanchen8911 yuanchen8911 force-pushed the master branch 2 times, most recently from ac1a89f to 91ba8c8 Compare March 7, 2023 18:46
@yuanchen8911
Copy link
Member Author

sig/node
sig/scheduling

@yuanchen8911
Copy link
Member Author

@k8s-ci-robot
Copy link
Contributor

@yuanchen8911: GitHub didn't allow me to request PR reviews from the following users: atosatto.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @ddebroy @atosatto @ravisantoshgudimetla

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yuanchen8911
Copy link
Member Author

/cc @atiratree @Huang-Wei

@yuanchen8911
Copy link
Member Author

/sig node scheduling

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Mar 7, 2023
@kerthcet
Copy link
Member

kerthcet commented Mar 7, 2023

FYI: may this kubernetes/kubernetes#115840 related? I didn't read the KEP.

@atosatto
Copy link
Contributor

atosatto commented Mar 7, 2023

FYI: may this kubernetes/kubernetes#115840 related? I didn't read the KEP.

@kerthcet we see this as a preliminary step aiming to remove deprecated features in NodeLifecycleController to enable the work proposed in this KEP.

@bart0sh bart0sh added this to Needs Reviewer in SIG Node PR Triage Mar 8, 2023
@yuanchen8911
Copy link
Member Author

yuanchen8911 commented Mar 20, 2023

@soltysh @kow3ns, @alculquicondor, thanks a lot for the discussions on the proposal in today's sig-apps meeting. Your review on the KEP will be very appreciated. @Huang-Wei will provide feedback from the sig-scheduling perspective. Thanks again!

@alculquicondor
Copy link
Member

ownership discussion kubernetes/kubernetes#116771


* Fix all reported bugs if any.

### Upgrade / Downgrade Strategy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if somebody had the controller replaced, after the split the second one will stay? SO they will need to disable one more?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For downgrade, disabling the new feature using the feature gate will automatically disable a custom controller and use the current and default combined controller instead.

BTW, Aldo and other reviewers strongly suggested custom controller should not be the focus of the proposal.

@yuanchen8911
Copy link
Member Author

@SergeyKanzhelev , thanks for the comments! Can you take another look?

@wojtek-t
Copy link
Member

wojtek-t commented Jun 6, 2023

And after that, we will pursue PRR review from @wojtek-t .

Queued - will take a look, but most probably Wed/Thu.

@wojtek-t wojtek-t self-assigned this Jun 6, 2023
@yuanchen8911 yuanchen8911 force-pushed the master branch 2 times, most recently from 4b5ff1e to 89f2ac7 Compare June 6, 2023 20:32
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall proposal makes sense to me (including going directly to Beta), but I have a bunch of comments about the PRR itself - PTAL


*This section must be completed when targeting alpha to a release.*

* **How can this feature be enabled / disabled in a live cluster?**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update to the new KEP template (as mentioned above) - the feature-gate is then required explicitly :)

* The feature will work as usual.
* **Are there any tests for feature enablement/disablement?**
* Planned for Beta release
* Appropriate tests have been added in the integration tests. See [Integration tests](#integration-tests) for more details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests haven't been added, rather they are planned, right?

That being said, if I'm reading correctly, those tests are rather checking if things work fine when FG is enabled. They are not trying to test the enablement/disablement - being "run k8s with FG enabled, restart controller-manager with FG disabled and check if things still work as expected (and vice-versa)"

Which BTW is probably fine here, given this is in-memory feature (so no state is really stored here).
So I would be ok with saying sth like:
"No enablement/disablement tests are needed since this is in-memory feature and just regular tests with the feature being enabled/disabled do the job."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated as suggested.

*This section must be completed when targeting beta graduation to a release.*

* **How can a rollout or rollback fail? Can it impact already running workloads?**
A feature gate `SeparateTaintManager` controls whether to use the split `TaintManager` or the old combined `NodeLifecycleController`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please answer the question. Can it impact already running workloads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense?

"This is an opt-in feature, and it does not change any default behavior. Unless there is a bug in the implementation, a rollout can not fail. If a rollout does fail, running workloads will not be evicted propertly on tainted nodes. We don't expect a rollback can fail."

A feature gate `SeparateTaintManager` controls whether to use the split `TaintManager` or the old combined `NodeLifecycleController`.

* **What specific metrics should inform a rollback?**
N/A
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't N/A for sure.

Some huge number of pod evictions? [Please add appropriate metric]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "A significantly changing number of pod evictions and/or a substantial increase in pod eviction latency"?

N/A

* **Were upgrade and rollback tested? Was the upgrade→downgrade→upgrade path tested?**
They will be covered in unit and e2e tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will they? We don't really have downgrade e2e tests running anywhere.

Please run them manually and describe the test case
[https://github.com//pull/3658 is a nice example]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to test plan "manually verify the rollback and upgrade-downgrade-upgrade path will pass the e2e testing"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall proposal makes sense to me (including going directly to Beta), but I have a bunch of comments about the PRR itself - PTAL

Thanks for the review, Wojciech! I've updated the doc to (hopefully) address the comments :-). Can you please take another look when you get a chance? Thanks again!

* **How can an operator determine if the feature is in use by workloads?**
* It can be determined by if the feature gate `SeparateTaintManager` is used.
* **How can someone using this feature know that it is working for their instance?**
* Node taints and taint-based pod eviction work as usual. Admins and users should not see any different behavior if everything works as planned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can I determine it as an operator running gazillions clusters?

Please specify a metric (number of pod evictions is probably a best one).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to "Node taints and taint-based pod eviction should work as usual and there is no significant change in the number pod evictions (taint_manager_pod_evictions_total) and/or pod eviction latency (taint_manager_pod_eviction_latency)."

* **What are the reasonable SLOs (Service Level Objectives) for the enhancement?**
* The performance of node taint-based eviction should remain the same level as before.
* **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?**
* The metrics for both `NodeLifecycleController` and `TaintManager`'s queues should stay the same levels as before.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What metrics? Please be specific.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "the number of pod evictions and pod eviction latency"?

taint_manager_pod_evictions_total
taint_manager_pod_eviction_latency

* **What are other known failure modes?**
* No
* **What steps should be taken if SLOs are not being met to determine the problem?**
* If the default or custom taint-manager is working or not.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand it - can you please clarify?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to

If the pod eviction latency increases significantly, validate if the communication between NodeLifecycleController and TaintManager works. If the number of pod evictions is abnormal, run tests to verify the TaintManager works properly.

@yuanchen8911 yuanchen8911 force-pushed the master branch 2 times, most recently from 17cd65c to 86dd1f5 Compare June 9, 2023 20:36
Add diagrams

Update diagrams

Update diagrams

Update diagrams

Resize diagram images

Resize diagram images

Resize diagram images

Update image size

Resize diagram images

Fix a format issue

Update images

Update README.md

Update README.md

Update README.md

Update README.md

Update README.md

Update README.md

Add issue id

Update toc

Fix typos

Fix a format issue

Fix format issues

Update decoupled-taint-manager KEP

Fix a typo in kep.yaml

Fix errors

Revise README.md

Update alternatives

Update alternatives

Update alterntives

Update alternatives

Update README

Update keps/sig-scheduling/3902-decoupled-taint-manager/README.md

Co-authored-by: Wei Huang <weih@hey.com>

Address Wei Huang's comments

Add Story 3

Remove backup files

Address Aldo and Ravi's comments

Update Goals

Reformat code

Reformat code

Focus on code split

Remove the backup file

Address Aldo's comments

Fix typos

Fix an format issue

Fix toc

Update README.md

Another revison to address comments

Fix diagram error

Update status in README.md

Address sig-app comments

Address Wei's comments

Update

Address final comments from sig-app/sig-scheduling

Address sig-node comments

Update risk and mitigation

Update risk and mitigations

Adress PRR review comments

Fix typos

Update status
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

lgtm from SIG Node.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 9, 2023
@wojtek-t
Copy link
Member

This seems fine from PRR perspective - thanks!

/lgtm
/approve PRR

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei, SergeyKanzhelev, wojtek-t, yuanchen8911

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

None yet