Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote taint based evictions to GA #1450

Merged
merged 1 commit into from Jan 20, 2020

Conversation

damemi
Copy link
Contributor

@damemi damemi commented Jan 14, 2020

Based on this comment in the TBE issue (#166 (comment)) we are creating an enhancement to officially track the promotion of Taint Based Evictions to GA

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 14, 2020
@damemi
Copy link
Contributor Author

damemi commented Jan 14, 2020

/cc @ahg-g @Huang-Wei

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 14, 2020
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Jan 14, 2020
## Summary

Taint Based Evictions was introduced as an alpha feature in Kubernetes 1.6 and was promoted to
beta in 1.13. Taint Based Evictions evicts pods from a node based on taints applied to the node
Copy link
Member

@Huang-Wei Huang-Wei Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the tricky part...

Long time ago, when the feature was proposed, the feature was supposed to evict pods based on node conditions. But along with the development, it's finalized with a much smaller scope:

Taint nodes with a NoExecute effect automatically, when the nodes gets not ready or unreachable. While TaintNodeByCondition feature apply taints with a NoSchedule effect. Sometime people confuse with them.

Other functionalities such as evicting pods and setting tolerationSeconds are out of TaintBasedEviction's scope (by taint manager and default toleration seconds admission controller).

So the name is very confusing, it's worth highlighting the tricky parts.

Copy link
Member

@Huang-Wei Huang-Wei Jan 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we disable TaintBasedEviction, the eviction behavior and setting default toleration seconds still work. Which also prove that those are not the scope of TaintBasedEviction.

And maybe we should emphasize the word "automatically", as NoExecute comes with taint long time ago, but system-managed NoExecute taints are exactly managed by this feature.


### Goals

Promote Taint Based Evictions to GA.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should say "Ensure nodes are tainted properly with a NoExecute effect when it's not ready or unreachable, so that scheduler can use taints to make scheduling decisions consistently."


## Motivation

Taint Based Evictions has been in beta since 1.13 and has functioned well since then.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Motivation section applies generally to the feature, not particularly for promoting it to GA.

How about: TaintNodesByCondition has ensured the nodes to be tainted well with NoSchedule effect, upon different node conditions. However, it's also required to taint nodes with NoExecute effect automatically upon some node conditions such as node gets not ready or unreachable.


### Non-Goals

Make any change of functionality to the feature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Non-goals applies to the feature.

@damemi
Copy link
Contributor Author

damemi commented Jan 14, 2020

@Huang-Wei updated with your feedback

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 14, 2020
@Huang-Wei
Copy link
Member

/lgtm

@k82cn @ahg-g do you want to take another look?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 14, 2020
title: Promote Taint Based Evictions to GA
authors:
- "@damemi"
owning-sig: sig-scheduling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sig-scheduling is not the owning sig of this feature, the scheduler doesn't include any code specifically for this feature.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt which sig does the node lifecycle controller falls under?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would that be sig-node?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably sig-node

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or sig-cloud-provider for the "not reachable nodes" aspect

* [Node lifecycle controller eviction tests](https://github.com/kubernetes/kubernetes/blob/47d5c3ef8d/pkg/controller/nodelifecycle/node_lifecycle_controller_test.go#L196)

### Integration tests
* [Taint based evictions integration test](https://github.com/kubernetes/kubernetes/blob/47d5c3ef8df2b1b26da739aec0ada15d41f20cf3/test/integration/scheduler/taint_test.go#L580) (note that prior to 1.17, this test existed as an [end-to-end test](https://github.com/kubernetes/kubernetes/blob/001f2cd2b553d06028c8542c8817820ee05d657f/test/e2e/scheduling/taint_based_evictions.go)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before graduating to GA, we should make sure that the tests are moved under the proper sig.

reviewers:
- "@Huang-Wei"
approvers:
- "@liggitt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend sig-node/sig-cloud-provider for approvers

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2020
@damemi
Copy link
Contributor Author

damemi commented Jan 16, 2020

@liggitt @ahg-g I've updated to list sig-node as the owning sig and both node and cloud-provider as approvers. I also made a note under the graduation criteria that tests should be moved to the appropriate sig

@ahg-g
Copy link
Member

ahg-g commented Jan 17, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 17, 2020
@ahg-g
Copy link
Member

ahg-g commented Jan 17, 2020

/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 17, 2020
@damemi
Copy link
Contributor Author

damemi commented Jan 20, 2020

Just checking, is there anything else this needs? Waiting on this to merge to start breaking out the issues in kubernetes/kubernetes#87161

@ahg-g
Copy link
Member

ahg-g commented Jan 20, 2020

/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, damemi, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@damemi
Copy link
Contributor Author

damemi commented Jan 20, 2020

/hold cancel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants