Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reconcile NoExecute Taint #89059

Merged

Conversation

chenkaiyue
Copy link
Contributor

@chenkaiyue chenkaiyue commented Mar 11, 2020

What type of PR is this?
/kind bug

What happened:
When node ready condition is unknown or false, node-lifecycle-controller will add NoSchedule and NoExecute taints on it, such as:

taints:
  - effect: NoSchedule
    key: node.kubernetes.io/not-ready
    timeAdded: "2020-03-11T16:27:55Z"
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    timeAdded: "2020-03-11T16:27:57Z"

While, if these two taints are removed by others, an easy way is just deleting them manually by kubectl edit node. After a while, NoSchedule taint will be added back, but NoExecute taint won't. The taints is like this:

taints:
  - effect: NoSchedule
    key: node.kubernetes.io/not-ready
    timeAdded: "2020-03-11T16:41:23Z"

What you expected to happen:
We know that NoSchedule and NoExecute taints depend on node's condition, node ready condition will determine what taints a node should have, node-lifecycle-controller just does this job and updates node's taints according to the node condition.

Cause the node condition doesn't change, is still unknown or false, so the corresponding taints should still be there by node-lifecycle-controller's reconcile process.

So, what i expected to happen is even the taints are removed, the taints should be added back by reconciling, not only NoSchedule taints but also NoExecute taint, just like this:

taints:
  - effect: NoSchedule
    key: node.kubernetes.io/not-ready
    timeAdded: "2020-03-11T16:41:23Z"
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    timeAdded: "2020-03-11T16:41:29Z"

How to reproduce it (as minimally and precisely as possible):
1.Stop a node's kubelet to make node's condition to be unknown, systemctl stop kubelet
2.When node condition is unknown, kubectl edit node <node-id> -o yaml and delete node.kubernetes.io/unreachable NoSchedule and NoExecute taints.
3.Wait for a while, you can find that there is only node.kubernetes.io/unreachable NoSchedule taint be added, but no NoExecute taints by kubectl get node <node-id> -o yaml.

Environment:
As far as I know, 1.14,1.16 and the master branch of kubernetes have this bug.

Why this bug happened:
UnSchedule taints generated from node informer, so when node object changed, UpdateFunc will be called, what taint should be added and what taint should be deleted is calculated according to the node condition. So even we delete NoSchedule taint, after calculated and compared, taint will be added back.

For NoExecute taint, its data resource is zoneNoExecuteTainter in monitorNodeHealth, when node condition is unknown or false, node will be added into the queue of zoneNoExecuteTainter, at the same time, node will also be added into a set of the queue. So, the same node won't be add into the queue in the next cycle of monitorNodeHealth because it is already in the set.

I think maybe the set is used to prevent duplicated same node object to be enqueued because monitorNodeHealth will be ran in few seconds, but this also causes the bug as stated above. An easy way is to check whether nodes already have the responding NoExecute taints, if no, just remove node from the set and enqueue it for adding NoExecute taint; If yes, we can ignore this duplicated object as before.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 11, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @chenkaiyue!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @chenkaiyue. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 11, 2020
@chenkaiyue
Copy link
Contributor Author

chenkaiyue commented Mar 11, 2020

/assign @gmarek

@k8s-ci-robot
Copy link
Contributor

@chenkaiyue: GitHub didn't allow me to assign the following users: gmare.

Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @Gmare

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@attlee-wang
Copy link

@chenkaiyue codes add function testing and formatting

@chenkaiyue chenkaiyue force-pushed the Reconcile-NoExecute-Taint branch 2 times, most recently from da6880d to 8c62c4a Compare March 12, 2020 04:04
@chenkaiyue
Copy link
Contributor Author

@chenkaiyue codes add function testing and formatting

done

@yuzhiquan
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 12, 2020
@chenkaiyue
Copy link
Contributor Author

/retest

1 similar comment
@chenkaiyue
Copy link
Contributor Author

/retest

@k82cn k82cn self-assigned this Mar 22, 2020
@@ -194,6 +194,15 @@ func (q *UniqueQueue) Clear() {
}
}

// SetRemove remove value from the set if value existed
func (q *UniqueQueue) SetRemove(value string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer to rename to Remove; the other parts are LGTM :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,i will correct it

@@ -280,6 +289,11 @@ func (q *RateLimitedTimedQueue) Clear() {
q.queue.Clear()
}

// SetRemove remove value from the set of the queue
func (q *RateLimitedTimedQueue) SetRemove(value string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!

@k82cn
Copy link
Member

k82cn commented Mar 22, 2020

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chenkaiyue, k82cn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 22, 2020
@xiaoxubeii
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 23, 2020
@xiaoxubeii
Copy link
Member

/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 23, 2020
@k8s-ci-robot k8s-ci-robot merged commit 0641e0c into kubernetes:master Mar 23, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Mar 23, 2020
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-none Denotes a PR that doesn't merit a release note. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Sep 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants