New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reconcile NoExecute Taint #89059
Reconcile NoExecute Taint #89059
Conversation
Welcome @chenkaiyue! |
Hi @chenkaiyue. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @gmarek |
@chenkaiyue: GitHub didn't allow me to assign the following users: gmare. Note that only kubernetes members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@chenkaiyue codes add function testing and formatting |
da6880d
to
8c62c4a
Compare
done |
8c62c4a
to
62dd519
Compare
/ok-to-test |
62dd519
to
b3637c9
Compare
/retest |
1 similar comment
/retest |
@@ -194,6 +194,15 @@ func (q *UniqueQueue) Clear() { | |||
} | |||
} | |||
|
|||
// SetRemove remove value from the set if value existed | |||
func (q *UniqueQueue) SetRemove(value string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer to rename to Remove
; the other parts are LGTM :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK,i will correct it
@@ -280,6 +289,11 @@ func (q *RateLimitedTimedQueue) Clear() { | |||
q.queue.Clear() | |||
} | |||
|
|||
// SetRemove remove value from the set of the queue | |||
func (q *RateLimitedTimedQueue) SetRemove(value string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK!
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chenkaiyue, k82cn The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
/release-note-none |
What type of PR is this?
/kind bug
What happened:
When node ready condition is unknown or false, node-lifecycle-controller will add NoSchedule and NoExecute taints on it, such as:
While, if these two taints are removed by others, an easy way is just deleting them manually by
kubectl edit node
. After a while, NoSchedule taint will be added back, but NoExecute taint won't. The taints is like this:What you expected to happen:
We know that NoSchedule and NoExecute taints depend on node's condition, node ready condition will determine what taints a node should have, node-lifecycle-controller just does this job and updates node's taints according to the node condition.
Cause the node condition doesn't change, is still unknown or false, so the corresponding taints should still be there by node-lifecycle-controller's reconcile process.
So, what i expected to happen is even the taints are removed, the taints should be added back by reconciling, not only NoSchedule taints but also NoExecute taint, just like this:
How to reproduce it (as minimally and precisely as possible):
1.Stop a node's kubelet to make node's condition to be unknown,
systemctl stop kubelet
2.When node condition is unknown,
kubectl edit node <node-id> -o yaml
and deletenode.kubernetes.io/unreachable
NoSchedule and NoExecute taints.3.Wait for a while, you can find that there is only
node.kubernetes.io/unreachable
NoSchedule taint be added, but no NoExecute taints bykubectl get node <node-id> -o yaml
.Environment:
As far as I know, 1.14,1.16 and the master branch of kubernetes have this bug.
Why this bug happened:
UnSchedule taints generated from node informer, so when node object changed, UpdateFunc will be called, what taint should be added and what taint should be deleted is calculated according to the node condition. So even we delete NoSchedule taint, after calculated and compared, taint will be added back.
For NoExecute taint, its data resource is
zoneNoExecuteTainter
inmonitorNodeHealth
, when node condition is unknown or false, node will be added into the queue ofzoneNoExecuteTainter
, at the same time, node will also be added into a set of the queue. So, the same node won't be add into the queue in the next cycle ofmonitorNodeHealth
because it is already in the set.I think maybe the set is used to prevent duplicated same node object to be enqueued because monitorNodeHealth will be ran in few seconds, but this also causes the bug as stated above. An easy way is to check whether nodes already have the responding NoExecute taints, if no, just remove node from the set and enqueue it for adding NoExecute taint; If yes, we can ignore this duplicated object as before.
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: