Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore AWS NodeWithImpairedVolumes taint #3040

Merged
merged 1 commit into from
Jul 6, 2020

Conversation

johanneswuerbach
Copy link
Contributor

@johanneswuerbach johanneswuerbach commented Apr 12, 2020

The NodeWithImpairedVolumes taint is applied to node on AWS when a volume is stuck in attaching state for too long.

The taint was introduced in k8s a while ago kubernetes/kubernetes#55558

Add it to the ignored node condition taints.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 12, 2020
@johanneswuerbach
Copy link
Contributor Author

/assign @losipiuk

@johanneswuerbach
Copy link
Contributor Author

/area provider/aws

@k8s-ci-robot k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Apr 12, 2020
@johanneswuerbach
Copy link
Contributor Author

Sorry, maybe @Jeffwan would be a better assignee as this is AWS specific.

@Jeffwan
Copy link
Contributor

Jeffwan commented Apr 22, 2020

@johanneswuerbach

Code changes look good to me

In your case, what's the root cause for the volume takes that long to attach?

@Jeffwan
Copy link
Contributor

Jeffwan commented Apr 22, 2020

@johanneswuerbach Code changes look good to me

/lgtm

In your case, what's the root cause for the volume takes that long to attach?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 22, 2020
@johanneswuerbach
Copy link
Contributor Author

Not sure, it looks like an issue in AWS, but it might also be k8s related. We currently run on 1.16 provisioned using kops and I requested a back port of a fix, which might be related kubernetes/kubernetes#89894

@Jeffwan
Copy link
Contributor

Jeffwan commented Apr 23, 2020

@johanneswuerbach em, good to know, I will ask my team member to track this change and try to get it approved

@johanneswuerbach
Copy link
Contributor Author

@Jeffwan any update on this?

@johanneswuerbach
Copy link
Contributor Author

/assign @aleksandra-malinowska

@mwielgus
Copy link
Contributor

mwielgus commented Jun 1, 2020

@Jeffwan ping :)

@Jeffwan
Copy link
Contributor

Jeffwan commented Jun 12, 2020

/lgtm

needs someone to approve this change. @mwielgus

@edsonmarquezani
Copy link

I've seen a side effect of it.

I have an instance group whose applications are shut down at a certain scheduled. When there's a node tainted like this, this specific node won't be terminated and will be left as the only one for that instance group. The day after, when PODs are up again, the autoscaler WON'T scale nodes up because of this taint, showing a message like this.

I0622 12:35:50.236346 1 utils.go:196] Pod wololo-67ccdcb765-xd7dq can't be scheduled
on spot-nodes, predicate failed: PodToleratesNodeTaints predicate mismatch, cannot put
forno/wololo-67ccdcb765-xd7dq on template-node-for-spot-nodes-176367186976534907,
reason: node(s) had taints that the pod didn't tolerate

If I scale one node manually on AWS or terminate the tainted node, things get back to work again.

@johanneswuerbach
Copy link
Contributor Author

johanneswuerbach commented Jun 22, 2020

That is actually what the PR is supposed to solve and solved for us.

As a workaround adding --ignore-taint=NodeWithImpairedVolumes should fix your issue also with a the current autoscaler.

@edsonmarquezani
Copy link

@joshbranham Thanks for the advice!

@johanneswuerbach
Copy link
Contributor Author

@mwielgus @Jeffwan anything missing here?

@MaciekPytel
Copy link
Contributor

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 6, 2020
@k8s-ci-robot k8s-ci-robot merged commit af750c3 into kubernetes:master Jul 6, 2020
k8s-ci-robot added a commit that referenced this pull request Jul 27, 2020
…of-#3040-upstream-cluster-autoscaler-release-1.18

Automated cherry pick of #3040: Ignore AWS NodeWithImpairedVolumes taint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/aws Issues or PRs related to aws provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants