-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore AWS NodeWithImpairedVolumes taint #3040
Ignore AWS NodeWithImpairedVolumes taint #3040
Conversation
953748c
to
86ba8ee
Compare
/assign @losipiuk |
/area provider/aws |
Sorry, maybe @Jeffwan would be a better assignee as this is AWS specific. |
Code changes look good to me In your case, what's the root cause for the volume takes that long to attach? |
@johanneswuerbach Code changes look good to me /lgtm In your case, what's the root cause for the volume takes that long to attach? |
Not sure, it looks like an issue in AWS, but it might also be k8s related. We currently run on 1.16 provisioned using kops and I requested a back port of a fix, which might be related kubernetes/kubernetes#89894 |
@johanneswuerbach em, good to know, I will ask my team member to track this change and try to get it approved |
@Jeffwan any update on this? |
/assign @aleksandra-malinowska |
@Jeffwan ping :) |
/lgtm needs someone to approve this change. @mwielgus |
I've seen a side effect of it. I have an instance group whose applications are shut down at a certain scheduled. When there's a node tainted like this, this specific node won't be terminated and will be left as the only one for that instance group. The day after, when PODs are up again, the autoscaler WON'T scale nodes up because of this taint, showing a message like this.
If I scale one node manually on AWS or terminate the tainted node, things get back to work again. |
That is actually what the PR is supposed to solve and solved for us. As a workaround adding |
@joshbranham Thanks for the advice! |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MaciekPytel The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…of-#3040-upstream-cluster-autoscaler-release-1.18 Automated cherry pick of #3040: Ignore AWS NodeWithImpairedVolumes taint
The
NodeWithImpairedVolumes
taint is applied to node on AWS when a volume is stuck in attaching state for too long.The taint was introduced in k8s a while ago kubernetes/kubernetes#55558
Add it to the ignored node condition taints.