Ignore AWS NodeWithImpairedVolumes taint #3040

johanneswuerbach · 2020-04-12T13:03:28Z

The NodeWithImpairedVolumes taint is applied to node on AWS when a volume is stuck in attaching state for too long.

The taint was introduced in k8s a while ago kubernetes/kubernetes#55558

Add it to the ignored node condition taints.

johanneswuerbach · 2020-04-12T13:04:26Z

/assign @losipiuk

johanneswuerbach · 2020-04-12T19:33:31Z

/area provider/aws

johanneswuerbach · 2020-04-20T20:59:09Z

Sorry, maybe @Jeffwan would be a better assignee as this is AWS specific.

Jeffwan · 2020-04-22T21:59:14Z

@johanneswuerbach

Code changes look good to me

In your case, what's the root cause for the volume takes that long to attach?

Jeffwan · 2020-04-22T22:00:23Z

@johanneswuerbach Code changes look good to me

/lgtm

In your case, what's the root cause for the volume takes that long to attach?

johanneswuerbach · 2020-04-22T22:27:17Z

Not sure, it looks like an issue in AWS, but it might also be k8s related. We currently run on 1.16 provisioned using kops and I requested a back port of a fix, which might be related kubernetes/kubernetes#89894

Jeffwan · 2020-04-23T01:26:22Z

@johanneswuerbach em, good to know, I will ask my team member to track this change and try to get it approved

johanneswuerbach · 2020-05-13T21:02:17Z

@Jeffwan any update on this?

johanneswuerbach · 2020-05-29T20:29:49Z

/assign @aleksandra-malinowska

mwielgus · 2020-06-01T15:07:26Z

@Jeffwan ping :)

Jeffwan · 2020-06-12T16:27:00Z

/lgtm

needs someone to approve this change. @mwielgus

edsonmarquezani · 2020-06-22T13:29:02Z

I've seen a side effect of it.

I have an instance group whose applications are shut down at a certain scheduled. When there's a node tainted like this, this specific node won't be terminated and will be left as the only one for that instance group. The day after, when PODs are up again, the autoscaler WON'T scale nodes up because of this taint, showing a message like this.

I0622 12:35:50.236346 1 utils.go:196] Pod wololo-67ccdcb765-xd7dq can't be scheduled
on spot-nodes, predicate failed: PodToleratesNodeTaints predicate mismatch, cannot put
forno/wololo-67ccdcb765-xd7dq on template-node-for-spot-nodes-176367186976534907,
reason: node(s) had taints that the pod didn't tolerate

If I scale one node manually on AWS or terminate the tainted node, things get back to work again.

johanneswuerbach · 2020-06-22T13:32:56Z

That is actually what the PR is supposed to solve and solved for us.

As a workaround adding --ignore-taint=NodeWithImpairedVolumes should fix your issue also with a the current autoscaler.

edsonmarquezani · 2020-06-24T02:05:57Z

@joshbranham Thanks for the advice!

johanneswuerbach · 2020-07-06T07:11:52Z

@mwielgus @Jeffwan anything missing here?

MaciekPytel · 2020-07-06T09:09:04Z

/approve

k8s-ci-robot · 2020-07-06T09:09:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [MaciekPytel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…of-#3040-upstream-cluster-autoscaler-release-1.18 Automated cherry pick of #3040: Ignore AWS NodeWithImpairedVolumes taint

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 12, 2020

k8s-ci-robot requested review from aleksandra-malinowska and losipiuk April 12, 2020 13:03

Ignore AWS NodeWithImpairedVolumes taint

86ba8ee

johanneswuerbach force-pushed the aws-impaired-vols branch from 953748c to 86ba8ee Compare April 12, 2020 13:04

k8s-ci-robot assigned losipiuk Apr 12, 2020

k8s-ci-robot added the area/provider/aws Issues or PRs related to aws provider label Apr 12, 2020

k8s-ci-robot assigned Jeffwan Apr 22, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 22, 2020

Jeffwan approved these changes Apr 22, 2020

View reviewed changes

k8s-ci-robot assigned aleksandra-malinowska May 29, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 6, 2020

k8s-ci-robot merged commit af750c3 into kubernetes:master Jul 6, 2020

johanneswuerbach mentioned this pull request Jul 27, 2020

Automated cherry pick of #3040: Ignore AWS NodeWithImpairedVolumes taint #3359

Merged

k8s-ci-robot added a commit that referenced this pull request Jul 27, 2020

Merge pull request #3359 from johanneswuerbach/automated-cherry-pick-…

f05cec0

…of-#3040-upstream-cluster-autoscaler-release-1.18 Automated cherry pick of #3040: Ignore AWS NodeWithImpairedVolumes taint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore AWS NodeWithImpairedVolumes taint #3040

Ignore AWS NodeWithImpairedVolumes taint #3040

johanneswuerbach commented Apr 12, 2020 •

edited

Loading

johanneswuerbach commented Apr 12, 2020

johanneswuerbach commented Apr 12, 2020

johanneswuerbach commented Apr 20, 2020

Jeffwan commented Apr 22, 2020

Jeffwan commented Apr 22, 2020

johanneswuerbach commented Apr 22, 2020

Jeffwan commented Apr 23, 2020

johanneswuerbach commented May 13, 2020

johanneswuerbach commented May 29, 2020

mwielgus commented Jun 1, 2020

Jeffwan commented Jun 12, 2020

edsonmarquezani commented Jun 22, 2020

johanneswuerbach commented Jun 22, 2020 •

edited

Loading

edsonmarquezani commented Jun 24, 2020

johanneswuerbach commented Jul 6, 2020

MaciekPytel commented Jul 6, 2020

k8s-ci-robot commented Jul 6, 2020

Ignore AWS NodeWithImpairedVolumes taint #3040

Ignore AWS NodeWithImpairedVolumes taint #3040

Conversation

johanneswuerbach commented Apr 12, 2020 • edited Loading

johanneswuerbach commented Apr 12, 2020

johanneswuerbach commented Apr 12, 2020

johanneswuerbach commented Apr 20, 2020

Jeffwan commented Apr 22, 2020

Jeffwan commented Apr 22, 2020

johanneswuerbach commented Apr 22, 2020

Jeffwan commented Apr 23, 2020

johanneswuerbach commented May 13, 2020

johanneswuerbach commented May 29, 2020

mwielgus commented Jun 1, 2020

Jeffwan commented Jun 12, 2020

edsonmarquezani commented Jun 22, 2020

johanneswuerbach commented Jun 22, 2020 • edited Loading

edsonmarquezani commented Jun 24, 2020

johanneswuerbach commented Jul 6, 2020

MaciekPytel commented Jul 6, 2020

k8s-ci-robot commented Jul 6, 2020

johanneswuerbach commented Apr 12, 2020 •

edited

Loading

johanneswuerbach commented Jun 22, 2020 •

edited

Loading