TaintNodesByCondition causes pods to become NodeLost #67536

aleksandra-malinowska · 2018-08-17T10:10:15Z

In autoscaling tests, we verify that when a node suddenly disappears from the cluster, Cluster Autoscaler will add another one to replace it. With TaintNodesByCondition (enabled by default in #62111), this is now failing:

https://k8s-testgrid.appspot.com/sig-autoscaling-cluster-autoscaler#gci-gce-autoscaling

Every time a node disappears, a subset of pods become NodeLost and aren't rescheduled on a new node.

/cc @k82cn @bsalamat
/sig node

Huang-Wei · 2018-08-21T02:31:14Z

@k82cn I can work on this.

Automatic merge from submit-queue (batch tested with PRs 64597, 67854, 67734, 67917, 67688). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>. fix an issue that scheduling doesn't respect NodeLost status of a node **What this PR does / why we need it**: - if Node is in UnknowStatus, apply unreachable taint with NoSchedule effect - some internal data structure refactoring - update unit test **Which issue(s) this PR fixes**: Fixes #67733, and very likely #67536 **Special notes for your reviewer**: See detailed reproducing steps in #67733. **Release note**: ```release-note Apply unreachable taint to a node when it lost network connection. ```

Huang-Wei · 2018-08-28T05:23:49Z

#67734 just got merged. Let's check result of autoscaling testings tomorrow: https://k8s-testgrid.appspot.com/sig-autoscaling-cluster-autoscaler#gci-gce-autoscaling.

Huang-Wei · 2018-08-28T16:48:23Z

@aleksandra-malinowska latest autoscaling testing is green. I think we can close this issue.

aleksandra-malinowska · 2018-08-28T17:29:14Z

Thanks for fixing this!

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Aug 17, 2018

k82cn self-assigned this Aug 17, 2018

This was referenced Aug 21, 2018

[WIP] fix an issue that scheduling doesn't consider NodeLost status of a node #67677

Closed

node wasn't tainted unreachable when TaintNodesByCondition is enabled #67733

Closed

fix an issue that scheduling doesn't respect NodeLost status of a node #67734

Merged

aleksandra-malinowska closed this as completed Aug 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TaintNodesByCondition causes pods to become NodeLost #67536

TaintNodesByCondition causes pods to become NodeLost #67536

aleksandra-malinowska commented Aug 17, 2018 •

edited

Loading

Huang-Wei commented Aug 21, 2018

Huang-Wei commented Aug 28, 2018

Huang-Wei commented Aug 28, 2018

aleksandra-malinowska commented Aug 28, 2018

TaintNodesByCondition causes pods to become NodeLost #67536

TaintNodesByCondition causes pods to become NodeLost #67536

Comments

aleksandra-malinowska commented Aug 17, 2018 • edited Loading

Huang-Wei commented Aug 21, 2018

Huang-Wei commented Aug 28, 2018

Huang-Wei commented Aug 28, 2018

aleksandra-malinowska commented Aug 28, 2018

aleksandra-malinowska commented Aug 17, 2018 •

edited

Loading