Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaintNodesByCondition causes pods to become NodeLost #67536

Closed
aleksandra-malinowska opened this issue Aug 17, 2018 · 4 comments
Closed

TaintNodesByCondition causes pods to become NodeLost #67536

aleksandra-malinowska opened this issue Aug 17, 2018 · 4 comments
Assignees
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@aleksandra-malinowska
Copy link
Contributor

aleksandra-malinowska commented Aug 17, 2018

In autoscaling tests, we verify that when a node suddenly disappears from the cluster, Cluster Autoscaler will add another one to replace it. With TaintNodesByCondition (enabled by default in #62111), this is now failing:

https://k8s-testgrid.appspot.com/sig-autoscaling-cluster-autoscaler#gci-gce-autoscaling

Every time a node disappears, a subset of pods become NodeLost and aren't rescheduled on a new node.

/cc @k82cn @bsalamat
/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Aug 17, 2018
@k82cn k82cn self-assigned this Aug 17, 2018
@Huang-Wei
Copy link
Member

@k82cn I can work on this.

k8s-github-robot pushed a commit that referenced this issue Aug 28, 2018
Automatic merge from submit-queue (batch tested with PRs 64597, 67854, 67734, 67917, 67688). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

fix an issue that scheduling doesn't respect NodeLost status of a node

**What this PR does / why we need it**:

- if Node is in UnknowStatus, apply unreachable taint with NoSchedule effect
- some internal data structure refactoring
- update unit test

**Which issue(s) this PR fixes**:
Fixes #67733, and very likely #67536

**Special notes for your reviewer**:

See detailed reproducing steps in #67733.

**Release note**:
```release-note
Apply unreachable taint to a node when it lost network connection.
```
@Huang-Wei
Copy link
Member

#67734 just got merged. Let's check result of autoscaling testings tomorrow: https://k8s-testgrid.appspot.com/sig-autoscaling-cluster-autoscaler#gci-gce-autoscaling.

@Huang-Wei
Copy link
Member

@aleksandra-malinowska latest autoscaling testing is green. I think we can close this issue.

@aleksandra-malinowska
Copy link
Contributor Author

Thanks for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants