Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All Pods on an unreachable node are marked NotReady upon the node turned Unknown #77864

Open
chaudyg opened this issue May 14, 2019 · 9 comments

Comments

Projects
None yet
6 participants
@chaudyg
Copy link

commented May 14, 2019

What happened:
We reached a corner case in our kubernetes cluster:

  1. a network partition between kubelet and the apiserver. ALL our nodes unable to report to the apiserver (misconfiguration of the loadbalancer between kubelet and the apiserver).
  2. As expected, after 40 seconds, all the nodes status turn to Unknown.
  3. However, at the same time, all the pods on all the nodes were marked NotReady. Endpoints available dropped to zero for every service running in the cluster. 100% traffic suddenly failed at our ingress. 100% of DNS queries failed inside the cluster.
  4. As expected, after 5 minutes, some of the pods started being evicted. However eviction stopped because this was a full cluster outage.

What you expected to happen:

One of the following two options:

  1. Pods readiness should NOT be marked as NotReady when the Node condition turns Unknown. Pod should be marked unready only when they are actually being evicted from the cluster.

The documentation mentions the following corner case.

The corner case is when all zones are completely unhealthy (i.e. there are no healthy nodes in the cluster). In such case, the node controller assumes that there’s some problem with master connectivity and stops all evictions until some connectivity is restored.
https://kubernetes.io/docs/concepts/architecture/nodes/

  1. OR, the documentation should be updated to explain this behaviour.

How to reproduce it (as minimally and precisely as possible):

  • Stop kubelet on one of the node
  • Wait 40 seconds
  • Watch all the pods running on that node being marked an NotReady (and removed from the service endpoints)

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.11,1.12,1.13,1.14
  • Cloud provider or hardware configuration: NA
  • OS (e.g: cat /etc/os-release): NA
  • Kernel (e.g. uname -a): NA
  • Install tools: NA
  • Network plugin and version (if this is a network-related bug): NA
  • Others:
@chaudyg

This comment has been minimized.

Copy link
Author

commented May 14, 2019

@kubernetes/sig-node-bugs
@kubernetes/sig-cluster-lifecycle

@k8s-ci-robot k8s-ci-robot added sig/node and removed needs-sig labels May 14, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented May 14, 2019

@chaudyg: Reiterating the mentions to trigger a notification:
@kubernetes/sig-node-bugs

In response to this:

@kubernetes/sig-node-bugs
@kubernetes/sig-cluster-lifecycle

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@neolit123

This comment has been minimized.

Copy link
Member

commented May 14, 2019

/sig apps

@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor

commented May 14, 2019

Thanks for your detailed error report @chaudyg :)

My guess is that, given the integration test, it is intentional that pods become NotReady when the node is marked NotReady.

I'm wondering, would that behavior have been acceptable if the documentation had been clearer? How could we update the documentation so there were less surprises?

@chaudyg

This comment has been minimized.

Copy link
Author

commented May 14, 2019

Thx @mattjmcnaughton for your reply.

Is seems like an intentional behaviour. But I am failing to grasp the full rational behind it.

What I am convinced of is that having the apiserver unreachable for 40+ seconds should NOT result in a full cluster outage.

They are 2 scenarios in this part of the code:

  • The node turns NotReady. In this case I can understand why marking pods as NotReady quickly make sense. It knows the underlying node in unhealthy, and assumes the underlying pods won't perform as expected. I will recommend we document this behaviour.
  • The node turns Unknown. In this case the controller cannot make an informed decision. The overall rational in the controller in an "unknown" scenario seems to be that the controller acts slowly and carefully. If first keeps the status quo by granting a 5m grace period to the node. Then it starts recreating some of pods slowly (eviction-rate flag) but only if some nodes are still healthy. If none of the nodes are ready, it won't make any decision. It feels like a sensible approach. Removing all the endpoints feels like a ruched decision. I would recommend we don't mark the pod as NotReady in this case.
@hjacobs

This comment has been minimized.

Copy link

commented May 14, 2019

@chaudyg thanks for looping me in, please submit the full postmortem on https://github.com/hjacobs/kubernetes-failure-stories when ready 👏

@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor

commented May 15, 2019

Thx @mattjmcnaughton for your reply.

Is seems like an intentional behaviour. But I am failing to grasp the full rational behind it.

What I am convinced of is that having the apiserver unreachable for 40+ seconds should NOT result in a full cluster outage.

They are 2 scenarios in this part of the code:

  • The node turns NotReady. In this case I can understand why marking pods as NotReady quickly make sense. It knows the underlying node in unhealthy, and assumes the underlying pods won't perform as expected. I will recommend we document this behaviour.
  • The node turns Unknown. In this case the controller cannot make an informed decision. The overall rational in the controller in an "unknown" scenario seems to be that the controller acts slowly and carefully. If first keeps the status quo by granting a 5m grace period to the node. Then it starts recreating some of pods slowly (eviction-rate flag) but only if some nodes are still healthy. If none of the nodes are ready, it won't make any decision. It feels like a sensible approach. Removing all the endpoints feels like a ruched decision. I would recommend we don't mark the pod as NotReady in this case.

Thanks for the clarification @chaudyg :)

To ensure I'm understanding, the second scenario you describe (i.e. "Unknown node status results in grace period than slow eviction provided other nodes are still healthy") is the way you believe it should work, not the way it actually works correct? From the code, it looks like the way it works is that all Pods are marked NotReady if the node is anything except healthy.

I certainly agree that your preferred behavior sounds reasonable in the use case you described. Thinking about other error scenarios or cluster use cases, can we think of any times that it wouldn't be reasonable? Are their controls we could expose to the cluster admin to make this type of behavior more flexible?

Really interested to hear the communities thoughts on this :)

@chaudyg

This comment has been minimized.

Copy link
Author

commented May 15, 2019

To ensure I'm understanding, the second scenario you describe (i.e. "Unknown node status results in grace period than slow eviction provided other nodes are still healthy") is the way you believe it should work, not the way it actually works correct?

Correct. It's the way I believe it should work. It's also what the documentation currently explains. All Pods are marked NotReady is an undocumented feature.

can we think of any times that it wouldn't be reasonable?
Maybe the following scenario?

Someone trips on the network cable or your node suddenly stops.

  • If you run in the cloud: this is a non problem. The cloud-provider will most likely detect this, and the could-provider-controller will remove the failing node from the kubernetes api. Then, all the pods on that node will marked as terminating.
  • However, if you run on bare metal: After 40 seconds, the node will be marked as Unknown. But then pods running on that node will keep on receiving traffic until the the --pod-eviction-timeout is reached (default to 5m). Then and only then those pods will be added to the eviction queue.

Really interested to hear the communities thoughts on this :)

@freehan you wrote the code to update pod status once node becomes NotReady . Would you be willing to share your opinion/experience?

@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented May 16, 2019

keep in mind pods still have to follow graceful termination, and if those pods are backed by a workload controller like a stateful set, the pods will not just get created. at this time, an administrator would have to force delete those pods (as the platform right now is not able to detect that the node is network partitioned or powered on yet, but work is in progress to enable this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.