Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico will not restore traffic on node return untill BIRD's "Graceful restart". #2211

Closed
e-pirate opened this issue Sep 27, 2018 · 27 comments
Closed

Comments

@e-pirate
Copy link

Hello!
We are building redundant HA cluster with Kubernetes and using Calico as a network layer. Recently we were testing ingress nodes and found that in some conditions Calico acts not "as expected". We have two nodes ingress1 and ingress2 that are dedicated to ingress-controllers (nginx-ingress in our case) by setting appropriate roles to this nodes. This nodes run only 3 pods: calico-node, nginx-ingress-controller and default-http-backend. Both of this nodes have KeepAlived that watches if nginx-ingress-controller pod is Running and Ready (via Kubernetes API) so node can accept VIP for external traffic to come in and switch the VIP to other node in case of failure.
What we found, is that if one of this ingress nodes becomes unavailable (we simply switch network off on the node), the other node will accept traffic to the cluster by both NodePort and ingress balancer mechanisms. Then we wait until Kubernetes marks pods on "deactivated" node as NodeLost and then switch it back on. After calico-node goes through its default "getting alive" procedures, the traffic start to flow through this node almost immediately after Status changes to "Running" with both containers of the calico-node become ready (we have birds checks off). Watching node status with calicoctl on that node during "return" shows STATE going "up" almost instantly, after calico-node restarts. That work just as expected. Please, find logs of node going through this test procedure in
diags-20180927_155249.tar.gz.
Everything changes if we put both nodes down at the same time. So, we switch network off on both ingress nodes, then wait until both nodes marked as NotReady and calico-node pods switched to NodeLost state and then switch network on one of the nodes on again. This node returns shortly, get marked as Ready, calico-node pod restarts and became Running with both containers ready, nginx-ingress-controller pod became Ready as well, the VIP moves to this node shortly after. So, we all set and ready for the traffic to flow into the cluster again, but this does not happen.
We were investigating this a lot (mostly because we have no idea who to suspect: kubelet, kube-proxy, calico, someone else?) . And found, that if we start watching calico-node log right after pod restarts and became Running and Ready, among calico stuff we see lots of usual bird messages that it is pocking its neighbors and reporting them to be available with one node (the other ingress node that is down) unreachable. At the same time we were watching node status with calicoctl tool and all the nodes except that one that is down were in wait state with Established condition in info column. Then, some time (4 minutes, actually) after birds "Graceful restart" happen all available nodes switches to "up" state instantly and traffic start to flow as expected with both NodePort and ingress balancer mechanisms.
We repeated this several times and always got same result: routing will not be restored and traffic will not start to flow until bird performs it's "Graceful restart" if we put both ingress nodes down and then return only one of them. The logs of the "procedure" from the returning ingress node can be seen in
diags-20180927_154656.tar.gz.

We investigated feather and found bird's -R option that forces it to "restart" and do it's magic, that bird will not "restart" and restore routing on forced process kills inside live pod until it goes through its "graceful restart wait" timeout and we actually have to wait for "graceful restart" timeout after bird start (calico-node pod start, actually).
Then we modified confd's template that forms bird config, added "graceful restart wait 30" to the general section while keeping bird from going over it's default 240 second timeout by killing its process time to time in the live pod (yap, that's it) and restoring routing in its original manner. After all was done, we killed bird for the last time and 30 seconds later all nodes switched to up state and traffic started to flow.
Then we made a ConfigMap with the modified template with "graceful restart wait 30", updated the calico manifest to replace original template with modified and rolled this out to the cluster including ingresses. Then we repeated our test procedure: switched network on both ingress nodes off, waited for Kubernetes to recognize this, then switched one node on and 30 seconds after the calico-node became Ready traffic flow restored. This is absolutely critical for ingress nodes, because currently it takes calico-node restart time + 4 minutes time to get networking back after node returns and all necessary condition met. In some cases this takes up to 10 minutes total time due to Kubernetes pod restart timeouts. For all that period cluster is unavailable outside.

P.S. We are using APISTORE, but switching to ETCDSTORE makes no difference.

Expected Behavior

Routing restored and traffic start to pass through node shortly after node became available and calico-node restarts and get Running.

Current Behavior

Routing will not be restored and traffic will not start to flow through node until calico-node restarts + 240 seconds bird's default "graceful restart" timeout in case more the one (same role?) node goes down.

Possible Solution

Set bird's "graceful restart wait" timeout to something meaningful then default 240 seconds. Probably make this parameter available in more easy way, then modifying confd's template for bird and storing it in ConfigMap.

Steps to Reproduce (for bugs)

  1. Put two ingress nodes down, wait until Kubernetes marks calico-node pods as "NodeLost";
  2. Bring only one back, try to access cluster via NodePort on returned node;
  3. Watch calico-node restart and became "Ready", then count for the time needed to the traffic flow restored via that node (optionally watch calico-node logs and node state with calicostl on that node).

Context

Ingress controllers stays unavailable 4 minutes more after one became available and all conditions met in case all of them fail and then only one restores.

Your Environment

  • Calico version: 3.2.2
  • Orchestrator version: Kubernetes 1.11.2 (binary kubelet).
  • Operating System and version: CentOS 7.5
@e-pirate e-pirate changed the title Calico will not restore traffic on node return untill DBRD's "Graceful restart". Calico will not restore traffic on node return untill BIRD's "Graceful restart". Sep 27, 2018
@caseydavenport
Copy link
Member

@e-pirate thanks for the detailed issue!

I think what you're seeing makes sense - BGP won't program routes to the dataplane until it has gracefully restarted and is confident it has the correct routing information.

It sounds like we need a way to make this timeout configurable for nodes performing critical routing duties, like ingress nodes. Perhaps we could add a field to the BGPConfiguration resource: https://docs.projectcalico.org/v3.2/reference/calicoctl/resources/bgpconfig

@caseydavenport
Copy link
Member

caseydavenport commented Sep 27, 2018

Another option here is for us to plug into k8s node status and remove "NotReady" nodes from our list of BGPPeers. We'd need to take care to think through the implications of that, but it might provide a more robust solution than simply dropping the GR timer.

I think that approach might be problematic though, since a node could be NotReady in the API but potentially still functional (e.g. just lost connectivity to the API). In that case, we wouldn't want to drop the connection.

@e-pirate
Copy link
Author

e-pirate commented Sep 27, 2018

I will discuss the second approach with my colleges tomorrow and make some tests maybe, so we can give some argued respond on that. But IMHO reducing default GR timeout to something like 30-60 seconds is a good point to start with. There is no reason to wait 240 seconds before bird to go GR. bird seem to establish connections to all BGPPeers almost instantly after Calico pod startup and then keep on wasting time for the rest of the timeout keeping node unavailable. At least we never seen a condition, when such a long GR timeout was really needed. And giving a way to change it for some specific conditions is a good idea anyway.

@e-pirate
Copy link
Author

The reason for reducing GR timeout is IMHO even if you implement node deletion mechanism depending on current node status, the case, when node will go down or became unavailable just before or during Calico startup on the other node is still imaginable. It takes Kubernetes a significant amount of time to finally mark node down, even in our tuned environment. So we will still get unavailable peer in BGP table and therefor bird sitting and waiting for it's GR. On the other hand, it is seems nothing bad in dropping GR timeout.

@e-pirate
Copy link
Author

e-pirate commented Oct 1, 2018

We discussed the approach based on node states, besides I asked some folks who have production clusters, so we came to a conclusion, that in general it is a good idea. No one can recall a node with unresponsive Kubelet but with running pods. Folks say that dead APIserver is more of a concern, then dead Kubelet, but this is a different story. Since this will affect only starting nodes (starting/restarting Calico pods), the window this approach can falsely remove some nodes from BGP peers table is very tight, and after a node with unresponsive Kubelet will come up again, it will be added to the BGP table. On the other hand, it will prevent starting nodes from waiting dead. But, I think, in some cases this mechanism may be unwanted, so administrators should be able to switch it off. To draw a line, IMHO, both approaches should be implemented: the default BIRD's GR timeout should be shorten to something tolerant like 60 seconds with the direct and easy way to change it without modifying the BIRD's template and all "NotReady" nodes should be removed from the BGP peers table of the starting nodes by default with an option to switch this mechanism off (to current behavior) somewhere in config for thous, who to some reasons will decide this unacceptable and for some odd usage (for static pods/containers with no Kubelet maybe, I don't know, we can not imagine this use-cases now, but there mast be some) . IMHO

@rachappag
Copy link

rachappag commented Oct 3, 2018

Looks like there are problems with liveness probe too.
In my cluster one node became unresponsive and after some time, the calico-node pod on other nodes restarted many times. when examined, the liveness probe was failing. The netstat on node revealed that the listening port 9099 just opened on loop back IP 127.0.0.1. This problem happens only when some node is down in the cluster and rest of the nodes suffer the skew.

# netstat -anp | grep "LISTEN" | grep 9099
tcp        0      0 127.0.0.1:9099          0.0.0.0:*               LISTEN      5157/calico-node

Help / remedy much appreciated.

@caseydavenport
Copy link
Member

@rachappag can you post the output from the liveness probe?

e.g. kubectl describe pod <X>?

As far as I know, BGP connections are not considered as part of liveness, and so something else may be going wrong in the cluster.

@rachappag
Copy link

@caseydavenport
List calico-node

# kc get pods -o wide | grep "calico-node"
calico-node-89ptz                                     2/2       Running            224        22h       X.Y.166.151   X.Y.166.151
calico-node-9lljn                                     2/2       Running            224        22h       X.Y.146.166     X.Y.146.166
calico-node-bwqjh                                     2/2       Running            224        22h       X.Y.254.137   X.Y.254.137
calico-node-h65xs                                     2/2       Running            224        22h       X.Y.254.198   X.Y.254.198
calico-node-zhtkl                                     0/2       Pending            0          22h       <none>         X.Y.254.52
# kc describe pod calico-node-9lljn
Events:
  Type     Reason     Age                  From                   Message
  ----     ------     ----                 ----                   -------
  Warning  Unhealthy  4m (x8069 over 22h)  kubelet, X.Y.146.166  Liveness probe failed: Get http://X.Y.146.166:9099/liveness: dial tcp X.Y.146.166:9099: connect: connection refused

The liveness probe socket is not opened on the INADDR_ANY, it is opened on 127.0.0.1.
Due to this kubelet is not able to perform liveness check.

# netstat -anp | grep LISTEN | grep 9099
tcp        0      0 127.0.0.1:9099          0.0.0.0:*               LISTEN      28228/calico-node

Liveness probe spec in calico node DS manifest

        livenessProbe:
          failureThreshold: 36
          httpGet:
            path: /liveness
            port: 9099
            scheme: HTTP
          periodSeconds: 10

corrected as:

        livenessProbe:
          failureThreshold: 36
          httpGet:
            host: localhost
            path: /liveness
            port: 9099
            scheme: HTTP
          periodSeconds: 10

And it is working now.

@rachappag
Copy link

@caseydavenport
The actual problem I am facing is with readiness probe : calico-node -bird-ready
When we have a skew in the calico mesh, example: one node is unresponsive, the readiness probe throws an error
Node X.Y.254.52 unresponsive.

# kc get pods -o wide | grep "calico-node"
calico-node-89ptz                                     2/2       Running            224        22h       X.Y.166.151   X.Y.166.151
calico-node-9lljn                                     2/2       Running            224        22h       X.Y.146.166     X.Y.146.166
calico-node-bwqjh                                     2/2       Running            224        22h       X.Y.254.137   X.Y.254.137
calico-node-h65xs                                     2/2       Running            224        22h       X.Y.254.198   X.Y.254.198
calico-node-zhtkl                                     0/2       Pending            0          22h       <none>         X.Y.254.52

If any other running calico pod is restarted or the node restarted, the readiness probe fails saying:

 # /bin/calico-node -bird-ready
calico/node is not ready: BIRD is not ready: BGP not established with X.Y.254.52/ #

I think readiness probe calico-node -bird-ready should not account for unresponsive nodes in the mesh.

@caseydavenport
Copy link
Member

The liveness probe socket is not opened on the INADDR_ANY, it is opened on 127.0.0.1.

Yes, this is by design - the liveness probe should have a host: localhost set in the manfiest to tell kubelet to use the local address.

I think readiness probe calico-node -bird-ready should not account for unresponsive nodes in the mesh.

It's tricky, because for rolling update we do want to consider BGP peers when checking readiness. This makes sure we wait for each node's graceful restart to complete before moving on the the next. If we didn't do that, there would be race conditions where traffic could be dropped.

In your case, why is one of the nodes unresponsive?

@rachappag
Copy link

@caseydavenport
It is test scenario in which we bring down one of the node in the cluster and restart the calico-node pod on other node(s). The expectation is that the other calico-node pods should not be affected by the node which is down. The actual bird process waits for 240 seconds ( default )and then it gives up on the neighbor which is unresponsive. It should be good if readiness probe also tolerates such skew in the mesh, else all other nodes in the mesh will suffer.
The node can become unresponsive in scenarios where the node gets rebooted due to patching, crash, maintenance etc. In some scenarios multiple nodes get rebooted and if one got stuck, rest will also suffer.
When we moved to calico 3.2.2 ( and 3.2.3) from 3.1.3, this case failed due to the new readiness probe.

@caseydavenport
Copy link
Member

It should be good if readiness probe also tolerates such skew in the mesh, else all other nodes in the mesh will suffer.

To be clear, the readiness probe does not affect the function of that node, it only reports a status. If the readiness probe fails on a node, that node should still function normally.

If a node is not reporting ready, it will pause a rolling update - this is to help protect against rolling out a bad configuration that could take down the cluster. In the scenario that you're doing a rolling update, you don't want to continue if your new nodes fail to establish peerings.

The node can become unresponsive

Given the above, I'm not sure what you mean by "unresponsive" in this case. Do you mean some nodes are not working in some way?

@e-pirate
Copy link
Author

@caseydavenport do you have any plans/roadmap/decisions made on how to solve the original issue? We can help to test a new (alpha/beta) version in our environment and our test cases that originally brought us here. We are open to help. Even we are semi-solved the issue by making bird template and reducing GR timeout to 30 seconds, we will much prefer to have a robust solution incorporating both approaches! This will drastically reduce downtime in many cases, not only ingress-controllers.

@caseydavenport
Copy link
Member

@e-pirate one thing we haven't discussed till now is the use of route-reflectors.

I believe part of the issue you're experiencing is due to the use of Calico full-mesh BGP, which means that each node is dependent on each other node. One way to mitigate this is to use BGP route reflectors, which remove the full mesh dependency.

If you configure other nodes to be route reflectors, you should be able to restart both ingress nodes simultaneously and they won't get stuck waiting for the other one to come back up.

More information on how to do this in Calico v3.3 here: https://docs.projectcalico.org/v3.3/usage/configuration/bgp

@e-pirate
Copy link
Author

e-pirate commented Nov 3, 2018

We will take a look on this as soon as will get our hand on our cluster and report back what we get.

@caseydavenport
Copy link
Member

Any updates on this?

@e-pirate
Copy link
Author

@caseydavenport sorry, Casey, we are under a severe load with a different project and abandoned our K8S cluster. I will put this on our white board and test as soon as I have a couple of hours. Sorry for that :(

@caseydavenport
Copy link
Member

No worries, just wanted to check in and see if there was anything to do here. Totally understand the busy schedule :D

@seh
Copy link

seh commented Dec 21, 2018

Today I was diagnosing what sounds like a related problem, in our case using Calico version 3.3 atop Kubernetes version 1.13.1 (in AWS).

We noticed that our GPU-bearing EC2 instances would drop off after having been up for about ten minutes. I tracked it down to the following causal chain:

  • calico-node pods' "calico" containers fail their readiness probe too many times due to complaints about BIRD and BGP peers not being reachable.
  • The Kubernetes node transitions from ready to not ready.
  • The node controller (or some other responsible party) evicts all pods on the node after a few minutes.
  • The cluster autoscaler finds the node without any non-DaemonSet-owned pods, and deletes the empty EC2 instance.

The calico-node pods are able to talk to each other; it's not like there's a total blackout on this BGP communication. On some nodes we'll see Kubernetes events posted that one calico-node container can't talk to three others, and then that complaint will drop to two others, then one, and then go away, presumably before the maximum number of probe failures is reached. In some cases, the event indicates that the pod can't talk to the same host on which it's running!

I am surprised that these calico-node pods failing their readiness check would cause the node to not be ready. Grasping here, maybe there's some essential CNI function that the pods stop performing that causes the kubelet to find itself once again nonfunctional.

We tried disabling the -bird-ready flag passed to calico-node in the readiness check, and that appears to fix the problem, but then we're left without some intended safeguard. Supposedly we shouldn't need to decommission nodes ourselves, but it seems that Calico is inducing a positive feedback loop: Due to failures in a readiness check, a node gets killed, which causes more readiness checks to fail, which causes more nodes to get killed.

Why might calico-node be failing to find so many peers it expects to find? Does this warrant a fresh issue?

@caseydavenport
Copy link
Member

  • calico-node pods' "calico" containers fail their readiness probe too many times due to complaints about BIRD and BGP peers not being reachable.
  • The Kubernetes node transitions from ready to not ready.

@seh Hmm, I agree with you that the relationship here is likely not causal - IIUC Calico readiness probes failing should not cause a Kubernetes node to go from Ready->NotReady. Likely, there is some other fundamental problem with the node that is causing both Calico and Kubernetes to go not ready. What does the node report as its reason for being unready?

Supposedly we shouldn't need to decommission nodes ourselves

Yes. I agree with this. If you're still seeing old nodes in the Calico data store, this is likely not working as expected. It might be a configurtion issue. Do you still see the old nodes via calicoctl after they are deleted from k8s? Is the node controller running in the kube-controllers pod?

In some cases, the event indicates that the pod can't talk to the same host on which it's running! ...
Why might calico-node be failing to find so many peers it expects to find? Does this warrant a fresh issue?

That's really bizarre. I think this probably does warrant a fresh issue with some more diags attached.

@caseydavenport
Copy link
Member

Any updates on this?

@seh
Copy link

seh commented Apr 4, 2019

Whether you consider this to fortunate or unfortunate, no, I haven't been able to reproduce it. It's good that it hasn't happened again since, but bad that I can't explain what was happening back then.

@e-pirate
Copy link
Author

e-pirate commented Apr 4, 2019

Any updates on this?

Well... I got a new job in a different company, so I have no access to our cluster any more. No help from me. Sorry!

@edsonmarquezani
Copy link

edsonmarquezani commented May 1, 2019

I guess I'm suferring from this same issue. Currently, I'm unable to do rolling-updates with the latest KOPS version (1.12.beta) + calico (3.4.0). New nodes won't get Calico ready until we restart it. Readiness check fails complaining about itself with this same "BIRD" error message.

Readiness probe failed: calico/node is not ready: BIRD is not ready: BGP not established with 10.21.34.168

The issue goes away once I delete the pod and let it to be recreated.

See kubernetes/kops#6784.

@fasaxc
Copy link
Member

fasaxc commented May 1, 2019

@edsonmarquezani suggest you jump to v3.6, which is a current supported version, we made some improvements to the readiness check to better handle rolling upgrades.

@edsonmarquezani
Copy link

@fasaxc Ok, thank you so much!

@caseydavenport
Copy link
Member

Going to close this for now, but please open a new issue if you encounter other issues with readiness / liveness reporting.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants