New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HA Linkerd doesn't self-heal when one node goes down on k3s cluster #7232
Comments
Hi @tobiasmuehl, apologies for taking a bit of time with my reply, it's been a hectic week for me. We talked about this on Slack, iirc your pods were never rescheduled when killing the node (although the node did receive a For the purpose of this issue, I did not look into I used
The next step was for me to register a new node using
After a while (~ 2 minutes), Kubernetes registered the new state:
However, my pods were not moved yet. I did a bit of searching while I let k8s manage its state, I was curious to see by default, what the behaviour should be. I came across this issue in the k8s GH. One of the answers (granted, from 2017):
By this point, my new node had already been registered for ~ 7 minutes
Shortly after, I checked my pods again, and they were rescheduled:
Conclusion: based on the comment (related to pod eviction time), k8s will evict pods from failed nodes after a grace period of around 5 minutes, by default (if the default hasn't been changed throughout the past couple of versions). It seems to have taken a bit longer than that for me, but Kubernetes as a system has a lot of syncing to do, so it's possible something just took a bit longer than expected. Since I wasn't able to reproduce this, and my pods were re-scheduled successfully, I can't help but think this isn't really an issue with the way we do HA, but rather with your k8s environment. We made the necessary changes to support HA as suggested by the documentation in k8s, i.e using PDPs and affinity rules. These seem to hold; I had looked at our HA manifests when we talked on Slack and I found nothing wrong with them from a logical perspective. I think in your case the environment misbehaved. As I said, we're not making any scheduling decisions ourselves, rather, we have additional information to aid the scheduler. That information seems to be good. We do not have any control over eviction. Hope this makes sense. I'll keep this open for a while in case you have any follow-ups. |
Thanks for the detailed response. It makes sense, yet I can't reproduce this correct behavior with clusters that use k3s instead of k3d, even after hours of waiting. The k3s installation isn't customized in any way, strictly following the docs. At least in my tests HA works correctly when a human runs the appropriate |
@tobiasmuehl I am pretty certain it works, but I wanted to test this out before replying. I managed to get around to this today, and here are my results. Just as before, I started a 3 node k3d cluster and installed linkerd along with the viz extension, both of them in HA mode. I checked everything runs smoothly by opening up a dashboard and querying traffic stats on an example application through the viz extension. Finally, after veryifying everything, I removed the node using
Next, I verified the dashboard is still accessible, which it was. I then went through the You will notice in the logs, there is a warning message about the proxy not connecting to a destination endpoint. Consequently, th success rate also dropped for the pod. This is in my mind understandable, it considers an endpoint that cannot function, however, it does not restrict or have a larger impact on the environment. The control plane still routes pods to where they need to go (the proxy gets all destination endpoints through DNS). In a normal production environment, we'd see a dip in the success rate until the control plane would recover fully -- this is the responsibility of the k8s scheduler, like I mentioned in my previous post. HA mode should allow everything from the control plane to viz stack to still function properly even in the event that a node goes down. You might see a dip in the success rate if all endpoints are considered and not cleaned up by Kubernetes -- in my case, k8s didn't deschedule my pods, but again, this is not related to linkerd, there's nothing we can do here -- but everything should still function. In my case, as soon as Kubernetes recovered and terminated my pods, the success rate also started recovering. Note: I did not add an extra node.
With that being said, I did bump into an issue when trying to run
When we try to interact with Hope this all makes sense! Raised #7385 to discuss this further. |
Thanks for your detailed investigation - I mostly understand what you said 😅 Sadly it seems like this issue can not be replicated with k3d. Perhaps there is a substantial difference in how removing nodes works on k3s vs k3d. |
Going to be closing this. There isn't really a lot on our side that can be done. If you do have any more questions, or you'd like to discuss things further, let me know and we can reopen this. 😄 |
Bug Report
Running Linkerd in HA configuration, I expect that one of the nodes hardware dying means that any pods that are scheduled on that node will be re-scheduled on another node automatically. I also expect that the viz dashboard works during this node failure, as 2 out of 3 nodes are still available.
What is the issue?
Pods are not rescheduled on new hardware and the dashboard can stop working
How can it be reproduced?
Create 3 nodes cluster with k3s. Deploy linkerd in HA mode and viz in HA mode. Add another node to the k3s cluster. Kill one of the first three nodes hardware, but don't drain or remove the node with kubectl
Environment
Server version: stable-2.11.1
Possible solution
Wondering if the anti-affinity rules somehow lock the pods onto the dead node and prevent rescheduling elsewhere
Additional context
https://linkerd.slack.com/archives/C89RTCWJF/p1635968159355300
The text was updated successfully, but these errors were encountered: